Monday, January 04, 2016

Configure Standalone Spark on Windows 10

Background

Its been almost 2 years since I wrote a blog post. Hopefully the next ones will be much more frequent. This post is about my experice of setting up Spark as a standalone instance on Windows 10 64 bit machine. I got back to bit of programming after a long gap and it was quite evident that I struggledd a bit in configuring the system. Someone else coming from a .Net background and new to Java way of working might face similar difficulties tat I faced over a day to get Spark up and running.

 

What is Spark

Spark is an execution engine which is  gaining popularity due to its ability to perform in memory parallel processing. It claims to be upto 100 times more faster compared to Hadoop MapReduce processing methods. It also fits more in the distributed computing paradigm related to big data world. One of the positives of Spark is that it can be run in standalone mode without having to setup nodes in the cluster. This also means that we do not need to set up Hadoop cluster to get started with Spark. Spark is written in Scala & support Scala, Java, Python and R languages as of writing this post in January 2016. Currently it is one of the most popular projects among the different tools used as part of Hadoop ecosystem.

 

What is the problem in installing Spark in stand alone mode on Windows machine?

I started with downloading a copy of Spark distribution 1.5.2 Nov 9 2015 from the Apache website. I chose the version which is pre-built for Hadoop 2.6 and later. If you prefer you can also download the source code & build the whole package. After extracting the contents of the downloaded file, I tried running the Spark-shell command from the commnand prompt. If everything is installed successfully, we should get a Scala shell to execute our commands. Unfortunately on Windows 10 64 bit machine, Spark does not start very well. This seems to be a known issue as there are multiple resources on the internet which talk about it. 

When the Spark-shell command is executed, there are multiple errors which are reported on the console. The error which I received showed problems with creation of SqlContext. There was a big stack trace which was difficult to understand.

Personally this is one thing which I do not like about Java. In my past exxperience I always found it very difficult to debug issues as the error messages showed some error which may not be the correct source of the problem. I wish Java based tools and applications in future will be easier to deploy. In one sense it is good that it makes us aware of many of the internal things, but on the other hand sometimes you just want to install the stuff and get startedd with it without wanting to spend days configuring it.

I was referring to the Pluralsight course relted to Apache Spark fundamentals. The getting started and the installatioon module of the course was helpful in the first step to resolve the issue related to Spark. As suggested in the course, I changed the verbosity of the output for Spark from INFO to ERROR and the amount of info on the consoe reduced a lot. With this change, I was immediately able to get the error related to missing Winutils which is like a utility required specifically for Windows systems. This is reported as an issue SPARK-2356 in the Spark issue list. 

After copying the Winutils.exe file from the pluralsight course in the Spark installation’s bin folder, I was getting the permissions error for the tmp/Hive folder error. As reccommended in different online posts, I tried changing the permissions using chmod and setting it to 777. This did not seem to fix the issue. I tried running the command with administrative previlages. Still no luck.I updated the PATH environemnt variable to point to the Spark\bin directory. As suggested, I added the Spark_HOME, HADOOP_HOME to environment variables. Initially I had put the Winutils.exe file in the Spark/bin folder. I moved it out to dedicated directory named Winutils and updated the environemnt variable for HADOOP_HOME to this directory. Still no luck.

As many people had experienced the same problem with the latest version of Spark 1.5.2, I thought of trying an older version. Even in 1.5.1 I had the same issue. I went back to 1.4.2 version released in November 2014 and that seemed to create the SqlContext correctly. but the version is more than a year old, so there wass no point sticking to the outdated version. 

At this stage I was contemplating the option of getting the source code and building it from scratch. Having read in multiple posts about setting JAVA_HOME environment variable I thought of trying this apparoach. I downloaded the Java 7 SDK and created the environment variable to point to the location where jdk was installed. Even this did not solve the problem.

 

Use right version of Winutils

As a last option, I decided to download the Winutils.exe from a different source. In the downloaded contents, I got Winutils and some other dlls as well like Hadoop.dll as shown in the figure below.

Winutils with hadoop dlls

After putting these contents in the Winutils directory and running the Spark-shell command everything was in place and SqlContext was successfully created.

I am not really sure which step fixed the issue. Was it the jdk and setting of JAVA_HOME environment.Or was it the update of winutils exe along with other dll. All this setup was quite time consuming. Hope this is helpful for people trying to setup standalone instance of Spark on Windows 10 machines.

While I was trying to get Spark up & running, I found following links which might be helpful in case you face similar issues

The last one was really helpful from where I took the idea of separating Winutils exe into different folder and also to install JDK & Scala. But setting scala envirnment variables were not required as I was able to get the scala prompt without scala installation.

Conclusion

Following are the steps I followed for installing Standalone instance of Spark on Windows 10 64 bit machine

  • JDK (6 or higher version)
  • Download Spark distribution
  • Download correct version of Winutils.exe dll
  • Set Environment variables for JAVA_HOME, SPARK_HOME & HADOOP_HOME

Note : When running the chmod command to set 777 attributes for tmp/hive directory make sure to run the command prompt with Administrative privilages.

17 comments:

  1. Hi, many thanks for this post. I am still fighting with it. I have windows 10 but 32 bits and I downloaded Spark 1.6 built for Hadoop 2.6
    I guess my problem is, I am not able to find the right winutils and I start to think also in building my own spark version. If I make some progress I will share it with you.

    ReplyDelete
    Replies
    1. You need to ensure that the chmod command is executed using administrative command prompt. One of my colleague also found out a "hard" dependency with the location of winutils or rather tmp/hive. It needs to be on the c:\. I do not know if this can be bypassed in some way using some configuration settings.

      Delete
  2. Anonymous6:22 PM

    Hi Nilesh,

    Thanks for such a detailed post.

    I've tried these instructions but still struggling to get "spark-shell" to run cleanly. Could you please help me here?


    Here are the details of the folders and env variables created/modified in the process:
    1. Downloaded spark's prebuilt image (spark-1.6.0-bin-hadoop2.4) and placed it in "C:\", so the structure looks like "C:\spark\bin"
    Created an env variable SPARK_HOME with the "C:\spark\" as value.

    2. Downloaded Winutils from the link mentioned in your post and placed (and extracted) in "C:\"

    Now, the env variable HADOOP_HOME was pointing to "C:\Winutils\".

    3. Instaled Scala and created SCALA_HOME and made it to point to Scala's directory.

    4. I already had a working version of java 8 installed with the required env variables in place.

    I can't seem to figure out where exactly am i going wrong. I'll try going through the Pluralsight course that you've mentioned but would appreciate if you could have a look at this and see if these seem familiar?

    Thanks in advance.
    Lalit

    ReplyDelete
    Replies
    1. Hi Lalit,
      Apologies for the late reply. Did you try changing the verbosity of the log to ERROR? That was helpful for me to point out the issue with access rights on tmp/hive folder.
      The other thing which could be helpful is to ensure you are running the chmod commands using administrative command prompt. Hope this helps

      Delete
    2. It is not working for me. I did everything you mentioned:

      JDK-> jdk1.8.0_91
      Downloaded Spark distribution->spark-1.6.1 with hadoop 2.6
      Downloaded correct version of Winutils.exe dll Set Environment variables for JAVA_HOME, SPARK_HOME & HADOOP_HOME
      I am using WIndows 10 , 64 bit

      Delete
    3. Hi Aamir,
      Did you try running the chmod command in administrative mode
      winutils.exe chmod 777 \tmp\hive?
      Please try changing the verbosity of the logger which will help to get more relevant information.
      If these two options don't help to fix the issue please share the stacktrace of the error you are getting after making the two changes

      Delete
  3. I don’t know how should I give you thanks! I am totally stunned by your article. You saved my time. Thanks a million for sharing this article.

    ReplyDelete
    Replies
    1. Glad that you foud it helpful.

      Delete
  4. I was really missing your posts. This was quite a great article to get spark running.

    ReplyDelete
  5. Hi,
    The alternative link you provided for downloading winutils works for me (Windows 10 64 bit) From what I can see the winutils on the official links are 32bit as I get an error when I tried running the command
    winutils.exe chmod 777 \tmp\hive

    after downloading from your link that command worked
    HTH, Garry

    ReplyDelete
  6. Anonymous12:20 PM

    Hi thank you very much for your post , I keep getting a 'the specified path can not be found' when I try to launch spark-shell. It would suggest that my env.variables are wrongly set but they seem correct after 20 checks ! Any pointers would be appreciated , thanks in advance.

    ReplyDelete
    Replies
    1. You can try running the spark-shell command from the directory where the spark bin folder is stored on the disk. For the individual environment variables like JAVA HOME try to open a command prompt and run the "java" command without any arguments. If the variables are set correctly you should get the options which needs to be provided with java command.
      Hpe this helps.

      Delete
  7. Anonymous5:22 AM

    Were you able to get Standalone Spark to work with multiple workers? I'm trying version 2.0 on windows server and setting the SPARK_WORKER_INSTANCES doesn't seem to work.

    ReplyDelete
    Replies
    1. I did not try running multiple worker instances. What is the error you are getting?

      Delete
  8. The blog gave me idea to configure standalone spark Thanks for sharing it
    Hadoop Training in Chennai

    ReplyDelete
  9. Anonymous2:26 PM

    spark-shell is working fine but pyspark is not working.Can any one tell what will b the problem

    ReplyDelete
    Replies
    1. I have not yet tried the Python integration with Spark. Can you provide some more details about what you have tried so far and may be the exact error that you are getting?

      My guess is that it might be related to PATH variables.

      May be some answers on StackOverflow might help
      https://stackoverflow.com/questions/23256536/importing-pyspark-in-python-shell
      https://stackoverflow.com/questions/38798816/pyspark-command-not-recognised

      Delete

How Travis CI saved my time?

Background Some time back I created an Ansible playbook to install software and setup my Mac Book Pro . I put the code for this on GitHub . ...