Sunday, October 01, 2017

Submit Apache Spark job from Command Line to HDInsight cluster


This is the 3rd part of the Step by Step guide to run Apache Spark on HDInsight cluster. The first part was about provisioning the cluster followed by 2nd part about submitting jobs using IntelliJ IDE. In this post we will use Azure CLI to interact with the HDInsight cluster.


Azure CLI

This prerequisite was mentioned in the first part of this series.

Movielens dataset

We will use this public dataset for doing some analysis. You can download the dataset to your computer.

There is a slight change in the resource names used in the earlier posts and this one. Due to some changes to my subscription, I had to recreate some resources while creating the HDInsight cluster. The default container is named ng-spark-2017 instead of ng-spark-2017-08-18t14-24-10-259z. Same way the storage account name is changed from ngstorageaccount to ngsparkstorageaccount.

Use CLI to interact with Azure resources

Assuming that the Azure CLI is installed successfully, we can login to Azure subscription using command from our favorite terminal. 

az login

You will be presented with the url and a code to enter. Just enter the code into the browser and you should login to the Azure subscription.

We will need to upload the jar to blobstorage account and also the files from the movielens dataset. In order to securely transfer files over the network, Azure provides 2 sets of keys. We can query these keys using list keys command.

az storage account keys list \
--resource-group ngresourcegroup \
--account-name ngsparkstorageaccount

list keys

The result of the command returns the keys as shown above. We can now use one of the keys to query the available containers.

az storage container list \
--account-name ngsparkstorageaccount \
--account-key <<your key here>>
list containers

I have tried using both the keys and the result is same. As seen from the above screenshot, we have ng-spark-2017 as the container. This is the default container specified at the time of creating the cluster. Now we upload the jar for our application to this blob storage container.

az storage blob upload \
--account-name ngsparkstorageaccount \
--account-key <<your key here>> \
--file target/learning-spark-1.0.jar \
--name learning-spark-1.0.jar \
--container-name ng-spark-2017
upload jar

In case you get any error check for the path of the jar file. With this step we have successfully uploaded the jar to blob storage account named ngsparkstorageaccount to a container named ng-spark-2017 with the filename learning-spark-1.0.jar. You will find these item highlighted in the above screenshot. Next step is to upload the files from Movielens dataset to the blob storage.

az storage blob upload-batch \
--destination ng-spark-2017/movielense_dataset \
--source ml-latest \
--account-name ngsparkstorageaccount \
--account-key <<your key here>> 

The command is almost similar to the previous one with a slight difference that we are using the batch mode (upload-batch) to transfer the files. Also note that instead of file and name parameters we are using the source & destination parameters. This command will take some time to complete based on the internet speed in your region. If you are bit tired of typing lengthy commands step back and take a coffee break.

Login to Head node

Once the artefacts are transferred to blob storage we can issue Spark submit command. We start by logging into the head node of the cluster. Once again the exact command for connecting to the head node is available via the Azure portal. Navigate to the cluster dashboard and you will find an option to connect using secure shell. Execute the command shown on the portal in the terminal


You will need to remember the passphrase which was used to generate the RSA key in the 1st part of the series.

ssh prompt

Once the passphrase is validated, you will be connected to the head node of the cluster. The name of the head node is added to the known hosts lists which allows future connections using same credentials.

ssh success


At times when you try to connect to the head node using ssh, you might get an error. Run the following command

ssh-keygen -R
Rerun the ssh command specified above & you should be able to login.

Submit spark commands

Lets start by issuing the command which executes a simple spark program.
spark-submit \
--class com.nileshgule.PairRDDExample \
--master yarn \
--deploy-mode cluster \
--executor-memory 2g \
--name PairRDDExample \
--conf "" \

Note the syntax used for accessing the jar file here. We are using the Windows Azure Storage Blob (WASB) mechanism for accessing the files stored in blob storage. You can refer to the post on understanding WASB and Hadoop storage in Azure for more details.

spark submit

The program runs successfully. This program is using an in memory collection and does not really interact with the Hadoop resources. Lets trigger another Spark job which actually refers to the files from the movielens dataset.

spark-submit \
--packages com.databricks:spark-csv_2.10:1.5.0 \
--class com.nileshgule.movielens.UserAnalysis \
--master yarn \
--deploy-mode cluster \
--num-executors 2 \
--executor-memory 2G \
--executor-cores 6 \
--name UserAnalysis \
--conf "" \
wasb:// \
wasb:// \

user analysis results

The program takes some time to finish. I am referring to the ratings.csv and movies.csv as the input files.In this example we can see that we are able to access files from Azure blob storage like normal files using the WASB.

Verify job execution

There are multiple ways to verify the output using the options available from the HDInsight cluster dashboard. We will use one of them this time around. The cluster dashboard link takes us to a blade with 5 options as shown below.

portal cluster dashboard

Spark history server and Yarn are easiest. We look at the Spark history server link.

spark history server

This will show the recent jobs status. You can drill down into the details of each job. This takes us to the Spark history UI. The Spark history UI provides rich information about different aspects related to the internals of Spark. Those details are well documented and are outside the scope of of this blog post.


As we saw during the course of this post, it is quite easy to use command line tools to connect to HDInsight cluster and submit Spark jobs by connecting to the head node. We also had the chance to scratch a bit about the Azure CLI when it came to uploading files from local machine to the blob storage. I have tried to run Spark workloads on other cloud platform. Mind you it is not so easy. Microsoft being Microsoft makes it very easy for developers to use their products. I hope the 3 posts in this series have been helpful. Until next time Happy Programming.

Monday, September 11, 2017

Submit Apache Spark job from IntelliJ to HDInsight cluster


This is 2nd part of the Step by Step guide to run Apache Spark on HDInsight cluster. In the first part we saw how to provision the HDInsight Spark cluster with Spark 1.6.3 on Azure. In this post we will see how to use IntelliJ IDEA IDE and submit the Spark job.


IntelliJ IDEA IDEA installed on your machine. I am using the community edition but same works with ultimate edition as well.

Azure toolkit for IntelliJ this is an IntelliJ plugin. This plugin provides different functionalities which helps to create, test and deploy Azure application using IntelliJ. If you are an Eclipse user there is a similar plugin named Azure toolkit for Eclipse. I have not used it personally. I prefer to work with IntelliJ as it makes me feel more at home since I come from a C# background having used Visual Studio for more than a decade and a half.

Steps to submit Spark job using IntelliJ

1. Login to Azure account

If the plugin is installed correctly, you should see an option to sign in to Azure as shown below

Azure toolkit

Select Azure Sign in option to login to the Azure account. This will bring up the Azure Sign in dialog.

2 . Sign in to Azure in Interactive mode

Interactive login

Use the default option of Interactive and click on Sign in. I have not yet tested the automated authentication method.

Azure login dialog

Provide the Windows live account details here which has access to the Azure subscription. If the account login was successful, your subscription details will be pulled into the IDE as shown below

Select Suscription

As I have only one subscription, I can click on select button to access different services associated with the subscription.

3. Explore Azure resources (Optional)

We can verify that different resources available under this subscription can be accesses using the Azure Explorer. Navigate to Azure Explorer from the Tools menu and then selecting Tool Windows option.

Azure explorer menu

Selecting this option brings up the Azure Explorer Sidebar. We can access things like Container Registry, Docker hosts, HDInsight cluster, Redis Caches, Storage Accounts, Virtual Machines and Web Apps.

Azure explorer sidebar

In the above screenshot we can see the storage account named ngstorageaccount associated with my Azure subscription.

4. Submit Spark job

Lets move onto the most important part which is to submit the Spark job to cluster. From the sidebar, navigate to the Projects pane. Select the file containing main method. In fact any file would do as we need to specify the details in the dialog box that appears. Right click and select Submit Spark Application to HDInsight from the context menu. The option is available right at the bottom of context menu.

right click submit

This brings up a dialog box where we can specify the details related to the job. Select the appropriate options from the drop down list. I have selected the Spark cluster, location of the jar file and name of the Main class as com.nileshgule.MapToDoubleExample. Note that the fully qualified name needs to be supplied here.

submit spark job

We can provide the runtime parameters like driverMemory, driverCores etc. I chose to go with default values. The widget provides us options to pass command line arguments, reference jars and any reference files if required.

Once we submit the job, the Spark Submission pane loads up at the bottom of the screen and reports the progress of job execution. Assuming everything goes fine, you should see output which says that The Spark application completed successfully. The output will also contain additional information with a link to the YARN UI and also the detailed job log copied from the cluster to a local directory.

Spark Submission pane

5. Verify job execution

In order to verify that the job was successfully executed on the cluster, we can click the link to the YARN UI. This will bring up a login prompt as shown below.

admin password

Note that this is using the SSL port 443. Provide the credentials which were specified when we created the admin user for the cluster in the provisioning step. Hope you remember the password that was supplied at the time of creating the cluster. Once the admin user is authenticated, you will be presented with the YARN application container details.

Yarn application container

This page gives various details like the status reported by application master, link to the Spark history server using the tracking URL, link to logs and many other details. you can get more details about the job by navigating to different links provided on this page.


We saw that it is quite easy to trigger Spark job from IntelliJ IDE directly onto the HDInsight cluster. I have demonstrated very few capabilities of the Azure plugin in this post. Feel free to explore other options. In the next part we will see how we can submit the Spark jobs from Head node of the cluster using command line interface. Hope you found this information useful. Until next time Happy Programming.

Friday, September 08, 2017

Step by Step guide to run Apache Spark on HDInsight cluster


Recently I have been experimenting a bit with cloud technologies including Amazon Web Services (AWS) and also Microsoft Azure. I have MSDN subscription which entitles me to free credits to utilize Azure resources. I was trying to run some big data workloads on Microsoft Azure’s offering in the Big Data space called HDInsight. HDInsight offers multiple types of clusters including Hadoop, HBase, Storm, Spark, R Server, Kafka and Interactive Hive. As of this writing Kafka & Hive clusters are in preview state. I decided to try the Spark cluster as I am currently exploring different features of Spark. This post is about different steps required to run Spark jobs using HDInsight cluster. I will start with provisioning the HDInsight cluster and in the following posts extend it to show executing Spark jobs from within IntelliJ IDE and also from Azure command line in upcoming posts.

Pre-requisites for running the code examples

If you wish to have a look at the source code, clone or download the repo learning spark from my GitHub repository.
As mentioned in the file of the repo, following dependencies must be satisfied in order to build the project

  • Java 8 is installed on your laptop / PC
  • JAVA_HOME environment variable is set correctly
  • Apache Spark 1.6.3 with Hadoop 2.6 is installed
  • SPARK_HOME environment variable is set correctly
  • Maven is installed
  • It is also good to have an IDE. I prefer to use the IntelliJ IDEA community edition. You can use other IDE like Eclipse or Visual Studio Code. As we will see later in the post, IntelliJ or Eclipse will give us the benefit or running code on Azure cloud directly from the IDE using a plugin. If you prefer to use VS Code, I will also demonstrate how to run Spark using Azure Command Line Interface (CLI).

Additional dependencies

MovieLens dataset : MovieLens dataset is publicly available sample dataset for performing analytics queries and for big data processing.
Azure Subscription : Apart from building the project from source code, you also need to have Microsoft Azure subscription in order to submit the jobs to remote cluster. I assume all the pre-requisites mentioned here are fulfilled.

Build the project

The codebase contains some basic Spark programs like WordCount and examples of PairRDD, MapToDouble, caching example etc. There are programs using the MovieLens dataset which are more for running in the cluster scenario. Open the project in your favorite IDE and build the project. I prefer to do it by using the terminal using mvn clean package command in the root of the source code directory. It might take some time to download all the dependencies from various repositories if this is the first item you are running maven. If all the environment variables were set correctly, you should have a successful build and the package should be created under the target folder.

Run Spark locally

maven build output

We can test the Spark program by running it locally using the command

spark-submit --class com.nileshgule.MapToDoubleExample \
--master local \
--deploy-mode client \
--executor-memory 2g \
--name MapToDouble \
--conf "" \

Note that we are specifying the master as local in the above execution. We are also setting the limit for executor memory to 2 GB. Once again if everything was setup correctly, then you should see the output similar to below screenshot

spark output local

Provision HDInsight cluster

There are multiple steps required inorder to run the Spark on HDInsight cluster. Each of these step can take some time. First of all we need to provision the cluster resources. This can be done in two ways. Easiest is to login to the Azure web portal. Alternate option is to use the Azure CLI. As of this writing Azure CLI 2.0 does not support provisioning HDInsight cluster. You can use the older version. I had 2.0 version installed so I was forced to use the web portal method. Refer to the Azure documentation on details related to provisioning different types of HDInsight clusters for more details. I will be using the ssh based approach to connect to the head node in the cluster. Before we provision the cluster, I need to generate the RSA public key. On my Mac I can generate the key by executing the command
ssh-keygen -t rsa

Provide the file location (default will be prompted, you can keep the default as is) and the passphrase. Remember the passphrase as it will be required later. With this prerequisite done, login to the Azure portal with your credentials. From the dashboards navigate to Add new resource screen and click on Data + Analytics section. HDInsight is the first option on the right. For quick access you can also search for HDInsights directly in the search bar.

search HD Insight

1 - Basic settings

basic config spark version

First thing we need to ensure is that the cluster name is unique. In my case I am using the MSDN subscription, if you have multiple subscriptions you will need to choose the one for billing purposes. Next we need to select the type of cluster. We select cluster type as Spark. The version is dependent on the type of cluster. In case of Spark cluster we need to select the appropriate Spark version. I chose Spark 1.6.3 as that is the one I am currently experimenting with. You can chose the other available versions if you wish to.

Next we need to provide credentials for logging into the cluster. We need an admin account and also the sshuser account which can enable us to submit the Spark jobs remotely. Provide the admin user password. Make sure to uncheck the use same password as cluster login checkbox. Instead we will use the public key we generated as the key by either selecting the file or pasting the contents of the file. When I created the cluster using the same password as admin user for the sshuser, I was unable to login to the head node.

In the SSH authentication type, select the PUBLIC KEY option.You can select the file or paste the contents of the .pub file from the location where you saved it when the RSA key was generated in the earlier step. In my case the file is stored in my home directory at ~/.ssh/

basic configuration

Penultimate step is to specify the resource group. I chose to reuse an existing one. You can chose one of the existing resource group or create a new one. Final step is to chose the location or region where the cluster will be created. In my case it was prefilled with East Asia.

2 - Storage

In this step I need to provide the storage details including Storage Account Settings and Meta Store Settings. The metastore settings are optional. I selected an existing storage account named ngstorageaccount and a default container as ng-spark-2017-08-18t14-24-10-259z. Ideally this should be having a meaningful name. I ended up reusing the default container name created for me by Azure since the first time. I do not need the additional storage accounts and Data Lake store access so I leave them blank. For the moment I do not wish to persist the metadata outside of the cluster. In a production scenario, it might be a good idea to store metadata outside of the cluster.

storage options

3 - Summary

Summary blade provides the summary of our choices made so far. It gives us last opportunity to customize the settings before Azure takes over and provisions the cluster resources for us.


I will make a slight change to the cluster nodes configuration by editing the cluster size. The default number of worker nodes is 4. I don’t intend to run heavy workloads at the moment. So I reduced it down to 2 worker nodes. Also the hardware configuration of each worker node is D4 V2 type. I changed it to a scaled down version with D12 V2 type.

resize head node

I did not modify any of the advanced setting and click the Create button after final review of all the settings. It will take 15 to 20 minutes to provision all the resources of the cluster. With this setup we are ready to run some Spark jobs on the cluster. If everything goes fine you should see a HDInsight cluster created as shown below

available cluster

In the part 2 of this post, I will demonstrate how to use IntelliJ IDE to submit a Spark job to Azure HDInsight cluster. The part 3 will focus on running the Spark jobs using Azure CLI. Till then happy programming.

Friday, August 04, 2017

How Travis CI saved my time?


Some time back I created an Ansible playbook to install software and setup my Mac Book Pro. I put the code for this on GitHub. I wanted the ability to keep running this playbook but without really having to test it on my mac every time. This is when I came across Travis CI. It is continuous integration server which helps to test and deploy the code with confidence. The best part is it is completely free for open source projects. I setup the build to run whenever there is any change to my GitHub repository. There is also a daily build which runs as a cron job. Setting up the CI build is very simple and integrates nicely with GitHub.

The breaking build

The builds were running successfully for past few weeks. Recently there was a failure reported by Travis CI after executing the automatically scheduled build. I had not changed anything in code for few days. So it was surprising to see a broken build. There are cases when Travis CI build fails due to timeout issues. These occur when the playbook tries to install Java or IntelliJ ide IDE. These are my known issues and they get resolved when the build is trigged again. I thought it was one of those timeout issues and restarted the build. Unfortunately it did not resolve the build failure.

The build failure was caused by one of the task within the Ansible playbook. The task tries to install a Visual Studio Code extension named vscode-yaml-validation. Below is the screenshot of the Travis CI build output log

Travis CI

This particular extension has been unpublished from the VS Code Marketplace. When you go to the Marketplace, you get an info message which says “This extension is now unpublished from Marketplace. You can choose to uninstall it”. If I did not have the Travis CI build running, it would have been difficult to identify that this extension is no longer available in marketplace.

When the Ansible playbook runs, it only checks if the extension is not installed and installs it. The step would be ignored on my Mac as I already have the extension installed. This is where Travis CI shines. It allows me to run the setup everytime on a new Mac whenever the build is triggered. It is as if running the setup playbook for the first time.

The fix is quite simple for this unpublished extension. I need to remove it from the list of my VS code extensions in the Ansible setup. This should be available shortly on GitHub repo.


I have found Travis CI very useful. It allows me to test the Ansible playbook on the remote infrastructure. I do not have to test the changes on my main Mac. It also acts a automatic regression tool. One of my favorite feature of Travis CI is its ability to support multiple operating systems and multiple versions of dependencies. For one of the repository where I contribute to the Ansible role, we are targeting multiple versions of Ansible 2.0.2 and 2.3.0. We can parameterize the build to run with each of those versions of Ansible. The builds run in parallel and can greatly reduce the amount of time required to find any breaking changes due to recent commits. This allows us to build like a matrix of different operating system and the specific versions of libraries or dependencies. Below is a screenshot of multiple versions of Ansible that I talked about

Travis multi version

I intend to enable continuous integration for all my active projects hosted on GitHub. Since it is free I would also recommend you to make good use of it especially if you have projects on GitHub. I am not aware of any free hosting environment which allows you to provision a machine with Mac OS installed on it. That is one of the biggest distinguishing factor personally for me to chose Travis CI.

Technorati Tags: ,,

Technorati Tags: ,,

Saturday, July 15, 2017

My developer toolkit 2017 (Mac)

Back in December 2010 I had blogged about the powertools I was using with windows. Soon I will do a revamp of the Windows powertools which I am using on my Windows 10 PC.This post is about the list of tools that I use for my day today activities on my Mac.

Terminal Utilities

Compared to the default terminal, I prefer to use iTem2. iTerm also integrates with Oh My Zsh which makes life lot more easier while working with terminal.Refer to the Oh my Zsh cheatsheet for more details. I love all the aliases related to Git like gst for git status, gcmsg for git commit –m and many other git commands. I use the Avit theme with Oh My Zsh which gives nice look and feel to the different commands and their outputs on the terminal.

While working with tabbed windows on the terminal it can be quite confusing and hard to remember what you were doing with which terminal window. Tabset helps to name the tabs and also give different colors to them. Display of the tab name on the right hand corner is quite helpful for me.

Code Editors

The more I use VS Code, more I am liking the features of it. It is very elegant and has a nice Themes and Plugins based ecosystem for enhancing its capabilities. No wonder more and more people (even those who hated Microsoft) are using VS Code. If you don’t want the full featured IDE of Visual Studio 2015 / 2017 but still need the better part of code editing go for VS Code.

Before I started using VS Code, I was a fan of Atom. I like its simplicity. It also integrates very well with the GitHub (it is actually created by Github). They call it the hacakble text editor for 21st century. Many people complain about slowness of Atom editor. I did not face any issues so far. May be the files I was dealing with were within the bounds of Atom.

When I moved from Windows 10 to Mac, I started using Sublime Text 3. I find it similar to Atom in many ways. Sublime is the oldest editor among VS Code, Atom & Sublime. As a result it has more features, themes and plugins.

When working with Java, I use IntelliJ Idea from Jetbrains. It is one of the best IDE I have come across (obviously after Visual Studio). I find IntelliJ much more easier to adapt coming from .Net world as compared to Eclipse.  The dark theme of IntelliJ makes me feel at home.

One of the plugin which I find very helpful in IntelliJ is the Key Promoter. It tells you how many times you have used mouse when there is a keyboard shortcut available for a command. I feel this is really needed for all the developers if you want to get better at keyboard shortcuts.

Although I prefer to work with terminal while using Git & Github repositories, I find the GitHub desktop tool handy when I want to do some GUI related work.

All the 3 text editors seem to have quite a few things in common. Especially plugins and themes are mostly ported from one editor to the other. I like the Material Theme and Monokai. Best part is all 3 editors are cross platform and work with Windows and Mac. That definitely reduces the learning curve.

Virtualization software

Docker allows to spin up lightweight containers as compared to full blown virtual images. If you don’t want to mess around with your laptop but want to try out some new tool, Docker is a good way to test it.

Not all things can be done via Docker containers.Sometimes you still need to use a virtual image. I tend to use VirtualBox for such cases.

Vagrant allows to configure virtual machines in a declarative manner. It uses VirtualBox as the provider for creating virtual machines.

General utilities

I use Ansible to automate the installation of software as much as possible on my Mac Book Pro. Refer to my post on Setup Mackbook almost at  the speed of light to know about it in more details

This is my favorite notes taking app. Works across all my devices including iPhone, iPad Pro, Mac book and windows PC. I like the simplicity of the tool. It resembles the physical notebook. The organization of notes into workbooks and pages is something I like very much when it comes to notes taking. I tried other alternatives like Evernote.

I use KeePass on Windows 10. There is a nice port of it available on Mac called MacPass. Use it if you want to store all your passwords in one place.

Todoist is one of my favourite task organizer. It works cross platform and has support for mobile devices as well. it is very simple to use and has a minimalistic UI. I have tried others like Evernote but prefer Todoist for its simplicity.

Dropbox is my preferred way for synching documents across devices. I also use Google Drive and Microsoft One Drive for different documents.

On Windows, I am a big fan of Open Live Writer for writing blog. Unfortunately it works only with Windows. On Mac I found Blogo. Although it is not as feature rich as Live Writer it solves the purpose.

I mostly read ebooks in PDF format. Acrobat reader allows synching of ebooks across devices using Adobe Cloud.

CheatSheet is one of the best free utility I have ever come across. It displays all the keyboard shortcuts for any application that you are currently running. You don’t need to remember each and every shortcut. Just hold the hotkey (by default the command key) for CheatSheet and you will see all the relevant keyboard shortcuts. I have started holding the Alt key on windows keyboard hoping to see keyboard shortcuts when I work on Windows 10 nowadays Smile

Spectacle is another nice little utility which allows you to resize & position the application windows with keyboard shortcuts. I also have a secondary monitor attached to my laptop. Spectacle is very helpful in moving windows across screens. Even if you don’t have multiple screens, you can still use spectacle to great effect to resize and position the windows.

f.lux is a utility which works with both Windows and Mac. It automatically adjusts the brightness of screen based on the time of the day.

Battery Related utilities

Has a very simplistic UI. Provides notifications on the battery levels. I find it useful to be notified when the battery is fully charged.

There are couple of other battery related apps that I am testing currently before picking up the one for my needs. These include coconut battery, Battery Heath, Battery Guardian.


The tolls that we use keep changing every year. I am sure there will be many more tools and utilities out there which would help to make our life more simpler and easier to work with machines. I would be interested in knowing such tools.

Tuesday, July 11, 2017

The ‘Yes’ command


Recently I was working on a personal project involving GitHub and Travis CI. This short post is about my experience of hacking some of the options with Travis CI. To start with, Travis CI is giving free access to all public repositories on GitHub to run continuous integration. I used this to setup CI build for my Mac Dev Setup project. Once again, the initial work of Travis build definition has been done by Jeff Geerling. I am extending his goodwork to incorporate my changes.

One of the best part of Travis CI is the option to chose as Mac OS to run the CI build. You can refer to link for more details. I started off with the xcode7.3 version of the image. This is the default image at the time of writing this blog. The build was working fine with this version of them image. So I thought of upgrading the next image version of OSX with label xcode8. This build was successful without any changes.

Problem with Homebrew uninstallation

I thought that it was quite easy to just change the image version and the builds would work without any problem. Unfortunately not. I skipped the xcodee8.1 & xcode8.2 versions and tried to jump to xcode8.3 directly. The build failed with timeout. Looking at the build log, I can find out that the build was waiting for the confirmation on removal of Homebrew package and was expecting an input to the prompt in the form of y/N. Look at line number 391 in the below screenshot.

So I thought of downgrading the image version to 8.2. It was still the same. Hmm. Something had changed between the 8 version and the others. So I reverted back to the least version after 8 which was 8.1. As part of the initial setup, I was uninstalling the Homebrew package and from 8.1 image version, the installer expects a confirmation. Not sure why it doesn’t do it in the earlier versions.

I was running a script by downloading it and running a Ruby command as
ruby -e "$(curl -fsSL"

Ruby does not provide an option to pass default as ‘Y’ to any of the prompts. Atleast I did not find such option using my google search skills. I started looking for ways to silently invoke the command and pass ‘Y’ as the default answer to the prompt.

‘Yes' command to the rescue

There were multiple solutions available online. But I liked the one provided by a command strangly named as yes. It can provide input as y or n to any prompt. The pipelining of commands and utilities in Unix / Linux based system helped here to pipe the yes command with the ruby script which I was using. The final command looks like
yes | ruby -e "$(curl -fsSL"

Note that default is Y for the yes command. If you wish to answer n, you can change the syntax to yes n :). It is quite ironic to use yes command and reply as no. But thats how the author of this command desired it. They could have create a complimentary no command which would respond with n. I even did a google search to check if there is such command. Unforunately it does not exist.

With the help of wonderful yes comand my build is running fine now. I don’t know if there is any better way of supplying an answer to the prompt on Travis CI. If you know lt me know via comments.

Monday, July 03, 2017

Setup MacBook almost at the speed of light


I bought a new MacBook recently. It is always fascinating to setup your new machine. But it is also a pain to look for all the tools that you have on your old machine and port it to the new machine. Sometime back I started learning abount Ansible which helps to automate routine tasks. I came across a blog by Jeff Geerling who is the author of book Ansible for DevOps. Jeff and many others had used Ansible to setup their machines. I took inspiration from these guys blogs to automate the process of setting up a new MacBook Pro. Here is my experience.

Why Ansible?

Ansible is very easy to understand. It uses human readbale YAML syntax to describe the different actions which needs to be performed. Group of Ansible actions which are executed as part of a playbook are idempotent. It does not have a side effect on the setup. The same playbook can be run multiple times. Only the changes will be applied incrementally.

How did I use Ansible?

I started off by cloning the Git repository of Geerlingguy which is a good starting point. Jeff Geerling has done a very good job in terms of laying out the framework for initial set of tools. Jeff used Homebrew as package manager for installing packages. For the UI applications Homebrew Cask is used.I added some of the applications which were not exising in the original repo of Jeff Geerling.

It is very easy to get started. The repo follows the best practices from Ansible world and organizes the different topics into structure shown in th image below

Lets start with looking at some of the important files & folders from this repo.

The files directory contains additional files required for configuring specific tools. Jeff Geerling had custom options / configurations for Sublime and terminal. I added The file is the dotfile for Oh My Zsh. We will talk about Oh My Zsh a bit later in this post.

Roles directory contains the Ansible Roles required for executing different tasks as part of the playbook. Here again, from the original repo of Jeff Geerling there were roles for managing dot files, home-brew, mas and command line tools. I added the role for managing atom packages.

Tasks folder contains list of tasks or actions which needs to performed during installation. These are organized into multiple files like ansible-setup, extra-packages-setup etc. I added a file for oh-my-zsh-setup.

The important files in the complete structure are default.config.yml and main.yml. The main.yml file is the glue that binds all the things together. This is like a main program in programming language like C# or Java. It contains references to the runtime variables, roles used and the order in which the tasks needs to be executed.
The default.config.yml file contains all the variables used by the tasks. It contains the list of tools & applications to be installed or uninstalled as part of the playbook. One of the advantage of using this approach is the applications which are installed gets moved to Applications folder as part of tasks. If we install the applications manually, sometimes we need to move them from downloads or other folder to Applications.

Apart from the applications itself, I also needed some additional libraries / tools. There were some which I was not using so I deleted those packages. Below are some of the additions / enhancements I did to meet my needs.

I made 2 major modifications to the repo of Jeff Geerling. I added the automatic configuration of Oh My Zsh and also Atom plugins & themes. Below steps were needed to make these modifications.

Setup Oh My Zsh

I like to use the Oh My Zsh as it enhances the default terminal with a better experience. It uses zsh as an alternative terminal to default terminal. Oh My ZSH is community driven framework for managing zsh configurations. It has lots of Themes & Plugins support and makes working on the terminal a really enjoyable experience. Doing a bit of Google search brought me to the GitHub repo for setting up Oh My Zsh by Remy Van Elst. I copied the file in the files directory. Same way I added the oh-my-zsh-setup.yml file in the tasks directory. The last step was to add an include statement to the main.yml file file to include oh-my-zsh-setup.yml file in the tasks definition.

Setup Atom plugins

Over the last few months, I had been using Atom as text editor. I used multiple Atom plugins and Themes. Atom has very good support for installing plugins using command line options. I especially like the Material UI theme which is supported by multiple editors including Atom & Sublime. I really like the minimilatic design of Atom editor.
It would be nice to have these plugins also installed as part of the machine setup. Fortunately there is an Ansible role hy Hiroaki Nakamura which allows exactly this functionality. You provide a list of Atom Themes & Plugins to this role and your machine will have all of them installed using Ansible. This is awesome. No need to go & search for plugins in the Atom UI. After the initial set of plugins, I have used the playbook for adding new ones with effortless ease.

To use the role, I added the role definition to the requirements.yml file. This file contains the list of roles whic need to be downloaded. As a pre-condition, all the roles listed here are downloaded before running any tasks. The hnakamur.atom-packages role expects a variable named atom_packages_packages. There is no difference between Themes & Plugins. I listed down all the Atom plugins & themes here. The last step was to include this role in the main.yml file.

Setup Visual Studio Code plugins & Themes

I have just started using the Visual Studio code editor. I was able to install VS Code using the default apps method from Jeff Geerling's playbook task. Similar to Atom or Sublime, VS Code has a rich support for Plugins & Themes. I found an Ansible Role by Gantsign named ansible-role-visual-studio-code. Looking at the readme file it seems to be made for Ubuntu. The role also installs VS Code editor. In my case I already have the editor installed using Homebrew Cask. I needed just the ability to install the plugins & themes.
From the code available within the repo, I found the code which is required to install the VS extension. The above role does a good job in installing VS code & extensions for multiple users of the system. Mine is a single user laptop & I did not need such functionality.

I ended up creating a file in the tasks named visual-studio-code-extensions-setup.yml. This file contains only one task of installing the extensions. The task wraps the command “code --install-extension extensionName”. The extension name is a placeholder in the above command and needs to be dynamically built. The default.config.yml defines a list of extensions in a variable named visual_studio_code_extensions. The extension name uses a specific format and it took me sometime to get hang of it. If we install the extension using VS Code IDE it works perfectly fine with just the extension name. But when we try to install the same extension using commandline, we need to prefix the publisher name. For e.g. the csharp extension is published by Microsoft. We need to provide the fully qualified name as ms-vscode.csharp.

The list of extensions can be specified as

- steoates.autoimport
- PeterJausovec.vscode-docker
- ms-vscode.csharp

But this looks very clumsy. This is where the simplicity and flexibility of Ansible & YAML can be beneficial. We can define custom lists or dictionaries which can split the publisher & the extension name. I used this approach to define the extensions as
- extensionName: autoimport
publisher: steoates
- extensionName: vscode-docker
publisher: PeterJausovec
- extensionName: csharp
publisher: ms-vscode

The tasks file then concatenates the publisher & the extension.

Next steps

There are still some manual steps involved in setting up the machine. As of now, I did not find a way to use Ansible to install applications from Apple App Store. I had to manually install one app Blogo which I used for writing this blog post. I am still looking for ways to automate this. There might be a way to invoke a shell command using Ansible which might allow to install App store apps. I have not tried it so far. Better way in my opinion would be to have a Ansible Role which can take a list of Apps to be installed and silently install them.

[Update]: I received a tweet from Jeff Geerling that there is mas role defined within his playbook which can be used to install apps from App Store by specifying the email & password linked to the Apple Id account. I will try this approach and update the contents accordingly.


At the end of the exercise, my laptop was setup and looked like below

All the applications that you see as well as additional ones which are not visible in this screen (like VS Code) were installed using Ansible playbook. It took just one single command to get these apps installed. The setup can be replicated to any other MacBook with minimal changes. Automating the installation steps have saved me much more time to do useful stuff (like writing this blog :)). You can also setup your Mac using similar steps. If you wish to do so refer to the readme file available at my Github repo. I intend to keep updating this repo with the changes that I am making to my dev environment. Feel free to contribute to this repo.

Submit Apache Spark job from Command Line to HDInsight cluster

Background This is the 3rd part of the Step by Step guide to run Apache Spark on HDInsight cluster. The first part was about provisioning t...