Background

I have been compiling step-by-step documentation, using help guides, blog posts and insights from previous exercises. Now that Spark 2.0 is out, I figured it was a good oportunity to update my 1.6 documentation and to make it available to others.

The plan is to leverage a feature in AWS that allows you to replicate an existing server setup.

AWS/EC2 Setup

Key and Connection setup from laptop

Next you need to setup add a password to the certificate on the “Key Pair” downloaded from AWS.

For ease, we’ll use spark as the Key Passphrase

AWS provides the step-by-step instructions for this part:

Master server configuration

Terminal session

  • Start a new terminal session using the instructions in the Key and Connection setup from laptop section.

  • The initial password you need to enter when prompted for the “Passphrase for key”imported-openssh-key“:” is : spark

Install Java

Tip The terminal commands inside the boxes can be copied and pasted into your terminal session. In putty you can use right-click as the “paste” command.

The latest version of Java needs to be installed

sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default

Install Scala

Spark is based on Scala, so we need to install it early in the process

sudo wget http://www.scala-lang.org/files/archive/scala-2.11.8.tgz
cd /usr/local/src/
sudo mkdir scala
cd /home/ubuntu/
sudo tar -xvf scala-2.11.8.tgz -C /usr/local/src/scala

We’ll also need to tell Ubuntu where we installed Scala. To do this we update a setup file called ‘bashrc’

For editing, we’ll use an application called ‘vi’. This application is a very paired down file editor.

  • Open the bashrc file

    vi .bashrc
  • Use the arrow keys to go to the bottom of the file

  • Press the {Insert} key

  • Type:

    export SCALA_HOME=/usr/local/src/scala/scala-2.11.8
    export PATH=$SCALA_HOME/bin:$PATH
  • Press {Esc}

  • Type :wq and Enter (to save and close)

  • Ask Ubuntu to read the new bashrc file

    . .bashrc
  • Verify that the the new Scala version is recognized. The command below should return something like Scala code runner version 2.11.8 …

    scala -version

Install Spark

First, we will install git

sudo apt-get install git

Now, we’ll start by updating our Ubuntu server and then download the latest Spark installation files

sudo apt-get upgrade 
sudo apt-get update
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0.tgz
tar -xvzf spark-2.0.0.tgz
cd spark-2.0.0

Building Spark to support ‘sparklyr’

We have to build the Spark in the server. This was a syntax change going from Spark 1.6 to 2.0. Spark recommends using something called ‘Maven’ to build Spark, but I’ve been more successful using ‘SBT’

In order to recognize ‘hive’ support, we’ll add the two options at the end, this is different from the original SparkR installation.

sudo build/sbt package -Phive -Phive-thriftserver

Side comment - In my original notes I have a note that says ‘Go watch paint dry’, since it took the VM in my laptop 78 minutes to complete. In AWS using the M large instance type, it should take around 4 minutes.

Install R

We need to update the package list so that Ubuntu installs the latest R version. Here is the reference: https://cran.r-project.org/bin/linux/ubuntu/

sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" >> /etc/apt/sources.list'
gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -
sudo apt-get update
sudo apt-get install r-base
sudo apt-get install gdebi-core

Start the Master

  • We’ll start the master service and close our terminal session

    sudo spark-2.0.0/sbin/start-master.sh
    exit
  • Navigate to http://MY_PUBLIC_DNS:8080

  • Note the the Spark Master url, it should be something like: spark://ip-[MY_PRIVATE_IP but with dashes]:7077

Create the AWS Image

Now that we have completed the necessary setup, we will take a snapshot of the current state of our Master server. We will use this image to easily deploy the worker servers.

  • Go to the AWS console: https://console.aws.amazon.com

  • Select EC2

  • Click on Instances

  • Right-click on the instance that for the Master

  • Select Image and then Create Image

  • Image Name: spark

  • Click Create Image

Install RStudio

The steps below install the current version. To find updated instructions go to: https://www.rstudio.com/products/rstudio/download-server/ and select Debian/Ubuntu

  • Start a new terminal session session

  • Download and install RStudio Server

    wget https://download2.rstudio.org/rstudio-server-0.99.903-amd64.deb
    sudo gdebi rstudio-server-0.99.903-amd64.deb
    sudo adduser rstudio
  • Install the pre-requesites to get ‘devtools’ to work. This step won’t be needed when ‘spraklyr’ goes CRAN

    sudo apt-get -y install libcurl4-gnutls-dev
    sudo apt-get -y install libssl-dev
  • Start the Master server

    sudo spark-2.0.0/sbin/start-master.sh

Spark Workers

Launch the workers

The AMI we created earlier will be used to deploy the workers.

  • Go to the AWS console: https://console.aws.amazon.com

  • Select EC2

  • Click on AMIs

  • Right-click on the spark AMI

  • Select Launch

  • Step 2 - Instance Type: m4.large [You can select a different size server, a smaller server will run slower but may be more cost effective]

  • Step 3 - Number of instances: 3 [A different number of instances can be selected]

  • Step 4 - Storage: 20 Size GiB

  • Step 5 - Name: worker

  • Step 6 - Select an existing group | Name: spark

  • Click Launch

  • After clicking Launch the “Select existing pair or create a new pair” screen will appear, select:

    Choose an existing key pair

    Key pair name: spark

  • Launch Instance

Starting and connecting the workers

This part is a little repetitive. You will need to follow these steps for each of the workers that were deployed. Additionally, if you were to stop the Instances in AWS, you will need to follow these steps again:

  • Go to the AWS console: https://console.aws.amazon.com

  • Select EC2

  • Click on Instances

  • Select a worker and note the Public DNS

  • Start a new terminal session that connects to that worker

  • Start the slave service and close the terminal session (Use dots not dashes)

    sudo spark-2.0.0/sbin/start-slave.sh spark://[MY_PRIVATE_IP]:7077
    #sudo spark-2.0.0/sbin/start-slave.sh spark://172.31.1.80:7077
    exit
  • Navigate to http://MY_PUBLIC_DNS:8080, the new node(s) should be listed

Connect RStudio to Spark via ‘sparklyr’

library(readr)
library(dplyr)

csv_file <- "https://www.huduser.gov/portal/datasets/hads/hads2013n_ASCII.zip"
download.file(url =  csv_file, destfile = "hads.zip")

csv_hads <- read_csv("thads2013n.txt",n_max=5000)

spark_hads <- copy_to(my_context,csv_hads)

avg_beds <- spark_hads %>%
  group_by(AGE1) %>%
  summarise(count=n(),
            avg_beds=mean(BEDRMS))

print(avg_beds)

spark_disconnect(my_context)