sparklyr + Spark 2.0 + AWS

Background

I have been compiling step-by-step documentation, using help guides, blog posts and insights from previous exercises. Now that Spark 2.0 is out, I figured it was a good oportunity to update my 1.6 documentation and to make it available to others.

The plan is to leverage a feature in AWS that allows you to replicate an existing server setup.

AWS/EC2 Setup

Go to the AWS console: https://console.aws.amazon.com
Select EC2
Click on Instances
Step 1- Amazon Machine Instance: Ubuntu
Step 2 - Instance Type: m4.large
Step 3 - No changes
Step 4 - Storage: 20 Size GiB
Step 5 - No changes
Step 6 - Security Group Name: spark

Click Add Rule, select Type: All Trafic | Source: My IP

Click Add Rule, select Type: All Traffic | Source: Custom and type the the first two numbers in the IP address followed by ‘.0.0/16’ (So,if your Shiny server’s internal IP address is 172.31.2.200 then you’d enter 172.31.0.0/16 This gives every server in your VPC access to to each other)
After clicking Launch the “Select existing pair or create a new pair” screen will appear, select:

Create a new pair

Key pair name: spark

Click Download Key Pair

Save the file

Click Launch Instances
Go to the Instances section in the EC2 Dashboard section and click on the new instance
Copy to a text editor in your laptop the Public DNS address, from here on we’ll refer to it as MY_PUBLIC_DNS
Copy to a text editor in your laptop the Private IPs address, from here on we’ll refer to it as MY_PRIVATE_IP

Key and Connection setup from laptop

Next you need to setup add a password to the certificate on the “Key Pair” downloaded from AWS.

For ease, we’ll use spark as the Key Passphrase

AWS provides the step-by-step instructions for this part:

Windows - http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html

Complete sections Converting Your Private Key Using PuTTYgen and Starting a PuTTY Session
Non-Windows - http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html

Only go through the instructions in the To connect to your instance using SSH section

Master server configuration

Terminal session

Start a new terminal session using the instructions in the Key and Connection setup from laptop section.
The initial password you need to enter when prompted for the “Passphrase for key”imported-openssh-key“:” is : spark

Install Java

Tip The terminal commands inside the boxes can be copied and pasted into your terminal session. In putty you can use right-click as the “paste” command.

The latest version of Java needs to be installed

sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default

Install Scala

Spark is based on Scala, so we need to install it early in the process

sudo wget http://www.scala-lang.org/files/archive/scala-2.11.8.tgz
cd /usr/local/src/
sudo mkdir scala
cd /home/ubuntu/
sudo tar -xvf scala-2.11.8.tgz -C /usr/local/src/scala

We’ll also need to tell Ubuntu where we installed Scala. To do this we update a setup file called ‘bashrc’

For editing, we’ll use an application called ‘vi’. This application is a very paired down file editor.

Open the bashrc file
```
vi .bashrc
```
Use the arrow keys to go to the bottom of the file
Press the {Insert} key

Type:

export SCALA_HOME=/usr/local/src/scala/scala-2.11.8
export PATH=$SCALA_HOME/bin:$PATH

Press {Esc}
Type :wq and Enter (to save and close)
Ask Ubuntu to read the new bashrc file
```
. .bashrc
```
Verify that the the new Scala version is recognized. The command below should return something like Scala code runner version 2.11.8 …
```
scala -version
```

Install Spark

First, we will install git

sudo apt-get install git

Now, we’ll start by updating our Ubuntu server and then download the latest Spark installation files

sudo apt-get upgrade 
sudo apt-get update
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0.tgz
tar -xvzf spark-2.0.0.tgz
cd spark-2.0.0

Building Spark to support ‘sparklyr’

We have to build the Spark in the server. This was a syntax change going from Spark 1.6 to 2.0. Spark recommends using something called ‘Maven’ to build Spark, but I’ve been more successful using ‘SBT’

In order to recognize ‘hive’ support, we’ll add the two options at the end, this is different from the original SparkR installation.

sudo build/sbt package -Phive -Phive-thriftserver

Side comment - In my original notes I have a note that says ‘Go watch paint dry’, since it took the VM in my laptop 78 minutes to complete. In AWS using the M large instance type, it should take around 4 minutes.

Install R

We need to update the package list so that Ubuntu installs the latest R version. Here is the reference: https://cran.r-project.org/bin/linux/ubuntu/

sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" >> /etc/apt/sources.list'
gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -
sudo apt-get update
sudo apt-get install r-base
sudo apt-get install gdebi-core

Start the Master

We’ll start the master service and close our terminal session
```
sudo spark-2.0.0/sbin/start-master.sh
exit
```
Navigate to http://MY_PUBLIC_DNS:8080
Note the the Spark Master url, it should be something like: spark://ip-[MY_PRIVATE_IP but with dashes]:7077

Create the AWS Image

Now that we have completed the necessary setup, we will take a snapshot of the current state of our Master server. We will use this image to easily deploy the worker servers.

Go to the AWS console: https://console.aws.amazon.com
Select EC2
Click on Instances
Right-click on the instance that for the Master
Select Image and then Create Image
Image Name: spark
Click Create Image

Install RStudio

The steps below install the current version. To find updated instructions go to: https://www.rstudio.com/products/rstudio/download-server/ and select Debian/Ubuntu

Start a new terminal session session

Download and install RStudio Server

wget https://download2.rstudio.org/rstudio-server-0.99.903-amd64.deb
sudo gdebi rstudio-server-0.99.903-amd64.deb
sudo adduser rstudio

Install the pre-requesites to get ‘devtools’ to work. This step won’t be needed when ‘spraklyr’ goes CRAN
```
sudo apt-get -y install libcurl4-gnutls-dev
sudo apt-get -y install libssl-dev
```
Start the Master server
```
sudo spark-2.0.0/sbin/start-master.sh
```

Spark Workers

Launch the workers

The AMI we created earlier will be used to deploy the workers.

Go to the AWS console: https://console.aws.amazon.com
Select EC2
Click on AMIs
Right-click on the spark AMI
Select Launch
Step 2 - Instance Type: m4.large [You can select a different size server, a smaller server will run slower but may be more cost effective]
Step 3 - Number of instances: 3 [A different number of instances can be selected]
Step 4 - Storage: 20 Size GiB
Step 5 - Name: worker
Step 6 - Select an existing group | Name: spark
Click Launch
After clicking Launch the “Select existing pair or create a new pair” screen will appear, select:

Choose an existing key pair

Key pair name: spark
Launch Instance

Starting and connecting the workers

This part is a little repetitive. You will need to follow these steps for each of the workers that were deployed. Additionally, if you were to stop the Instances in AWS, you will need to follow these steps again:

Go to the AWS console: https://console.aws.amazon.com
Select EC2
Click on Instances
Select a worker and note the Public DNS
Start a new terminal session that connects to that worker

Start the slave service and close the terminal session (Use dots not dashes)

sudo spark-2.0.0/sbin/start-slave.sh spark://[MY_PRIVATE_IP]:7077
#sudo spark-2.0.0/sbin/start-slave.sh spark://172.31.1.80:7077
exit

Navigate to http://MY_PUBLIC_DNS:8080, the new node(s) should be listed

Connect RStudio to Spark via ‘sparklyr’

Install the package

install.packages("devtools")
devtools::install_github("rstudio/sparklyr")

Connect to ‘Spark’

library(sparklyr)
my_context <- spark_connect(master = "spark://172.31.1.184:7077", spark_home = "/home/ubuntu/spark-2.0.0", version="2.0.0")

Run a test script

library(readr)
library(dplyr)

csv_file <- "https://www.huduser.gov/portal/datasets/hads/hads2013n_ASCII.zip"
download.file(url =  csv_file, destfile = "hads.zip")

csv_hads <- read_csv("thads2013n.txt",n_max=5000)

spark_hads <- copy_to(my_context,csv_hads)

avg_beds <- spark_hads %>%
  group_by(AGE1) %>%
  summarise(count=n(),
            avg_beds=mean(BEDRMS))

print(avg_beds)

spark_disconnect(my_context)