I have been compiling step-by-step documentation, using help guides, blog posts and insights from previous exercises. Now that Spark 2.0 is out, I figured it was a good oportunity to update my 1.6 documentation and to make it available to others.
The plan is to leverage a feature in AWS that allows you to replicate an existing server setup.
Go to the AWS console: https://console.aws.amazon.com
Select EC2
Click on Instances
Step 1- Amazon Machine Instance: Ubuntu
Step 2 - Instance Type: m4.large
Step 3 - No changes
Step 4 - Storage: 20 Size GiB
Step 5 - No changes
Step 6 - Security Group Name: spark
Click Add Rule, select Type: All Trafic | Source: My IP
Click Add Rule, select Type: All Traffic | Source: Custom and type the the first two numbers in the IP address followed by ‘.0.0/16’ (So,if your Shiny server’s internal IP address is 172.31.2.200 then you’d enter 172.31.0.0/16 This gives every server in your VPC access to to each other)
After clicking Launch the “Select existing pair or create a new pair” screen will appear, select:
Create a new pair
Key pair name: spark
Click Download Key Pair
Save the file
Click Launch Instances
Go to the Instances section in the EC2 Dashboard section and click on the new instance
Copy to a text editor in your laptop the Public DNS address, from here on we’ll refer to it as MY_PUBLIC_DNS
Copy to a text editor in your laptop the Private IPs address, from here on we’ll refer to it as MY_PRIVATE_IP
Next you need to setup add a password to the certificate on the “Key Pair” downloaded from AWS.
For ease, we’ll use spark as the Key Passphrase
AWS provides the step-by-step instructions for this part:
Windows - http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html
Complete sections Converting Your Private Key Using PuTTYgen and Starting a PuTTY Session
Non-Windows - http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html
Only go through the instructions in the To connect to your instance using SSH section
Start a new terminal session using the instructions in the Key and Connection setup from laptop section.
The initial password you need to enter when prompted for the “Passphrase for key”imported-openssh-key“:” is : spark
Tip The terminal commands inside the boxes can be copied and pasted into your terminal session. In putty you can use right-click as the “paste” command.
The latest version of Java needs to be installed
sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default
Spark is based on Scala, so we need to install it early in the process
sudo wget http://www.scala-lang.org/files/archive/scala-2.11.8.tgz
cd /usr/local/src/
sudo mkdir scala
cd /home/ubuntu/
sudo tar -xvf scala-2.11.8.tgz -C /usr/local/src/scala
We’ll also need to tell Ubuntu where we installed Scala. To do this we update a setup file called ‘bashrc’
For editing, we’ll use an application called ‘vi’. This application is a very paired down file editor.
Open the bashrc file
vi .bashrcUse the arrow keys to go to the bottom of the file
Press the {Insert} key
Type:
export SCALA_HOME=/usr/local/src/scala/scala-2.11.8
export PATH=$SCALA_HOME/bin:$PATHPress {Esc}
Type :wq and Enter (to save and close)
Ask Ubuntu to read the new bashrc file
. .bashrcVerify that the the new Scala version is recognized. The command below should return something like Scala code runner version 2.11.8 …
scala -versionFirst, we will install git
sudo apt-get install git
Now, we’ll start by updating our Ubuntu server and then download the latest Spark installation files
sudo apt-get upgrade
sudo apt-get update
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0.tgz
tar -xvzf spark-2.0.0.tgz
cd spark-2.0.0
We have to build the Spark in the server. This was a syntax change going from Spark 1.6 to 2.0. Spark recommends using something called ‘Maven’ to build Spark, but I’ve been more successful using ‘SBT’
In order to recognize ‘hive’ support, we’ll add the two options at the end, this is different from the original SparkR installation.
sudo build/sbt package -Phive -Phive-thriftserver
Side comment - In my original notes I have a note that says ‘Go watch paint dry’, since it took the VM in my laptop 78 minutes to complete. In AWS using the M large instance type, it should take around 4 minutes.
We need to update the package list so that Ubuntu installs the latest R version. Here is the reference: https://cran.r-project.org/bin/linux/ubuntu/
sudo sh -c 'echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" >> /etc/apt/sources.list'
gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -
sudo apt-get update
sudo apt-get install r-base
sudo apt-get install gdebi-core
We’ll start the master service and close our terminal session
sudo spark-2.0.0/sbin/start-master.sh
exitNavigate to http://MY_PUBLIC_DNS:8080
Note the the Spark Master url, it should be something like: spark://ip-[MY_PRIVATE_IP but with dashes]:7077
Now that we have completed the necessary setup, we will take a snapshot of the current state of our Master server. We will use this image to easily deploy the worker servers.
Go to the AWS console: https://console.aws.amazon.com
Select EC2
Click on Instances
Right-click on the instance that for the Master
Select Image and then Create Image
Image Name: spark
Click Create Image
The steps below install the current version. To find updated instructions go to: https://www.rstudio.com/products/rstudio/download-server/ and select Debian/Ubuntu
Start a new terminal session session
Download and install RStudio Server
wget https://download2.rstudio.org/rstudio-server-0.99.903-amd64.deb
sudo gdebi rstudio-server-0.99.903-amd64.deb
sudo adduser rstudioInstall the pre-requesites to get ‘devtools’ to work. This step won’t be needed when ‘spraklyr’ goes CRAN
sudo apt-get -y install libcurl4-gnutls-dev
sudo apt-get -y install libssl-devStart the Master server
sudo spark-2.0.0/sbin/start-master.shThe AMI we created earlier will be used to deploy the workers.
Go to the AWS console: https://console.aws.amazon.com
Select EC2
Click on AMIs
Right-click on the spark AMI
Select Launch
Step 2 - Instance Type: m4.large [You can select a different size server, a smaller server will run slower but may be more cost effective]
Step 3 - Number of instances: 3 [A different number of instances can be selected]
Step 4 - Storage: 20 Size GiB
Step 5 - Name: worker
Step 6 - Select an existing group | Name: spark
Click Launch
After clicking Launch the “Select existing pair or create a new pair” screen will appear, select:
Choose an existing key pair
Key pair name: spark
Launch Instance
This part is a little repetitive. You will need to follow these steps for each of the workers that were deployed. Additionally, if you were to stop the Instances in AWS, you will need to follow these steps again:
Go to the AWS console: https://console.aws.amazon.com
Select EC2
Click on Instances
Select a worker and note the Public DNS
Start a new terminal session that connects to that worker
Start the slave service and close the terminal session (Use dots not dashes)
sudo spark-2.0.0/sbin/start-slave.sh spark://[MY_PRIVATE_IP]:7077
#sudo spark-2.0.0/sbin/start-slave.sh spark://172.31.1.80:7077
exitNavigate to http://MY_PUBLIC_DNS:8080, the new node(s) should be listed
Install the package
install.packages("devtools")
devtools::install_github("rstudio/sparklyr")Connect to ‘Spark’
library(sparklyr)
my_context <- spark_connect(master = "spark://172.31.1.184:7077", spark_home = "/home/ubuntu/spark-2.0.0", version="2.0.0")Run a test script
library(readr)
library(dplyr)
csv_file <- "https://www.huduser.gov/portal/datasets/hads/hads2013n_ASCII.zip"
download.file(url = csv_file, destfile = "hads.zip")
csv_hads <- read_csv("thads2013n.txt",n_max=5000)
spark_hads <- copy_to(my_context,csv_hads)
avg_beds <- spark_hads %>%
group_by(AGE1) %>%
summarise(count=n(),
avg_beds=mean(BEDRMS))
print(avg_beds)
spark_disconnect(my_context)