Amazon Web Services

Amazon Web Services (AWS) is a secure cloud services platform, offering compute power, database storage, content delivery and other functionality to help businesses scale and grow. Customers can leverage AWS cloud products and solutions to build sophisticated applications with increased flexibility, scalability and reliability.

Quick Launch

Create an AWS account.
Log into your AWS Management Console.
Click on the Launch a virtual machine link.
Quick launch an EC2 Instance of Windows or Linux with default settings.
Navigate to the EC2 Management Console.
Select Instances from the menu options.
Wait until instance state is “running.”
Click Connect button and follow instructions:
- Connecting to Windows.
- Connecting to Linux.

Beware of AWS Free Tier limits. There are several surprise charges that will accrue if the instance does not meet the Free Tier limits exactly. The free tier allotment for Linux and Microsoft Windows EC2 instances is counted separately; you can run 750 hours of a Linux t2.micro or t1.micro instance plus 750 hours of a Windows t2.micro or t1.micro instance each month for the first 12 months.

Google Cloud Platform

Google Cloud Platform, offered by Google, is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search and YouTube. The beauty of GCP is in its simplicity. There are not as many knobs and buttons as AWS, but sane or similar results are still attainable. The simplicity of GCP’s design requires more familiarity with certain concepts but gets everything done in fewer screens, GCP even provides helpful tutorials. On top of all this, GCP is transparent with billing.

Quick Launch

Create a GCP account.
Log into your GCP Console.
Go to the Compute Engine dashboard.
Select Create Instance from the menu options.
Select the Quickstart option and follow the tutorial
For what follows, a g1-small Ubuntu Zesty 17.04 30G machine was used.
Wait for instance to start which is when a green check mark appears next to instance name.
Select the desired Connect drop-down option and follow instructions:
- Connecting to Windows.
- Connecting to Linux.

RStudio Server

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging, and workspace management. RStudio Server enables you to provide a browser based interface to a version of R running on a remote Linux server. Once connected to the server, enter the following.

sudo apt-get update
sudo apt-get install r-base
sudo apt-get install gdebi-core
wget https://download2.rstudio.org/rstudio-server-1.1.419-amd64.deb
sudo gdebi rstudio-server-1.1.419-amd64.deb
rm rstudio-server-1.1.419-amd64.deb
sudo apt-get install libcurl4-gnutls-dev # installs curl-config for RCurl package
sudo apt-get install libssl-dev # installs openssl-dev for PKI and rsconnect package
sudo apt-get install r-cran-car # For enhanced data manipulation
sudo apt-get install r-cran-ggplot2 # For enhanced graphical capabilities
sudo apt-get install r-cran-dplyr # For enhanced data manipulation
sudo apt-get install r-cran-tidyr # For enhanced data manipulation

By default RStudio Server runs on port 8787 and accepts connections from all remote clients. After installation you should therefore be able access the server by navigating a web browser to http://public_dns_name:8787, where public_dns_name is either the AWS IPv4 Public IP described above, or the GCP External IP address for your VM Instance. RStudio will prompt for a username and password, therefore you must add a user with password and root permissions.

sudo adduser rstudio
sudo usermod -aG sudo rstudio

RMarkdown

RMarkdown produces high quality documents, reports, presentations and dashboards that are fully reproducible. RMarkdown weaves together narrative text and code from multiple languages (R, Python, SQL, etc.) to produce elegantly formatted static and dynamic outputs in HTML, PDF, MS Word, Beamer, HTML5 slides, Tufte-style handouts, books, dashboards, shiny applications, scientific articles, websites, and more. To install, run the following script in the R Console.

install.packages("rmarkdown") # Make dynamic documents
install.packages("RCurl") # For publishing markdown docs
install.packages("PKI") # For publishing markdown docs
install.packages("rsconnect") # For publishing markdown docs
install.packages("plotly") # For advanced visualizations
install.packages("shiny") # For interactive visualizations

LaTeX

LaTeX is a markup language mainly used to create technical or scientific articles, papers, reports, books or PhD thesis. Pandoc document converter and LaTeX markup language are required to create PDF outputs. To install, connect to the server and enter the following commands.

sudo apt-get install haskell-platform
sudo apt-get install texlive-full
sudo apt-get install texmaker

Troubleshooting

RStudio sessions sometimes stay hanging in a code execution loop. Clearing the workspace will stop the code from running even when restarting the session fails. Note that this will clear all you projects.

sudo rstudio-server active-sessions
sudo rstudio-server force-suspend-session <PID>
sudo rstudio-server active-sessions
su - rstudio  # requires password
whoami
sudo rm -r ~/.rstudio # requires password
exit # user logout

Jupyter Notebook Server

Conda is an open source package management system and environment management system. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language. Anaconda is a set of over two hundred packages including conda, numpy, scipy, ipython notebook, and more. Miniconda is a smaller alternative to Anaconda that is just conda and its dependencies.

curl -O https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
bash Anaconda3-5.0.1-Linux-x86_64.sh

Go through the four prompts by typing Ctrl+v, then typing “yes”, hitting Enter, then typing “yes” again. Failing to enter the last “yes” that prepends the Anaconda install location to PATH will raise errors that require editing .bashrc to prepend the Anaconda3 install location to PATH.

source ~/.bashrc
conda list
rm Anaconda3-5.0.1-Linux-x86_64.sh
conda update conda 
conda update anaconda
conda update jupyter
jupyter notebook --generate-config
jupyter notebook password
nohup jupyter notebook --ip=* --no-browser >/dev/null 2>&1 &
jupyter notebook list

Jupyter Notebooks are not limited to the Python 3 programming language. Jupyter Notebooks can handle Python 2 and other programming languages as well. The following installs additional Kernels for Python 2, Bash, and R. After installing these Kernels, each of the options will be available when creating a new notebook.

conda create -n ipykernel_py2 python=2 ipykernel # Install Python 2 Kernel
source activate ipykernel_py2 # Install Python 2 Kernel
python -m ipykernel install --user # Install Python 2 Kernel
source deactivate # Install Python 2 Kernel
pip install bash_kernel # Install Bash Kernel
python -m bash_kernel.install # Install Bash Kernel
conda install -c r r-irkernel # Install R Kernel

Troubleshooting

To stop a running server, get the Process Identification Number (PID) and then kill the PID.

netstat -tulpn # PID/pyhton will be listed
kill <PID>

If the error conda: command not found or something similiar is encountered.

# TEMPORARY SOLUTION
export PATH=~/anaconda3/bin:$PATH

Docker Image Containers

Containers are a way to package software in a format that can run isolated on a shared operating system. Unlike VMs, containers do not bundle a full operating system - only libraries and settings required to make the software work are needed. This makes for efficient, lightweight, self-contained systems and guarantees that software will always run the same, regardless of where it’s deployed.

Docker is the world’s leading software container platform. Developers use Docker to eliminate “works on my machine” problems when collaborating on code with co-workers. Operators use Docker to run and manage apps side-by-side in isolated containers to get better compute density. Enterprises use Docker to build agile software delivery pipelines to ship new features faster, more securely and with confidence for both Linux, Windows Server, and Linux-on-mainframe apps.

Docker Cloud on AWS

Docker for AWS is installed with a CloudFormation template that configures Docker in swarm mode, running on instances backed by custom AMIs. It is the easiest way to get started, and requires the least amount of work. All you need to do is run the CloudFormation template, answer “Which SSH key to use?” in the Specify Details section, check the acknowledgment in the Review section, and you are good to go. The CloudFormation template will create everything that you need from scratch: a new VPC, subnets, gateways, and everything else needed in order to run Docker for AWS. Allowing Docker for AWS to create the VPC allows Docker to optimize the environment.

Beware of AWS Free Tier limits. The free tier allotment for Linux and Microsoft Windows EC2 instances is counted separately; you can run 750 hours of a Linux t2.micro or t1.micro instance plus 750 hours of a Windows t2.micro or t1.micro instance each month for the first 12 months. Therefore, launching a swarm with anything more than one manager in the swarm will not fall within the Free Tier. There are also several surprise charges that will accrue if the instance does not meet the Free Tier limits exactly.

Logging into Nodes

Instructions for connecting to an AWS Linux Instance can be found here. Connections to Docker will be through SSH (using PuTTY if logging in from a Windows computer). Your Host Name will be in the format of user_name@public_dns_name where user_name is docker rather than root or ec2-user and public_dns_name is the Public DNS (IPv4) for the Instance you are logging into which can be found by going into the AWS EC2 Dashboard Instances menu option and looking at the Description tab of the instance you are trying to access. Note: you must create a rule allowing “All traffic” from “My IP” under the Inbound tab of the “Docker-SwarmWideSG-…” Security Group.

Docker Cloud on GCP

There are several options for installing Docker on GCP instances which support nested virtualization. Google Container Engine, powered by Kubernetes, supports the common Docker container format. Compute Engine provides several public VM images that you can use to create instances and run your container workloads. Some of these public VM images have a minimalistic container-optimized operating system that includes newer versions of Docker, rkt, or Kubernetes preinstalled. When you run container workloads on Compute Engine, you have the freedom to employ whatever container technologies and orchestration tools that you need. The below installs Docker on a Ubuntu Zesty 17.04 instance using the Docker repository.

sudo apt-get remove docker docker-engine docker.io
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

(Optional) Verify key has fingerprint 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88.

sudo apt-key fingerprint 0EBFCD88

Continue the installation process.

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce
sudo docker run hello-world

Logging into Nodes

Instructions for connecting to a GCP Linux Instance can be found here. Connections to Docker will be through SSH on your browser. Note: you must Create a firewall rule allowing Ingress to “All instances in the Network” from your IP address in order to “Allow all” (or selected) Protocols and ports access to the instance.

RStudio Server Image

RStudio Server enables you to provide a browser based interface to a version of R running on a remote Linux server. Once the download is finished RStudio-Server will launch invisibly. To connect to it, open web browser and navigate to http://public_dns_name:8787 where public_dns_name is the IPv4 Public IP described above. You will see the RStudio welcome screen. Log in using “rstudio” as the username and password.

sudo docker run -d -p 8787:8787 -e ROOT=TRUE -e USER=rstudio -e PASSWORD=rstudio rocker/rstudio

Anaconda Python Image

Anaconda is the leading open data science platform powered by Python. The open source version of Anaconda is a high performance distribution and includes over 100 of the most popular Python packages for data science. Additionally, it provides access to over 720 Python and R packages that can easily be installed using the conda dependency and environment manager, which is included in Anaconda. You can download and run this image using the following commands:

docker pull continuumio/anaconda3
docker run -i -t continuumio/anaconda3 /bin/bash

Alternatively, you can start a Jupyter Notebook server and interact with Anaconda via your browser:

docker run -i -t -p 8888:8888 continuumio/anaconda3 /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && mkdir /opt/notebooks && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=8888 --no-browser"

You can then view the Jupyter Notebook by opening http://public_dns_name:8888 in your browser, where public_dns_name is either the AWS IPv4 Public IP described above, or the GCP External IP address for your VM Instance.

DataQuest Data Science Image

Log into Node using SSH (though PuTTy if necessary). Create directory, download Docker Data Science image, and set file permissions.

sudo mkdir -p /home/me/notebooks/
sudo docker run -d -p 8888:8888 -v /home/me/notebooks:/home/ds/notebooks dataquestio/python3-starter
sudo docker ps # to get <container_hash>
sudo docker exec -it <container_hash> bash
sudo chmod 777 /home/ds/notebooks/

Alternatively, permissions can be set from within Jupyter. Open web browser and navigate to http://public_dns_name:8888 where public_dns_name is either the AWS IPv4 Public IP described above, or the GCP External IP address for your VM Instance. This will open the Jupyter interface. Start a “New Terminal” and set file permissions.

sudo chmod 777 /home/ds/notebooks/

Update DataQuest Image

pip install --upgrade pip
touch requirements.txt
pip freeze > requirements.txt
pip install -U $(pip freeze | awk '{split($0, a, "=="); print a[1]}')

After the mass update, you need to revert notebook, ipython, and ipykernel to the original DataQuest versions. The first package is for accessing the Jupyter interface and the other two work to keep new kernels active. The pre-installed versions can be found in the requirements.txt file.

grep -r 'notebook==\|ipython==\|ipykernel==' requirements.txt

At the time of this document (October 2017), they were as follows:

pip install notebook==4.0.6 ipython==4.0.0 ipykernel==4.1.1
pip freeze > requirements.txt

Cloudera Hadoop QuickStart Image

Cloudera CDH (Cloudera’s Distribution Including Apache Hadoop) is to Hadoop, similar to what Anaconda is to Python. Hadoop (part of the Apache project) is a Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. A Hadoop Distributed File System (HDFS) stores data in multiple nodes (file systems). The data is accessed through a MapReduce framework that splits the input data set into independent chunks and processes the chunks in a completely parallel manner. The outputs of the maps (parallel outputs) are then sorted, which are then input to the reduce tasks. Hive (for structured data) and Pig (for semi-structured data) can be used to query Hadoop data through MapReduce using SQL. Impala is native to Hadoop and allows users to query Hadoop data directly using SQL. Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Hue is a web user interface (UI).

sudo docker pull cloudera/quickstart:latest
sudo docker run --hostname=quickstart.cloudera --privileged=true -t -i -d -p 8888:8888 -p 7180:7180 cloudera/quickstart /usr/bin/docker-quickstart /home/cloudera/cloudera-manager --express
sudo docker ps # to get <container_hash>
sudo docker attach <container_hash>

Hue is a web user interface (UI) for Hadoop. After installation you should therefore be able access the server by navigating a web browser http://public_dns_name:8888. Cloudera Manager is an application for managing CDH clusters. After installation you should therefore be able access the server by navigating a web browser http://public_dns_name:7180. For both applications, public_dns_name is either the AWS IPv4 Public IP described above, or the GCP External IP address for your VM Instance. Both Hue and Cloudera Manager will prompt for a username and password, which are both “cloudera” by default.

Note: The QuickStart Docker Container image is no longer updated or maintained. Also, a container dies when you exit the shell, but you can disconnect and leave the container running by typing Ctrl+p followed by Ctrl+q.

Spark, Mesos, Jupyter, Python Image

Apache Spark, an open source framework that combines an engine for distributing programs across clusters of machines with an elegant model for writing programs atop it. Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. With Mesos clusters, any version of Spark drivers and executors can run in Docker containers. Using a Jupyter Notebook makes it easy to write programs to access the Spark clusters. Finally, Python (in the form PySpark) has less of a learning curve and is considered easier to use, less verbose, and more readable than Scala (Spark’s native language).

sudo docker run -d -p 8888:8888 --user root -e GRANT_SUDO=yes jupyter/pyspark-notebook start-notebook.sh --NotebookApp.token=''
sudo docker ps # to get <container_hash>
sudo docker exec -it <container_hash> bash
pip install pyspark --upgrade
pip install findspark --upgrade

Declaring -d will run the container in âdetachedâ mode in the background in lieu of the default foreground mode. This allows the user to continue using the command line. Failing to -d will require a Ctrl+C logout. Not using the ...token='' option will result in a token being assigned automatically and a message similar to the below:

Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=…

To start using Spark through the Jupyter Notebook interface, open web browser and navigate to http://public_dns_name:8888 where public_dns_name is either the AWS IPv4 Public IP described above, or the GCP External IP address for your VM Instance.

PySpark Shell

cd ../../usr/local/spark
./bin/pyspark

Remove Exited Containers

sudo docker rm $(sudo docker ps -a -f status=exited -q)

Removing Docker Images

sudo docker images -a # get <image_id>
sudo docker rmi -f <image_id>

Louis Aslett AMI

The fastest way to get started with a cloud Data Science environment is the one-click solution fashioned by Louis Aslett, an Assistant Professor in the Department of Mathematical Sciences at Durham University. Aslett created an Amazon Machine Image (AMI) specifically targeted at R and RStudio Server with the goal of making it a one-minute job to get going for anyone with an AWS account. Many common tools and dependencies are built-in. AMIs also include a web interface (Jupyter) which enables support for Julia (and Python). More information can be found on his page. The downside is that it has a limited number of installed packages and you need root permissions, which you will not have, to make changes.

System Info

Created a g1-small (1 vCPU, 1.7 GB memory) instance with RStudio Server, RMarkdown, pandoc, LaTeX and Jupyter Notebook Server. Disk Free (df) in KB/MB/GB (-h) is:

uname -a

## Linux ubuntu 4.13.0-38-generic #43-Ubuntu SMP Wed Mar 14 15:20:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

lscpu

## Architecture:        x86_64
## CPU op-mode(s):      32-bit, 64-bit
## Byte Order:          Little Endian
## CPU(s):              1
## On-line CPU(s) list: 0
## Thread(s) per core:  1
## Core(s) per socket:  1
## Socket(s):           1
## NUMA node(s):        1
## Vendor ID:           GenuineIntel
## CPU family:          6
## Model:               63
## Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
## Stepping:            0
## CPU MHz:             2300.000
## BogoMIPS:            4600.00
## Hypervisor vendor:   KVM
## Virtualization type: full
## L1d cache:           32K
## L1i cache:           32K
## L2 cache:            256K
## L3 cache:            46080K
## NUMA node0 CPU(s):   0
## Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti retpoline fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt

lsblk

## NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
## sda       8:0    0   30G  0 disk 
## ├─sda1    8:1    0 29.9G  0 part /
## ├─sda14   8:14   0    4M  0 part 
## └─sda15   8:15   0  106M  0 part /boot/efi

df -h

## Filesystem      Size  Used Avail Use% Mounted on
## udev            837M     0  837M   0% /dev
## tmpfs           169M   11M  159M   6% /run
## /dev/sda1        29G   15G   15G  51% /
## tmpfs           845M     0  845M   0% /dev/shm
## tmpfs           5.0M     0  5.0M   0% /run/lock
## tmpfs           845M     0  845M   0% /sys/fs/cgroup
## /dev/sda15      105M  3.4M  102M   4% /boot/efi