Introduction

Two powerful tools in the data scientist’s toolbox are RSelenium (for automated browsing) and Amazon Web Services EC2 instances (for hosting and running your R scripts remotely).

However, using these tools in combination presents a unique challenge not faced when using each in isolation. For beginners especially, the process can be confusing, and clear answers can be difficult to find online.

Here, I lay out a step-by-step guide on setting up an AWS EC2 instance for the specific purpose of using it to host scraping/testing scripts that depend on the RSelenium package.


Setting up an EC2 Instance

The obvious first step in this process is to launch your own EC2 instance. Big thanks to Andrew Collier for originally helping me through this process; many of the following steps come from his tutorial on Remote Computing.


Getting on AWS

  1. If you don’t already have an Amazon Web Services account, make one.

  2. Once you have an AWS account, sign in to the console.


Launching an EC2 Instance

  1. In your AWS console, click the EC2 link under “Compute”.

  2. Click the “Launch Instance Button”

  3. Select the Ubuntu Server option. If you plan on using the rest of this tutorial, it is critical that you select the Ubuntu image and not one of the plethora of other options.

  4. Select the “General purpose” instance type that is eligible for the free tier.

  5. Click “Review and Launch”

  6. Press “Launch”


Creating Keys

  1. In the drop down, select “Create a new key pair”

  2. Name the key pair — the name should be descriptive, but simple and easy to remember.

  3. Download the key pair and store it somewhere safe in your machine.

  4. Open a terminal, and navigate to the parent directory of your key file. Set the your-key-name.pem file’s permissions using the following command:

chmod 400 ~/.ssh/your-key-name.pem
  1. Back in the AWS console, press “Launch Instances” button.

  2. Press “View Instances”


Setting up R on EC2 Instance

We’ll now need to connect to the EC2 instance and install R so that we can run R scripts on the machine. This can be done easily on Unix-based machines using Terminal, but Windows users will have to connect to their EC2 instance via an SSH client like Putty.

Using Unix commands:

  1. Under the description tab of your EC2 instance in the AWS console, copy the “Public DNS” name. It should be similar to the following:
ec2-52-43-168-220.us-west-2.compute.amazonaws.com
  1. Launch Terminal and connect to the remote instance using SSH. Assuming you launched an Ubuntu instance, the last portion of this command is ubuntu@your_public_dns. The command should look something like this, with your specific public DNS and key location/name. If asked if you’re sure you want to continue connecting, enter ‘yes’.
ssh -i your-key-name.pem ubuntu@ec2-52-43-168-220.us-west-2.compute.amazonaws.com
  1. Install R on the remote machine:
sudo apt-get update
sudo apt-get install r-base-core

Install R Packages

  1. Before we can install the RSelenium package from within R, we need to install both the XML package and the RCurl package externally.
sudo apt-get install r-cran-xml
sudo apt-get install r-cran-RCurl
  1. To launch R in the command line of your EC2 instance, simply enter the R command.
R
  1. Installing R Packages for the first time on your EC2 instance requires the creation of a personal library to hold the R package files. Run the install.packages() command like normal from within R, and enter ‘y’ at both prompts.
install.packages("RSelenium")
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Warning in install.packages("RSelenium") :
  'lib = "/usr/local/lib/R/site-library"' is not writable
Would you like to use a personal library instead?  (y/n) y
Would you like to create a personal library
~/R/x86_64-pc-linux-gnu-library/3.0
to install packages into?  (y/n) y
  1. You’ll be prompted to select a CRAN Mirror, choose a mirror arbitrarily or pick the mirror closest to the location of your EC2 instance. The RSelenium package should now install properly.

  2. Load the RSelenium package to ensure that it was correctly installed.

library("RSelenium")

Installing a Headless Browser for RSelenium

Unfortunately, our EC2 instance isn’t quite ready to use RSelenium. While the default browser of both the RSelenium package and Ubuntu instances is Firefox, using RSelenium on a remote EC2 instance requires a headless browser, which Firefox is not.

We’ll be using PhantomJS. Thanks to Julio Napurí for his helpful Gist file on this very topic.

  1. Install or Update system software:
sudo apt-get update
sudo apt-get install build-essential chrpath libssl-dev libxft-dev
  1. Install packages PhantomJS depends on:
sudo apt-get install libfreetype6 libfreetype6-dev
sudo apt-get install libfontconfig1 libfontconfig1-dev
  1. Download PhantomJS itself:
cd ~
export PHANTOM_JS="phantomjs-1.9.8-linux-x86_64"
wget https://bitbucket.org/ariya/phantomjs/downloads/$PHANTOM_JS.tar.bz2
sudo tar xvjf $PHANTOM_JS.tar.bz2
  1. Move PhantomJS to /usr/local/share/ and create a symlink:
sudo mv $PHANTOM_JS /usr/local/share
sudo ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin
  1. Ensure that PhantomJS installed correctly:
phantomjs --version

Installing Java on EC2 Instance

In order for the startServer() function to be able to start the Selenium server, our EC2 instance will need Java installed.

  1. Install Java from the command line:
sudo apt-get update
sudo apt-get install default-jre
sudo apt-get install default-jdk
  1. Check that Java was properly installed:
java -version

Downloading the Standalone Selenium Server

With everything above taken care of, your EC2 instance is now ready to use RSelenium.

First, load R on the command line:

R

The first time you use RSelenium, you’ll need to call checkForServer() to download the Standalone Selenium Server file:

library(RSelenium)
checkForServer()
[1] "DOWNLOADING STANDALONE SELENIUM SERVER. THIS MAY TAKE SEVERAL MINUTES"
trying URL 'http://selenium-release.storage.googleapis.com/2.53/selenium-server-standalone-2.53.1.jar'
Content type 'application/java-archive' length 21231092 bytes (20.2 Mb)
opened URL
==================================================
downloaded 20.2 Mb

Using RSelenium

Once the the Standalone Selenium Server is downloaded, your EC2 instance is ready to use RSelenium!

It is crucial that when you call remoteDriver(), you set the browserName arg = “phantomjs”. RSelenium cannot function on an EC2 instance without a headless browser, so the default to Firefox will break your script.

Let’s run some simple tasks using Selenium to test its functionality:

startServer()
browser = remoteDriver(browserName = "phantomjs")
browser$open()
browser$navigate("https://en.wikipedia.org/wiki/Special:Random")
header <- browser$findElement(using = 'css selector', "#firstHeading")
header$getElementText()

search <- browser$findElement(using = 'css selector', "#searchInput")
search$sendKeysToElement(list("Data Science", key = 'enter'))
header <- browser$findElement(using = 'css selector', "#firstHeading")
header$getElementText()

Sending Files to Your EC2 Instance

If you have an RSelenium script on your local machine that you want to move to your EC2 instance, there are a few ways to connect and transfer files to your newly created EC2 instance:

If you’re on a Unix-based OS, you can connect via the command line using SSH as before.

If you’re on a Windows OS, check out an SSH client like Putty.

If you’re on a Unix-based OS but prefer a GUI over the command line, check out this great tutorial.


Automating Scripts on an EC2 Instance

There’s a good chance that the reason a data scientist would want an EC2 instance capable of running RSelenium scripts at all is to automate scraping/testing on a remote server.


To make an Rscript executable on your EC2 instance, follow these steps:

  1. Locate the Rscript command on your EC2 instance:
which Rscript
/usr/bin/Rscript
  1. Prepend the result with “#!”, and add it to the first line of your R script
#!/usr/bin/Rscript
  1. In Terminal, navigate to the location of your R script:
cd ~/my_R_scripts
  1. Make the R script executable:
chmod u+x my_script.R
  1. Test executing the R Script:
./my_script.R

In order to automate the running of the script, it will need to be added to the user crontab.

  1. Open the crontab file in the terminal editor
crontab -e
  1. Add a cron job for your script to the cron file. An example to run a script every 10 minutes is below, but if you’re new to cron job syntax, read the following tutorial.
*/10 * * * * ~/my_R_scripts/my_script.R