Two powerful tools in the data scientist’s toolbox are RSelenium (for automated browsing) and Amazon Web Services EC2 instances (for hosting and running your R scripts remotely).
However, using these tools in combination presents a unique challenge not faced when using each in isolation. For beginners especially, the process can be confusing, and clear answers can be difficult to find online.
Here, I lay out a step-by-step guide on setting up an AWS EC2 instance for the specific purpose of using it to host scraping/testing scripts that depend on the RSelenium package.
The obvious first step in this process is to launch your own EC2 instance. Big thanks to Andrew Collier for originally helping me through this process; many of the following steps come from his tutorial on Remote Computing.
If you don’t already have an Amazon Web Services account, make one.
Once you have an AWS account, sign in to the console.
In your AWS console, click the EC2 link under “Compute”.
Click the “Launch Instance Button”
Select the Ubuntu Server option. If you plan on using the rest of this tutorial, it is critical that you select the Ubuntu image and not one of the plethora of other options.
Select the “General purpose” instance type that is eligible for the free tier.
Click “Review and Launch”
Press “Launch”
In the drop down, select “Create a new key pair”
Name the key pair — the name should be descriptive, but simple and easy to remember.
Download the key pair and store it somewhere safe in your machine.
Open a terminal, and navigate to the parent directory of your key file. Set the your-key-name.pem file’s permissions using the following command:
chmod 400 ~/.ssh/your-key-name.pem
Back in the AWS console, press “Launch Instances” button.
Press “View Instances”
We’ll now need to connect to the EC2 instance and install R so that we can run R scripts on the machine. This can be done easily on Unix-based machines using Terminal, but Windows users will have to connect to their EC2 instance via an SSH client like Putty.
Using Unix commands:
ec2-52-43-168-220.us-west-2.compute.amazonaws.com
ssh -i your-key-name.pem ubuntu@ec2-52-43-168-220.us-west-2.compute.amazonaws.com
sudo apt-get update
sudo apt-get install r-base-core
sudo apt-get install r-cran-xml
sudo apt-get install r-cran-RCurl
R
install.packages("RSelenium")
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Warning in install.packages("RSelenium") :
'lib = "/usr/local/lib/R/site-library"' is not writable
Would you like to use a personal library instead? (y/n) y
Would you like to create a personal library
~/R/x86_64-pc-linux-gnu-library/3.0
to install packages into? (y/n) y
You’ll be prompted to select a CRAN Mirror, choose a mirror arbitrarily or pick the mirror closest to the location of your EC2 instance. The RSelenium package should now install properly.
Load the RSelenium package to ensure that it was correctly installed.
library("RSelenium")
Unfortunately, our EC2 instance isn’t quite ready to use RSelenium. While the default browser of both the RSelenium package and Ubuntu instances is Firefox, using RSelenium on a remote EC2 instance requires a headless browser, which Firefox is not.
We’ll be using PhantomJS. Thanks to Julio Napurí for his helpful Gist file on this very topic.
sudo apt-get update
sudo apt-get install build-essential chrpath libssl-dev libxft-dev
sudo apt-get install libfreetype6 libfreetype6-dev
sudo apt-get install libfontconfig1 libfontconfig1-dev
cd ~
export PHANTOM_JS="phantomjs-1.9.8-linux-x86_64"
wget https://bitbucket.org/ariya/phantomjs/downloads/$PHANTOM_JS.tar.bz2
sudo tar xvjf $PHANTOM_JS.tar.bz2
sudo mv $PHANTOM_JS /usr/local/share
sudo ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin
phantomjs --version
In order for the startServer() function to be able to start the Selenium server, our EC2 instance will need Java installed.
sudo apt-get update
sudo apt-get install default-jre
sudo apt-get install default-jdk
java -version
With everything above taken care of, your EC2 instance is now ready to use RSelenium.
First, load R on the command line:
R
The first time you use RSelenium, you’ll need to call checkForServer() to download the Standalone Selenium Server file:
library(RSelenium)
checkForServer()
[1] "DOWNLOADING STANDALONE SELENIUM SERVER. THIS MAY TAKE SEVERAL MINUTES"
trying URL 'http://selenium-release.storage.googleapis.com/2.53/selenium-server-standalone-2.53.1.jar'
Content type 'application/java-archive' length 21231092 bytes (20.2 Mb)
opened URL
==================================================
downloaded 20.2 Mb
Once the the Standalone Selenium Server is downloaded, your EC2 instance is ready to use RSelenium!
It is crucial that when you call remoteDriver(), you set the browserName arg = “phantomjs”. RSelenium cannot function on an EC2 instance without a headless browser, so the default to Firefox will break your script.
Let’s run some simple tasks using Selenium to test its functionality:
startServer()
browser = remoteDriver(browserName = "phantomjs")
browser$open()
browser$navigate("https://en.wikipedia.org/wiki/Special:Random")
header <- browser$findElement(using = 'css selector', "#firstHeading")
header$getElementText()
search <- browser$findElement(using = 'css selector', "#searchInput")
search$sendKeysToElement(list("Data Science", key = 'enter'))
header <- browser$findElement(using = 'css selector', "#firstHeading")
header$getElementText()
If you have an RSelenium script on your local machine that you want to move to your EC2 instance, there are a few ways to connect and transfer files to your newly created EC2 instance:
If you’re on a Unix-based OS, you can connect via the command line using SSH as before.
If you’re on a Windows OS, check out an SSH client like Putty.
If you’re on a Unix-based OS but prefer a GUI over the command line, check out this great tutorial.
There’s a good chance that the reason a data scientist would want an EC2 instance capable of running RSelenium scripts at all is to automate scraping/testing on a remote server.
To make an Rscript executable on your EC2 instance, follow these steps:
which Rscript
/usr/bin/Rscript
#!/usr/bin/Rscript
cd ~/my_R_scripts
chmod u+x my_script.R
./my_script.R
In order to automate the running of the script, it will need to be added to the user crontab.
crontab -e
*/10 * * * * ~/my_R_scripts/my_script.R