Data, data, big not small, how can R increase efficiency for all?

Mark Cherrie
28/09/17

Background

I am interested in the relationship between population-level ultraviolet radiation (UVR) exposure and health, I applied the following method during my PhD:
- Downloaded 457 daily METEOSAT satellite-derived .tiff images (June 2002 to 31st of July 2003 for Europe) held on JRC File Transfer Protocol (FTP) server
- Imported these files into ArcMap 10.0 and used the model builder to batch process
- Generated exposure estimates for postcode sectors in Geospatial Modelling Environment
- Performed analysis with 1958 Birth cohort health outcomes in Stata 11

Ch-ch-ch-ch-changes

JRC data product not produced past 2004, new Terra/Aqua satellite-derived product held on a JAXA FTP
- JAXA data is stored in binary format and FTP updated daily
- Interested in larger time period (6,481 days; 1st of Jan 2000- today)
- Interested in multiple outputs (i.e. UVA and UVB)
- Interested in exposure at a smaller scale, i.e. postcode units

Is this familiar?

Has the data you use or want to use grown in:
- Volume (e.g. global instead of regional)
- Velocity (e.g. updated daily instead of bulk downloads)
- Variety (e.g. Photos, videos, audio recordings, email messages, documents, books, presentations, tweets instead of stuctured data)
Even if the data you are working with has not changed, there is a greater need to document the steps taken in the collection and subsequent analysis so that the experiment can be replicated

How can R help?

End-to-end batch processing
- The act of programming a computer to execute a sequence of commands, scheduled to run at certain times, so that the desired output is generated with minimal human input.
Reproducible research
- Data and code organised in a way that another individual/group could replicate the results entirely, including the figures and tables.

R popularity

R has growth in popularity, from 73rd (2001) to 11th (2017) most popular programming language, according to the TIOBE rankings.
Lots of resources online, plus those dedicated to spatial analysis.
Very helpful online community that post regularly on the following sites: R-bloggers, Stackoverflow and Github.

Aim

This workshop will give all the commands and file structure to run a batch processing example in order to help you to develop your own
- The aim of this example is to extract up-to-date mean UVB radiation for local authorities in Scotland
  - I assume that you are using a Windows PC
  - I recommend having a look at R basics and this introduction to data analysis before modifying this example for your own purpose

Not the Aim

xkcd.com

Setup

First things first, install R and Rstudio and Git Bash.
My installation paths are:
- R: “C:/PROGRA~1/R/R-33~1.2/bin/x64/R.exe”
- GitBash:“C:/PROGRA~1/Git/bin/sh.exe”
- Check that yours are too!
Download the batch processing folder from my github, click 'Clone or download' > Download Zip, unzip it, move it somewhere safe (and with enough space)

Setup

IMPORTANT

Double click the 'batchprocessing.Rproj' file, this will start up the project in Rstudio
When it has loaded you will see all the files and folders in the files pane (bottom right)
Now click on each file (.r, .sh and .bat files) and change “C:/Users/mcherrie” to the place where you have the unzipped folder

Accessing data in R

Large volumes of data often reside in a database or web server.

Databases: RPostgreSQL and mongolite.
Web server:
- The command download.file will do the job in simple cases.
- The packages RCurl or httr are good for when you have to deal with cookies, redirects, authentication, etc.
- An alternative is the downloader package which is easier to setup due to having no external dependencies but it does have less functionality.

Accessing data in R

It is also possible to use the above to access any website's API, with help from this guide.
R packages have already been written to retrieve data from the most common data providers, for example: Facebook, Twitter, Google

Step 1: Download file from the internet

The JAXA data is held on a FTP server, with no restrictions on access, so download.file is the best method.
We will use download.file to save each file to our /rawdata folder.
It's good practice to check the performance of functions (benchmarking), also this comes in handy so we can tune the scheduling later

Step 1: Download file from the internet

To time how long code runs in R we can wrap the main code in the following commands:

# Start the clock!
ptm <- proc.time()
# Insert command to be evaluated
# output to a text file
sink("timeout.txt")
# Stop the clock
proc.time() - ptm
# close connection to the text file
sink()

Step 1: Download file from the internet

# Build the string where the data resides, the URL stem, then function variables to pinpoint a specific folder:

FTP <- paste0("ftp://apollo.eorc.jaxa.jp/
               pub/JASMES/Global_05km/", 
               UVRtype, "/", temporal,"/", 
               substring(date,1,6), "/")

# Now we need to build string that pinpoints a specfic file and downloads it to /rawdata

searchFTP<-paste0(FTP, sat, "02SSH_A", date, "Av1_v811_7200_3601_",UVRtype,"__8b.gz")
download.file(searchFTP, destfile=paste0(getwd(), "/rawdata/", sub(FTP, "",searchFTP)))

Step 2: Perform analysis on the file

A bash script ('convert.sh') that first unzips the file in /rawdata and then runs the conversion from binary to text format (via the perl script 'go.pl'), and outputs the text file in /convert
An R script ('subset.r') that takes the text file in /convert, subsets the data to the UK and creates a csv file and outputs it to /subset
An R script ('raster.r') that takes csv file from /subset creates an inverse distance weighted raster image and outputs to /raster
An R script ('extract.r') that takes raster file from /raster (keeps a copy in /rasterrepository) and takes the mean for each polygon in /auxdata (i.e. Local authority here)

Step 3: Create the batch files

For each processing script file we need to create a batch file (.bat)
The format is as follows:
- '@echo off'
- The installation path of the executable that will run the code
- 'CMD BATCH'
- The path of the processing script file
So the first batch file we create looks like this:

@echo off "C:/PROGRA~1/R/R-33~1.2/bin/x64/R.exe" 
CMD BATCH 
C:/Users/mcherrie/batchprocessing/download.R

Step 4: Create the batch arguments

Here we want to specify the timing details on which the batch files will run; the full rundown of all the available arguments is available here
- We are interested in running the batch files daily, as the FTP is only updated daily
- However if you need a finer timescale then you can use the recurrence argument 'minute' and then /mo (modifier) set to the number of minutes that you require

Step 4: Create the batch arguments

We can create a file called runbats.R, here are the arguments for the download stage:

# download
recurrence <- "daily"
task_name <- "download"
bat_loc <- "C:\\Users\\mcherrie\\
batchprocessing\\download.bat"
time <- "23:59"
date<-"12/09/2017"
system(sprintf("schtasks /create /sd %s /sc %s /tn %s /tr \"%s\" /st %s", date, 
recurrence, task_name, bat_loc, time))

IMPORTANT The reccurence argument is dependent on how often the often the raw data changes and how long the function takes (see the timeout.txt)

Step 5: De-bug and re-tune

Different errors to check for
- To check for R code errors run in R until it works as desired
- To check for batch file errors, run the batch file command in the terminal window (type CMD from start button)
- To check for bottleneck errors once the batch process has started look at the output in the .Rout files
  - Control flow to prevent the functions breaking
- Working directory errors might arise from using a mapped network drive

Step 5: De-bug and re-tune

To re-tune type this into the R Console:

## open tasks 
system("control schedtasks")

Go to Task Scheduler Library, click on the batch file name and click on the History tab
- Compare the 'Created Task Process' with 'Task Completed'
Do this for each of the files and re-tune by:
- Deleting the task name in the Task Scheduler
- Then change the recurrence argument in the runbat.R file

Summary

R can batch process data by performing a split-apply-combine strategy
This is especially important if you need to manage memory constraints
- For this, you'll need to change the settings (Settings>If the running task does not end when requested, force it to stop); You also need to make use of sink so that you can pick up where you left of

Other potential applications?

Social media Analysis
- It's estimated that only 1% of tweets are geocoded, so it may take a while to collect a large sample for a given subject; run a script to collect tweets daily and analyses them every 3 months
Literature Review
- To stay up-to-date on health geography trends, you could use RISmed, to produce tables every month of the most frequent words in the abstracts from articles that match a certain word/phrase/acronymn/author, and email the tables to yourself and colleagues

How to make it reproducible?

Host your code and data on github, follow these instructions
- Those with a university email address get private repositories so you can keep code secure until it's ready to be published
Use a structured template for the analysis:
- Analysis workflow by Susan Johnston

Next Steps

Short Term
- Remove command window from popping up on the screen
- How to do this on a mac (cronR)
Medium Term
- Link to Amazon Web Services (Bigger data)
- Link output to database-as-a-service (Mlab)
  - Query and share results
  - Administrative and maintenance tasks are taken care of
Ongoing
- Write R code more efficiently (Gillespie and Lovelace, 2017)

Links and References

Batch Procesing Tutorial (by Tyler Rinker)
Batch Processing Presentation
Batch Processing Data
Reproducible Research Presentation (by Susan Johnston)
Reproducible Research Data

Data, data, big not small, how can R increase efficiency for all?

Background

Ch-ch-ch-ch-changes

Is this familiar?

How can R help?

R popularity

R popularity

Aim

Not the Aim

Setup

Setup

Accessing data in R

Accessing data in R

Step 1: Download file from the internet

Step 1: Download file from the internet

Step 1: Download file from the internet

Step 2: Perform analysis on the file

Step 3: Create the batch files

Step 4: Create the batch arguments

Step 4: Create the batch arguments

Step 5: De-bug and re-tune

Step 5: De-bug and re-tune

Summary

Summary

Other potential applications?

How to make it reproducible?

Next Steps

Links and References

Thanks