Data, data, big not small, how can R increase efficiency for all?

Mark Cherrie
28/09/17

Background

  • I am interested in the relationship between population-level ultraviolet radiation (UVR) exposure and health, I applied the following method during my PhD:

    • Downloaded 457 daily METEOSAT satellite-derived .tiff images (June 2002 to 31st of July 2003 for Europe) held on JRC File Transfer Protocol (FTP) server
    • Imported these files into ArcMap 10.0 and used the model builder to batch process
    • Generated exposure estimates for postcode sectors in Geospatial Modelling Environment
    • Performed analysis with 1958 Birth cohort health outcomes in Stata 11

Ch-ch-ch-ch-changes

  • JRC data product not produced past 2004, new Terra/Aqua satellite-derived product held on a JAXA FTP
    • JAXA data is stored in binary format and FTP updated daily
    • Interested in larger time period (6,481 days; 1st of Jan 2000- today)
    • Interested in multiple outputs (i.e. UVA and UVB)
    • Interested in exposure at a smaller scale, i.e. postcode units

Is this familiar?

  • Has the data you use or want to use grown in:

    • Volume (e.g. global instead of regional)
    • Velocity (e.g. updated daily instead of bulk downloads)
    • Variety (e.g. Photos, videos, audio recordings, email messages, documents, books, presentations, tweets instead of stuctured data)
  • Even if the data you are working with has not changed, there is a greater need to document the steps taken in the collection and subsequent analysis so that the experiment can be replicated

How can R help?

  • End-to-end batch processing

    • The act of programming a computer to execute a sequence of commands, scheduled to run at certain times, so that the desired output is generated with minimal human input.
  • Reproducible research

    • Data and code organised in a way that another individual/group could replicate the results entirely, including the figures and tables.

R popularity

R popularity

Aim

  • This workshop will give all the commands and file structure to run a batch processing example in order to help you to develop your own
    • The aim of this example is to extract up-to-date mean UVB radiation for local authorities in Scotland
      • I assume that you are using a Windows PC
      • I recommend having a look at R basics and this introduction to data analysis before modifying this example for your own purpose

Not the Aim

xkcd.com

Setup

  • First things first, install R and Rstudio and Git Bash.
  • My installation paths are:
    • R: “C:/PROGRA~1/R/R-33~1.2/bin/x64/R.exe”
    • GitBash:“C:/PROGRA~1/Git/bin/sh.exe”
    • Check that yours are too!
  • Download the batch processing folder from my github, click 'Clone or download' > Download Zip, unzip it, move it somewhere safe (and with enough space)

Setup

IMPORTANT

  • Double click the 'batchprocessing.Rproj' file, this will start up the project in Rstudio
  • When it has loaded you will see all the files and folders in the files pane (bottom right)
  • Now click on each file (.r, .sh and .bat files) and change “C:/Users/mcherrie” to the place where you have the unzipped folder

Accessing data in R

Large volumes of data often reside in a database or web server.

  • Databases: RPostgreSQL and mongolite.

  • Web server:

    • The command download.file will do the job in simple cases.
    • The packages RCurl or httr are good for when you have to deal with cookies, redirects, authentication, etc.
    • An alternative is the downloader package which is easier to setup due to having no external dependencies but it does have less functionality.

Accessing data in R

  • It is also possible to use the above to access any website's API, with help from this guide.
  • R packages have already been written to retrieve data from the most common data providers, for example: Facebook, Twitter, Google

Step 1: Download file from the internet

  • The JAXA data is held on a FTP server, with no restrictions on access, so download.file is the best method.
  • We will use download.file to save each file to our /rawdata folder.
  • It's good practice to check the performance of functions (benchmarking), also this comes in handy so we can tune the scheduling later

Step 1: Download file from the internet

  • To time how long code runs in R we can wrap the main code in the following commands:
# Start the clock!
ptm <- proc.time()
# Insert command to be evaluated
# output to a text file
sink("timeout.txt")
# Stop the clock
proc.time() - ptm
# close connection to the text file
sink()

Step 1: Download file from the internet

# Build the string where the data resides, the URL stem, then function variables to pinpoint a specific folder:

FTP <- paste0("ftp://apollo.eorc.jaxa.jp/
               pub/JASMES/Global_05km/", 
               UVRtype, "/", temporal,"/", 
               substring(date,1,6), "/")

# Now we need to build string that pinpoints a specfic file and downloads it to /rawdata

searchFTP<-paste0(FTP, sat, "02SSH_A", date, "Av1_v811_7200_3601_",UVRtype,"__8b.gz")
download.file(searchFTP, destfile=paste0(getwd(), "/rawdata/", sub(FTP, "",searchFTP)))

Step 2: Perform analysis on the file

  • A bash script ('convert.sh') that first unzips the file in /rawdata and then runs the conversion from binary to text format (via the perl script 'go.pl'), and outputs the text file in /convert
  • An R script ('subset.r') that takes the text file in /convert, subsets the data to the UK and creates a csv file and outputs it to /subset
  • An R script ('raster.r') that takes csv file from /subset creates an inverse distance weighted raster image and outputs to /raster
  • An R script ('extract.r') that takes raster file from /raster (keeps a copy in /rasterrepository) and takes the mean for each polygon in /auxdata (i.e. Local authority here)

Step 3: Create the batch files

  • For each processing script file we need to create a batch file (.bat)
  • The format is as follows:
    • '@echo off'
    • The installation path of the executable that will run the code
    • 'CMD BATCH'
    • The path of the processing script file
  • So the first batch file we create looks like this:
@echo off "C:/PROGRA~1/R/R-33~1.2/bin/x64/R.exe" 
CMD BATCH 
C:/Users/mcherrie/batchprocessing/download.R

Step 4: Create the batch arguments

  • Here we want to specify the timing details on which the batch files will run; the full rundown of all the available arguments is available here
    • We are interested in running the batch files daily, as the FTP is only updated daily
    • However if you need a finer timescale then you can use the recurrence argument 'minute' and then /mo (modifier) set to the number of minutes that you require

Step 4: Create the batch arguments

  • We can create a file called runbats.R, here are the arguments for the download stage:
# download
recurrence <- "daily"
task_name <- "download"
bat_loc <- "C:\\Users\\mcherrie\\
batchprocessing\\download.bat"
time <- "23:59"
date<-"12/09/2017"
system(sprintf("schtasks /create /sd %s /sc %s /tn %s /tr \"%s\" /st %s", date, 
recurrence, task_name, bat_loc, time))
  • IMPORTANT The reccurence argument is dependent on how often the often the raw data changes and how long the function takes (see the timeout.txt)

Step 5: De-bug and re-tune

  • Different errors to check for
    • To check for R code errors run in R until it works as desired
    • To check for batch file errors, run the batch file command in the terminal window (type CMD from start button)
    • To check for bottleneck errors once the batch process has started look at the output in the .Rout files
    • Working directory errors might arise from using a mapped network drive

Step 5: De-bug and re-tune

  • To re-tune type this into the R Console:
## open tasks 
system("control schedtasks")
  • Go to Task Scheduler Library, click on the batch file name and click on the History tab
    • Compare the 'Created Task Process' with 'Task Completed'
  • Do this for each of the files and re-tune by:
    • Deleting the task name in the Task Scheduler
    • Then change the recurrence argument in the runbat.R file

Summary

Summary

  • R can batch process data by performing a split-apply-combine strategy
  • This is especially important if you need to manage memory constraints
    • For this, you'll need to change the settings (Settings>If the running task does not end when requested, force it to stop); You also need to make use of sink so that you can pick up where you left of

Other potential applications?

  • Social media Analysis
    • It's estimated that only 1% of tweets are geocoded, so it may take a while to collect a large sample for a given subject; run a script to collect tweets daily and analyses them every 3 months
  • Literature Review
    • To stay up-to-date on health geography trends, you could use RISmed, to produce tables every month of the most frequent words in the abstracts from articles that match a certain word/phrase/acronymn/author, and email the tables to yourself and colleagues

How to make it reproducible?

  • Host your code and data on github, follow these instructions
    • Those with a university email address get private repositories so you can keep code secure until it's ready to be published
  • Use a structured template for the analysis:

Next Steps

  • Short Term
    • Remove command window from popping up on the screen
    • How to do this on a mac (cronR)
  • Medium Term
    • Link to Amazon Web Services (Bigger data)
    • Link output to database-as-a-service (Mlab)
      • Query and share results
      • Administrative and maintenance tasks are taken care of
  • Ongoing
    • Write R code more efficiently (Gillespie and Lovelace, 2017)

Links and References

Thanks