23 September 2016

Introduction: going far with minimum effort

Source: openclipart.org

Context

Motivations

  • Colin and I taught 'R for Big Data'
  • Common thread: working inefficiently with small data makes big data impossible
  • Over a decade teaching experience between us, we both picked up many tricks of the demand
  • Seemed to be demand, so we landed a book contract with O'Reilly Press

The Efficient R Programming book

Structure

  • Efficient set-up
  • Efficient programming
  • Efficient workflow
  • Exercises

Efficient set-up

The basics

  • Run a decent laptop (recommendation: 2nd hand Lenovo/Dell)
  • Keep your operating system lean (e.g. by running Linux)
  • And up-to-date (e.g. Ubuntu 16.04)
  • Keep your desktop tidy

Top 5 tips for an efficient set-up

  1. Use system monitoring to identify bottlenecks in your hardware/code
  2. Keep your R installation and packages up-to-date
  3. Make use of RStudio's powerful autocompletion capabilities and shortcuts
  4. Store API keys in the .Renviron file
  5. Use BLAS if your R number crunching is too slow

R and package versions

  • Keep them up-to-date

Updating packages

  • The RStudio way
  • update.package()
update.packages(oldPkgs = "sp")

The .Renviron and .Rprofile files

Efficient programming

Tips

  1. Be careful never to grow vectors.
  2. Vectorise codes whenever possible.
  3. Use factors when appropriate.
  4. Clever use of S3 objects can make code easier to understand.
  5. Byte compile packages for an easy performance boost.

Vectorised code

x = x + 1

involves a single function call to the + function. Whereas the for loop

for(i in seq_len(n)) 
  x[i] = x[i] + 1 

Do not optimise early!

Efficient workflow + collaboration

Version control + communication with GitHub

Code style

  • Not rocket style: pick a style and stick to it
  • Like Hadley's style guide

Indentation

# Poorly indented/formatted code
if(!exists("x")){
x=c(3,5)
y=x[2]}

# Automatically indented code (Ctrl+I in RStudio)
if(!exists("x")){
  x=c(3,5)
  y=x[2]}

Spacing

# Automatically reformat the code (Ctrl+Shift+A in RStudio)
if(!exists("x")) {
  x = c(3, 5)
  y = x[2]
}

Manage your projects with RStudio projects

  • Always launches R in the right working directory
  • Provides everything you need in a single place
  • Allows easy version control and sharing of code

RStudio tips

  • File tab autocompleteion
  • Ctl-Up autocompletes last string beginning with current string

Worked example

x = 1:100 # initiate vector to cumulatively sum

cs_for = function(x){
  for(i in x){
    if(i == 1){
      xc = x[i]
    } else {
      xc = c(xc, sum(x[1:i]))
    }
  }
  xc
}

Worked example II

cs_apply = function(x){
  sapply(x, function(x) sum(1:x))
}
library(microbenchmark)
microbenchmark(cs_for(x), cs_apply(x), cumsum(x), times = 2)
## Unit: nanoseconds
##         expr    min     lq   mean median     uq    max neval
##    cs_for(x) 186050 186050 228431 228431 270812 270812     2
##  cs_apply(x) 124669 124669 138334 138334 151999 151999     2
##    cumsum(x)    482    482   1281   1281   2080   2080     2

Exercises

Exercises to work on

  • If you haven't already, Create a GitHub account and create/fork a project (5 minutes).
  • Work through the 5 exercises in Chapter 2 (15 minutes).
  • Work through the example in Section 3.2.1: how long does each method take on your computer? (5 minutes)
  • Complete the exercise in Chapter 4.#
  • Think about file organisation and storage: talk with someone next to you about how you organise your files and how you could improve the directory structure and format of storage (in relation to Chapter 5) (5 minutes).
  • Bonus 1: Which aspect of efficiency is most relevant to you? Go to and read the relevant Chapter.
  • Bonus 2: What efficient tips have we missed? Convert thi into a Pull Request (or just tell us).