R and RStudio

Josh Day

January 26, 2015

Sources

Outline

  • R Packages
  • Using C++ with R via Rcpp
  • Profiling and Debugging
  • rmarkdown and knitr

R Packages

Package Development

  • Why create a package?
    • You won't remember code you wrote last week, much less 5 years ago.
    • Adding documentation is easy (with roxygen2)
    • Share documented functions/data with collaborators

Package Development

Things to make your life easier:

  • devtools
    • Functions for development
  • roxygen2
    • maintains NAMESPACE file, generates documentation from source code
  • testthat
    • Framework for testing
  • knitr
    • For making vignettes
  • RStudio has great integration with GitHub

Make your package

# Get packages
install.packages(c("devtools", "roxygen2", "testthat", "knitr"))

# Check that everything is installed and functioning
devtools::has_devel() 

# Create your package
devtools::create('path/to/myPackage')

That was easy.

What Does create() Create?

  • DESCRIPTION
    • metadata about the package (version, date, author, license, imports)
  • myPackage.Rproj
    • Projects have a unique working directory, workspace, history, and source documents
  • NAMESPACE
    • Can be confusing - ensures that your package won't interfere with other packages.
    • Using roxygen2 ensures you don't have to edit this by hand
  • R/
    • Stores your R code

Other Directories You Might Want

  • src/
    • Compiled code (to be discussed later)
  • inst/
    • Installed files: Anything extra that should be installed with the package
      • inst/CITATION: how the package should be cited
    • Convention is for code from other languages to be put here
      • inst/python, inst/ruby, etc.
  • data/
    • devtools::use_data()
    • Example data
    • Document with roxygen2 (see Hadley's R Packages )
  • tests/
    • devtools::use_testthat()

How To Use roxygen2

  • First sentence is title of documentation
  • Then add tags
#' Update Bins for ASH estimator
#'
#' @param bin Object returned from \code{ash::bin1}
#' @param newdata Vector of new data
#'
#' @example inst/examples/ash_example.R
#'
#' @export
updateBin1 <- function(bin, newdata) { ...
  • Use roxygen2::roxygenize() or devtools::document() to build documentation
  • Build and Reload in RStudio

What Does the roxygen2 Block Produce?

% Generated by roxygen2 (4.1.0): do not edit by hand
% Please edit documentation in R/ash.R
\name{updateBin1}
\alias{updateBin1}
\title{Update Bins for ASH estimator}
\usage{
updateBin1(bin, newdata)
}
\arguments{
\item{bin}{Object returned from \code{ash::bin1}}

\item{newdata}{Vector of new data}
}
\description{
Update Bins for ASH estimator
}
\examples{
# Create two batches of data
y1 <- rnorm(100)
y2 <- rnorm(100)

# Create bins from the batch 1
mybin <- ash::bin1(y1, ab=c(-5, 5), nbin=50)

# Get ASH estimate and plot it
myash <- ash::ash1(mybin, m=5)
plot(myash, type="l")

# Update the bins with batch 2
mybin <- online::updateBin1(mybin, y2)

# update ASH estimate and plot it
myash <- ash::ash1(mybin, m=5)
plot(myash, type="l")
}

What Does the roxygen2 Block Produce?

A Useful Function from devtools

devtools::install_github()
- Main argument is repo: accepts form username/repo[/subdir][@ref|#pull]

  • Install package straight from github (can be based on specific reference)
    • Get updates before they are on CRAN
    • Get packages that aren't available on CRAN (runr)
# Get old release of devtools (Mac and Linux)
# For windows, see ?build_github_devtools
devtools::install_github("hadley/devtools@v1.6.1")

# Get latest release
devtools::install_github("hadley/devtools")
  • See also: install_bitbucket, install_svn, install_git, install_url, install_version, install_local

Rcpp

Why use C++?

  • Loops are slow in R
    • May be hard to vectorize
  • Recursive functions
  • Problems that require advanced data structures and algorithms that R doesn’t provide
  • Avoid this:

source: xkcd

Use R's High-Level Syntax in C++

  • Rcpp provides a lot of syntactic “sugar” to ensure that C++ functions work very similarly to their R equivalents.
  • All the basic arithmetic and logical operators are vectorized: +, *, -, /, pow, <, <=, >, >=, ==, !=, !.
  • Math functions: floor(), cos(), gamma(), exp(), etc.
  • r/d/p/q for all standard distributions
  • apply family
  • Rcpp-Sugar Vignette

What you need

Rcpp:

  • install.packages('Rcpp')

A working C++ compiler:

  • Windows: Rtools
  • Mac: Xcode from app store
  • Linux: sudo apt-get install r-base-dev

Where To Start

  • Does it work?
library(Rcpp)
evalCpp("1+1")
## [1] 2

Implement a Function in C++ from R

  • The cppFunction function compiles C++ code and constructs an R wrapper
  • C++ code needs specified types. Variables need to be initialized.
library(Rcpp)
cppFunction('
int addTwo(int x, int y) { 
  int sum = x + y;
  return sum;
}
')
addTwo(1, 2)
## [1] 3
addTwo
## function (x, y) 
## .Primitive(".Call")(<pointer: 0x107943f50>, x, y)

Be Careful About Data Types

addTwo(1.9, 2.9)
## [1] 3
  • Scalars:
    • double, int, String, bool
  • Vectors:
    • NumericVector, IntegerVector, CharacterVector, LogicalVector
  • Matrices:
    • NumericMatrix, IntegerMatrix, CharacterMatrix, LogicalMatrix
  • Classes added by Rcpp:
    • List, DataFrame, Function, Robject

Let's fix this

Change output and inputs to double

cppFunction('
double addTwo(double x, double y) { 
  double sum = x + y;
  return sum;
}
')
addTwo(1.9, 2.9)
## [1] 4.8

For Loop

  • C++ Syntax: for (init; condition; increment)
  • Indexing starts at 0!! (common source of bugs)
sumR <- function(x) {
  n <- length(x)
  total <- 0
  for(i in 1:n) {
    total <- total + x[i]
  }
  return(total)
}

cppFunction('
  double sumC(NumericVector x) {
    int n = x.size();
    double total = 0;
    for(int i = 0; i < n; ++i) {
      total += x[i];
    }
    return total;
  }
')

Timing Results for Sum Functions

library(microbenchmark)
x <- runif(100000)
microbenchmark(
  sum(x),
  sumR(x),
  sumC(x)
)
## Unit: microseconds
##     expr       min        lq       mean     median         uq       max neval
##   sum(x)    92.389    92.826   105.5395    94.5465   117.5525   198.901   100
##  sumR(x) 35038.670 36697.757 38855.6528 37385.2430 38169.9185 71635.789   100
##  sumC(x)    88.438    89.823   108.7193   100.6170   119.4515   213.851   100

Example Using NumericMatrix

  • Let's write a function to create a diagonal matrix from a vector.
  • Matrix values accessed by A(i, j)
cppFunction('
NumericMatrix diagC(NumericVector x) {
  int n = x.size(); 

  NumericMatrix mat(n, n);

  for (int i = 0; i < n; i++) {
    mat(i, i) = x[i];
  }

  return mat;
}          
')

Timing Results for diag() Function

microbenchmark(
  diag(1:1000),
  diagC(1:1000)
)
## Unit: milliseconds
##           expr      min       lq     mean   median       uq       max neval
##   diag(1:1000) 2.790997 2.821166 3.893509 2.856402 2.899913 37.579529   100
##  diagC(1:1000) 2.798069 2.825595 2.877935 2.864132 2.911366  3.084649   100

Sourcing C++ Code From R

  • Creating a new C++ file in RStudio sets you up with the necessary parts:

Your .cpp file needs to start with

#include <Rcpp.h>
using namespace Rcpp;

and include the Rcpp attribute

// [[Rcpp::export]]

above any function you wish to be called from R.

  • You can also include R code with special comments

example1.cpp

    #include <Rcpp.h>
    using namespace Rcpp;
    
    // [[Rcpp::export]]
    int timesTwo(int x) {
       return x * 2;
    }
    
    /*** R
    # test my function 
    timesTwo(5)
    */

Using C++ libraries other than Rcpp

Boost and Armadillo

  • STL and Rcpp doesn't get you far
  • Thankfully, many libraries are available
    • Linear Algebra
      • RcppArmadillo
      • RcppEigen
    • Parallelization
      • RcppParallel
      • OpenMP (use attribute // [[Rcpp::plugins(openmp)]])
    • Boost
      • Large collection of libraries

Boost

  • Large collection of peer-reviewed libraries
    • Some of which now included in C++11 Standard
  • "…one of the most highly regarded and expertly designed C++ library projects in the world."
  • Access Boost headers with R package BH
    • Use attribute [[Rcpp::depends(BH)]] in source

Create Wrappers for Functions from the boost/math Library

// We can now use the BH package
// [[Rcpp::depends(BH)]]

#include <Rcpp.h>
#include <boost/math/common_factor.hpp>  

using namespace Rcpp;
 
// [[Rcpp::export]]
int computeGCD(int a, int b) {
    return boost::math::gcd(a, b);
}

// [[Rcpp::export]]
int computeLCM(int a, int b) {
    return boost::math::lcm(a, b);
}

Use our new functions in R

sourceCpp("examples/boost.cpp")
computeGCD(100, 75)
## [1] 25
computeLCM(77, 55)
## [1] 385

RcppArmadillo

http://arma.sourceforge.net/

  • Syntax deliberately similar to Matlab
  • Aims towards a good balance between speed and ease of use
  • 100+ packages depend on RcppArmadillo
  • OSX Mavericks needs some help before you can use Armadillo:
    • Here is the explanation why
    • curl -O http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2
    • sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /

RcppArmadillo

RcppArmadillo Inline Function

What if I want to use the Armadillo package for an inline function?

cppFunction("
  arma::mat outerC(arma::colvec a, arma::rowvec b) {
    return a*b;
  }
", depends="RcppArmadillo")
outerC(1:3, 1:3)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    2    4    6
## [3,]    3    6    9

RcppArmadillo Function from Source Code

  • What if I want to source a .cpp file using Armadillo namespace? We need to add:
    #include <RcppArmadillo.h>
    // [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;

// [[Rcpp::export]]
arma::mat outerC2(arma::colvec a, arma::rowvec b) {
    return a*b;
  }

/*** R
outerC2(1:2, 1:2)
*/

Timing Results for outer Functions

x <- 1:100
y <- 2:101
microbenchmark(
  x %*% t(y),
  tcrossprod(x, y),
  outer(x, y),
  outerC(x, y),
  outerC2(x, y),
times=500)
## Unit: microseconds
##              expr    min      lq     mean  median      uq      max neval
##        x %*% t(y) 19.732 21.6235 35.64684 24.5695 35.6925 2204.709   500
##  tcrossprod(x, y) 14.583 15.9000 43.76020 19.3800 25.1895 3157.459   500
##       outer(x, y) 24.997 28.7700 50.63866 32.7280 50.8855 2714.753   500
##      outerC(x, y) 22.646 28.0510 57.28111 34.5680 46.1335 1805.597   500
##     outerC2(x, y) 23.055 28.0570 57.90161 34.5795 49.0270 2547.162   500

Using C++ Code in a Package

  • Run devtools::use_rcpp()
    • Among other things, this adds a src/ folder for .cpp files
  • You need to add a file with the following to R/
#' @useDynLib your-package-name
#' @importFrom Rcpp sourceCpp
NULL
  • In your .cpp files, you can use roxygen2 tags like //` @export
  • Building the package will create R wrappers and documentation for your C++ functions
    • devtools::load_all()
    • Cmd + Shift + B in RStudio

Debugging and Profiling

Debugging

  • traceback()
    • prints the call stack, bottom to top
    • this shows where the error is
  • browser()
    • examine objects as you run function line by line

Breakpoints

  • Breakpoints in RStudio
    • puts RStudio in debug mode: calls both traceback() and browser()

Profiling

  • lineprof package
    • devtools::install_github("hadley/lineprof")
    • Times are not exact. This is most useful for spotting bottlenecks
  • Profiling.R
library(lineprof)
f <- function() {
  pause(0.1)
  g()
  h()
}
g <- function() {
  pause(0.1)
  h()
}
h <- function() {
  pause(0.1)
}

Profiling

l <- lineprof(f())
l
## Reducing depth to 2 (from 5)
## Common path: <text>
##    time alloc release dups      ref     src
## 1 0.080 0.001       0    1 <text>#3 f/pause
## 2 0.161 0.002       0    0 <text>#4 f/g    
## 3 0.080 0.000       0    0 <text>#5 f/h
focus(l, "g")
## Reducing depth to 2 (from 4)
## Common path: <text>
##    time alloc release dups      ref     src
## 1 0.080 0.001       0    0 <text>#8 g/pause
## 2 0.081 0.001       0    0 <text>#9 g/h

Rmarkdown

Reproducible research with Rmarkdown

  • Markdown is a text-to-HTML conversion tool
    • Easy to read, easy to write
  • Rmarkdown is markdown with embedded code and output
  • Generate report/code/figures from one file
  • Rmarkdown files can generate
    • html (slides or report, can embed shiny apps)
    • PDF (beamer slides or report)
    • Word document
  • Add equations with
    • Inline: $y = x^2$
    • Display: $$y = x^2$$

Example

knitr Language Engines

  • In code chunks, you are not limited to R
  • Rmarkdown includes text output and syntax highlighting for:
    • python, bash, awk, ruby, Rcpp, SAS, julia with runr, and more
  • To my knowledge, no graphics from other languages are supported.
  • Note: knitr is not limited to markdown
    • LaTeX (R Sweave in RStudio), HTML
    • Code chunks have different syntax in LaTeX

Publishing on RPubs

  • Easily publish any RMarkdown report with RStudio to RPubs.com
  • Warning: Your report is immediately made public
  • This presentation was made in RStudio with RMarkdown and is hosted on RPubs