August 8, 2016

What is R?

  • R is a language and environment for statistical computing and graphics
  • Supported by the R Foundation for Statistical Computing
  • R language is based on the S Language, developed at Bell Labs by John Chambers in the 1970s.
  • The 1988 version 3 rewrite of S is the basis for R as it exists today.
  • Versioned releases of R go back to 2004, version 0.49.
  • Runs on Windows, Mac, and Linux platforms

Why R?

My short answer:

  • As a statistical dabbler, I can't afford a SAS license.
  • R is …
    • Free to use
    • Comparable in many respects to commercial software
    • Large and growing user base
    • Widely available instruction, tutorials, tips & tricks
  • Practically everything touted as a reason for R could easily be said about other software packages.

Is R "better" than SAS, SPSS, Stata, Excel?

  • This is something of a religious question. My only experience with statistical software is using R and Excel. Excel is mostly useful for very simple situations.

  • If you have built a career using SAS or another package, I would not discourage you from continuing down that road.

  • If you prefer to have a wide assortment of tools at your disposal, then learn R. It will add to your personal and our organizational capacity. Plus it will build character!

Learning R

  • R is moderately difficult to learn
  • Fortunately there are many resources for training available
  • Coursera
  • Lynda.com
  • DataCamp
  • There are also a wide range of resources for finding answers and tips
  • RBloggers
  • Rexercises
  • Quick-R

Coursera's Data Science Curriculum

My experience has been through the Coursera Data Science specialization. This is a 10 course series of month-long classes (I've completed 9 of 10) taught by professors from John Hopkins Bloomberg School of Public Health.

This doesn't make me an expert, just a slightly experienced dabbler.

The R Ecosystem: Just R

  • R by itself is a command-line based statistical engine.
  • The R runtime download includes a very basic GUI, imaginatively named "RGUI"
  • The native RGui is functional but lacks a lot of features you might expect from an full IDE, or integrated development environment.
  • With RGui, you still are basically working with the command line interface.

The R Ecosystem: R + RStudio

  • There are many alternative GUI tools for working with R
  • The most widely used GUI tool is the RStudio IDE
  • Rstudio operates as adjunct to native R, but adds useful conveniences

The R Ecosystem: R + RStudio + CRAN

  • CRAN is the Comprehensive R Archive Network
  • Contains nearly 9,000 R packages
  • Reasonable vetting of packages assures they will work, but no guarantees.
  • CRAN Task views provide topical guidance for CRAN packages

The R Ecosystem: R + RStudio + CRAN + GIT

  • GIT is a distributed version control system that can easily be integrated with RStudio
  • Developed by Linus Torvalds, principal developer of Linux
  • Designed for distributed, non-linear workflows
  • Useful for iterative versioning of data management and analyis tasks
  • Extremely useful for collaborative versioning, either publicly or privately, via GitHub or a private Git server
  • Widely used, with many books and tutorials available

Surprising Things (for me!)

Most general purpose programming languages operate on single variables, with code littered with loops to process arrays of information

g <- rnorm(100000)
h <- rep(NA, 100000)

# Start the clock!
ptm <- proc.time()

# Loop through the vector, adding one
for (i in 1:100000){
  h[i] <- g[i] + 1
}
# Stop the clock
proc.time() - ptm
##    user  system elapsed 
##    0.19    0.00    0.18
h[1:10]
##  [1]  0.9159057  0.2007966  2.1333830 -0.1743988 -0.4212987  1.6411912
##  [7] -0.5610115  0.9808929  1.1687580  1.0009902

Surprises: Vectors (array, lists) without Loops

vectors in R are processed easily, without loops.

ptm <- proc.time()

# operations occur over the entire vector g
h <- g + 1

proc.time() - ptm
##    user  system elapsed 
##       0       0       0
h[1:10]
##  [1]  0.9159057  0.2007966  2.1333830 -0.1743988 -0.4212987  1.6411912
##  [7] -0.5610115  0.9808929  1.1687580  1.0009902

And quickly!

Popular Packages: GGPLOT2

An Example of Using GGPLOT2

library(ggplot2)
set.seed(314159)
df <- data.frame(X=rnorm(1000,mean=50,sd=20),Y=rnorm(1000,mean=50,sd=10))
df$relation <- df$X <= df$Y
p <- ggplot(df,aes(x=X, y=Y)) +
      coord_cartesian() +
      scale_x_continuous() +
      scale_y_continuous() +
      scale_color_manual(name="X <= Y",values =c("TRUE"="red","FALSE"="blue")) +
      geom_point(aes(color=factor(relation))) +
      geom_smooth(method="lm",se=TRUE) + 
      labs(title="Using GGPLOT2", x="Random X", y="Random Y")

An Example of using GGPLOT2 - Results

Popular Packages: DPLYR

Other Popular Packages