Intro to R

Robert Norberg
Wednesday, Sep 02, 2015

What is R?

  • A statistical programming language
  • Evolved from S+, which came from Bell Labs
  • Free
  • Open Source
  • Object oriented

Pros/Cons of R

Pros

Cons

  • Difficult to learn at first
  • Not always double and triple checked
  • Constantly evolving
  • No help line

What is RStudio?

An Integrated Development Environment (IDE) for R

What can R/RStudio do?

Statistics

model <- lm(mpg~hp, data=mtcars)
summary(model)

Call:
lm(formula = mpg ~ hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.7121 -2.1122 -0.8854  1.5819  8.2360 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
hp          -0.06823    0.01012  -6.742 1.79e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.863 on 30 degrees of freedom
Multiple R-squared:  0.6024,    Adjusted R-squared:  0.5892 
F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Graphics

library(ggplot2)
ggplot(mtcars, aes(x=hp, y=mpg, color=gear))+
  facet_grid(.~cyl, labeller=label_both)+
  geom_point(size=8)+
  theme_bw(base_size=30)

plot of chunk unnamed-chunk-3

Dynamic Report Generation

  • Switch between plain text and code
  • Compile the report to a PDF, docx, html, or just about anything else
  • More than one person working on the report? Use Subversion for version control!

A quick look at my blog for an example.

Slide Shows

This slideshow was made in R!

Making slideshows is very similar to making dynamic reports.

Interface With the Web

library(RCurl)
library(data.table)
myfile <- getURL("http://statistics.cos.ucf.edu/mjohnson/wp-content/uploads/2013/08/CH06PR09.txt", ssl.verifyhost=F, ssl.verifypeer=F)
mydat <- fread(myfile)
summary(mydat)
       V1             V2               V3              V4        
 Min.   :3998   Min.   :211944   Min.   :4.610   Min.   :0.0000  
 1st Qu.:4193   1st Qu.:268759   1st Qu.:6.805   1st Qu.:0.0000  
 Median :4316   Median :291271   Median :7.325   Median :0.0000  
 Mean   :4363   Mean   :302693   Mean   :7.371   Mean   :0.1154  
 3rd Qu.:4472   3rd Qu.:321906   3rd Qu.:7.938   3rd Qu.:0.0000  
 Max.   :5045   Max.   :472476   Max.   :9.650   Max.   :1.0000  

Apps

The Shiny package allows easy app creation.

Example

More

See CRAN Task Views to see what R packages are available for your next project.

A brief intro to R

Objects

We can create an object, x, via assignment.

x <- 5 

Then, when we ask for the value of x:

x
[1] 5

Logical operators && conditional logic (if/else)

Logical operaters return boolean values.

x > 3
[1] TRUE

Conditional logic - a fancy term for if/else

if(x > 3){print('Yusssss')}
[1] "Yusssss"

Vectors

x is a scalar. Objects can be lots of things, including vectors.

y <- c(1:10) # `c` is for concatenate
y
 [1]  1  2  3  4  5  6  7  8  9 10

R is particularly adept at operating on vectors.

y > 5 # operates on each element of the vector
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Indexing

Indexing allows us to retrieve individual elements from a vector.

y[1] # returns the first element of y
[1] 1
y[y > 5] # returns all elements of y where y > 5
[1]  6  7  8  9 10

Functions

Akin to macros in SAS, functions take input and return output.

sum(y) # sum() is a function
[1] 55

There is a function for just about everything in R. The sample() function randomly samples from a vector.

sample(y, 2) # sample from `y` two times
[1] 5 1

Loops

for(i in 1:5){
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

An example of problem solving in R

The Monty Hall Problem

Allow Kevin Spacey to explain.

(Famously solved by Marilyn vos Savant in her Parade magazine coulmn)

Verify by simulation

"There are only two hard things in Computer Science: cache invalidation and naming things." 
- Phil Karlton
doors <- c(1:3)
correctDoor <- sample(doors, 1)
guess1 <- sample(doors, 1)

Let's peak:

correctDoor; guess1
[1] 1
[1] 2

Two strategies

“Stubborn”

You make a first guess at random and no matter what the host does, you stick to your guns. You're sure he's only trying to bait you!

“Switch”

You pick a door at random, then the host reveals a door, eliminating a wrong answer. Then you switch from your original guess to the remaining door and hope for the best!

"Stubborn" strategy

(Recall: correct door = 1, first guess = 2)

if(guess1==correctDoor){
  result <- "Winner!"
}else{
  result <- "Sorry :("
}
result
[1] "Sorry :("

"Switch" strategy

  • The host must reveal what is behind one door after your first guess.
  • He cannot reveal the prize.
  • If your first guess is wrong, he must reveal the only remaining incorrect door.
if(guess1!=correctDoor){
  doorToReveal <- doors[(doors!=correctDoor & doors!=guess1)]
}

"Switch" strategy

(Recall: correct door = 1, first guess = 2)

If your first guess is correct, the host may reveal either of the two remaining doors.

if(guess1==correctDoor){
  canReveal <- doors[doors!=guess1]
  doorToReveal <- sample(canReveal, 1)
}

And the host reveals…

doorToReveal
[1] 3

"Switch" strategy

(Recall: correct door = 1, first guess = 2)

Finally, you switch from your first guess to whichever remaining door that has not been revealed

remaining <- doors[(doors!=guess1 & doors!=doorToReveal)]
guess2 <- doors[remaining]
if(guess2==correctDoor){
  result <- "Winner!"
}else{
  result <- "Sorry :("
}
result
[1] "Winner!"

One simulation is good, but many is better

wins <- 0
for(i in 1:1e6){

  ... # insert "switch" strategy here

  if(result=="Winner"){
    wins <- wins + 1
  }
}

And the results are in

wins/1e6
[1] 0.666485

Does this prove our hypothesis? Are you convinced?

A challenge for you

What if there were 5 doors? (All the other rules remain the same)

  • How often would you win with a “stubborn” strategy?
  • How often would you win with a “switch” strategy?

Prove your answers via simulation and send me your code! First one to get it wins… something.

How do I learn to use R?

rpubs.com/rnorberg/intro-to-R