R for Statistics


Chester Ismay (cismay@reed.edu)

Lewis & Clark FTI 2016
Tuesday, May 17, 2016

Slides available at http://rpubs.com/cismay/r_FTI_2016

Schedule for session

  • Why R?

  • What is R Markdown?

  • How should we conduct statistical analyses?

    • Data visualization
    • Data manipulation/summarizing
    • Data analysis/modeling
    • Interpretation of results
    • Reproducible research

Why R?

The traditional approach to reports / analysis in the sciences

  • Import data set into statistical software package
  • Run the procedure to get results
  • Copy and paste appropriate pieces from the analysis into document editor
  • Add descriptions
  • Finish/submit report

Disadvantages of this process

  • Lots of manual work (prone to make errors)
  • Tedious (who likes to carefully copy-and-paste?)
  • Likely not recordable (did you write down all the steps you followed to get your analysis?)
  • What if you made an error at the beginning of your analysis? If your data had an error?

What is R?

  • R is a completely free software package and language for statistical analysis and graphics.
  • It excels in helping you with
    • data manipulation
    • automation
    • reproducibility
    • improved accuracy
    • error finding
    • customizability
    • beautiful visualizations
  • Any downsides?

Learning how to use R

RStudio and R Markdown

  • RStudio is a powerful user interface that helps you get better control of your analysis.
  • It is also completely free.
  • It comes in both a desktop version and a server version (on the cloud).
  • You can write your entire paper/report (text, code, analysis, graphics, etc.) all in R Markdown.
  • If you need to update any of your code, R Markdown will automatically update your plots and output of your analysis and will create an updated PDF/HTML file.
  • No more copy-and-paste!

What is Markdown?

  • A “plaintext formatting syntax”
  • Type in plain text, render to more complex formats
  • One step beyond writing a txt file
  • Render to HTML, PDF, DOCX, etc.

What does it look like?

  # Header 1
  
  ## Header 2
  
  Normal paragraphs of text go here.
  
  **I'm bold**
  
  [links!](http://rstudio.com)
  
   * Unordered
   * Lists   
   
  And  Tables
  ---- -------
  Like This
  
markdown

What is R Markdown?

  • “Literate programming”
  • Embed R code in a Markdown document
  • Renders textual output along with graphics

```{r chunk_name}
x <- rnorm(1000)
length(x)
qplot(x, bins = 10, 
      fill = I("orange"), 
      color = I("black"))
```
## [1] 1000

But then I have to learn R…

Like learning a foreign language!

Similarities (In the simplest case)

R Foreign Language R examples
functions verb - sqrt()
- arrange()
- lm()
command sentence - exp(3)
- tail(babynames)

KEY POINT - Exposure makes you fluent!

Data summarization/visualization

Flights from PDX in 2014

library(dplyr)
library(pnwflights14)
data("flights", package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>% 
  select(-year, -origin)
str(object = pdx_flights)

Flights from PDX in 2014

library(dplyr)
library(pnwflights14)
data("flights", package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>% 
  select(-year, -origin)
str(object = pdx_flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53335 obs. of  14 variables:
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  1 8 28 526 541 549 559 602 606 618 ...
##  $ dep_delay: num  96 13 -2 -4 1 24 -1 -3 6 -2 ...
##  $ arr_time : int  235 548 800 1148 911 907 916 1204 746 1135 ...
##  $ arr_delay: num  70 -4 -23 15 4 12 -9 7 3 -30 ...
##  $ carrier  : chr  "AS" "UA" "US" "UA" ...
##  $ tailnum  : chr  "N508AS" "N37422" "N547UW" "N813UA" ...
##  $ flight   : int  145 1609 466 229 1569 649 796 1573 406 1650 ...
##  $ dest     : chr  "ANC" "IAH" "CLT" "IAH" ...
##  $ air_time : num  194 201 251 217 130 122 125 203 87 184 ...
##  $ distance : num  1542 1825 2282 1825 991 ...
##  $ hour     : num  0 0 0 5 5 5 5 6 6 6 ...
##  $ minute   : num  1 8 28 26 41 49 59 2 6 18 ...

Random sample

We randomly select 6000 flights with no missing values from this set of 53,335 flights.

set.seed(2016)
pdx_rs <- na.omit(pdx_flights) %>% sample_n(6000)

Question of interest

Does a statistically significant difference exist in the mean departure delays for airlines departing PDX in 2014?

Using visualizations to get a hint

What type of plot will help us visualize this?

  • Explanatory variable: categorical

  • Response variable: continuous

    • Side-by-side boxplot

Mindmap to assist

Coggle Diagram

Multiple Means (ANOVA)

Coggle Diagram for ANOVA

Plotting our data

library(ggplot2)
qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot")

Plotting our data

library(ggplot2)
qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot")

Making tweaks (Part I)

# library(ggplot2)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA)

Making tweaks (Part I)

# library(ggplot2)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA)

Making tweaks (Part II)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45))

Making tweaks (Part II)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45))

Making tweaks (Part III)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45)) +
  stat_summary(fun.y = "mean", geom = "point", color = "red")

Making tweaks (Part III)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45)) +
  stat_summary(fun.y = "mean", geom = "point", color = "red")

Data summarizing

pdx_summary <- pdx_rs %>% group_by(carrier) %>%
  summarize(`Mean Delay` = mean(dep_delay), `Median Delay` = median(dep_delay))
kable(pdx_summary)
carrier Mean Delay Median Delay
AA 13.2077922 -2
AS 0.3305204 -5
B6 4.2057143 -3
DL 2.1482759 -3
F9 9.5774648 -3
HA -1.8108108 -6
OO 4.5040504 -4
UA 9.0545455 -2
US 2.2024291 -3
VX 2.3333333 -4
WN 12.0638468 2

Get airline names

data("airlines", package = "pnwflights14")
pdx_join <- inner_join(x = pdx_summary, y = airlines, by = "carrier")
kable(pdx_join)
carrier Mean Delay Median Delay name
AA 13.2077922 -2 American Airlines Inc.
AS 0.3305204 -5 Alaska Airlines Inc.
B6 4.2057143 -3 JetBlue Airways
DL 2.1482759 -3 Delta Air Lines Inc.
F9 9.5774648 -3 Frontier Airlines Inc.
HA -1.8108108 -6 Hawaiian Airlines Inc.
OO 4.5040504 -4 SkyWest Airlines Inc.
UA 9.0545455 -2 United Air Lines Inc.
US 2.2024291 -3 US Airways Inc.
VX 2.3333333 -4 Virgin America
WN 12.0638468 2 Southwest Airlines Co.

Data analysis

Hypothesis test

Assuming conditions are met…

  • they might not be…
pdx_anova <- aov(formula = dep_delay ~ carrier, data = pdx_rs)
summary(pdx_anova)
##               Df  Sum Sq Mean Sq F value              Pr(>F)    
## carrier       10  130335   13033   13.49 <0.0000000000000002 ***
## Residuals   5989 5785272     966                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of results

Our friend the p-value

  • The \(p\)-value resulting from our analysis is essentially 0.

  • This corresponds to the probability of obtaining an observed \(F\) statistic of 13.4924731 or greater on an \(F\) distribution with \({df}_1 = 10\) and \(df_2 = 5989\), which assumes that the mean departure delays for all carriers is the same (the null hypothesis is true).

  • This small \(p\)-value leads us to reject the null hypothesis in favor of the alternative: at least one of the carriers has a mean departure delay that is different than the others (in the population of all 2014 flights from PDX).

Visualizing the p-value

  • To better assist in understanding \(p\)-values and the shape of distributions, I created a Shiny app with an undergraduate student available at http://ismay.shinyapps.io/ProbApp

Reproducible research

Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.”

  • Roger Peng, Johns Hopkins

Full worked out example

  • A full worked out example of this analysis is available here as an HTML file to view.
  • The corresponding R Markdown file is available here.

How did we do?

  • We collected a random sample of the actual data on all 2014 flights departing PDX. Does a difference actually exist in the average departure delays for carriers in our population (all 2014 flights departing PDX)?

  • We can change the code below slightly to get our answer:

    pdx_summary <- pdx_rs %>% group_by(carrier) %>%
      summarize(`Mean Delay` = mean(dep_delay), 
            `Median Delay` = median(dep_delay))
    kable(pdx_summary)

Answer

pdx_full_summary <- na.omit(pdx_flights) %>% group_by(carrier) %>%
  summarize(`Mean Delay` = mean(dep_delay), `Median Delay` = median(dep_delay))
kable(pdx_full_summary)
carrier Mean Delay Median Delay
AA 13.0708625 -2
AS 0.9418523 -5
B6 5.9677926 -3
DL 2.5678412 -3
F9 8.4546125 -3
HA -0.8027397 -5
OO 4.2595904 -4
UA 7.3794427 -2
US 1.5259545 -3
VX 6.2477477 -4
WN 12.1458352 1

Workshop problems

  • Ratings from all beers I’ve rated using the Untappd app since February 2015

  • Use the dplyr package (vignette here) along with appropriate plots using the ggplot2 package to understand which styles of beers I like best.

  • You can also look into which cities and states produce beers I like and have tried most. What stands out?

  • We are just doing data visualization and summary here (not inference)

Getting the problems

  • To access the template file to begin your analysis on my beer adorations, go to

    • Your RStudio Server here and sign in
    • File -> New File -> R Markdown -> From Template -> Chester’s Beer Analysis
    • Give the file a name and proceed!
  • I’ve added these beer ratings to an R data package available here

Thanks!


cismay@reed.edu


sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rmarkdown_0.9.6         knitr_1.13              ggplot2_2.1.0           dplyr_0.4.3.9001       
## [5] pnwflights14_0.1.0.9000 revealjs_0.6.1         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.4      magrittr_1.5     munsell_0.4.3    colorspace_1.2-6 R6_2.1.2        
##  [6] highr_0.6        stringr_1.0.0    plyr_1.8.3       tools_3.3.0      grid_3.3.0      
## [11] gtable_0.2.0     DBI_0.4-1        htmltools_0.3.5  lazyeval_0.1.10  yaml_2.1.13     
## [16] assertthat_0.1   digest_0.6.9     tibble_1.0-1     formatR_1.4      evaluate_0.9    
## [21] labeling_0.3     stringi_1.0-1    scales_0.4.0