R for Statistics

Chester Ismay (cismay@reed.edu)

Lewis & Clark FTI 2016
Tuesday, May 17, 2016

Slides available at http://rpubs.com/cismay/r_FTI_2016

Schedule for session

Why R?
What is R Markdown?
How should we conduct statistical analyses?
- Data visualization
- Data manipulation/summarizing
- Data analysis/modeling
- Interpretation of results
- Reproducible research

Why R?

The traditional approach to reports / analysis in the sciences

Import data set into statistical software package
Run the procedure to get results
Copy and paste appropriate pieces from the analysis into document editor
Add descriptions
Finish/submit report

Disadvantages of this process

Lots of manual work (prone to make errors)
Tedious (who likes to carefully copy-and-paste?)
Likely not recordable (did you write down all the steps you followed to get your analysis?)
What if you made an error at the beginning of your analysis? If your data had an error?

What is R?

R is a completely free software package and language for statistical analysis and graphics.
It excels in helping you with
- data manipulation
- automation
- reproducibility
- improved accuracy
- error finding
- customizability
- beautiful visualizations
Any downsides?

Learning how to use R

RStudio and R Markdown

RStudio is a powerful user interface that helps you get better control of your analysis.
It is also completely free.
It comes in both a desktop version and a server version (on the cloud).
You can write your entire paper/report (text, code, analysis, graphics, etc.) all in R Markdown.
If you need to update any of your code, R Markdown will automatically update your plots and output of your analysis and will create an updated PDF/HTML file.
No more copy-and-paste!

What is Markdown?

A “plaintext formatting syntax”
Type in plain text, render to more complex formats
One step beyond writing a txt file
Render to HTML, PDF, DOCX, etc.

What does it look like?

  # Header 1
  
  ## Header 2
  
  Normal paragraphs of text go here.
  
  **I'm bold**
  
  [links!](http://rstudio.com)
  
   * Unordered
   * Lists   
   
  And  Tables
  ---- -------
  Like This

What is R Markdown?

“Literate programming”
Embed R code in a Markdown document
Renders textual output along with graphics

```{r chunk_name}
x <- rnorm(1000)
length(x)
qplot(x, bins = 10, 
      fill = I("orange"), 
      color = I("black"))
```

## [1] 1000

But then I have to learn R…

Like learning a foreign language!

Similarities (In the simplest case)

R	Foreign Language	R examples
functions	verb	- `sqrt()`
		- `arrange()`
		- `lm()`
command	sentence	- `exp(3)`
		- `tail(babynames)`

KEY POINT - Exposure makes you fluent!

Data summarization/visualization

Flights from PDX in 2014

library(dplyr)
library(pnwflights14)
data("flights", package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>% 
  select(-year, -origin)
str(object = pdx_flights)

Flights from PDX in 2014

library(dplyr)
library(pnwflights14)
data("flights", package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>% 
  select(-year, -origin)
str(object = pdx_flights)

## Classes 'tbl_df', 'tbl' and 'data.frame':    53335 obs. of  14 variables:
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  1 8 28 526 541 549 559 602 606 618 ...
##  $ dep_delay: num  96 13 -2 -4 1 24 -1 -3 6 -2 ...
##  $ arr_time : int  235 548 800 1148 911 907 916 1204 746 1135 ...
##  $ arr_delay: num  70 -4 -23 15 4 12 -9 7 3 -30 ...
##  $ carrier  : chr  "AS" "UA" "US" "UA" ...
##  $ tailnum  : chr  "N508AS" "N37422" "N547UW" "N813UA" ...
##  $ flight   : int  145 1609 466 229 1569 649 796 1573 406 1650 ...
##  $ dest     : chr  "ANC" "IAH" "CLT" "IAH" ...
##  $ air_time : num  194 201 251 217 130 122 125 203 87 184 ...
##  $ distance : num  1542 1825 2282 1825 991 ...
##  $ hour     : num  0 0 0 5 5 5 5 6 6 6 ...
##  $ minute   : num  1 8 28 26 41 49 59 2 6 18 ...

Random sample

We randomly select 6000 flights with no missing values from this set of 53,335 flights.

set.seed(2016)
pdx_rs <- na.omit(pdx_flights) %>% sample_n(6000)

Question of interest

Does a statistically significant difference exist in the mean departure delays for airlines departing PDX in 2014?

Using visualizations to get a hint

What type of plot will help us visualize this?

Explanatory variable: categorical
Response variable: continuous
- Side-by-side boxplot

Mindmap to assist

Coggle Diagram

Multiple Means (ANOVA)

Coggle Diagram for ANOVA

Plotting our data

library(ggplot2)
qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot")

Plotting our data

library(ggplot2)
qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot")

Making tweaks (Part I)

# library(ggplot2)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA)

Making tweaks (Part I)

# library(ggplot2)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA)

Making tweaks (Part II)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45))

Making tweaks (Part II)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45))

Making tweaks (Part III)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45)) +
  stat_summary(fun.y = "mean", geom = "point", color = "red")

Making tweaks (Part III)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45)) +
  stat_summary(fun.y = "mean", geom = "point", color = "red")

Data summarizing

pdx_summary <- pdx_rs %>% group_by(carrier) %>%
  summarize(`Mean Delay` = mean(dep_delay), `Median Delay` = median(dep_delay))
kable(pdx_summary)

carrier	Mean Delay	Median Delay
AA	13.2077922	-2
AS	0.3305204	-5
B6	4.2057143	-3
DL	2.1482759	-3
F9	9.5774648	-3
HA	-1.8108108	-6
OO	4.5040504	-4
UA	9.0545455	-2
US	2.2024291	-3
VX	2.3333333	-4
WN	12.0638468	2

Get airline names

data("airlines", package = "pnwflights14")
pdx_join <- inner_join(x = pdx_summary, y = airlines, by = "carrier")
kable(pdx_join)

carrier	Mean Delay	Median Delay	name
AA	13.2077922	-2	American Airlines Inc.
AS	0.3305204	-5	Alaska Airlines Inc.
B6	4.2057143	-3	JetBlue Airways
DL	2.1482759	-3	Delta Air Lines Inc.
F9	9.5774648	-3	Frontier Airlines Inc.
HA	-1.8108108	-6	Hawaiian Airlines Inc.
OO	4.5040504	-4	SkyWest Airlines Inc.
UA	9.0545455	-2	United Air Lines Inc.
US	2.2024291	-3	US Airways Inc.
VX	2.3333333	-4	Virgin America
WN	12.0638468	2	Southwest Airlines Co.

Data analysis

Hypothesis test

Assuming conditions are met…

they might not be…

pdx_anova <- aov(formula = dep_delay ~ carrier, data = pdx_rs)
summary(pdx_anova)

##               Df  Sum Sq Mean Sq F value              Pr(>F)    
## carrier       10  130335   13033   13.49 <0.0000000000000002 ***
## Residuals   5989 5785272     966                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of results

Our friend the p-value

The \(p\)-value resulting from our analysis is essentially 0.
This corresponds to the probability of obtaining an observed \(F\) statistic of 13.4924731 or greater on an \(F\) distribution with \({df}_1 = 10\) and \(df_2 = 5989\), which assumes that the mean departure delays for all carriers is the same (the null hypothesis is true).
This small \(p\)-value leads us to reject the null hypothesis in favor of the alternative: at least one of the carriers has a mean departure delay that is different than the others (in the population of all 2014 flights from PDX).

Visualizing the p-value

To better assist in understanding \(p\)-values and the shape of distributions, I created a Shiny app with an undergraduate student available at http://ismay.shinyapps.io/ProbApp

Reproducible research

“Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.”

Roger Peng, Johns Hopkins

Full worked out example

A full worked out example of this analysis is available here as an HTML file to view.
The corresponding R Markdown file is available here.

How did we do?

We collected a random sample of the actual data on all 2014 flights departing PDX. Does a difference actually exist in the average departure delays for carriers in our population (all 2014 flights departing PDX)?

We can change the code below slightly to get our answer:

pdx_summary <- pdx_rs %>% group_by(carrier) %>%
  summarize(`Mean Delay` = mean(dep_delay), 
        `Median Delay` = median(dep_delay))
kable(pdx_summary)

Answer

pdx_full_summary <- na.omit(pdx_flights) %>% group_by(carrier) %>%
  summarize(`Mean Delay` = mean(dep_delay), `Median Delay` = median(dep_delay))
kable(pdx_full_summary)

carrier	Mean Delay	Median Delay
AA	13.0708625	-2
AS	0.9418523	-5
B6	5.9677926	-3
DL	2.5678412	-3
F9	8.4546125	-3
HA	-0.8027397	-5
OO	4.2595904	-4
UA	7.3794427	-2
US	1.5259545	-3
VX	6.2477477	-4
WN	12.1458352	1

Useful links

Data Processing with dplyr & tidyr

Introduction to dplyr

ggplot2 documentation

R Graphics Cookbook

ggplot2 Cheat Sheet

OR just Google it!

Workshop problems

Ratings from all beers I’ve rated using the Untappd app since February 2015
Use the dplyr package (vignette here) along with appropriate plots using the ggplot2 package to understand which styles of beers I like best.
You can also look into which cities and states produce beers I like and have tried most. What stands out?
We are just doing data visualization and summary here (not inference)

Getting the problems

To access the template file to begin your analysis on my beer adorations, go to
- Your RStudio Server here and sign in
- File -> New File -> R Markdown -> From Template -> Chester’s Beer Analysis
- Give the file a name and proceed!
I’ve added these beer ratings to an R data package available here

Thanks!

cismay@reed.edu

Code for slide creation on my GitHub page
Slides available here

sessionInfo()

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rmarkdown_0.9.6         knitr_1.13              ggplot2_2.1.0           dplyr_0.4.3.9001       
## [5] pnwflights14_0.1.0.9000 revealjs_0.6.1         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.4      magrittr_1.5     munsell_0.4.3    colorspace_1.2-6 R6_2.1.2        
##  [6] highr_0.6        stringr_1.0.0    plyr_1.8.3       tools_3.3.0      grid_3.3.0      
## [11] gtable_0.2.0     DBI_0.4-1        htmltools_0.3.5  lazyeval_0.1.10  yaml_2.1.13     
## [16] assertthat_0.1   digest_0.6.9     tibble_1.0-1     formatR_1.4      evaluate_0.9    
## [21] labeling_0.3     stringi_1.0-1    scales_0.4.0

R for Statistics

Chester Ismay (cismay@reed.edu)

Lewis & Clark FTI 2016 Tuesday, May 17, 2016 Slides available at http://rpubs.com/cismay/r_FTI_2016

Schedule for session

Why R?

The traditional approach to reports / analysis in the sciences

Disadvantages of this process

What is R?

Learning how to use R

RStudio and R Markdown

What is Markdown?

What does it look like?

What is R Markdown?

But then I have to learn R…

Like learning a foreign language!

Similarities (In the simplest case)

Data summarization/visualization

Flights from PDX in 2014

Flights from PDX in 2014

Random sample

Question of interest

Does a statistically significant difference exist in the mean departure delays for airlines departing PDX in 2014?

Using visualizations to get a hint

What type of plot will help us visualize this?

Mindmap to assist

Multiple Means (ANOVA)

Plotting our data

Plotting our data

Making tweaks (Part I)

Making tweaks (Part I)

Making tweaks (Part II)

Making tweaks (Part II)

Making tweaks (Part III)

Making tweaks (Part III)

Data summarizing

Sidebar: Which airline has the worst average departure delay?

Get airline names

Data analysis

Hypothesis test

Interpretation of results

Our friend the p-value

Visualizing the p-value

Reproducible research

Full worked out example

How did we do?

Answer

Useful links

Workshop problems

Getting the problems

Thanks!

Lewis & Clark FTI 2016
Tuesday, May 17, 2016

Slides available at http://rpubs.com/cismay/r_FTI_2016