What is R?


Learning how to use R

RStudio and R Markdown

  • RStudio is a powerful user interface that helps you get better control of your analysis.
  • It is also completely free.
  • It comes in both a desktop version and a server version (on the cloud).
  • You can write your entire paper/report (text, code, analysis, graphics, etc.) all in R Markdown.
  • If you need to update any of your code, R Markdown will automatically update your plots and output of your analysis and will create an updated PDF/HTML file.
  • No more copy-and-paste!

What is Markdown?


What does it look like?

  # Header 1
  
  ## Header 2
  
  Normal paragraphs of text go here.
  
  **I'm bold**
  
  [links!](http://rstudio.com)
  
   * Unordered
   * Lists   
   
  And  Tables
  ---- -------
  Like This
  
markdown

What is R Markdown?

  • “Literate programming”
  • Embed R code in a Markdown document
  • Renders textual output along with graphics

```{r chunk_name}
x <- rnorm(1000)
length(x)
qplot(x, bins = 10, 
      fill = I("orange"), 
      color = I("black"))
```
## [1] 1000


But then I have to learn R…


Like learning a foreign language!

Similarities (In the simplest case)

R Foreign Language R examples
functions verb - sqrt()
- arrange()
- lm()
command sentence - exp(3)
- tail(babynames)

KEY POINT - Exposure makes you fluent!

Data summarization/visualization



Flights from PDX in 2014

library(dplyr)
library(pnwflights14)
data(flights, package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>% 
  select(-year, -origin)
str(object = pdx_flights)

Flights from PDX in 2014

library(dplyr)
library(pnwflights14)
data(flights, package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>% 
  select(-year, -origin)
str(object = pdx_flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53335 obs. of  14 variables:
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  1 8 28 526 541 549 559 602 606 618 ...
##  $ dep_delay: num  96 13 -2 -4 1 24 -1 -3 6 -2 ...
##  $ arr_time : int  235 548 800 1148 911 907 916 1204 746 1135 ...
##  $ arr_delay: num  70 -4 -23 15 4 12 -9 7 3 -30 ...
##  $ carrier  : chr  "AS" "UA" "US" "UA" ...
##  $ tailnum  : chr  "N508AS" "N37422" "N547UW" "N813UA" ...
##  $ flight   : int  145 1609 466 229 1569 649 796 1573 406 1650 ...
##  $ dest     : chr  "ANC" "IAH" "CLT" "IAH" ...
##  $ air_time : num  194 201 251 217 130 122 125 203 87 184 ...
##  $ distance : num  1542 1825 2282 1825 991 ...
##  $ hour     : num  0 0 0 5 5 5 5 6 6 6 ...
##  $ minute   : num  1 8 28 26 41 49 59 2 6 18 ...

Random sample

We randomly select 6000 flights with no missing values from this set of 53,335 flights.

set.seed(2016)
pdx_rs <- na.omit(pdx_flights) %>% sample_n(6000)

Question of interest

Does a statistically significant difference exist in the mean departure delays for airlines departing PDX in 2014?


Using visualizations to get a hint

What type of plot will help us visualize this?

  • Explanatory variable: categorical

  • Response variable: continuous

    • Side-by-side boxplot

Mindmap to assist

Coggle Diagram


Multiple Means (ANOVA)

Coggle Diagram for ANOVA


Plotting our data

library(ggplot2)
qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot")

Plotting our data

library(ggplot2)
qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot")


Making tweaks (Part I)

# library(ggplot2)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA)

Making tweaks (Part I)

# library(ggplot2)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA)


Making tweaks (Part II)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45))

Making tweaks (Part II)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45))


Making tweaks (Part III)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45)) +
  stat_summary(fun.y = "mean", geom = "point", color = "red")

Making tweaks (Part III)

ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) + 
  geom_boxplot(outlier.shape = NA) +
  coord_cartesian(ylim = c(-20, 45)) +
  stat_summary(fun.y = "mean", geom = "point", color = "red")


Data summarizing


Get airline names

data(airlines, package = "pnwflights14")
pdx_join <- inner_join(x = pdx_summary, y = airlines, by = "carrier")
kable(pdx_join)
carrier Mean Delay Median Delay name
AA 13.2077922 -2 American Airlines Inc.
AS 0.3305204 -5 Alaska Airlines Inc.
B6 4.2057143 -3 JetBlue Airways
DL 2.1482759 -3 Delta Air Lines Inc.
F9 9.5774648 -3 Frontier Airlines Inc.
HA -1.8108108 -6 Hawaiian Airlines Inc.
OO 4.5040504 -4 SkyWest Airlines Inc.
UA 9.0545455 -2 United Air Lines Inc.
US 2.2024291 -3 US Airways Inc.
VX 2.3333333 -4 Virgin America
WN 12.0638468 2 Southwest Airlines Co.

Data analysis


Hypothesis test

Assuming conditions are met…

pdx_anova <- aov(formula = dep_delay ~ carrier, data = pdx_rs)
summary(pdx_anova)
##               Df  Sum Sq Mean Sq F value              Pr(>F)    
## carrier       10  130335   13033   13.49 <0.0000000000000002 ***
## Residuals   5989 5785272     966                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of results


Our friend the p-value

  • The \(p\)-value resulting from our analysis is essentially 0.

  • This corresponds to the probability of obtaining an observed \(F\) statistic of 13.4924731 or greater on an \(F\) distribution with \({df}_1 = 10\) and \(df_2 = 5989\), which assumes that the mean departure delays for all carriers is the same (the null hypothesis is true).

  • This small \(p\)-value leads us to reject the null hypothesis in favor of the alternative: at least one of the carriers has a mean departure delay that is different than the others (in the population of all 2014 flights from PDX).


Visualizing the p-value



Reproducible research


Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.”

  • Roger Peng, Johns Hopkins

Full worked out example


How did we do?


Answer

pdx_full_summary <- na.omit(pdx_flights) %>% group_by(carrier) %>%
  summarize(`Mean Delay` = mean(dep_delay), `Median Delay` = median(dep_delay))
kable(pdx_full_summary)
carrier Mean Delay Median Delay
AA 13.0708625 -2
AS 0.9418523 -5
B6 5.9677926 -3
DL 2.5678412 -3
F9 8.4546125 -3
HA -0.8027397 -5
OO 4.2595904 -4
UA 7.3794427 -2
US 1.5259545 -3
VX 6.2477477 -4
WN 12.1458352 1

Workshop problems


Getting the problems


Thanks!


sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rmarkdown_0.9.6.10      knitr_1.13              ggplot2_2.1.0           dplyr_0.4.3.9001       
## [5] pnwflights14_0.1.0.9000 revealjs_0.6.1         
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5      magrittr_1.5     munsell_0.4.3    colorspace_1.2-6 R6_2.1.2        
##  [6] highr_0.6        stringr_1.0.0    plyr_1.8.3       tools_3.3.0      grid_3.3.0      
## [11] gtable_0.2.0     DBI_0.4-1        htmltools_0.3.5  lazyeval_0.1.10  yaml_2.1.13     
## [16] assertthat_0.1   digest_0.6.9     tibble_1.0-1     formatR_1.4      evaluate_0.9    
## [21] labeling_0.3     stringi_1.0-1    scales_0.4.0