R
is a completely free software package and language for statistical analysis and graphics.txt
file # Header 1
## Header 2
Normal paragraphs of text go here.
**I'm bold**
[links!](http://rstudio.com)
* Unordered
* Lists
And Tables
---- -------
Like This
- “Literate programming”
- Embed R code in a Markdown document
- Renders textual output along with graphics
```{r chunk_name}
x <- rnorm(1000)
length(x)
qplot(x, bins = 10,
fill = I("orange"),
color = I("black"))
```
## [1] 1000
R | Foreign Language | R examples |
---|---|---|
functions | verb | - sqrt() |
- arrange() |
||
- lm() |
||
command | sentence | - exp(3) |
- tail(babynames) |
KEY POINT - Exposure makes you fluent!
library(dplyr)
library(pnwflights14)
data(flights, package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>%
select(-year, -origin)
str(object = pdx_flights)
library(dplyr)
library(pnwflights14)
data(flights, package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>%
select(-year, -origin)
str(object = pdx_flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53335 obs. of 14 variables:
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 1 8 28 526 541 549 559 602 606 618 ...
## $ dep_delay: num 96 13 -2 -4 1 24 -1 -3 6 -2 ...
## $ arr_time : int 235 548 800 1148 911 907 916 1204 746 1135 ...
## $ arr_delay: num 70 -4 -23 15 4 12 -9 7 3 -30 ...
## $ carrier : chr "AS" "UA" "US" "UA" ...
## $ tailnum : chr "N508AS" "N37422" "N547UW" "N813UA" ...
## $ flight : int 145 1609 466 229 1569 649 796 1573 406 1650 ...
## $ dest : chr "ANC" "IAH" "CLT" "IAH" ...
## $ air_time : num 194 201 251 217 130 122 125 203 87 184 ...
## $ distance : num 1542 1825 2282 1825 991 ...
## $ hour : num 0 0 0 5 5 5 5 6 6 6 ...
## $ minute : num 1 8 28 26 41 49 59 2 6 18 ...
We randomly select 6000 flights with no missing values from this set of 53,335 flights.
set.seed(2016)
pdx_rs <- na.omit(pdx_flights) %>% sample_n(6000)
Explanatory variable: categorical
Response variable: continuous
library(ggplot2)
qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot")
library(ggplot2)
qplot(x = carrier, y = dep_delay, data = pdx_rs, geom = "boxplot")
# library(ggplot2)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) +
geom_boxplot(outlier.shape = NA)
# library(ggplot2)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) +
geom_boxplot(outlier.shape = NA)
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) +
geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim = c(-20, 45))
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) +
geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim = c(-20, 45))
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) +
geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim = c(-20, 45)) +
stat_summary(fun.y = "mean", geom = "point", color = "red")
ggplot(aes(x = carrier, y = dep_delay), data = pdx_rs) +
geom_boxplot(outlier.shape = NA) +
coord_cartesian(ylim = c(-20, 45)) +
stat_summary(fun.y = "mean", geom = "point", color = "red")
data(airlines, package = "pnwflights14")
pdx_join <- inner_join(x = pdx_summary, y = airlines, by = "carrier")
kable(pdx_join)
carrier | Mean Delay | Median Delay | name |
---|---|---|---|
AA | 13.2077922 | -2 | American Airlines Inc. |
AS | 0.3305204 | -5 | Alaska Airlines Inc. |
B6 | 4.2057143 | -3 | JetBlue Airways |
DL | 2.1482759 | -3 | Delta Air Lines Inc. |
F9 | 9.5774648 | -3 | Frontier Airlines Inc. |
HA | -1.8108108 | -6 | Hawaiian Airlines Inc. |
OO | 4.5040504 | -4 | SkyWest Airlines Inc. |
UA | 9.0545455 | -2 | United Air Lines Inc. |
US | 2.2024291 | -3 | US Airways Inc. |
VX | 2.3333333 | -4 | Virgin America |
WN | 12.0638468 | 2 | Southwest Airlines Co. |
Assuming conditions are met…
pdx_anova <- aov(formula = dep_delay ~ carrier, data = pdx_rs)
summary(pdx_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## carrier 10 130335 13033 13.49 <0.0000000000000002 ***
## Residuals 5989 5785272 966
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The \(p\)-value resulting from our analysis is essentially 0.
This corresponds to the probability of obtaining an observed \(F\) statistic of 13.4924731 or greater on an \(F\) distribution with \({df}_1 = 10\) and \(df_2 = 5989\), which assumes that the mean departure delays for all carriers is the same (the null hypothesis is true).
This small \(p\)-value leads us to reject the null hypothesis in favor of the alternative: at least one of the carriers has a mean departure delay that is different than the others (in the population of all 2014 flights from PDX).
“Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.”
- Roger Peng, Johns Hopkins
We collected a random sample of the actual data on all 2014 flights departing PDX. Does a difference actually exist in the average departure delays for carriers in our population (all 2014 flights departing PDX)?
We can change the code below slightly to get our answer:
pdx_summary <- pdx_rs %>% group_by(carrier) %>%
summarize(`Mean Delay` = mean(dep_delay),
`Median Delay` = median(dep_delay))
kable(pdx_summary)
pdx_full_summary <- na.omit(pdx_flights) %>% group_by(carrier) %>%
summarize(`Mean Delay` = mean(dep_delay), `Median Delay` = median(dep_delay))
kable(pdx_full_summary)
carrier | Mean Delay | Median Delay |
---|---|---|
AA | 13.0708625 | -2 |
AS | 0.9418523 | -5 |
B6 | 5.9677926 | -3 |
DL | 2.5678412 | -3 |
F9 | 8.4546125 | -3 |
HA | -0.8027397 | -5 |
OO | 4.2595904 | -4 |
UA | 7.3794427 | -2 |
US | 1.5259545 | -3 |
VX | 6.2477477 | -4 |
WN | 12.1458352 | 1 |
Ratings from all beers I’ve rated using the Untappd app since February 2015
Use the dplyr
package (vignette here) along with appropriate plots using the ggplot2
package to understand which styles of beers I like best.
You can also look into which cities and states produce beers I like and have tried most. What stands out?
We are just doing data visualization and summary here (not inference)
To access the template file to begin your analysis on my beer adorations, go to
I’ve added these beer ratings to an R data package available here
- Code for slide creation on my GitHub page
- Slides available here
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rmarkdown_0.9.6.10 knitr_1.13 ggplot2_2.1.0 dplyr_0.4.3.9001
## [5] pnwflights14_0.1.0.9000 revealjs_0.6.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.5 magrittr_1.5 munsell_0.4.3 colorspace_1.2-6 R6_2.1.2
## [6] highr_0.6 stringr_1.0.0 plyr_1.8.3 tools_3.3.0 grid_3.3.0
## [11] gtable_0.2.0 DBI_0.4-1 htmltools_0.3.5 lazyeval_0.1.10 yaml_2.1.13
## [16] assertthat_0.1 digest_0.6.9 tibble_1.0-1 formatR_1.4 evaluate_0.9
## [21] labeling_0.3 stringi_1.0-1 scales_0.4.0