Data Analysis With R

Fall 2015 - Data @ Reed Research Skills Workshop Series

Chester Ismay

ETC 223

cismay@reed.edu

http://blogs.reed.edu/datablog

Who am I?

Grew up in South Dakota (town of 112 people)
BS in Math, minor in Computer Science from SDSM&T
MS in Statistics from Northern Arizona University
Worked as an actuary before obtaining PhD in Statistics from Arizona State University
Was Assistant Professor of Statistics and Data Science at Ripon College the last two years
Moved to Portland area this summer
Started working at Reed on August 11th

What can I help you with?

Data analysis
Data wrangling/cleaning
Data visualization
Data tidying/manipulating
Reproducible research

When am I available?

Email me at cismay@reed.edu or chester.ismay@reed.edu
Office (ETC 223) hours
- Mondays (11 AM to noon)
- Tuesdays and Thursdays (2 PM to 3 PM)

Basic research process

Research Process

Further support

data@reed.edu
http://www.reed.edu/data-at-reed

Research Process

What is data analysis?

Data Cycle

Introduction

Installing/Loading packages and data

library(babynames)
suppressPackageStartupMessages(library(dplyr))
library(ggplot2)
library(devtools)
library(broom)
data("names", package = "babynames")

Structure of the dataset

str(babynames)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1792091 obs. of  5 variables:
##  $ year: num  1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
##  $ sex : chr  "F" "F" "F" "F" ...
##  $ name: chr  "Mary" "Anna" "Emma" "Elizabeth" ...
##  $ n   : int  7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
##  $ prop: num  0.0724 0.0267 0.0205 0.0199 0.0179 ...

What does the data look like?

head(babynames)

## Source: local data frame [6 x 5]
## 
##    year   sex      name     n       prop
##   (dbl) (chr)     (chr) (int)      (dbl)
## 1  1880     F      Mary  7065 0.07238359
## 2  1880     F      Anna  2604 0.02667896
## 3  1880     F      Emma  2003 0.02052149
## 4  1880     F Elizabeth  1939 0.01986579
## 5  1880     F    Minnie  1746 0.01788843
## 6  1880     F  Margaret  1578 0.01616720

Is anyone else named “Chester”?

babynames %>% filter(name == "Chester") %>%
  qplot(year, prop, data = .)

I win!

babynames %>% filter(name == "Chester" | name == "Karolyn") %>%
  qplot(year, prop, data = ., colour = name)

What’s going on here?

babynames %>% filter(name == "Chester" | name == "Karolyn") %>%
  group_by(name) %>%
  summarize(mean_prop = mean(prop), 
            mean_n = mean(n),
            sd_prop = sd(prop),
            sd_n = sd(n)
  )

What’s going on here?

babynames %>% filter(name == "Chester" | name == "Karolyn") %>%
  group_by(name) %>%
  summarize(mean_prop = mean(prop), 
            mean_n = mean(n),
            sd_prop = sd(prop),
            sd_n = sd(n)
  )

## Source: local data frame [2 x 5]
## 
##      name    mean_prop    mean_n      sd_prop      sd_n
##     (chr)        (dbl)     (dbl)        (dbl)     (dbl)
## 1 Chester 8.555174e-04 608.66332 1.108811e-03 868.48528
## 2 Karolyn 3.439051e-05  56.38614 2.539501e-05  37.77591

What about here?

babynames %>% filter(name == "Chester" | name == "Karolyn") %>%
  group_by(name) %>%
  top_n(3)

What about here?

babynames %>% filter(name == "Chester" | name == "Karolyn") %>%
  group_by(name) %>%
  top_n(3)

## Selecting by prop

## Source: local data frame [6 x 5]
## Groups: name [2]
## 
##    year   sex    name     n         prop
##   (dbl) (chr)   (chr) (int)        (dbl)
## 1  1916     M Chester  3212 0.0034789541
## 2  1917     M Chester  3338 0.0034794827
## 3  1918     M Chester  3691 0.0035194849
## 4  1941     F Karolyn   149 0.0001196058
## 5  1943     F Karolyn   154 0.0001073054
## 6  1946     F Karolyn   167 0.0001035495

That wasn’t fair!

mod_names <- babynames %>% 
  filter(name == "Chester" 
         | name %in% c("Karolyn", "Carolyn", "Caroline")) %>%
  mutate(kname = ifelse(name == "Chester", "Chester", "KarolynVar"))

What the…

mod_names %>% qplot(year, prop, data = ., colour = kname)

Aw! Boo!

mod_names <- mod_names %>% group_by(year, kname) %>%
  summarize(sumprop = sum(prop))
mod_names %>% qplot(year, sumprop, data = ., colour = kname)

What is R and why should we use it?

R is a completely free software package and language for statistical analysis and graphics.
It excels in helping you with
- data manipulation
- automation
- reproducibility
- error finding
- customizability
Any downsides?

Like learning a foreign language!

Similarities (In the simplest case)

R	Foreign Language	R examples
functions	verb	- sqrt()
		- arrange()
		- lm()
command	sentence	- exp(3)
		- tail(babynames)

KEY POINT - Exposure makes you fluent!

Command patterns

To chain or not to chain…

These two commands accomplish the same thing in R.

Data_Table %>% function_name (argument1, argument2)
function_name(Data_Table, argument1, argument2)

BUT!

Chaining is much easier to read when you want to do a series of steps.

Unchained

filter(
  summarise(
    select(
      group_by(flights, year, month, day),
      arr_delay, dep_delay
    ),
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ),
  arr > 30 | dep > 30
)

Versus

Chained

flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)

More analysis!

Pacific Northwest 2014 Departing Flights

library(pnwflights14)
data("flights", package = "pnwflights14")

Coming up with questions

flights %>% str()

## Classes 'tbl_df', 'tbl' and 'data.frame':    162049 obs. of  16 variables:
##  $ year     : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  1 4 8 28 34 37 346 526 527 536 ...
##  $ dep_delay: num  96 -6 13 -2 44 82 227 -4 7 1 ...
##  $ arr_time : int  235 738 548 800 325 747 936 1148 917 1334 ...
##  $ arr_delay: num  70 -23 -4 -23 43 88 219 15 24 -6 ...
##  $ carrier  : chr  "AS" "US" "UA" "US" ...
##  $ tailnum  : chr  "N508AS" "N195UW" "N37422" "N547UW" ...
##  $ flight   : int  145 1830 1609 466 121 1823 1481 229 1576 478 ...
##  $ origin   : chr  "PDX" "SEA" "PDX" "PDX" ...
##  $ dest     : chr  "ANC" "CLT" "IAH" "CLT" ...
##  $ air_time : num  194 252 201 251 201 224 202 217 136 268 ...
##  $ distance : num  1542 2279 1825 2282 1448 ...
##  $ hour     : num  0 0 0 0 0 0 3 5 5 5 ...
##  $ minute   : num  1 4 8 28 34 37 46 26 27 36 ...

Correlation (by airport)

flights %>% group_by(origin) %>%
  summarize(correl = cor(distance, arr_delay, use = "complete.obs"))

## Source: local data frame [2 x 2]
## 
##   origin      correl
##    (chr)       (dbl)
## 1    PDX -0.07893240
## 2    SEA -0.04535174

Correlation (over everything)

flights %>% summarize(correl = cor(dep_delay, month, use = "complete.obs"))

## Source: local data frame [1 x 1]
## 
##        correl
##         (dbl)
## 1 0.005312167

Simple Linear Regression & Correlation

flights %>% filter(origin == "PDX") %>%
  summarize(correl = cor(dep_delay, arr_delay, use = "complete.obs"))

## Source: local data frame [1 x 1]
## 
##      correl
##       (dbl)
## 1 0.9388435

lin_reg <- flights %>% filter(origin == "PDX") %>%
  lm(arr_delay ~ dep_delay, data = .)
broom::tidy(lin_reg)

##          term   estimate   std.error statistic p.value
## 1 (Intercept) -3.9571898 0.049581177 -79.81234       0
## 2   dep_delay  0.9963924 0.001590334 626.53032       0

t test

t_test <- flights %>% t.test(arr_delay ~ origin, data = .)
tidy(t_test) %>% rename("PDX" = estimate1, "SEA" = estimate2)

##     estimate      PDX      SEA statistic      p.value parameter  conf.low  conf.high
## 1 -0.7970638 1.705651 2.502714 -4.707663 2.509114e-06   99052.6 -1.128913 -0.4652143

ANOVA

anova_model <- flights %>% aov(arr_delay ~ carrier, data = .)
tidy(anova_model)

##        term     df     sumsq      meansq statistic       p.value
## 1   carrier     10   1448964 144896.3545  150.3188 1.635115e-315
## 2 Residuals 160737 154938701    963.9268        NA            NA

R Studio
+
R Markdown

Tools to make working with R friendly

RStudio is a powerful user interface that helps you get better control of your analysis.
It is also completely free.
You can write your entire paper/report (text, code, analysis, graphics, etc.) all in R Markdown.
If you need to update any of your code, R Markdown will automatically update your plots and output of your analysis and will create an updated PDF file.
No more copy-and-paste!
It’s my job!

Time to try it for yourself!

Useful links

Data Processing with dplyr & tidyr : https://rpubs.com/bradleyboehmke/data_wrangling

Introduction to dplyr : https://goo.gl/SY6qBy

Your assignment : Right click on me and Save

Solutions (Rmd) : Right click and Save

Solutions (HTML) : Click away…after you’ve tried!

Data @ Reed Research Skills Workshops for Fall 2015

All workshops in ETC 211 from 4 - 5 PM

September 16 - Data analysis with Stata
September 23 - Data analysis with R

September 30 - Data visualization using R
October 7 - Maps and more: spatial data
October 14 - Reproducible research

Thanks!

cismay@reed.edu

Slides available at http://rpubs.com/cismay/dawr_workshop_2015

Data Analysis With R

Fall 2015 - Data @ Reed Research Skills Workshop Series

Who am I?

What can I help you with?

When am I available?

Basic research process

Further support

data@reed.edu http://www.reed.edu/data-at-reed

What is data analysis?

Introduction

Installing/Loading packages and data

Structure of the dataset

What does the data look like?

Is anyone else named “Chester”?

I win!

What’s going on here?

What’s going on here?

What about here?

What about here?

That wasn’t fair!

What the…

Aw! Boo!

What is R and why should we use it?

Like learning a foreign language!

Similarities (In the simplest case)

Command patterns

To chain or not to chain…

BUT!

Unchained

Versus

Chained

More analysis!

Pacific Northwest 2014 Departing Flights

Coming up with questions

Correlation (by airport)

Correlation (over everything)

Simple Linear Regression & Correlation

t test

ANOVA

R Studio + R Markdown

Tools to make working with R friendly

Time to try it for yourself!

Useful links

Data @ Reed Research Skills Workshops for Fall 2015

Thanks!

cismay@reed.edu

data@reed.edu
http://www.reed.edu/data-at-reed

R Studio
+
R Markdown