Data Transformation with dplyr

Disclaimer: The contents of this document come from Chapter 3. Data Transformation with `dplyr` of R for Data Science (Wickham & Grolemund, 2017). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.

This document is also published on RPubs.

install.packages("tidyverse", repos = "http://cran.us.r-project.org", dependencies = TRUE)
install.packages(c("nycflights13", "gapminder", "Lahman"), repos = "http://cran.us.r-project.org", dependencies = TRUE)
library(tidyverse)
library(nycflights13)

1. Intro

What we do:

Functions	Tasks
`filter()`	Choose observations (rows) by their values
`arrange()`	Reorder the rows
`select()`	Choose variables (columns) by their names
`mutate()`	Creat new variables (columns) as functions of existing variables
`summarize()`	Collapse many values down to a single summary

Why we do:

Make data ready for further analysis
Better understand data (e.g., compute key statistics such as mean/median by group)
Identify ciritcal issues in data, which is not obvious (e.g., missing/data entry errors/outliers)

Sample dataset nycflights13:

336,776 flights that departed from New York City 2013.

nycflights13::flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

tibble a new version of dataframe (similar to the MS Excel spreadsheet; for swtiching between two, use as_tibble / as.data.frame)
int integer
dbl double
chr character
dttm date-times

—(not used here)—

lgl logical: TRUE or FALSE
fctr factor (categorical variable)
date dates

2. Filter Rows with `filter()`

Create a new tibble with those observations that meet a certain condition(s).

filter(flights, month == 1, day ==1)

## Warning: package 'bindrcpp' was built under R version 3.5.2

## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

jan1 <- filter(flights, month == 1, day == 1)
(dec25 <- filter(flights, month == 12, day ==25)) # assign & print together

## # A tibble: 719 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    12    25      456            500        -4      649
##  2  2013    12    25      524            515         9      805
##  3  2013    12    25      542            540         2      832
##  4  2013    12    25      546            550        -4     1022
##  5  2013    12    25      556            600        -4      730
##  6  2013    12    25      557            600        -3      743
##  7  2013    12    25      557            600        -3      818
##  8  2013    12    25      559            600        -1      855
##  9  2013    12    25      559            600        -1      849
## 10  2013    12    25      600            600         0      850
## # ... with 709 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Logical Operators

equality ==
and , &
or |
not !

filter(flights, month == 11 | month == 12)

## # A tibble: 55,403 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    11     1        5           2359         6      352
##  2  2013    11     1       35           2250       105      123
##  3  2013    11     1      455            500        -5      641
##  4  2013    11     1      539            545        -6      856
##  5  2013    11     1      542            545        -3      831
##  6  2013    11     1      549            600       -11      912
##  7  2013    11     1      550            600       -10      705
##  8  2013    11     1      554            600        -6      659
##  9  2013    11     1      554            600        -6      826
## 10  2013    11     1      554            600        -6      749
## # ... with 55,393 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

nov_dec <- filter(flights, month %in% c(11, 12))

# De Morgan's law 
# !(x | y) == !x & !y 
filter(flights, !(arr_delay > 120 | dep_delay > 120))

## # A tibble: 316,050 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 316,040 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

filter(flights, arr_delay <= 120, dep_delay <=120)

## # A tibble: 316,050 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 316,040 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Missing Values

If a value is missing, arithmetic calculation or logical comparison is not possible.

NA > 5
10 == NA 
NA + 10 
NA/2 
NA == NA # we don't know  

x <- NA 
y <- NA 
x == y # we don't know 

is.na(x)

df <- tibble(x = c(1, NA, 3)) # convert a vector (1-dimensional) to a tibble (2-dimensional) 
filter(df, x > 1) # excluding cases with NA 
filter(df, is.na(x) | x > 1) # including cases with NA

3. Arrange Rows with `arrange()`

Sort the data by selected columns (either in asceding/desceding order)

arrange(flights, year, month, day) # by default in ascending order (the smallest value on the top)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

arrange(flights, desc(arr_delay)) # in descending order (the largest value on the top)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900      1301     1242
##  2  2013     6    15     1432           1935      1137     1607
##  3  2013     1    10     1121           1635      1126     1239
##  4  2013     9    20     1139           1845      1014     1457
##  5  2013     7    22      845           1600      1005     1044
##  6  2013     4    10     1100           1900       960     1342
##  7  2013     3    17     2321            810       911      135
##  8  2013     7    22     2257            759       898      121
##  9  2013    12     5      756           1700       896     1058
## 10  2013     5     3     1133           2055       878     1250
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

df <- tibble(x = c(5, 2, NA)) 
arrange(df, x) # missing values are always sorted at the end (at the bottom of the tibble)

## # A tibble: 3 x 1
##       x
##   <dbl>
## 1     2
## 2     5
## 3    NA

arrange(df, desc(x)) # sorting order doesn't matter

## # A tibble: 3 x 1
##       x
##   <dbl>
## 1     5
## 2     2
## 3    NA

4. Select Columns with `select()`

Create a tibble with selected variables.

select(flights, year, month, day) # select columns by variable name

## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

select(flights, year:day) # select columns between year and day (inclusive)

## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

select(flights, -(year:day)) # select all columns except those from year to day

## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>
##  1      517            515         2      830            819        11
##  2      533            529         4      850            830        20
##  3      542            540         2      923            850        33
##  4      544            545        -1     1004           1022       -18
##  5      554            600        -6      812            837       -25
##  6      554            558        -4      740            728        12
##  7      555            600        -5      913            854        19
##  8      557            600        -3      709            723       -14
##  9      557            600        -3      838            846        -8
## 10      558            600        -2      753            745         8
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Several helper functions:

starts_with("abc")
ends_with("xyz")
contains("ijk")
matches("(.)\\1")
num_range("x", 1:3) when numbers are part of variable names

rename(flights, tail_num = tailnum) # when renaming, also keep all the other variables that are not mentioned 
select(flights, time_hour, air_time, everything()) # when reordering columns

5. Add New Variables with `mutate()`

Always add a new **variable(s)* at the end.

(flights_sml <- select(flights, 
                      year:day, 
                      ends_with("delay"), 
                      distance, 
                      air_time))

## # A tibble: 336,776 x 7
##     year month   day dep_delay arr_delay distance air_time
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
##  1  2013     1     1         2        11     1400      227
##  2  2013     1     1         4        20     1416      227
##  3  2013     1     1         2        33     1089      160
##  4  2013     1     1        -1       -18     1576      183
##  5  2013     1     1        -6       -25      762      116
##  6  2013     1     1        -4        12      719      150
##  7  2013     1     1        -5        19     1065      158
##  8  2013     1     1        -3       -14      229       53
##  9  2013     1     1        -3        -8      944      140
## 10  2013     1     1        -2         8      733      138
## # ... with 336,766 more rows

mutate(flights_sml, 
       gain = arr_delay - dep_delay, 
       speed = distance / air_time * 60)

## # A tibble: 336,776 x 9
##     year month   day dep_delay arr_delay distance air_time  gain speed
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227     9  370.
##  2  2013     1     1         4        20     1416      227    16  374.
##  3  2013     1     1         2        33     1089      160    31  408.
##  4  2013     1     1        -1       -18     1576      183   -17  517.
##  5  2013     1     1        -6       -25      762      116   -19  394.
##  6  2013     1     1        -4        12      719      150    16  288.
##  7  2013     1     1        -5        19     1065      158    24  404.
##  8  2013     1     1        -3       -14      229       53   -11  259.
##  9  2013     1     1        -3        -8      944      140    -5  405.
## 10  2013     1     1        -2         8      733      138    10  319.
## # ... with 336,766 more rows

mutate(flights_sml, # refer to a just created varaible 
       gain = arr_delay - dep_delay, 
       hours = air_time / 60, 
       gain_per_hour = gain/hours)

## # A tibble: 336,776 x 10
##     year month   day dep_delay arr_delay distance air_time  gain hours
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227     9 3.78 
##  2  2013     1     1         4        20     1416      227    16 3.78 
##  3  2013     1     1         2        33     1089      160    31 2.67 
##  4  2013     1     1        -1       -18     1576      183   -17 3.05 
##  5  2013     1     1        -6       -25      762      116   -19 1.93 
##  6  2013     1     1        -4        12      719      150    16 2.5  
##  7  2013     1     1        -5        19     1065      158    24 2.63 
##  8  2013     1     1        -3       -14      229       53   -11 0.883
##  9  2013     1     1        -3        -8      944      140    -5 2.33 
## 10  2013     1     1        -2         8      733      138    10 2.3  
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>

transmute(flights, # keep only new variables 
          gain = arr_delay - dep_delay, 
          hours = air_time/60, 
          gain_per_houw = gain/hours)

## # A tibble: 336,776 x 3
##     gain hours gain_per_houw
##    <dbl> <dbl>         <dbl>
##  1     9 3.78           2.38
##  2    16 3.78           4.23
##  3    31 2.67          11.6 
##  4   -17 3.05          -5.57
##  5   -19 1.93          -9.83
##  6    16 2.5            6.4 
##  7    24 2.63           9.11
##  8   -11 0.883        -12.5 
##  9    -5 2.33          -2.14
## 10    10 2.3            4.35
## # ... with 336,766 more rows

Useful Creation Functions

# modular arithmetic %/% (integer division) & %% (remainder)
transmute(flights, 
          dep_time, 
          hour = dep_time %/% 100,  # integer division
          minute = dep_time %% 100) # remainder

## # A tibble: 336,776 x 3
##    dep_time  hour minute
##       <int> <dbl>  <dbl>
##  1      517     5     17
##  2      533     5     33
##  3      542     5     42
##  4      544     5     44
##  5      554     5     54
##  6      554     5     54
##  7      555     5     55
##  8      557     5     57
##  9      557     5     57
## 10      558     5     58
## # ... with 336,766 more rows

# simple ranking, which ignores NA; more variants are available  
y <- c(1, 2, 2, NA, 3, 4) 
min_rank(y)

## [1]  1  2  2 NA  4  5

min_rank(desc(y))

## [1]  5  3  3 NA  2  1

6. Grouped summaries with `summarize()` (powerful with `group_by()`)

Collapses a dataframe into a single row.

summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6

Powerful when combined with group_by: dplyr verbs are applied to individual groups

by_day <- group_by(flights, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

Combining Multiple Operations with the Pipe (or then)

x %>% f(y) == f(x, y)
x %>% f(y) >%> g(z) == g(f(x,y), z)

First, do multiple operations without pipes.

by_dest <- group_by(flights, dest)  
delay <- summarize(by_dest, 
                   count = n(), # number of observations, or rows
                   dist = mean(distance, na.rm = TRUE), 
                   delay = mean(arr_delay, na.rm = TRUE)
                   )
delay <- filter(delay, count >20, dest != "HNL") # exluce Honolulu (outlier), as the arrival airport for each departure

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha =1/3) + 
  geom_smooth(se = FALSE) # remove the range of standard deviations for brevity

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Next, do the same operations with pipes.

#with pipes 
delays <- flights %>% 
  group_by(dest) %>%
  summarize(
    count = n(), 
    dist  = mean(distance, na.rm = TRUE), 
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count>20, dest != "HNL") # exluce Honolulu (outlier), as the arrival airport for each departure

ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha =1/3) + 
  geom_smooth(se = FALSE) # remove the range of standard deviations for brevity

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Why to use? (more details on Chapter 14):

No need to save intermediate/temporary-use outputs.

It’s painful to come up with good names.

The order of individual operations is more intuitive.

From inside to outside
From one line to the next

Missing Values

# without na.rm = TRUE  
flights %>% 
  group_by(year, month, day) %>%
  summarize(mean = mean(dep_delay))

# with na.rm = TRUE
flights %>% 
  group_by(year, month, day) %>%
  summarize(mean = mean(dep_delay, na.rm = TRUE))

A clean data set from flights

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

Counts

delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarize(
    delay = mean(arr_delay), 
    n = n()
  )

ggplot(data = delays, mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10)

Next, combine dplyr verbs and ggplot2 commands.

delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarize(
    delay = mean(arr_delay), 
    n = n()
  )

ggplot(data=delays, mapping = aes(x = n, y = delay)) +
  geom_point(alpha=1/10)

# remove airplains with fewer flights (n<=25)  
# combine a dplyr verb with ggplot  
delays %>%
  filter(n>25) %>%
  ggplot(mapping = aes(x= n, y = delay)) + # ggplot2 does not work with %>% yet 
  geom_point(alpha=1/10)

Useful Summary Functions

Count the number of non-missing values: sum(!is.na(x))
Count the number of distinctive(unique) values: n_distinct()

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>%
  group_by(dest) %>%
  summarize(carriers = n_distinct(carrier)) %>% 
  arrange(desc(carriers))

## # A tibble: 104 x 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # ... with 94 more rows

not_cancelled %>%
  count(dest) %>% 
  arrange(desc(n))

## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ATL   16837
##  2 ORD   16566
##  3 LAX   16026
##  4 BOS   15022
##  5 MCO   13967
##  6 CLT   13674
##  7 SFO   13173
##  8 FLL   11897
##  9 MIA   11593
## 10 DCA    9111
## # ... with 94 more rows

not_cancelled %>% 
  count(tailnum, wt = distance) %>% 
  arrange(desc(n))

## # A tibble: 4,037 x 2
##    tailnum      n
##    <chr>    <dbl>
##  1 N328AA  929090
##  2 N338AA  921172
##  3 N335AA  902271
##  4 N327AA  900482
##  5 N323AA  839468
##  6 N319AA  837924
##  7 N336AA  833136
##  8 N329AA  825826
##  9 N324AA  786159
## 10 N339AA  783648
## # ... with 4,027 more rows

sum(x) gives the number of TRUEs in x (=condition)
mean(x) gives the proportion

not_cancelled %>% 
  group_by(year, month, day) %>%
  summarize(n_early = sum(dep_time < 500)) %>% 
  arrange(desc(n_early))

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day n_early
##    <int> <int> <int>   <int>
##  1  2013     6    28      32
##  2  2013     4    10      30
##  3  2013     7    28      30
##  4  2013     3    18      29
##  5  2013     7     7      29
##  6  2013     7    10      27
##  7  2013     6    27      25
##  8  2013     6    13      24
##  9  2013     3     8      22
## 10  2013     7    22      22
## # ... with 355 more rows

not_cancelled %>% 
  group_by(year, month, day) %>%
  summarize(hour_perc = mean(arr_delay >60)) %>% 
  arrange(desc(hour_perc))

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day hour_perc
##    <int> <int> <int>     <dbl>
##  1  2013     3     8     0.525
##  2  2013     6    13     0.402
##  3  2013     7     1     0.401
##  4  2013     7    22     0.366
##  5  2013     5    23     0.361
##  6  2013     8     8     0.361
##  7  2013     9    12     0.350
##  8  2013    12    17     0.339
##  9  2013     7    10     0.332
## 10  2013     6    24     0.323
## # ... with 355 more rows

Weekly HW #2

This homework has FOUR questions from R for Data Science.

Submit your R script file that includes your answers to individual questions (use #comment for discussion) on Canvas.

Note that online chapters differ from those of the print copy. Check the page numbers below.

Due at 11:59:00 PM (EST) on January 27th Sunday.

3.6.1. Geometric Objects (in the print copy, p.21)

Recreate the R code necessary to generate the following graphs.

5.2.4 Filter rows with filter() (in the print copy, pp.49-50)

Find all flights that

Had an arrival delay of two or more hours
Flew to Houston (IAH or HOU)
Were operated by United, American, or Delta
Departed in summer (July, August, and September)
Arrived more than two hours late, but didn’t leave late
Were delayed by at least an hour, but made up over 30 minutes in flight
Departed between midnight and 6am (inclusive)

5.3.1 Arrange rows with arrange() (in the print copy, p.51)

How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

5.5.2 Add new variables with mutate() (in the print copy, p.58)

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

Data Transformation with dplyr

January 23 Wednesday, 2019

This document is also published on RPubs.

1. Intro

2. Filter Rows with filter()

Logical Operators

Missing Values

3. Arrange Rows with arrange()

4. Select Columns with select()

5. Add New Variables with mutate()

Useful Creation Functions

6. Grouped summaries with summarize() (powerful with group_by())

Combining Multiple Operations with the Pipe (or then)

Missing Values

Counts

Useful Summary Functions

Weekly HW #2

Data Transformation with `dplyr`

2. Filter Rows with `filter()`

3. Arrange Rows with `arrange()`

4. Select Columns with `select()`

5. Add New Variables with `mutate()`

6. Grouped summaries with `summarize()` (powerful with `group_by()`)