dplyrdplyr of R for Data Science (Wickham & Grolemund, 2017). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.install.packages("tidyverse", repos = "http://cran.us.r-project.org", dependencies = TRUE)
install.packages(c("nycflights13", "gapminder", "Lahman"), repos = "http://cran.us.r-project.org", dependencies = TRUE)
library(tidyverse)
library(nycflights13)
What we do:
| Functions | Tasks |
|---|---|
filter() |
Choose observations (rows) by their values |
arrange() |
Reorder the rows |
select() |
Choose variables (columns) by their names |
mutate() |
Creat new variables (columns) as functions of existing variables |
summarize() |
Collapse many values down to a single summary |
Why we do:
Sample dataset nycflights13:
336,776 flights that departed from New York City 2013.
nycflights13::flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
tibble a new version of dataframe (similar to the MS Excel spreadsheet; for swtiching between two, use as_tibble / as.data.frame)int integerdbl doublechr characterdttm date-times—(not used here)—
lgl logical: TRUE or FALSEfctr factor (categorical variable)date datesfilter()Create a new tibble with those observations that meet a certain condition(s).
filter(flights, month == 1, day ==1)
## Warning: package 'bindrcpp' was built under R version 3.5.2
## # A tibble: 842 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 832 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
jan1 <- filter(flights, month == 1, day == 1)
(dec25 <- filter(flights, month == 12, day ==25)) # assign & print together
## # A tibble: 719 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 12 25 456 500 -4 649
## 2 2013 12 25 524 515 9 805
## 3 2013 12 25 542 540 2 832
## 4 2013 12 25 546 550 -4 1022
## 5 2013 12 25 556 600 -4 730
## 6 2013 12 25 557 600 -3 743
## 7 2013 12 25 557 600 -3 818
## 8 2013 12 25 559 600 -1 855
## 9 2013 12 25 559 600 -1 849
## 10 2013 12 25 600 600 0 850
## # ... with 709 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
==, &|!filter(flights, month == 11 | month == 12)
## # A tibble: 55,403 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 11 1 5 2359 6 352
## 2 2013 11 1 35 2250 105 123
## 3 2013 11 1 455 500 -5 641
## 4 2013 11 1 539 545 -6 856
## 5 2013 11 1 542 545 -3 831
## 6 2013 11 1 549 600 -11 912
## 7 2013 11 1 550 600 -10 705
## 8 2013 11 1 554 600 -6 659
## 9 2013 11 1 554 600 -6 826
## 10 2013 11 1 554 600 -6 749
## # ... with 55,393 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
nov_dec <- filter(flights, month %in% c(11, 12))
# De Morgan's law
# !(x | y) == !x & !y
filter(flights, !(arr_delay > 120 | dep_delay > 120))
## # A tibble: 316,050 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 316,040 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
filter(flights, arr_delay <= 120, dep_delay <=120)
## # A tibble: 316,050 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 316,040 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
If a value is missing, arithmetic calculation or logical comparison is not possible.
NA > 5
10 == NA
NA + 10
NA/2
NA == NA # we don't know
x <- NA
y <- NA
x == y # we don't know
is.na(x)
df <- tibble(x = c(1, NA, 3)) # convert a vector (1-dimensional) to a tibble (2-dimensional)
filter(df, x > 1) # excluding cases with NA
filter(df, is.na(x) | x > 1) # including cases with NA
arrange()Sort the data by selected columns (either in asceding/desceding order)
arrange(flights, year, month, day) # by default in ascending order (the smallest value on the top)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
arrange(flights, desc(arr_delay)) # in descending order (the largest value on the top)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 7 22 2257 759 898 121
## 9 2013 12 5 756 1700 896 1058
## 10 2013 5 3 1133 2055 878 1250
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
df <- tibble(x = c(5, 2, NA))
arrange(df, x) # missing values are always sorted at the end (at the bottom of the tibble)
## # A tibble: 3 x 1
## x
## <dbl>
## 1 2
## 2 5
## 3 NA
arrange(df, desc(x)) # sorting order doesn't matter
## # A tibble: 3 x 1
## x
## <dbl>
## 1 5
## 2 2
## 3 NA
select()Create a tibble with selected variables.
select(flights, year, month, day) # select columns by variable name
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
select(flights, year:day) # select columns between year and day (inclusive)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
select(flights, -(year:day)) # select all columns except those from year to day
## # A tibble: 336,776 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 517 515 2 830 819 11
## 2 533 529 4 850 830 20
## 3 542 540 2 923 850 33
## 4 544 545 -1 1004 1022 -18
## 5 554 600 -6 812 837 -25
## 6 554 558 -4 740 728 12
## 7 555 600 -5 913 854 19
## 8 557 600 -3 709 723 -14
## 9 557 600 -3 838 846 -8
## 10 558 600 -2 753 745 8
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Several helper functions:
starts_with("abc")ends_with("xyz")contains("ijk")matches("(.)\\1")num_range("x", 1:3) when numbers are part of variable namesrename(flights, tail_num = tailnum) # when renaming, also keep all the other variables that are not mentioned
select(flights, time_hour, air_time, everything()) # when reordering columns
mutate()Always add a new **variable(s)* at the end.
(flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time))
## # A tibble: 336,776 x 7
## year month day dep_delay arr_delay distance air_time
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227
## 2 2013 1 1 4 20 1416 227
## 3 2013 1 1 2 33 1089 160
## 4 2013 1 1 -1 -18 1576 183
## 5 2013 1 1 -6 -25 762 116
## 6 2013 1 1 -4 12 719 150
## 7 2013 1 1 -5 19 1065 158
## 8 2013 1 1 -3 -14 229 53
## 9 2013 1 1 -3 -8 944 140
## 10 2013 1 1 -2 8 733 138
## # ... with 336,766 more rows
mutate(flights_sml,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
## # A tibble: 336,776 x 9
## year month day dep_delay arr_delay distance air_time gain speed
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 9 370.
## 2 2013 1 1 4 20 1416 227 16 374.
## 3 2013 1 1 2 33 1089 160 31 408.
## 4 2013 1 1 -1 -18 1576 183 -17 517.
## 5 2013 1 1 -6 -25 762 116 -19 394.
## 6 2013 1 1 -4 12 719 150 16 288.
## 7 2013 1 1 -5 19 1065 158 24 404.
## 8 2013 1 1 -3 -14 229 53 -11 259.
## 9 2013 1 1 -3 -8 944 140 -5 405.
## 10 2013 1 1 -2 8 733 138 10 319.
## # ... with 336,766 more rows
mutate(flights_sml, # refer to a just created varaible
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain/hours)
## # A tibble: 336,776 x 10
## year month day dep_delay arr_delay distance air_time gain hours
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 9 3.78
## 2 2013 1 1 4 20 1416 227 16 3.78
## 3 2013 1 1 2 33 1089 160 31 2.67
## 4 2013 1 1 -1 -18 1576 183 -17 3.05
## 5 2013 1 1 -6 -25 762 116 -19 1.93
## 6 2013 1 1 -4 12 719 150 16 2.5
## 7 2013 1 1 -5 19 1065 158 24 2.63
## 8 2013 1 1 -3 -14 229 53 -11 0.883
## 9 2013 1 1 -3 -8 944 140 -5 2.33
## 10 2013 1 1 -2 8 733 138 10 2.3
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>
transmute(flights, # keep only new variables
gain = arr_delay - dep_delay,
hours = air_time/60,
gain_per_houw = gain/hours)
## # A tibble: 336,776 x 3
## gain hours gain_per_houw
## <dbl> <dbl> <dbl>
## 1 9 3.78 2.38
## 2 16 3.78 4.23
## 3 31 2.67 11.6
## 4 -17 3.05 -5.57
## 5 -19 1.93 -9.83
## 6 16 2.5 6.4
## 7 24 2.63 9.11
## 8 -11 0.883 -12.5
## 9 -5 2.33 -2.14
## 10 10 2.3 4.35
## # ... with 336,766 more rows
# modular arithmetic %/% (integer division) & %% (remainder)
transmute(flights,
dep_time,
hour = dep_time %/% 100, # integer division
minute = dep_time %% 100) # remainder
## # A tibble: 336,776 x 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # ... with 336,766 more rows
# simple ranking, which ignores NA; more variants are available
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1] 1 2 2 NA 4 5
min_rank(desc(y))
## [1] 5 3 3 NA 2 1
summarize() (powerful with group_by())Collapses a dataframe into a single row.
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6
Powerful when combined with group_by: dplyr verbs are applied to individual groups
by_day <- group_by(flights, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups: year, month [?]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ... with 355 more rows
x %>% f(y) == f(x, y)
x %>% f(y) >%> g(z) == g(f(x,y), z)
First, do multiple operations without pipes.
by_dest <- group_by(flights, dest)
delay <- summarize(by_dest,
count = n(), # number of observations, or rows
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count >20, dest != "HNL") # exluce Honolulu (outlier), as the arrival airport for each departure
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha =1/3) +
geom_smooth(se = FALSE) # remove the range of standard deviations for brevity
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Next, do the same operations with pipes.
#with pipes
delays <- flights %>%
group_by(dest) %>%
summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count>20, dest != "HNL") # exluce Honolulu (outlier), as the arrival airport for each departure
ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha =1/3) +
geom_smooth(se = FALSE) # remove the range of standard deviations for brevity
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Why to use? (more details on Chapter 14):
# without na.rm = TRUE
flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay))
# with na.rm = TRUE
flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay, na.rm = TRUE))
A clean data set from flights
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(
delay = mean(arr_delay),
n = n()
)
ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)
Next, combine dplyr verbs and ggplot2 commands.
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(
delay = mean(arr_delay),
n = n()
)
ggplot(data=delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha=1/10)
# remove airplains with fewer flights (n<=25)
# combine a dplyr verb with ggplot
delays %>%
filter(n>25) %>%
ggplot(mapping = aes(x= n, y = delay)) + # ggplot2 does not work with %>% yet
geom_point(alpha=1/10)
sum(!is.na(x))n_distinct()not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(dest) %>%
summarize(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
## # A tibble: 104 x 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # ... with 94 more rows
not_cancelled %>%
count(dest) %>%
arrange(desc(n))
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ATL 16837
## 2 ORD 16566
## 3 LAX 16026
## 4 BOS 15022
## 5 MCO 13967
## 6 CLT 13674
## 7 SFO 13173
## 8 FLL 11897
## 9 MIA 11593
## 10 DCA 9111
## # ... with 94 more rows
not_cancelled %>%
count(tailnum, wt = distance) %>%
arrange(desc(n))
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 N328AA 929090
## 2 N338AA 921172
## 3 N335AA 902271
## 4 N327AA 900482
## 5 N323AA 839468
## 6 N319AA 837924
## 7 N336AA 833136
## 8 N329AA 825826
## 9 N324AA 786159
## 10 N339AA 783648
## # ... with 4,027 more rows
sum(x) gives the number of TRUEs in x (=condition)mean(x) gives the proportionnot_cancelled %>%
group_by(year, month, day) %>%
summarize(n_early = sum(dep_time < 500)) %>%
arrange(desc(n_early))
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day n_early
## <int> <int> <int> <int>
## 1 2013 6 28 32
## 2 2013 4 10 30
## 3 2013 7 28 30
## 4 2013 3 18 29
## 5 2013 7 7 29
## 6 2013 7 10 27
## 7 2013 6 27 25
## 8 2013 6 13 24
## 9 2013 3 8 22
## 10 2013 7 22 22
## # ... with 355 more rows
not_cancelled %>%
group_by(year, month, day) %>%
summarize(hour_perc = mean(arr_delay >60)) %>%
arrange(desc(hour_perc))
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day hour_perc
## <int> <int> <int> <dbl>
## 1 2013 3 8 0.525
## 2 2013 6 13 0.402
## 3 2013 7 1 0.401
## 4 2013 7 22 0.366
## 5 2013 5 23 0.361
## 6 2013 8 8 0.361
## 7 2013 9 12 0.350
## 8 2013 12 17 0.339
## 9 2013 7 10 0.332
## 10 2013 6 24 0.323
## # ... with 355 more rows
This homework has FOUR questions from R for Data Science.
Submit your R script file that includes your answers to individual questions (use #comment for discussion) on Canvas.
Note that online chapters differ from those of the print copy. Check the page numbers below.
Due at 11:59:00 PM (EST) on January 27th Sunday.
filter() (in the print copy, pp.49-50)arrange() (in the print copy, p.51)arrange() to sort all missing values to the start? (Hint: use is.na()).mutate() (in the print copy, p.58)dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.