library(dplyr); library(magrittr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
This is the companion document to R discussion on 15 June, 2017.
In these examples, we will use the New York City flights 2013 and the Motor Trend Cars Dataset
flights = nycflights13::flights
data(mtcars)
By the numbers:
The magrittr operator takes the left side of an expression and deposits it on the right side. While this doesn’t sound like a big deal, it really is. The magrittr operator does this:
X %>% f() -> f(x)
For example, suppose I wanted to know the class of the ‘mpg’ variable in the mtcars dataset
mtcars$mpg %>% class()
## [1] "numeric"
As you will see below, using the %>% operator will make code much more readable!
Now, let’s take a look at the NYC Flights Dataset. I’m going to simply add some comments in the code below.
#how big is this dataset?
flights %>% dim()
## [1] 336776 19
# what are the names of the fields?
flights %>% names()
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
The four basic operations in dplyr are: 1. Data 2. Filter 3. group 4. Summarize
For example, I might be interested in knowing what the average departure delay was by carrier, considering only the top 10 carriers:
flights %>% group_by(carrier) %>% summarize(count = n(), delay = mean(dep_delay, na.rm = TRUE), maxDelay = max(dep_delay, na.rm = TRUE)) %>% top_n(n = 10, wt = count)
## # A tibble: 10 × 4
## carrier count delay maxDelay
## <chr> <int> <dbl> <dbl>
## 1 9E 18460 16.725769 747
## 2 AA 32729 8.586016 1014
## 3 B6 54635 13.022522 502
## 4 DL 48110 9.264505 960
## 5 EV 54173 19.955390 548
## 6 MQ 26397 10.552041 1137
## 7 UA 58665 12.106073 483
## 8 US 20536 3.782418 500
## 9 VX 5162 12.869421 653
## 10 WN 12275 17.711744 471
Note that the default ordering in R is alphabetical by first column. You can specify other orderings by using the ‘arrange’ argument. We could also filter on specific airlines and months, like this:
flights %>% filter(month == 12) %>% filter(carrier %in% c("AA", "UA", "DL")) %>% group_by(carrier) %>% summarize(count = n(), delay = mean(dep_delay, na.rm = TRUE), maxDelay = max(dep_delay, na.rm = TRUE))
## # A tibble: 3 × 4
## carrier count delay maxDelay
## <chr> <int> <dbl> <dbl>
## 1 AA 2705 11.71143 896
## 2 DL 4093 10.79024 849
## 3 UA 4931 17.72274 392
It’s all well and good to look at tables, but let’s really dig into this data. First, is there a trend for delay and date?
flights %>% filter(arr_delay > .5) %>% ggplot(aes(x = time_hour, y = arr_delay)) + geom_point()
This chart is not terribly insightful. Let’s see if we can break it out by departure airport, and the month of December.
flights %>% filter(arr_delay > .5) %>% filter(month == 12) %>% ggplot(aes(x = time_hour, y = arr_delay)) + geom_point() + facet_wrap(~origin)
What’s the relationship between air time and distance?
flights %>% ggplot(aes(x = air_time, y = distance)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 9430 rows containing non-finite values (stat_smooth).
## Warning: Removed 9430 rows containing missing values (geom_point).
Here we’re using geom_smooth() to add a loess trendline. Other trendlines are possible.
mtcars %>% ggplot(aes(x = mpg, y = hp)) + geom_point() + geom_smooth() + ggtitle("Smoothed HP and mpg, Motor Trend Cars")
## `geom_smooth()` using method = 'loess'
mtcars %>% ggplot(aes(x = mpg, y = hp, color = wt, size = cyl)) + geom_point() + facet_wrap(~carb)