library(dplyr); library(magrittr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

This is the companion document to R discussion on 15 June, 2017.

In these examples, we will use the New York City flights 2013 and the Motor Trend Cars Dataset

flights = nycflights13::flights

data(mtcars)

By the numbers:

Magrittr

The magrittr operator takes the left side of an expression and deposits it on the right side. While this doesn’t sound like a big deal, it really is. The magrittr operator does this:

X %>% f() -> f(x)

For example, suppose I wanted to know the class of the ‘mpg’ variable in the mtcars dataset

mtcars$mpg %>% class()
## [1] "numeric"

As you will see below, using the %>% operator will make code much more readable!

Dplyr and the nycflights dataset

Now, let’s take a look at the NYC Flights Dataset. I’m going to simply add some comments in the code below.

#how big is this dataset?

flights %>% dim()
## [1] 336776     19
# what are the names of the fields?

flights %>% names()
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

The four basic operations in dplyr are: 1. Data 2. Filter 3. group 4. Summarize

For example, I might be interested in knowing what the average departure delay was by carrier, considering only the top 10 carriers:

flights %>% group_by(carrier) %>% summarize(count = n(), delay = mean(dep_delay, na.rm = TRUE), maxDelay = max(dep_delay, na.rm = TRUE)) %>% top_n(n = 10, wt = count)
## # A tibble: 10 × 4
##    carrier count     delay maxDelay
##      <chr> <int>     <dbl>    <dbl>
## 1       9E 18460 16.725769      747
## 2       AA 32729  8.586016     1014
## 3       B6 54635 13.022522      502
## 4       DL 48110  9.264505      960
## 5       EV 54173 19.955390      548
## 6       MQ 26397 10.552041     1137
## 7       UA 58665 12.106073      483
## 8       US 20536  3.782418      500
## 9       VX  5162 12.869421      653
## 10      WN 12275 17.711744      471

Note that the default ordering in R is alphabetical by first column. You can specify other orderings by using the ‘arrange’ argument. We could also filter on specific airlines and months, like this:

flights %>% filter(month == 12) %>% filter(carrier %in% c("AA", "UA", "DL"))  %>% group_by(carrier) %>% summarize(count = n(), delay = mean(dep_delay, na.rm = TRUE), maxDelay = max(dep_delay, na.rm = TRUE)) 
## # A tibble: 3 × 4
##   carrier count    delay maxDelay
##     <chr> <int>    <dbl>    <dbl>
## 1      AA  2705 11.71143      896
## 2      DL  4093 10.79024      849
## 3      UA  4931 17.72274      392

GGPLOT

It’s all well and good to look at tables, but let’s really dig into this data. First, is there a trend for delay and date?

flights %>% filter(arr_delay > .5) %>% ggplot(aes(x = time_hour, y = arr_delay)) + geom_point()

This chart is not terribly insightful. Let’s see if we can break it out by departure airport, and the month of December.

flights %>% filter(arr_delay > .5) %>% filter(month == 12) %>% ggplot(aes(x = time_hour, y = arr_delay)) + geom_point() + facet_wrap(~origin)

What’s the relationship between air time and distance?

flights %>% ggplot(aes(x = air_time, y = distance)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 9430 rows containing non-finite values (stat_smooth).
## Warning: Removed 9430 rows containing missing values (geom_point).

Here we’re using geom_smooth() to add a loess trendline. Other trendlines are possible.

Another graph example - mtcars

mtcars %>% ggplot(aes(x = mpg, y = hp)) + geom_point() + geom_smooth() + ggtitle("Smoothed HP and mpg, Motor Trend Cars")
## `geom_smooth()` using method = 'loess'

mtcars %>% ggplot(aes(x = mpg, y = hp, color = wt, size = cyl)) + geom_point() + facet_wrap(~carb)