Load Libraries

library(nycflights13)
library(tidyverse)
library(openintro)


0. Introduction

In this lecture, we will continue to study data transformation using dplyr functions:


1. The mutate() function

In many situations, we hope to create new variables from the original data for our convenience when exploring or analyzing data.

For example, we want to compute the average speed of each flight, which is not included in the original data. We can do the following:

my_data <- mutate(flights, speed = distance/air_time*60)

This creates a new column named speed which is computed from the formula distance * 60 / air_time, which is the average speed in the unit of miles per hour. The new column is added to a new data set named my_data.

select(my_data, speed)
## # A tibble: 336,776 × 1
##    speed
##    <dbl>
##  1  370.
##  2  374.
##  3  408.
##  4  517.
##  5  394.
##  6  288.
##  7  404.
##  8  259.
##  9  405.
## 10  319.
## # … with 336,766 more rows

Lab Exercise: Find the flight with the highest average speed. What is its destination?


How to use mutate() function

The mutate() function creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).

The template to call mutate() function is

mutate(<data_name>, <variable_name> = <formula>)

If the <data_name> matches an existing column, then we would modify the column or delete it (if the formula is NULL). Otherwise we would create a new column following the formula.

If we only hope to keep new columns in the new data set, we can use transmute:

transmute(<data_name>, <new_variable_name> = <formula>)

In the example above, we may do the following as well:

my_data <- transmute(flights, speed = distance/air_time*60)


Useful creation functions

There are many functions for creating new variables that you can use with mutate().

  • The key property is that the function must be vectorized:
    • it must take a vector of values as input,
    • return a vector with the same number of values as output.

There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:


Arithmetic operators

  • Arithmetic operators: +, -, *, /, ^. These are all vectorized, using the so called “recycling rules”.

  • If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number: air_time / 60, hours * 60 + minute, etc.

  • Arithmetic operators are also useful in conjunction with the aggregate functions you’ll learn about later. For example, x / sum(x) calculates the proportion of a total, and y - mean(y) computes the difference from the mean.


Example

Let’s create a new variable price/carat for the data set diamonds, which accounts for the unit price of a diamond per carat.

my_data <- mutate(diamonds, unit_price = price/carat)

ggplot(my_data) + geom_histogram(aes(unit_price)) +
  labs(title = "Diamond unit price per carat", x = "dollar per carat", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

We may expect that this new variable has already incorporated the effect of carat, and then we can study the effect of color simply by creating a bar plot on this variable.

ggplot(filter(my_data, between(carat, 1, 2))) + 
  stat_summary(mapping = aes(x = color, y = unit_price, fill = color), geom = "bar", fun = "mean") +
  labs(title = "Diamond unit price by color", x = "Color Grade", y = "Mean Unit Price (dollar per carat)") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

Here we pick up diamonds between 1 carat and 2 carat since the unit price in this group has less variance (when the carat is too low the value price/carat can suffer much greater variable since carat is in the denominator.) By plotting the mean unit price we see a reasonable trend to reflect the effect of color.


Modular arithmetic

  • Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with:
    transmute(flights,
      dep_time,
      hour = dep_time %/% 100,
      minute = dep_time %% 100 )
## # A tibble: 336,776 × 3
##    dep_time  hour minute
##       <int> <dbl>  <dbl>
##  1      517     5     17
##  2      533     5     33
##  3      542     5     42
##  4      544     5     44
##  5      554     5     54
##  6      554     5     54
##  7      555     5     55
##  8      557     5     57
##  9      557     5     57
## 10      558     5     58
## # … with 336,766 more rows


Logs: log(), log2(), log10()

  • Logs: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive, a feature we’ll come back to in modelling.

  • All else being equal, we recommend using log2() because it’s easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.

Example

In the loans_full_schema data set, the annual income variable spans over a few orders of magnitude. It can be useful to take its logarithm as a new variable.

library(openintro)

my_data <- mutate(loans_full_schema, log_income = log10(annual_income))

ggplot(filter(my_data, between(annual_income, 10000, 1000000))) + 
  geom_point(aes(loan_amount, annual_income)) +
  geom_smooth(aes(loan_amount, annual_income)) + 
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Annual Income vs Loan Amount", x = "Loan Amount ($)", y = "Annual Income ($)")  +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))
ggplot(filter(my_data, between(annual_income, 1000, 1000000))) + 
  geom_point(aes(loan_amount, log_income)) +
  geom_smooth(aes(loan_amount, log_income)) +
  ylim(4,6) +
  labs(title = "Log Annual Income vs Loan Amount", x = "Log Loan Amount", y = "Log Annual Income")  +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))


Offsets: lead() and lag()

  • Offsets: lead() and dplyr::llag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - dplyr::lag(x)) or find when values change (x != lag(x)). They are most useful in conjunction with group_by(), which you’ll learn about shortly.
    (x <- 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
    dplyr::lag(x)
##  [1] NA  1  2  3  4  5  6  7  8  9
    dplyr::lead(x)
##  [1]  2  3  4  5  6  7  8  9 10 NA


Cumulative and rolling aggregates: cumsum(), cumprod(), cummin(), cummax(), cummean()

  • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
x <- 1:10
cumsum(x)
##  [1]  1  3  6 10 15 21 28 36 45 55
cumprod(x)
##  [1]       1       2       6      24     120     720    5040   40320  362880
## [10] 3628800
cummin(x)
##  [1] 1 1 1 1 1 1 1 1 1 1
cummax(x)
##  [1]  1  2  3  4  5  6  7  8  9 10
cummean(x)
##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5


Create a new binary categorical variable by ifelse()

In some cases we hope to create a new label for each sample for analysis purpose. For example, we hope to label all flights with “non-canceled” and “canceled”. Then we may do the following:

my_data <- mutate(flights, cancel_flag = ifelse(is.na(dep_time), "canceled", "non-canceled"))
my_data <- select(my_data, dep_time, cancel_flag, everything())
my_data
## # A tibble: 336,776 × 20
##    dep_time cancel_f…¹  year month   day sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶
##       <int> <chr>      <int> <int> <int>   <int>   <dbl>   <int>   <int>   <dbl>
##  1      517 non-cance…  2013     1     1     515       2     830     819      11
##  2      533 non-cance…  2013     1     1     529       4     850     830      20
##  3      542 non-cance…  2013     1     1     540       2     923     850      33
##  4      544 non-cance…  2013     1     1     545      -1    1004    1022     -18
##  5      554 non-cance…  2013     1     1     600      -6     812     837     -25
##  6      554 non-cance…  2013     1     1     558      -4     740     728      12
##  7      555 non-cance…  2013     1     1     600      -5     913     854      19
##  8      557 non-cance…  2013     1     1     600      -3     709     723     -14
##  9      557 non-cance…  2013     1     1     600      -3     838     846      -8
## 10      558 non-cance…  2013     1     1     600      -2     753     745       8
## # … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​cancel_flag, ²​sched_dep_time, ³​dep_delay, ⁴​arr_time, ⁵​sched_arr_time,
## #   ⁶​arr_delay

Here the function ifelse() takes three arguments - ifelse(test, value_when_test_returns_true, value_when_test_returns_false)

The first argument test returns TRUE, FALSE or NA. If TRUE, the returning value is value of the second argument; if FALSE, the value of the third argument; if NA, it returns NA. The function returns vectorized result.

x <- 1:10
ifelse(x > 5, ">5", "<=5")
##  [1] "<=5" "<=5" "<=5" "<=5" "<=5" ">5"  ">5"  ">5"  ">5"  ">5"


Example

Let’s explore whether short-distance or long-distance flights are more likely to be canceled.

ggplot(my_data) +
  stat_summary(mapping = aes(x = cancel_flag, y = distance, fill = cancel_flag), geom = "bar", fun = "mean") +
  labs(title = "Mean Distance vs Cancelation", x = "", y = "Counts")  +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

It seems that more canceled flights are short-distance ones. Can you make sense of the result?


Create a new non-binary categorical variable with case_when()

What if we hope to create a categorical variable with more than two labels? We may use the case_when() function from the dplyr() package. As below is an example.

Let’s create a season label for flights data set.

my_data <- mutate(flights, season = case_when(
                  month %in% c(3,4,5) ~ "Spring",
                  month %in% c(6,7,8) ~ "Summer",
                  month %in% c(9,10,11) ~ "Autumn",
                  month %in% c(12,1,2) ~ "Winter",
                  .default = NA
                  ))

my_data <- select(my_data, season, month, everything())

The template for case_when() function is

case_when(
  condition1 ~ value1,
  condition2 ~ value2,
  condition3 ~ value3,
  ...
  .default = value_when_all_conditions_FALSE_or_NA
)

The case_when() function will check for every element in a vector whether it satisfies any of the condition following the order given. The first condition that turns out to be TRUE will be assigned to the corresponding position in the output vector of the same size. For example,

x <- -5:5
case_when(
  x == 0 ~ "zero",
  x %% 2 == 0 ~ "even",
  .default = "odd"
)
##  [1] "odd"  "even" "odd"  "even" "odd"  "zero" "odd"  "even" "odd"  "even"
## [11] "odd"

Now we can see which season is the most busiest.

ggplot(my_data) + 
  geom_bar(aes(season, fill = season)) +
  labs(title = "Seasonal Flights Distribution", x = "", y = "Counts")  +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))


Lab Exercise

For diamonds data, create a new variable named carat_group, which groups data by less than 1 carat, 1-2 carat, 2-3 carat and greater than 3 carat.


2. Grouped summaries with summarise() and group_by()

The last key verb is summarise(). It collapses a data frame to a single row:

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

(na.rm = TRUE removes all NA values when computing the measure from the data set. We’ll come back to that very shortly.)

So we see that the average delay of all flights (excluding canceled ones) is 12.6 minutes. The template for summarise() (or equivalently summarize()) is

summarise(<data_name>, <new_measure1> = <formula1>, <new_measure2> = <formula2>, ...)


group_by()

summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete data set to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by month, we get the average delay per month:

by_month <- group_by(flights, month)
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 12 × 2
##    month delay
##    <int> <dbl>
##  1     1 10.0 
##  2     2 10.8 
##  3     3 13.2 
##  4     4 13.9 
##  5     5 13.0 
##  6     6 20.8 
##  7     7 21.7 
##  8     8 12.6 
##  9     9  6.72
## 10    10  6.24
## 11    11  5.44
## 12    12 16.6

So we see that the average delay is the worst in June, July and December, when summer or winter breaks start.

Note: Grouping doesn’t change how the data looks (apart from listing how it’s grouped). It changes how it acts with the other dplyr verbs.

by_month
## # A tibble: 336,776 × 19
## # Groups:   month [12]
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Other than telling us that the data set is now grouped by the variable month, there is nothing different in its appearance from the original data set.

The template to use group_by() when grouping a data set with given variable names is:

group_by(<data_name>, ..., <variable_names>, ...)

Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries.


Missing values

You may have wondered about the na.rm argument we used above. What happens if we don’t set it?

  by_date <- group_by(flights, year, month, day) 
  summarise(by_date, mean_delay = mean(dep_delay))
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day mean_delay
##    <int> <int> <int>      <dbl>
##  1  2013     1     1         NA
##  2  2013     1     2         NA
##  3  2013     1     3         NA
##  4  2013     1     4         NA
##  5  2013     1     5         NA
##  6  2013     1     6         NA
##  7  2013     1     7         NA
##  8  2013     1     8         NA
##  9  2013     1     9         NA
## 10  2013     1    10         NA
## # … with 355 more rows

We get a lot of missing values! That’s because aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation:

  by_date <- group_by(flights, year, month, day) 
  summarise(by_date, mean_delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day mean_delay
##    <int> <int> <int>      <dbl>
##  1  2013     1     1      11.5 
##  2  2013     1     2      13.9 
##  3  2013     1     3      11.0 
##  4  2013     1     4       8.95
##  5  2013     1     5       5.73
##  6  2013     1     6       7.15
##  7  2013     1     7       5.42
##  8  2013     1     8       2.55
##  9  2013     1     9       2.28
## 10  2013     1    10       2.84
## # … with 355 more rows

In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights. We’ll save this dataset so we can reuse in the next few examples.

Example of analysis

If we hope to know which day resulted in the worst average delay, we may do the following:

by_date <- group_by(flights, year, month, day) 
data_sum <- summarise(by_date, mean_delay = mean(dep_delay, na.rm = TRUE))
arrange(data_sum, desc(mean_delay))
## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day mean_delay
##    <int> <int> <int>      <dbl>
##  1  2013     3     8       83.5
##  2  2013     7     1       56.2
##  3  2013     9     2       53.0
##  4  2013     7    10       52.9
##  5  2013    12     5       52.3
##  6  2013     5    23       51.1
##  7  2013     9    12       50.0
##  8  2013     6    28       48.8
##  9  2013     6    24       47.2
## 10  2013     7    22       46.7
## # … with 355 more rows

We see that March 8th, 2013 had the worst delay of the year. What happened on that day? Can you answer the question?

Lab Exericse:

  1. Find which airport around NYC had the worst delay on average.
  2. Find which airport around NYC had the worst average delay on December 24, 2013.


Useful summary functions

Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:

mean(), median()

  • Measures of location: we’ve used mean(x), but median(x) is also useful. The mean is the sum divided by the length; the median is a value where 50% of x is above it, and 50% is below it.

It’s sometimes useful to combine aggregation with logical subsetting. For example,

not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay)) 
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data,
  avg_delay1 = mean(arr_delay),
  avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day avg_delay1 avg_delay2
##    <int> <int> <int>      <dbl>      <dbl>
##  1  2013     1     1     12.7         32.5
##  2  2013     1     2     12.7         32.0
##  3  2013     1     3      5.73        27.7
##  4  2013     1     4     -1.93        28.3
##  5  2013     1     5     -1.53        22.6
##  6  2013     1     6      4.24        24.4
##  7  2013     1     7     -4.95        27.8
##  8  2013     1     8     -3.23        20.8
##  9  2013     1     9     -0.264       25.6
## 10  2013     1    10     -5.90        27.3
## # … with 355 more rows


Measures of spread: sd(x), IQR(x), mad(x).

The mean squared deviation, or standard deviation or sd for short, is the standard measure of spread. The interquartile range IQR() and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.


Measures of rank: min(x), quantile(x, 0.25), max(x).

Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.

When do the first and last flights leave each day?

summarise(my_data, 
  first = min(dep_time),
  last = max(dep_time)
)
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day first  last
##    <int> <int> <int> <int> <int>
##  1  2013     1     1   517  2356
##  2  2013     1     2    42  2354
##  3  2013     1     3    32  2349
##  4  2013     1     4    25  2358
##  5  2013     1     5    14  2357
##  6  2013     1     6    16  2355
##  7  2013     1     7    49  2359
##  8  2013     1     8   454  2351
##  9  2013     1     9     2  2252
## 10  2013     1    10     3  2320
## # … with 355 more rows


first(), nth(), last()

  • Measures of position: first(x), nth(x, 2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:
summarise(my_data,
  first_dep = first(dep_time), 
  last_dep = last(dep_time)
)
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day first_dep last_dep
##    <int> <int> <int>     <int>    <int>
##  1  2013     1     1       517     2356
##  2  2013     1     2        42     2354
##  3  2013     1     3        32     2349
##  4  2013     1     4        25     2358
##  5  2013     1     5        14     2357
##  6  2013     1     6        16     2355
##  7  2013     1     7        49     2359
##  8  2013     1     8       454     2351
##  9  2013     1     9         2     2252
## 10  2013     1    10         3     2320
## # … with 355 more rows


n(), sum(!is.na()), n_distinct(), count()

  • Counts: n(), which takes no arguments, returns the size of the current group. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).


Which destinations have the most carriers?

my_data <- group_by(not_canceled, dest) 
my_sum <- summarise(my_data, carriers = n_distinct(carrier)) 
arrange(my_sum, desc(carriers))
## # A tibble: 104 × 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # … with 94 more rows


Lab Exercise:

Find which origin has the most destinations?