Lecture 7 - Data Transformation Part Two

Load Libraries

library(nycflights13)
library(tidyverse)
library(openintro)

0. Introduction

In this lecture, we will continue to study data transformation using dplyr functions:

mutate() to add or modify columns
summarise() or summarize() to summarize data
group_by() to group a data frame by given variables (often used together with summarise())

1. The `mutate()` function

In many situations, we hope to create new variables from the original data for our convenience when exploring or analyzing data.

For example, we want to compute the average speed of each flight, which is not included in the original data. We can do the following:

my_data <- mutate(flights, speed = distance/air_time*60)

This creates a new column named speed which is computed from the formula distance * 60 / air_time, which is the average speed in the unit of miles per hour. The new column is added to a new data set named my_data.

select(my_data, speed)

## # A tibble: 336,776 × 1
##    speed
##    <dbl>
##  1  370.
##  2  374.
##  3  408.
##  4  517.
##  5  394.
##  6  288.
##  7  404.
##  8  259.
##  9  405.
## 10  319.
## # … with 336,766 more rows

Lab Exercise: Find the flight with the highest average speed. What is its destination?

How to use `mutate()` function

The mutate() function creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).

The template to call mutate() function is

mutate(<data_name>, <variable_name> = <formula>)

If the <data_name> matches an existing column, then we would modify the column or delete it (if the formula is NULL). Otherwise we would create a new column following the formula.

If we only hope to keep new columns in the new data set, we can use transmute:

transmute(<data_name>, <new_variable_name> = <formula>)

In the example above, we may do the following as well:

my_data <- transmute(flights, speed = distance/air_time*60)

Useful creation functions

There are many functions for creating new variables that you can use with mutate().

The key property is that the function must be vectorized:
- it must take a vector of values as input,
- return a vector with the same number of values as output.

There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:

Arithmetic operators

Arithmetic operators: +, -, *, /, ^. These are all vectorized, using the so called “recycling rules”.
If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number: air_time / 60, hours * 60 + minute, etc.
Arithmetic operators are also useful in conjunction with the aggregate functions you’ll learn about later. For example, x / sum(x) calculates the proportion of a total, and y - mean(y) computes the difference from the mean.

Example

Let’s create a new variable price/carat for the data set diamonds, which accounts for the unit price of a diamond per carat.

my_data <- mutate(diamonds, unit_price = price/carat)

ggplot(my_data) + geom_histogram(aes(unit_price)) +
  labs(title = "Diamond unit price per carat", x = "dollar per carat", y = "Count") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

We may expect that this new variable has already incorporated the effect of carat, and then we can study the effect of color simply by creating a bar plot on this variable.

ggplot(filter(my_data, between(carat, 1, 2))) + 
  stat_summary(mapping = aes(x = color, y = unit_price, fill = color), geom = "bar", fun = "mean") +
  labs(title = "Diamond unit price by color", x = "Color Grade", y = "Mean Unit Price (dollar per carat)") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

Here we pick up diamonds between 1 carat and 2 carat since the unit price in this group has less variance (when the carat is too low the value price/carat can suffer much greater variable since carat is in the denominator.) By plotting the mean unit price we see a reasonable trend to reflect the effect of color.

Modular arithmetic

Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with:

    transmute(flights,
      dep_time,
      hour = dep_time %/% 100,
      minute = dep_time %% 100 )

## # A tibble: 336,776 × 3
##    dep_time  hour minute
##       <int> <dbl>  <dbl>
##  1      517     5     17
##  2      533     5     33
##  3      542     5     42
##  4      544     5     44
##  5      554     5     54
##  6      554     5     54
##  7      555     5     55
##  8      557     5     57
##  9      557     5     57
## 10      558     5     58
## # … with 336,766 more rows

Logs: `log()`, `log2()`, `log10()`

Logs: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive, a feature we’ll come back to in modelling.
All else being equal, we recommend using log2() because it’s easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.

Example

In the loans_full_schema data set, the annual income variable spans over a few orders of magnitude. It can be useful to take its logarithm as a new variable.

library(openintro)

my_data <- mutate(loans_full_schema, log_income = log10(annual_income))

ggplot(filter(my_data, between(annual_income, 10000, 1000000))) + 
  geom_point(aes(loan_amount, annual_income)) +
  geom_smooth(aes(loan_amount, annual_income)) + 
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Annual Income vs Loan Amount", x = "Loan Amount ($)", y = "Annual Income ($)")  +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))
ggplot(filter(my_data, between(annual_income, 1000, 1000000))) + 
  geom_point(aes(loan_amount, log_income)) +
  geom_smooth(aes(loan_amount, log_income)) +
  ylim(4,6) +
  labs(title = "Log Annual Income vs Loan Amount", x = "Loan Amount ($)", y = "Log Annual Income")  +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

Offsets: `lead()` and `lag()`

Offsets: lead() and dplyr::llag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - dplyr::lag(x)) or find when values change (x != lag(x)). They are most useful in conjunction with group_by(), which you’ll learn about shortly.

    (x <- 1:10)

##  [1]  1  2  3  4  5  6  7  8  9 10

    dplyr::lag(x)

##  [1] NA  1  2  3  4  5  6  7  8  9

    dplyr::lead(x)

##  [1]  2  3  4  5  6  7  8  9 10 NA

Exercise: On January 1st, what is the average time between the departure of two domestic flights at EWR?

Cumulative and rolling aggregates: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`, `cummean()`

Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.

x <- 1:10
cumsum(x)

##  [1]  1  3  6 10 15 21 28 36 45 55

cumprod(x)

##  [1]       1       2       6      24     120     720    5040   40320  362880
## [10] 3628800

cummin(x)

##  [1] 1 1 1 1 1 1 1 1 1 1

cummax(x)

##  [1]  1  2  3  4  5  6  7  8  9 10

cummean(x)

##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

Create a new binary categorical variable by `ifelse()`

In some cases we hope to create a new label for each sample for analysis purpose. For example, we hope to label all flights with “non-canceled” and “canceled”. Then we may do the following:

my_data <- mutate(flights, cancel_flag = ifelse(is.na(dep_time), "canceled", "non-canceled"))
my_data <- select(my_data, dep_time, cancel_flag, everything())
my_data

## # A tibble: 336,776 × 20
##    dep_time cancel_f…¹  year month   day sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶
##       <int> <chr>      <int> <int> <int>   <int>   <dbl>   <int>   <int>   <dbl>
##  1      517 non-cance…  2013     1     1     515       2     830     819      11
##  2      533 non-cance…  2013     1     1     529       4     850     830      20
##  3      542 non-cance…  2013     1     1     540       2     923     850      33
##  4      544 non-cance…  2013     1     1     545      -1    1004    1022     -18
##  5      554 non-cance…  2013     1     1     600      -6     812     837     -25
##  6      554 non-cance…  2013     1     1     558      -4     740     728      12
##  7      555 non-cance…  2013     1     1     600      -5     913     854      19
##  8      557 non-cance…  2013     1     1     600      -3     709     723     -14
##  9      557 non-cance…  2013     1     1     600      -3     838     846      -8
## 10      558 non-cance…  2013     1     1     600      -2     753     745       8
## # … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹cancel_flag, ²sched_dep_time, ³dep_delay, ⁴arr_time, ⁵sched_arr_time,
## #   ⁶arr_delay

Here the function ifelse() takes three arguments - ifelse(test, value_when_test_returns_true, value_when_test_returns_false)

The first argument test returns TRUE, FALSE or NA. If TRUE, the returning value is value of the second argument; if FALSE, the value of the third argument; if NA, it returns NA. The function returns vectorized result.

x <- 1:10
ifelse(x > 5, ">5", "<=5")

##  [1] "<=5" "<=5" "<=5" "<=5" "<=5" ">5"  ">5"  ">5"  ">5"  ">5"

Example

Let’s explore whether short-distance or long-distance flights are more likely to be canceled.

ggplot(my_data) +
  stat_summary(mapping = aes(x = cancel_flag, y = distance, fill = cancel_flag), geom = "bar", fun = "mean") +
  labs(title = "Mean Distance vs Cancelation", x = "", y = "Mean Distance (mile)")  +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

It seems that more canceled flights are short-distance ones. Can you make sense of the result?

Create a new non-binary categorical variable with `case_when()`

What if we hope to create a categorical variable with more than two labels? We may use the case_when() function from the dplyr() package. As below is an example.

Let’s create a season label for flights data set.

my_data <- mutate(flights, season = case_when(
                  month %in% c(3,4,5) ~ "Spring",
                  month %in% c(6,7,8) ~ "Summer",
                  month %in% c(9,10,11) ~ "Autumn",
                  month %in% c(12,1,2) ~ "Winter",
                  .default = NA
                  ))

my_data <- select(my_data, season, month, everything())

The template for case_when() function is

case_when(
  condition1 ~ value1,
  condition2 ~ value2,
  condition3 ~ value3,
  ...
  .default = value_when_all_conditions_FALSE_or_NA
)

The case_when() function will check for every element in a vector whether it satisfies any of the condition following the order given. The first condition that turns out to be TRUE will be assigned to the corresponding position in the output vector of the same size. For example,

x <- -5:5
case_when(
  x == 0 ~ "zero",
  x %% 2 == 0 ~ "even",
  .default = "odd"
)

##  [1] "odd"  "even" "odd"  "even" "odd"  "zero" "odd"  "even" "odd"  "even"
## [11] "odd"

Now we can see which season is the most busiest.

ggplot(my_data) + 
  geom_bar(aes(season, fill = season)) +
  labs(title = "Seasonal Flights Distribution", x = "", y = "Counts")  +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

Lab Exercise

For diamonds data, create a new variable named carat_group, which groups data by less than 1 carat, 1-2 carat, 2-3 carat and greater than 3 carat.

Grouped summaries with `summarise()` and `group_by()`

The last key verb is summarise(). It collapses a data frame to a single row:

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

(na.rm = TRUE removes all NA values when computing the measure from the data set. We’ll come back to that very shortly.)

So we see that the average delay of all flights (excluding canceled ones) is 12.6 minutes. The template for summarise() (or equivalently summarize()) is

summarise(<data_name>, <new_measure1> = <formula1>, <new_measure2> = <formula2>, ...)

group_by()

summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete data set to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by month, we get the average delay per month:

by_month <- group_by(flights, month)
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 12 × 2
##    month delay
##    <int> <dbl>
##  1     1 10.0 
##  2     2 10.8 
##  3     3 13.2 
##  4     4 13.9 
##  5     5 13.0 
##  6     6 20.8 
##  7     7 21.7 
##  8     8 12.6 
##  9     9  6.72
## 10    10  6.24
## 11    11  5.44
## 12    12 16.6

So we see that the average delay is the worst in June, July and December, when summer or winter breaks start.

Note: Grouping doesn’t change how the data looks (apart from listing how it’s grouped). It changes how it acts with the other dplyr verbs.

by_month

## # A tibble: 336,776 × 19
## # Groups:   month [12]
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Other than telling us that the data set is now grouped by the variable month, there is nothing different in its appearance from the original data set.

The template to use group_by() when grouping a data set with given variable names is:

group_by(<data_name>, ..., <variable_names>, ...)

Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries.

Missing values

You may have wondered about the na.rm argument we used above. What happens if we don’t set it?

  by_date <- group_by(flights, year, month, day) 
  summarise(by_date, mean_delay = mean(dep_delay))

## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day mean_delay
##    <int> <int> <int>      <dbl>
##  1  2013     1     1         NA
##  2  2013     1     2         NA
##  3  2013     1     3         NA
##  4  2013     1     4         NA
##  5  2013     1     5         NA
##  6  2013     1     6         NA
##  7  2013     1     7         NA
##  8  2013     1     8         NA
##  9  2013     1     9         NA
## 10  2013     1    10         NA
## # … with 355 more rows

We get a lot of missing values! That’s because aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation:

  by_date <- group_by(flights, year, month, day) 
  summarise(by_date, mean_delay = mean(dep_delay, na.rm = TRUE))

## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day mean_delay
##    <int> <int> <int>      <dbl>
##  1  2013     1     1      11.5 
##  2  2013     1     2      13.9 
##  3  2013     1     3      11.0 
##  4  2013     1     4       8.95
##  5  2013     1     5       5.73
##  6  2013     1     6       7.15
##  7  2013     1     7       5.42
##  8  2013     1     8       2.55
##  9  2013     1     9       2.28
## 10  2013     1    10       2.84
## # … with 355 more rows

In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights. We’ll save this dataset so we can reuse in the next few examples.

Example of analysis

If we hope to know which day resulted in the worst average delay, we may do the following:

by_date <- group_by(flights, year, month, day) 
data_sum <- summarise(by_date, mean_delay = mean(dep_delay, na.rm = TRUE))
arrange(data_sum, desc(mean_delay))

## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day mean_delay
##    <int> <int> <int>      <dbl>
##  1  2013     3     8       83.5
##  2  2013     7     1       56.2
##  3  2013     9     2       53.0
##  4  2013     7    10       52.9
##  5  2013    12     5       52.3
##  6  2013     5    23       51.1
##  7  2013     9    12       50.0
##  8  2013     6    28       48.8
##  9  2013     6    24       47.2
## 10  2013     7    22       46.7
## # … with 355 more rows

We see that March 8th, 2013 had the worst delay of the year. What happened on that day? Can you answer the question?

Lab Exericse:

Find which airport around NYC had the worst delay on average.
Find which airport around NYC had the worst average delay on December 24, 2013.

Useful summary functions

R provides many other summary functions when using summarise() to summarize data. As below are some of the most helpful ones.

`mean()`, `median()`

Measures of location: we’ve used mean(x), but median(x) is also useful. The mean is the sum divided by the length; the median is a value where 50% of x is above it, and 50% is below it.

It’s sometimes useful to combine aggregation with logical subsetting. For example,

not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay)) 
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data,
  avg_delay1 = mean(arr_delay),
  avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)

## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day avg_delay1 avg_delay2
##    <int> <int> <int>      <dbl>      <dbl>
##  1  2013     1     1     12.7         32.5
##  2  2013     1     2     12.7         32.0
##  3  2013     1     3      5.73        27.7
##  4  2013     1     4     -1.93        28.3
##  5  2013     1     5     -1.53        22.6
##  6  2013     1     6      4.24        24.4
##  7  2013     1     7     -4.95        27.8
##  8  2013     1     8     -3.23        20.8
##  9  2013     1     9     -0.264       25.6
## 10  2013     1    10     -5.90        27.3
## # … with 355 more rows

`sum()` and `n()`

The sum() function is commonly used to compute the relative frequency, combining with the counting function n().

my_summary <- summarise(group_by(flights, origin), n = n())  # n() counts how many rows there are for each group
my_summary

## # A tibble: 3 × 2
##   origin      n
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662

my_summary <- mutate(my_summary, prop = n/sum(n))  # Use `mutate()` to add a column of "proportion"
my_summary

## # A tibble: 3 × 3
##   origin      n  prop
##   <chr>   <int> <dbl>
## 1 EWR    120835 0.359
## 2 JFK    111279 0.330
## 3 LGA    104662 0.311

ggplot(my_summary) + stat_summary(mapping = aes(x = origin, y = prop), geom = "bar") +
labs(title = "Total Number of Flights from NYC in 2013", x = "Airport Code", y = "Relative Frequency") +
scale_y_continuous(labels = scales::percent)  +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`.

The mean squared deviation, or standard deviation or sd for short, is the standard measure of spread. The interquartile range IQR() and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.

For example, let’s find the destination with the highest variation in arrival delay

not_canceled <- filter(flights, !is.na(arr_delay))
my_data <- group_by(not_canceled, dest)
my_summary <- summarise(my_data, sd = sd(arr_delay))
arrange(my_summary, desc(sd))

## # A tibble: 104 × 2
##    dest     sd
##    <chr> <dbl>
##  1 HNL    60.8
##  2 TUL    60.3
##  3 TVC    59.3
##  4 CAK    58.3
##  5 TYS    58.0
##  6 SAT    57.8
##  7 DSM    56.9
##  8 MSN    56.8
##  9 BHM    56.2
## 10 CVG    55.1
## # … with 94 more rows

Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.

Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.

For example, if we hope to know the first and last flights leave on each day, we may do the following:

not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay)) 
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data, 
  first = min(dep_time),
  last = max(dep_time)
)

## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day first  last
##    <int> <int> <int> <int> <int>
##  1  2013     1     1   517  2356
##  2  2013     1     2    42  2354
##  3  2013     1     3    32  2349
##  4  2013     1     4    25  2358
##  5  2013     1     5    14  2357
##  6  2013     1     6    16  2355
##  7  2013     1     7    49  2359
##  8  2013     1     8   454  2351
##  9  2013     1     9     2  2252
## 10  2013     1    10     3  2320
## # … with 355 more rows

first(), nth(), last()

Measures of position: first(x), nth(x, 2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:

summarise(my_data,
  first_dep = first(dep_time), 
  last_dep = last(dep_time)
)

## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day first_dep last_dep
##    <int> <int> <int>     <int>    <int>
##  1  2013     1     1       517     2356
##  2  2013     1     2        42     2354
##  3  2013     1     3        32     2349
##  4  2013     1     4        25     2358
##  5  2013     1     5        14     2357
##  6  2013     1     6        16     2355
##  7  2013     1     7        49     2359
##  8  2013     1     8       454     2351
##  9  2013     1     9         2     2252
## 10  2013     1    10         3     2320
## # … with 355 more rows

n(), sum(!is.na()), n_distinct(), count()

Counts: n(), which takes no arguments, returns the size of the current group. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).

Counting is so frequently used for data analysis that there is a separate function count() to do the job. count(<data_name>, <variable_names>) is equivalent to first group by those variables then count the number of samples in each group.

count(flights, origin)

## # A tibble: 3 × 2
##   origin      n
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662

Which destinations have the most carriers?

my_data <- group_by(not_canceled, dest) 
my_sum <- summarise(my_data, carriers = n_distinct(carrier)) 
arrange(my_sum, desc(carriers))

## # A tibble: 104 × 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # … with 94 more rows

Lab Exercise:

Find which origin has the most destinations?

The Pipe Operator `%>%`

In most examples so far, we need to do a sequence of operations on the data set to fulfill our task. For example:

not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay)) 
my_data <- group_by(not_canceled, dest) 
my_sum <- summarise(my_data, carriers = n_distinct(carrier)) 
arrange(my_sum, desc(carriers))

## # A tibble: 104 × 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # … with 94 more rows

As we see, we have to assign the transformed data to a new or existing data set in every step to save the intermediate results. when there are many steps to do, this is quite cumbersome and inconvenient. The code also becomes less clear for readers.

Instead, we can use the pipe operator %>% introduced in the package magrittr, which is automatically loaded when we load tidyverse so we don’t have to explicitly load it.

The code after using the pipe for the previous example looks like this:

my_data <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
  group_by(dest) %>%
  summarise(carriers = n_distinct(carrier)) %>%
  arrange(desc(carriers)) %>%
  print()

## # A tibble: 104 × 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # … with 94 more rows

So the template for the pipe is:

<new_data_name> <- <original_data_name> %>%
  operation1() %>%
  operation2() %>%
  ...

Most importantly, in every operation function (basically all functions we have introduced so far from delyr) we can now ignore the data set name as the first argument and directly input the operations.

Lab Exercise

Rewrite the following analysis code in pipe operator:

my_data <- filter(diamonds, carat < 2)
my_data <- group_by(my_data, color)
my_summary <- summarise(my_data, mean_carat = mean(carat), mean_price = mean(price))
my_summary <- mutate(my_summary, mean_unit_price = mean_price/mean_carat)
print(my_summary)

## # A tibble: 7 × 4
##   color mean_carat mean_price mean_unit_price
##   <ord>      <dbl>      <dbl>           <dbl>
## 1 D          0.646      3063.           4743.
## 2 E          0.643      2944.           4577.
## 3 F          0.716      3547.           4952.
## 4 G          0.737      3720.           5047.
## 5 H          0.828      3765.           4548.
## 6 I          0.884      3814.           4314.
## 7 J          0.980      3838.           3917.

Example

Now let’s exercise using %>% to do analysis.

We will analyze the data set babies in the package openintro. Use the help documents to understand the meaning of variables.

Question 1: Histogram of mother’s age

This can be done by data visualization alone without any transformation.

ggplot(babies) + geom_histogram(aes(age), binwidth = 1, boundary = 20.5) + 
  labs(title = "Analysis of 'babies' Data Set", x = "Mother's age when giving birth") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

Question 2: Is there any correlation between whether the mother smokes and the gestation length?

We can create a box plot to study this question. Again this only involves data visualization.

ggplot(babies) + 
  geom_boxplot(mapping = aes(x = as.factor(smoke), y = gestation))

This figure does not look very nice, so we may consider summarizing data instead of creating a graph. Note that we will create a new categorical variable with labels “non-smoker”, “smoker” and “NA” to replace “0”, “1” and “NA” in smoke variable.

smoke_gest <- babies %>%
  mutate(smoke_label = fct_recode(as.factor(smoke), "non-smoker" = "0", "smoker" = "1")) %>%
  group_by(smoke_label) %>%
  summarise(n = n(), mean_gest = mean(gestation, na.rm = TRUE), median_gest = median(gestation, na.rm = TRUE)) %>%
  print()

## # A tibble: 3 × 4
##   smoke_label     n mean_gest median_gest
##   <fct>       <int>     <dbl>       <dbl>
## 1 non-smoker    742      280.        281 
## 2 smoker        484      278.        279 
## 3 <NA>           10      282.        280.

As we see, there is little difference in the measures and it seems that smoking or not does not matter to the gestation length.

Lecture 7 - Data Transformation Part Two

Miao Yu

2023-09-26

Load Libraries

0. Introduction

1. The mutate() function

How to use mutate() function

Useful creation functions

Arithmetic operators

Example

Modular arithmetic

Logs: log(), log2(), log10()

Example

Offsets: lead() and lag()

Cumulative and rolling aggregates: cumsum(), cumprod(), cummin(), cummax(), cummean()

Create a new binary categorical variable by ifelse()

Example

Create a new non-binary categorical variable with case_when()

Lab Exercise

Grouped summaries with summarise() and group_by()

group_by()

Missing values

Example of analysis

Lab Exericse:

Useful summary functions

mean(), median()

sum() and n()

Measures of spread: sd(x), IQR(x), mad(x).

Measures of rank: min(x), quantile(x, 0.25), max(x).

first(), nth(), last()

n(), sum(!is.na()), n_distinct(), count()

Which destinations have the most carriers?

Lab Exercise:

Find which origin has the most destinations?

The Pipe Operator %>%

Lab Exercise

Rewrite the following analysis code in pipe operator:

Example

Question 1: Histogram of mother’s age

Question 2: Is there any correlation between whether the mother smokes and the gestation length?

Lab Exercise

1. With the babies data set, investigate whether there is a correlation between mother’s weight and the gestation length.

2. Investigate all factors of mother in the data set to see which factor may have an non-negligible correlation with the gestation length.

3. Investigate all factors in the data set to see which factor may have an non-negligible correlation with the birthweight of babies.

1. The `mutate()` function

How to use `mutate()` function

Logs: `log()`, `log2()`, `log10()`

Offsets: `lead()` and `lag()`

Cumulative and rolling aggregates: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`, `cummean()`

Create a new binary categorical variable by `ifelse()`

Create a new non-binary categorical variable with `case_when()`

Grouped summaries with `summarise()` and `group_by()`

`mean()`, `median()`

`sum()` and `n()`

Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`.

Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.

The Pipe Operator `%>%`

1. With the `babies` data set, investigate whether there is a correlation between mother’s weight and the gestation length.