library(nycflights13)
library(tidyverse)
library(openintro)
In this lecture, we will continue to study data transformation using
dplyr
functions:
mutate()
to add or modify columnssummarise()
or summarize()
to summarize
datagroup_by()
to group a data frame by given variables
(often used together with summarise()
)mutate()
functionIn many situations, we hope to create new variables from the original data for our convenience when exploring or analyzing data.
For example, we want to compute the average speed of each flight, which is not included in the original data. We can do the following:
my_data <- mutate(flights, speed = distance/air_time*60)
This creates a new column named speed
which is computed
from the formula distance * 60 / air_time
, which is the
average speed in the unit of miles per hour. The new column is added to
a new data set named my_data
.
select(my_data, speed)
## # A tibble: 336,776 × 1
## speed
## <dbl>
## 1 370.
## 2 374.
## 3 408.
## 4 517.
## 5 394.
## 6 288.
## 7 404.
## 8 259.
## 9 405.
## 10 319.
## # … with 336,766 more rows
Lab Exercise: Find the flight with the highest average speed. What is its destination?
mutate()
functionThe mutate()
function creates new columns that are
functions of existing variables. It can also modify (if the name is the
same as an existing column) and delete columns (by setting their value
to NULL
).
The template to call mutate()
function is
mutate(<data_name>, <variable_name> = <formula>)
If the <data_name>
matches an existing column,
then we would modify the column or delete it (if the formula is
NULL
). Otherwise we would create a new column following the
formula.
If we only hope to keep new columns in the new data set, we can use
transmute
:
transmute(<data_name>, <new_variable_name> = <formula>)
In the example above, we may do the following as well:
my_data <- transmute(flights, speed = distance/air_time*60)
There are many functions for creating new variables that you can use
with mutate()
.
There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:
Arithmetic operators: +
, -
,
*
, /
, ^
. These are all
vectorized, using the so called “recycling
rules”.
If one parameter is shorter than the other, it will be
automatically extended to be the same length. This is
most useful when one of the arguments is a single number:
air_time / 60
, hours * 60 + minute
,
etc.
Arithmetic operators are also useful in conjunction with the
aggregate functions you’ll learn about later. For example,
x / sum(x)
calculates the proportion of a total, and
y - mean(y)
computes the difference from the mean.
Let’s create a new variable price/carat
for the data set
diamonds
, which accounts for the unit price of a diamond
per carat.
my_data <- mutate(diamonds, unit_price = price/carat)
ggplot(my_data) + geom_histogram(aes(unit_price)) +
labs(title = "Diamond unit price per carat", x = "dollar per carat", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
We may expect that this new variable has already incorporated the
effect of carat
, and then we can study the effect of color
simply by creating a bar plot on this variable.
ggplot(filter(my_data, between(carat, 1, 2))) +
stat_summary(mapping = aes(x = color, y = unit_price, fill = color), geom = "bar", fun = "mean") +
labs(title = "Diamond unit price by color", x = "Color Grade", y = "Mean Unit Price (dollar per carat)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
Here we pick up diamonds between 1 carat and 2 carat since the unit price in this group has less variance (when the carat is too low the value price/carat can suffer much greater variable since carat is in the denominator.) By plotting the mean unit price we see a reasonable trend to reflect the effect of color.
%/%
(integer division) and
%%
(remainder), where
x == y * (x %/% y) + (x %% y)
. Modular arithmetic is a
handy tool because it allows you to break integers up into pieces. For
example, in the flights dataset, you can compute hour
and
minute
from dep_time
with: transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100 )
## # A tibble: 336,776 × 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # … with 336,766 more rows
log()
, log2()
,
log10()
Logs: log()
, log2()
,
log10()
. Logarithms are an incredibly useful transformation
for dealing with data that ranges across multiple orders of magnitude.
They also convert multiplicative relationships to additive, a feature
we’ll come back to in modelling.
All else being equal, we recommend using log2()
because it’s easy to interpret: a difference of 1 on the log scale
corresponds to doubling on the original scale and a difference of -1
corresponds to halving.
In the loans_full_schema
data set, the annual income
variable spans over a few orders of magnitude. It can be useful to take
its logarithm as a new variable.
library(openintro)
my_data <- mutate(loans_full_schema, log_income = log10(annual_income))
ggplot(filter(my_data, between(annual_income, 10000, 1000000))) +
geom_point(aes(loan_amount, annual_income)) +
geom_smooth(aes(loan_amount, annual_income)) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Annual Income vs Loan Amount", x = "Loan Amount ($)", y = "Annual Income ($)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
ggplot(filter(my_data, between(annual_income, 1000, 1000000))) +
geom_point(aes(loan_amount, log_income)) +
geom_smooth(aes(loan_amount, log_income)) +
ylim(4,6) +
labs(title = "Log Annual Income vs Loan Amount", x = "Log Loan Amount", y = "Log Annual Income") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
lead()
and lag()
lead()
and dplyr::llag()
allow
you to refer to leading or lagging values. This allows you to compute
running differences (e.g. x - dplyr::lag(x)
) or find when
values change (x != lag(x))
. They are most useful in
conjunction with group_by()
, which you’ll learn about
shortly. (x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
dplyr::lag(x)
## [1] NA 1 2 3 4 5 6 7 8 9
dplyr::lead(x)
## [1] 2 3 4 5 6 7 8 9 10 NA
cumsum()
,
cumprod()
, cummin()
, cummax()
,
cummean()
cumsum()
,
cumprod()
, cummin()
, cummax()
;
and dplyr provides cummean()
for cumulative means. If you
need rolling aggregates (i.e. a sum computed over a rolling window), try
the RcppRoll package.x <- 1:10
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cumprod(x)
## [1] 1 2 6 24 120 720 5040 40320 362880
## [10] 3628800
cummin(x)
## [1] 1 1 1 1 1 1 1 1 1 1
cummax(x)
## [1] 1 2 3 4 5 6 7 8 9 10
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
ifelse()
In some cases we hope to create a new label for each sample for analysis purpose. For example, we hope to label all flights with “non-canceled” and “canceled”. Then we may do the following:
my_data <- mutate(flights, cancel_flag = ifelse(is.na(dep_time), "canceled", "non-canceled"))
my_data <- select(my_data, dep_time, cancel_flag, everything())
my_data
## # A tibble: 336,776 × 20
## dep_time cancel_f…¹ year month day sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶
## <int> <chr> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
## 1 517 non-cance… 2013 1 1 515 2 830 819 11
## 2 533 non-cance… 2013 1 1 529 4 850 830 20
## 3 542 non-cance… 2013 1 1 540 2 923 850 33
## 4 544 non-cance… 2013 1 1 545 -1 1004 1022 -18
## 5 554 non-cance… 2013 1 1 600 -6 812 837 -25
## 6 554 non-cance… 2013 1 1 558 -4 740 728 12
## 7 555 non-cance… 2013 1 1 600 -5 913 854 19
## 8 557 non-cance… 2013 1 1 600 -3 709 723 -14
## 9 557 non-cance… 2013 1 1 600 -3 838 846 -8
## 10 558 non-cance… 2013 1 1 600 -2 753 745 8
## # … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹cancel_flag, ²sched_dep_time, ³dep_delay, ⁴arr_time, ⁵sched_arr_time,
## # ⁶arr_delay
Here the function ifelse()
takes three arguments -
ifelse(test, value_when_test_returns_true, value_when_test_returns_false)
The first argument test
returns TRUE
,
FALSE
or NA
. If TRUE
, the
returning value is value of the second argument; if FALSE
,
the value of the third argument; if NA
, it returns
NA
. The function returns vectorized result.
x <- 1:10
ifelse(x > 5, ">5", "<=5")
## [1] "<=5" "<=5" "<=5" "<=5" "<=5" ">5" ">5" ">5" ">5" ">5"
Let’s explore whether short-distance or long-distance flights are more likely to be canceled.
ggplot(my_data) +
stat_summary(mapping = aes(x = cancel_flag, y = distance, fill = cancel_flag), geom = "bar", fun = "mean") +
labs(title = "Mean Distance vs Cancelation", x = "", y = "Counts") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
It seems that more canceled flights are short-distance ones. Can you make sense of the result?
case_when()
What if we hope to create a categorical variable with more than two
labels? We may use the case_when()
function from the
dplyr()
package. As below is an example.
Let’s create a season label for flights
data set.
my_data <- mutate(flights, season = case_when(
month %in% c(3,4,5) ~ "Spring",
month %in% c(6,7,8) ~ "Summer",
month %in% c(9,10,11) ~ "Autumn",
month %in% c(12,1,2) ~ "Winter",
.default = NA
))
my_data <- select(my_data, season, month, everything())
The template for case_when()
function is
case_when(
condition1 ~ value1,
condition2 ~ value2,
condition3 ~ value3,
...
.default = value_when_all_conditions_FALSE_or_NA
)
The case_when()
function will check for every element in
a vector whether it satisfies any of the condition following the order
given. The first condition that turns out to be TRUE will be assigned to
the corresponding position in the output vector of the same size. For
example,
x <- -5:5
case_when(
x == 0 ~ "zero",
x %% 2 == 0 ~ "even",
.default = "odd"
)
## [1] "odd" "even" "odd" "even" "odd" "zero" "odd" "even" "odd" "even"
## [11] "odd"
Now we can see which season is the most busiest.
ggplot(my_data) +
geom_bar(aes(season, fill = season)) +
labs(title = "Seasonal Flights Distribution", x = "", y = "Counts") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
For diamonds
data, create a new variable named
carat_group
, which groups data by
less than 1 carat
, 1-2 carat
,
2-3 carat
and greater than 3 carat
.
summarise()
and
group_by()
The last key verb is summarise()
. It collapses a data
frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
(na.rm = TRUE
removes all NA
values when
computing the measure from the data set. We’ll come back to that very
shortly.)
So we see that the average delay of all flights (excluding canceled
ones) is 12.6 minutes. The template for summarise()
(or
equivalently summarize()
) is
summarise(<data_name>, <new_measure1> = <formula1>, <new_measure2> = <formula2>, ...)
summarise()
is not terribly useful unless we pair it
with group_by()
. This changes the unit of analysis from the
complete data set to individual groups. Then, when you use the
dplyr
verbs on a grouped data frame they’ll be
automatically applied “by group”. For example, if we applied exactly the
same code to a data frame grouped by month, we get the average delay per
month:
by_month <- group_by(flights, month)
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 12 × 2
## month delay
## <int> <dbl>
## 1 1 10.0
## 2 2 10.8
## 3 3 13.2
## 4 4 13.9
## 5 5 13.0
## 6 6 20.8
## 7 7 21.7
## 8 8 12.6
## 9 9 6.72
## 10 10 6.24
## 11 11 5.44
## 12 12 16.6
So we see that the average delay is the worst in June, July and December, when summer or winter breaks start.
Note: Grouping doesn’t change how the data looks (apart from listing how it’s grouped). It changes how it acts with the other dplyr verbs.
by_month
## # A tibble: 336,776 × 19
## # Groups: month [12]
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Other than telling us that the data set is now grouped by the
variable month
, there is nothing different in its
appearance from the original data set.
The template to use group_by()
when grouping a data set
with given variable names is:
group_by(<data_name>, ..., <variable_names>, ...)
Together group_by()
and summarise()
provide
one of the tools that you’ll use most commonly when working with dplyr:
grouped summaries.
You may have wondered about the na.rm
argument we used
above. What happens if we don’t set it?
by_date <- group_by(flights, year, month, day)
summarise(by_date, mean_delay = mean(dep_delay))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean_delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 NA
## 2 2013 1 2 NA
## 3 2013 1 3 NA
## 4 2013 1 4 NA
## 5 2013 1 5 NA
## 6 2013 1 6 NA
## 7 2013 1 7 NA
## 8 2013 1 8 NA
## 9 2013 1 9 NA
## 10 2013 1 10 NA
## # … with 355 more rows
We get a lot of missing values! That’s because aggregation functions
obey the usual rule of missing values: if there’s any missing value in
the input, the output will be a missing value. Fortunately, all
aggregation functions have an na.rm
argument which
removes the missing values prior to computation:
by_date <- group_by(flights, year, month, day)
summarise(by_date, mean_delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean_delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # … with 355 more rows
In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights. We’ll save this dataset so we can reuse in the next few examples.
If we hope to know which day resulted in the worst average delay, we may do the following:
by_date <- group_by(flights, year, month, day)
data_sum <- summarise(by_date, mean_delay = mean(dep_delay, na.rm = TRUE))
arrange(data_sum, desc(mean_delay))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean_delay
## <int> <int> <int> <dbl>
## 1 2013 3 8 83.5
## 2 2013 7 1 56.2
## 3 2013 9 2 53.0
## 4 2013 7 10 52.9
## 5 2013 12 5 52.3
## 6 2013 5 23 51.1
## 7 2013 9 12 50.0
## 8 2013 6 28 48.8
## 9 2013 6 24 47.2
## 10 2013 7 22 46.7
## # … with 355 more rows
We see that March 8th, 2013 had the worst delay of the year. What happened on that day? Can you answer the question?
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
mean(x)
, but
median(x)
is also useful. The mean is the sum divided by
the length; the median is a value where 50% of x
is above
it, and 50% is below it.It’s sometimes useful to combine aggregation with logical subsetting. For example,
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data,
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day avg_delay1 avg_delay2
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 12.7 32.5
## 2 2013 1 2 12.7 32.0
## 3 2013 1 3 5.73 27.7
## 4 2013 1 4 -1.93 28.3
## 5 2013 1 5 -1.53 22.6
## 6 2013 1 6 4.24 24.4
## 7 2013 1 7 -4.95 27.8
## 8 2013 1 8 -3.23 20.8
## 9 2013 1 9 -0.264 25.6
## 10 2013 1 10 -5.90 27.3
## # … with 355 more rows
sd(x)
, IQR(x)
,
mad(x)
.The mean squared deviation, or standard deviation or sd for short, is
the standard measure of spread. The interquartile range
IQR()
and median absolute deviation mad(x)
are
robust equivalents that may be more useful if you have outliers.
min(x)
,
quantile(x, 0.25)
, max(x)
.Quantiles are a generalisation of the median. For example,
quantile(x, 0.25)
will find a value of x
that
is greater than 25% of the values, and less than the remaining 75%.
summarise(my_data,
first = min(dep_time),
last = max(dep_time)
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day first last
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # … with 355 more rows
first(x)
, nth(x, 2)
,
last(x)
. These work similarly to x[1]
,
x[2]
, and x[length(x)]
but let you set a
default value if that position does not exist (i.e. you’re trying to get
the 3rd element from a group that only has two elements). For example,
we can find the first and last departure for each day:summarise(my_data,
first_dep = first(dep_time),
last_dep = last(dep_time)
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day first_dep last_dep
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # … with 355 more rows
n()
, which takes no arguments, returns the size
of the current group. To count the number of non-missing values, use
sum(!is.na(x))
. To count the number of distinct (unique)
values, use n_distinct(x)
.my_data <- group_by(not_canceled, dest)
my_sum <- summarise(my_data, carriers = n_distinct(carrier))
arrange(my_sum, desc(carriers))
## # A tibble: 104 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # … with 94 more rows
Find which origin has the most destinations?