library(nycflights13)
library(tidyverse)
library(openintro)
In this lecture, we will continue to study data transformation using
dplyr
functions:
mutate()
to add or modify columnssummarise()
or summarize()
to summarize
datagroup_by()
to group a data frame by given variables
(often used together with summarise()
)mutate()
functionIn many situations, we hope to create new variables from the original data for our convenience when exploring or analyzing data.
For example, we want to compute the average speed of each flight, which is not included in the original data. We can do the following:
my_data <- mutate(flights, speed = distance/air_time*60)
This creates a new column named speed
which is computed
from the formula distance * 60 / air_time
, which is the
average speed in the unit of miles per hour. The new column is added to
a new data set named my_data
.
select(my_data, speed)
## # A tibble: 336,776 × 1
## speed
## <dbl>
## 1 370.
## 2 374.
## 3 408.
## 4 517.
## 5 394.
## 6 288.
## 7 404.
## 8 259.
## 9 405.
## 10 319.
## # … with 336,766 more rows
Lab Exercise: Find the flight with the highest average speed. What is its destination?
mutate()
functionThe mutate()
function creates new columns that are
functions of existing variables. It can also modify (if the name is the
same as an existing column) and delete columns (by setting their value
to NULL
).
The template to call mutate()
function is
mutate(<data_name>, <variable_name> = <formula>)
If the <data_name>
matches an existing column,
then we would modify the column or delete it (if the formula is
NULL
). Otherwise we would create a new column following the
formula.
If we only hope to keep new columns in the new data set, we can use
transmute
:
transmute(<data_name>, <new_variable_name> = <formula>)
In the example above, we may do the following as well:
my_data <- transmute(flights, speed = distance/air_time*60)
There are many functions for creating new variables that you can use
with mutate()
.
There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:
Arithmetic operators: +
, -
,
*
, /
, ^
. These are all
vectorized, using the so called “recycling
rules”.
If one parameter is shorter than the other, it will be
automatically extended to be the same length. This is
most useful when one of the arguments is a single number:
air_time / 60
, hours * 60 + minute
,
etc.
Arithmetic operators are also useful in conjunction with the
aggregate functions you’ll learn about later. For example,
x / sum(x)
calculates the proportion of a total, and
y - mean(y)
computes the difference from the mean.
Let’s create a new variable price/carat
for the data set
diamonds
, which accounts for the unit price of a diamond
per carat.
my_data <- mutate(diamonds, unit_price = price/carat)
ggplot(my_data) + geom_histogram(aes(unit_price)) +
labs(title = "Diamond unit price per carat", x = "dollar per carat", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
We may expect that this new variable has already incorporated the
effect of carat
, and then we can study the effect of color
simply by creating a bar plot on this variable.
ggplot(filter(my_data, between(carat, 1, 2))) +
stat_summary(mapping = aes(x = color, y = unit_price, fill = color), geom = "bar", fun = "mean") +
labs(title = "Diamond unit price by color", x = "Color Grade", y = "Mean Unit Price (dollar per carat)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
Here we pick up diamonds between 1 carat and 2 carat since the unit price in this group has less variance (when the carat is too low the value price/carat can suffer much greater variable since carat is in the denominator.) By plotting the mean unit price we see a reasonable trend to reflect the effect of color.
%/%
(integer division) and
%%
(remainder), where
x == y * (x %/% y) + (x %% y)
. Modular arithmetic is a
handy tool because it allows you to break integers up into pieces. For
example, in the flights dataset, you can compute hour
and
minute
from dep_time
with: transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100 )
## # A tibble: 336,776 × 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # … with 336,766 more rows
log()
, log2()
,
log10()
Logs: log()
, log2()
,
log10()
. Logarithms are an incredibly useful transformation
for dealing with data that ranges across multiple orders of magnitude.
They also convert multiplicative relationships to additive, a feature
we’ll come back to in modelling.
All else being equal, we recommend using log2()
because it’s easy to interpret: a difference of 1 on the log scale
corresponds to doubling on the original scale and a difference of -1
corresponds to halving.
In the loans_full_schema
data set, the annual income
variable spans over a few orders of magnitude. It can be useful to take
its logarithm as a new variable.
library(openintro)
my_data <- mutate(loans_full_schema, log_income = log10(annual_income))
ggplot(filter(my_data, between(annual_income, 10000, 1000000))) +
geom_point(aes(loan_amount, annual_income)) +
geom_smooth(aes(loan_amount, annual_income)) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Annual Income vs Loan Amount", x = "Loan Amount ($)", y = "Annual Income ($)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
ggplot(filter(my_data, between(annual_income, 1000, 1000000))) +
geom_point(aes(loan_amount, log_income)) +
geom_smooth(aes(loan_amount, log_income)) +
ylim(4,6) +
labs(title = "Log Annual Income vs Loan Amount", x = "Loan Amount ($)", y = "Log Annual Income") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
lead()
and lag()
lead()
and dplyr::llag()
allow
you to refer to leading or lagging values. This allows you to compute
running differences (e.g. x - dplyr::lag(x)
) or find when
values change (x != lag(x))
. They are most useful in
conjunction with group_by()
, which you’ll learn about
shortly. (x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
dplyr::lag(x)
## [1] NA 1 2 3 4 5 6 7 8 9
dplyr::lead(x)
## [1] 2 3 4 5 6 7 8 9 10 NA
Exercise: On January 1st, what is the average time between the departure of two domestic flights at EWR?
cumsum()
,
cumprod()
, cummin()
, cummax()
,
cummean()
cumsum()
,
cumprod()
, cummin()
, cummax()
;
and dplyr provides cummean()
for cumulative means. If you
need rolling aggregates (i.e. a sum computed over a rolling window), try
the RcppRoll package.x <- 1:10
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cumprod(x)
## [1] 1 2 6 24 120 720 5040 40320 362880
## [10] 3628800
cummin(x)
## [1] 1 1 1 1 1 1 1 1 1 1
cummax(x)
## [1] 1 2 3 4 5 6 7 8 9 10
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
ifelse()
In some cases we hope to create a new label for each sample for analysis purpose. For example, we hope to label all flights with “non-canceled” and “canceled”. Then we may do the following:
my_data <- mutate(flights, cancel_flag = ifelse(is.na(dep_time), "canceled", "non-canceled"))
my_data <- select(my_data, dep_time, cancel_flag, everything())
my_data
## # A tibble: 336,776 × 20
## dep_time cancel_f…¹ year month day sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶
## <int> <chr> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
## 1 517 non-cance… 2013 1 1 515 2 830 819 11
## 2 533 non-cance… 2013 1 1 529 4 850 830 20
## 3 542 non-cance… 2013 1 1 540 2 923 850 33
## 4 544 non-cance… 2013 1 1 545 -1 1004 1022 -18
## 5 554 non-cance… 2013 1 1 600 -6 812 837 -25
## 6 554 non-cance… 2013 1 1 558 -4 740 728 12
## 7 555 non-cance… 2013 1 1 600 -5 913 854 19
## 8 557 non-cance… 2013 1 1 600 -3 709 723 -14
## 9 557 non-cance… 2013 1 1 600 -3 838 846 -8
## 10 558 non-cance… 2013 1 1 600 -2 753 745 8
## # … with 336,766 more rows, 10 more variables: carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹cancel_flag, ²sched_dep_time, ³dep_delay, ⁴arr_time, ⁵sched_arr_time,
## # ⁶arr_delay
Here the function ifelse()
takes three arguments -
ifelse(test, value_when_test_returns_true, value_when_test_returns_false)
The first argument test
returns TRUE
,
FALSE
or NA
. If TRUE
, the
returning value is value of the second argument; if FALSE
,
the value of the third argument; if NA
, it returns
NA
. The function returns vectorized result.
x <- 1:10
ifelse(x > 5, ">5", "<=5")
## [1] "<=5" "<=5" "<=5" "<=5" "<=5" ">5" ">5" ">5" ">5" ">5"
Let’s explore whether short-distance or long-distance flights are more likely to be canceled.
ggplot(my_data) +
stat_summary(mapping = aes(x = cancel_flag, y = distance, fill = cancel_flag), geom = "bar", fun = "mean") +
labs(title = "Mean Distance vs Cancelation", x = "", y = "Mean Distance (mile)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
It seems that more canceled flights are short-distance ones. Can you make sense of the result?
case_when()
What if we hope to create a categorical variable with more than two
labels? We may use the case_when()
function from the
dplyr()
package. As below is an example.
Let’s create a season label for flights
data set.
my_data <- mutate(flights, season = case_when(
month %in% c(3,4,5) ~ "Spring",
month %in% c(6,7,8) ~ "Summer",
month %in% c(9,10,11) ~ "Autumn",
month %in% c(12,1,2) ~ "Winter",
.default = NA
))
my_data <- select(my_data, season, month, everything())
The template for case_when()
function is
case_when(
condition1 ~ value1,
condition2 ~ value2,
condition3 ~ value3,
...
.default = value_when_all_conditions_FALSE_or_NA
)
The case_when()
function will check for every element in
a vector whether it satisfies any of the condition following the order
given. The first condition that turns out to be TRUE will be assigned to
the corresponding position in the output vector of the same size. For
example,
x <- -5:5
case_when(
x == 0 ~ "zero",
x %% 2 == 0 ~ "even",
.default = "odd"
)
## [1] "odd" "even" "odd" "even" "odd" "zero" "odd" "even" "odd" "even"
## [11] "odd"
Now we can see which season is the most busiest.
ggplot(my_data) +
geom_bar(aes(season, fill = season)) +
labs(title = "Seasonal Flights Distribution", x = "", y = "Counts") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
For diamonds
data, create a new variable named
carat_group
, which groups data by
less than 1 carat
, 1-2 carat
,
2-3 carat
and greater than 3 carat
.
summarise()
and
group_by()
The last key verb is summarise()
. It collapses a data
frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
(na.rm = TRUE
removes all NA
values when
computing the measure from the data set. We’ll come back to that very
shortly.)
So we see that the average delay of all flights (excluding canceled
ones) is 12.6 minutes. The template for summarise()
(or
equivalently summarize()
) is
summarise(<data_name>, <new_measure1> = <formula1>, <new_measure2> = <formula2>, ...)
summarise()
is not terribly useful unless we pair it
with group_by()
. This changes the unit of analysis from the
complete data set to individual groups. Then, when you use the
dplyr
verbs on a grouped data frame they’ll be
automatically applied “by group”. For example, if we applied exactly the
same code to a data frame grouped by month, we get the average delay per
month:
by_month <- group_by(flights, month)
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 12 × 2
## month delay
## <int> <dbl>
## 1 1 10.0
## 2 2 10.8
## 3 3 13.2
## 4 4 13.9
## 5 5 13.0
## 6 6 20.8
## 7 7 21.7
## 8 8 12.6
## 9 9 6.72
## 10 10 6.24
## 11 11 5.44
## 12 12 16.6
So we see that the average delay is the worst in June, July and December, when summer or winter breaks start.
Note: Grouping doesn’t change how the data looks (apart from listing how it’s grouped). It changes how it acts with the other dplyr verbs.
by_month
## # A tibble: 336,776 × 19
## # Groups: month [12]
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Other than telling us that the data set is now grouped by the
variable month
, there is nothing different in its
appearance from the original data set.
The template to use group_by()
when grouping a data set
with given variable names is:
group_by(<data_name>, ..., <variable_names>, ...)
Together group_by()
and summarise()
provide
one of the tools that you’ll use most commonly when working with dplyr:
grouped summaries.
You may have wondered about the na.rm
argument we used
above. What happens if we don’t set it?
by_date <- group_by(flights, year, month, day)
summarise(by_date, mean_delay = mean(dep_delay))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean_delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 NA
## 2 2013 1 2 NA
## 3 2013 1 3 NA
## 4 2013 1 4 NA
## 5 2013 1 5 NA
## 6 2013 1 6 NA
## 7 2013 1 7 NA
## 8 2013 1 8 NA
## 9 2013 1 9 NA
## 10 2013 1 10 NA
## # … with 355 more rows
We get a lot of missing values! That’s because aggregation functions
obey the usual rule of missing values: if there’s any missing value in
the input, the output will be a missing value. Fortunately, all
aggregation functions have an na.rm
argument which
removes the missing values prior to computation:
by_date <- group_by(flights, year, month, day)
summarise(by_date, mean_delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean_delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # … with 355 more rows
In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights. We’ll save this dataset so we can reuse in the next few examples.
If we hope to know which day resulted in the worst average delay, we may do the following:
by_date <- group_by(flights, year, month, day)
data_sum <- summarise(by_date, mean_delay = mean(dep_delay, na.rm = TRUE))
arrange(data_sum, desc(mean_delay))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean_delay
## <int> <int> <int> <dbl>
## 1 2013 3 8 83.5
## 2 2013 7 1 56.2
## 3 2013 9 2 53.0
## 4 2013 7 10 52.9
## 5 2013 12 5 52.3
## 6 2013 5 23 51.1
## 7 2013 9 12 50.0
## 8 2013 6 28 48.8
## 9 2013 6 24 47.2
## 10 2013 7 22 46.7
## # … with 355 more rows
We see that March 8th, 2013 had the worst delay of the year. What happened on that day? Can you answer the question?
R provides many other summary functions when using
summarise()
to summarize data. As below are some of the
most helpful ones.
mean()
, median()
mean(x)
, but
median(x)
is also useful. The mean is the sum divided by
the length; the median is a value where 50% of x
is above
it, and 50% is below it.It’s sometimes useful to combine aggregation with logical subsetting. For example,
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data,
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day avg_delay1 avg_delay2
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 12.7 32.5
## 2 2013 1 2 12.7 32.0
## 3 2013 1 3 5.73 27.7
## 4 2013 1 4 -1.93 28.3
## 5 2013 1 5 -1.53 22.6
## 6 2013 1 6 4.24 24.4
## 7 2013 1 7 -4.95 27.8
## 8 2013 1 8 -3.23 20.8
## 9 2013 1 9 -0.264 25.6
## 10 2013 1 10 -5.90 27.3
## # … with 355 more rows
sum()
and n()
The sum()
function is commonly used to compute the
relative frequency, combining with the counting function
n()
.
my_summary <- summarise(group_by(flights, origin), n = n()) # n() counts how many rows there are for each group
my_summary
## # A tibble: 3 × 2
## origin n
## <chr> <int>
## 1 EWR 120835
## 2 JFK 111279
## 3 LGA 104662
my_summary <- mutate(my_summary, prop = n/sum(n)) # Use `mutate()` to add a column of "proportion"
my_summary
## # A tibble: 3 × 3
## origin n prop
## <chr> <int> <dbl>
## 1 EWR 120835 0.359
## 2 JFK 111279 0.330
## 3 LGA 104662 0.311
ggplot(my_summary) + stat_summary(mapping = aes(x = origin, y = prop), geom = "bar") +
labs(title = "Total Number of Flights from NYC in 2013", x = "Airport Code", y = "Relative Frequency") +
scale_y_continuous(labels = scales::percent) +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
sd(x)
, IQR(x)
,
mad(x)
.The mean squared deviation, or standard deviation or sd for short, is
the standard measure of spread. The interquartile range
IQR()
and median absolute deviation mad(x)
are
robust equivalents that may be more useful if you have outliers.
For example, let’s find the destination with the highest variation in arrival delay
not_canceled <- filter(flights, !is.na(arr_delay))
my_data <- group_by(not_canceled, dest)
my_summary <- summarise(my_data, sd = sd(arr_delay))
arrange(my_summary, desc(sd))
## # A tibble: 104 × 2
## dest sd
## <chr> <dbl>
## 1 HNL 60.8
## 2 TUL 60.3
## 3 TVC 59.3
## 4 CAK 58.3
## 5 TYS 58.0
## 6 SAT 57.8
## 7 DSM 56.9
## 8 MSN 56.8
## 9 BHM 56.2
## 10 CVG 55.1
## # … with 94 more rows
min(x)
,
quantile(x, 0.25)
, max(x)
.Quantiles are a generalisation of the median. For example,
quantile(x, 0.25)
will find a value of x
that
is greater than 25% of the values, and less than the remaining 75%.
For example, if we hope to know the first and last flights leave on each day, we may do the following:
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data,
first = min(dep_time),
last = max(dep_time)
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day first last
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # … with 355 more rows
first(x)
, nth(x, 2)
,
last(x)
. These work similarly to x[1]
,
x[2]
, and x[length(x)]
but let you set a
default value if that position does not exist (i.e. you’re trying to get
the 3rd element from a group that only has two elements). For example,
we can find the first and last departure for each day:summarise(my_data,
first_dep = first(dep_time),
last_dep = last(dep_time)
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day first_dep last_dep
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # … with 355 more rows
n()
, which takes no arguments, returns the size
of the current group. To count the number of non-missing values, use
sum(!is.na(x))
. To count the number of distinct (unique)
values, use n_distinct(x)
.Counting is so frequently used for data analysis that there is a
separate function count()
to do the job.
count(<data_name>, <variable_names>)
is
equivalent to first group by those variables then count the number of
samples in each group.
count(flights, origin)
## # A tibble: 3 × 2
## origin n
## <chr> <int>
## 1 EWR 120835
## 2 JFK 111279
## 3 LGA 104662
my_data <- group_by(not_canceled, dest)
my_sum <- summarise(my_data, carriers = n_distinct(carrier))
arrange(my_sum, desc(carriers))
## # A tibble: 104 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # … with 94 more rows
%>%
In most examples so far, we need to do a sequence of operations on the data set to fulfill our task. For example:
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
my_data <- group_by(not_canceled, dest)
my_sum <- summarise(my_data, carriers = n_distinct(carrier))
arrange(my_sum, desc(carriers))
## # A tibble: 104 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # … with 94 more rows
As we see, we have to assign the transformed data to a new or existing data set in every step to save the intermediate results. when there are many steps to do, this is quite cumbersome and inconvenient. The code also becomes less clear for readers.
Instead, we can use the pipe operator
%>%
introduced in the package magrittr
,
which is automatically loaded when we load tidyverse
so we
don’t have to explicitly load it.
The code after using the pipe for the previous example looks like this:
my_data <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers)) %>%
print()
## # A tibble: 104 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # … with 94 more rows
So the template for the pipe is:
<new_data_name> <- <original_data_name> %>%
operation1() %>%
operation2() %>%
...
Most importantly, in every operation function (basically all
functions we have introduced so far from delyr
) we can now
ignore the data set name as the first argument and
directly input the operations.
my_data <- filter(diamonds, carat < 2)
my_data <- group_by(my_data, color)
my_summary <- summarise(my_data, mean_carat = mean(carat), mean_price = mean(price))
my_summary <- mutate(my_summary, mean_unit_price = mean_price/mean_carat)
print(my_summary)
## # A tibble: 7 × 4
## color mean_carat mean_price mean_unit_price
## <ord> <dbl> <dbl> <dbl>
## 1 D 0.646 3063. 4743.
## 2 E 0.643 2944. 4577.
## 3 F 0.716 3547. 4952.
## 4 G 0.737 3720. 5047.
## 5 H 0.828 3765. 4548.
## 6 I 0.884 3814. 4314.
## 7 J 0.980 3838. 3917.
Now let’s exercise using %>%
to do analysis.
We will analyze the data set babies
in the package
openintro
. Use the help documents to understand the meaning
of variables.
This can be done by data visualization alone without any transformation.
ggplot(babies) + geom_histogram(aes(age), binwidth = 1, boundary = 20.5) +
labs(title = "Analysis of 'babies' Data Set", x = "Mother's age when giving birth") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
We can create a box plot to study this question. Again this only involves data visualization.
ggplot(babies) +
geom_boxplot(mapping = aes(x = as.factor(smoke), y = gestation))
This figure does not look very nice, so we may consider summarizing
data instead of creating a graph. Note that we will create a new
categorical variable with labels “non-smoker”, “smoker” and “NA” to
replace “0”, “1” and “NA” in smoke
variable.
smoke_gest <- babies %>%
mutate(smoke_label = fct_recode(as.factor(smoke), "non-smoker" = "0", "smoker" = "1")) %>%
group_by(smoke_label) %>%
summarise(n = n(), mean_gest = mean(gestation, na.rm = TRUE), median_gest = median(gestation, na.rm = TRUE)) %>%
print()
## # A tibble: 3 × 4
## smoke_label n mean_gest median_gest
## <fct> <int> <dbl> <dbl>
## 1 non-smoker 742 280. 281
## 2 smoker 484 278. 279
## 3 <NA> 10 282. 280.
As we see, there is little difference in the measures and it seems that smoking or not does not matter to the gestation length.
babies
data set, investigate whether there
is a correlation between mother’s weight and the gestation length.