Let’s load the following libraries for this lecture.
library(nycflights13)
library(tidyverse)
library(openintro)
In this lecture, we will continue to study data transformation using
dplyr
functions:
mutate()
to add or modify columns
summarise()
or summarize()
to summarize
data
group_by()
to group a data frame by given variables
(often used together with summarise()
)
mutate()
functionIn many situations, we hope to create new variables from the original data for our convenience when exploring or analyzing data.
For example, we want to compute the average speed of each flight, which is not included in the original data. We can do the following:
my_data <- mutate(flights, speed = distance/air_time*60)
This creates a new column named speed
which is computed
from the formula distance * 60 / air_time
, which is the
average speed in the unit of miles per hour. The new column is added to
a new data set named my_data
.
select(my_data, speed)
## # A tibble: 336,776 × 1
## speed
## <dbl>
## 1 370.
## 2 374.
## 3 408.
## 4 517.
## 5 394.
## 6 288.
## 7 404.
## 8 259.
## 9 405.
## 10 319.
## # ℹ 336,766 more rows
Lab Exercise: Find the flight with the highest average speed. What is its destination?
mutate()
functionThe mutate()
function creates new columns that are
functions of existing variables. It can also modify (if the name is the
same as an existing column) and delete columns (by setting their value
to NULL
).
The template to call mutate()
function is
mutate(<data_name>, <variable_name> = <formula>)
If the <data_name>
matches an existing column,
then we would modify the column or delete it (if the formula is
NULL
). Otherwise we would create a new column following the
formula.
If we only hope to keep new columns in the new data set, we can use
transmute
:
transmute(<data_name>, <new_variable_name> = <formula>)
In the example above, we may do the following as well:
my_data <- transmute(flights, speed = distance/air_time*60)
There are many functions for creating new variables that you can use
with mutate()
.
There’s no way to list every possible function that you might use, but here’s a selection of functions that are frequently useful:
Arithmetic operators: +
, -
,
*
, /
, ^
. These are all
vectorized, using the so called “recycling rules”.
If one parameter is shorter than the other, it will be
automatically extended to be the same length. This is
most useful when one of the arguments is a single number:
air_time / 60
, hours * 60 + minute
,
etc.
Arithmetic operators are also useful in conjunction with the
aggregate functions you’ll learn about later. For example,
x / sum(x)
calculates the proportion of a total, and
y - mean(y)
computes the difference from the mean.
Let’s create a new variable price/carat
for the data set
diamonds
, which accounts for the unit price of a diamond
per carat.
my_data <- mutate(diamonds, unit_price = price/carat)
ggplot(my_data) + geom_histogram(aes(unit_price)) +
labs(title = "Diamond unit price per carat", x = "dollar per carat", y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
%/%
(integer division) and
%%
(remainder), where
x == y * (x %/% y) + (x %% y)
. Modular arithmetic is a
handy tool because it allows you to break integers up into pieces. For
example, in the flights dataset, you can compute hour
and
minute
from dep_time
with: transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100 )
## # A tibble: 336,776 × 3
## dep_time hour minute
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # ℹ 336,766 more rows
log()
, log2()
,
log10()
Logs: log()
, log2()
,
log10()
. Logarithms are an incredibly useful transformation
for dealing with data that ranges across multiple orders of magnitude.
They also convert multiplicative relationships to additive, a feature
we’ll come back to in modelling.
All else being equal, we recommend using log2()
because it’s easy to interpret: a difference of 1 on the log scale
corresponds to doubling on the original scale and a difference of -1
corresponds to halving.
In the loans_full_schema
data set, the annual income
variable spans over a few orders of magnitude. It can be useful to take
its logarithm as a new variable.
library(openintro)
my_data <- mutate(loans_full_schema, log_income = log10(annual_income))
ggplot(filter(my_data, between(annual_income, 10000, 1000000))) +
geom_point(aes(loan_amount, annual_income)) +
geom_smooth(aes(loan_amount, annual_income)) +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Annual Income vs Loan Amount", x = "Loan Amount ($)", y = "Annual Income ($)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
ggplot(filter(my_data, between(annual_income, 1000, 1000000))) +
geom_point(aes(loan_amount, log_income)) +
geom_smooth(aes(loan_amount, log_income)) +
ylim(4,6) +
labs(title = "Log Annual Income vs Loan Amount", x = "Loan Amount ($)", y = "Log Annual Income") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
lead()
and lag()
lead()
and dplyr::llag()
allow
you to refer to leading or lagging values. This allows you to compute
running differences (e.g. x - dplyr::lag(x)
) or find when
values change (x != lag(x))
. They are most useful in
conjunction with group_by()
, which you’ll learn about
shortly. (x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
dplyr::lag(x)
## [1] NA 1 2 3 4 5 6 7 8 9
dplyr::lead(x)
## [1] 2 3 4 5 6 7 8 9 10 NA
Exercise: On January 1st, what is the average time between the departure of two domestic flights at EWR?
cumsum()
,
cumprod()
, cummin()
, cummax()
,
cummean()
cumsum()
,
cumprod()
, cummin()
, cummax()
;
and dplyr provides cummean()
for cumulative means. If you
need rolling aggregates (i.e. a sum computed over a rolling window), try
the RcppRoll
package.x <- 1:10
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cumprod(x)
## [1] 1 2 6 24 120 720 5040 40320 362880
## [10] 3628800
cummin(x)
## [1] 1 1 1 1 1 1 1 1 1 1
cummax(x)
## [1] 1 2 3 4 5 6 7 8 9 10
cummean(x)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
ifelse()
In some cases we hope to create a new label for each sample for analysis purpose. For example, we hope to label all flights with “non-canceled” and “canceled”. Then we may do the following:
my_data <- mutate(flights, cancel_flag = ifelse(is.na(dep_time), "canceled", "non-canceled"))
my_data <- select(my_data, dep_time, cancel_flag, everything())
my_data
## # A tibble: 336,776 × 20
## dep_time cancel_flag year month day sched_dep_time dep_delay arr_time
## <int> <chr> <int> <int> <int> <int> <dbl> <int>
## 1 517 non-canceled 2013 1 1 515 2 830
## 2 533 non-canceled 2013 1 1 529 4 850
## 3 542 non-canceled 2013 1 1 540 2 923
## 4 544 non-canceled 2013 1 1 545 -1 1004
## 5 554 non-canceled 2013 1 1 600 -6 812
## 6 554 non-canceled 2013 1 1 558 -4 740
## 7 555 non-canceled 2013 1 1 600 -5 913
## 8 557 non-canceled 2013 1 1 600 -3 709
## 9 557 non-canceled 2013 1 1 600 -3 838
## 10 558 non-canceled 2013 1 1 600 -2 753
## # ℹ 336,766 more rows
## # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Here the function ifelse()
takes three arguments -
ifelse(test, value_when_test_returns_true, value_when_test_returns_false)
The first argument test
returns TRUE
,
FALSE
or NA
. If TRUE
, the
returning value is value of the second argument; if FALSE
,
the value of the third argument; if NA
, it returns
NA
. The function returns vectorized result.
x <- 1:10
ifelse(x > 5, ">5", "<=5")
## [1] "<=5" "<=5" "<=5" "<=5" "<=5" ">5" ">5" ">5" ">5" ">5"
Let’s explore whether short-distance or long-distance flights are more likely to be canceled.
ggplot(my_data) +
stat_summary(mapping = aes(x = cancel_flag, y = distance, fill = cancel_flag), geom = "bar", fun = "mean") +
labs(title = "Mean Distance vs Cancelation", x = "", y = "Mean Distance (mile)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
It seems that more canceled flights are short-distance ones. Can you make sense of the result?
Comments: the function stat_summary
can be used to
create bar plots where the y axis is not counts
. We will
study more details in the next lecture.
case_when()
What if we hope to create a categorical variable with more than two
labels? We may use the case_when()
function from the
dplyr()
package. As below is an example.
Let’s create a season label for flights
data set.
my_data <- mutate(flights, season = case_when(
month %in% c(3,4,5) ~ "Spring",
month %in% c(6,7,8) ~ "Summer",
month %in% c(9,10,11) ~ "Autumn",
month %in% c(12,1,2) ~ "Winter",
.default = NA
))
my_data <- select(my_data, season, month, everything())
The template for case_when()
function is
case_when(
condition1 ~ value1,
condition2 ~ value2,
condition3 ~ value3,
...
.default = value_when_all_conditions_FALSE_or_NA
)
The case_when()
function will check for every element in
a vector whether it satisfies any of the condition following the order
given. The first condition that turns out to be TRUE will be assigned to
the corresponding position in the output vector of the same size. For
example,
x <- -5:5
case_when(
x == 0 ~ "zero",
x %% 2 == 0 ~ "even",
.default = "odd"
)
## [1] "odd" "even" "odd" "even" "odd" "zero" "odd" "even" "odd" "even"
## [11] "odd"
Now we can see which season is the busiest one.
ggplot(my_data) +
geom_bar(aes(season, fill = season)) +
labs(title = "Seasonal Flights Distribution", x = "", y = "Counts") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
For diamonds
data, create a new variable named
carat_group
, which groups data by
less than 1 carat
, 1-2 carat
,
2-3 carat
and greater than 3 carat
.
summarise()
and
group_by()
The last key verb is summarise()
. It collapses a data
frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
(na.rm = TRUE
removes all NA
values when
computing the measure from the data set. We’ll come back to that very
shortly.)
So we see that the average delay of all flights (excluding canceled
ones) is 12.6 minutes. The template for summarise()
(or
equivalently summarize()
) is
summarise(<data_name>, <new_measure1> = <formula1>, <new_measure2> = <formula2>, ...)
summarise()
is not terribly useful unless we pair it
with group_by()
. This changes the unit of analysis from the
complete data set to individual groups. Then, when you use the
dplyr
verbs on a grouped data frame they’ll be
automatically applied “by group”. For example, if we applied exactly the
same code to a data frame grouped by month, we get the average delay per
month:
by_month <- group_by(flights, month)
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 12 × 2
## month delay
## <int> <dbl>
## 1 1 10.0
## 2 2 10.8
## 3 3 13.2
## 4 4 13.9
## 5 5 13.0
## 6 6 20.8
## 7 7 21.7
## 8 8 12.6
## 9 9 6.72
## 10 10 6.24
## 11 11 5.44
## 12 12 16.6
So we see that the average delay is the worst in June, July and December, when summer or winter breaks start.
Note: Grouping doesn’t change how the data looks (apart from listing how it’s grouped). It changes how it acts with the other dplyr verbs.
by_month
## # A tibble: 336,776 × 19
## # Groups: month [12]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Other than telling us that the data set is now grouped by the
variable month
, there is nothing different in its
appearance from the original data set.
The template to use group_by()
when grouping a data set
with given variable names is:
group_by(<data_name>, ..., <variable_names>, ...)
Together group_by()
and summarise()
provide
one of the tools that you’ll use most commonly when working with dplyr:
grouped summaries.
You may have wondered about the na.rm
argument we used
above. What happens if we don’t set it?
by_date <- group_by(flights, year, month, day)
summarise(by_date, mean_delay = mean(dep_delay))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean_delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 NA
## 2 2013 1 2 NA
## 3 2013 1 3 NA
## 4 2013 1 4 NA
## 5 2013 1 5 NA
## 6 2013 1 6 NA
## 7 2013 1 7 NA
## 8 2013 1 8 NA
## 9 2013 1 9 NA
## 10 2013 1 10 NA
## # ℹ 355 more rows
We get a lot of missing values! That’s because aggregation functions
obey the usual rule of missing values: if there’s any missing value in
the input, the output will be a missing value. Fortunately, all
aggregation functions have an na.rm
argument which
removes the missing values prior to computation:
by_date <- group_by(flights, year, month, day)
summarise(by_date, mean_delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean_delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # ℹ 355 more rows
In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights. We’ll save this dataset so we can reuse in the next few examples.
If we hope to know which day resulted in the worst average delay, we may do the following:
by_date <- group_by(flights, year, month, day)
data_sum <- summarise(by_date, mean_delay = mean(dep_delay, na.rm = TRUE))
arrange(data_sum, desc(mean_delay))
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day mean_delay
## <int> <int> <int> <dbl>
## 1 2013 3 8 83.5
## 2 2013 7 1 56.2
## 3 2013 9 2 53.0
## 4 2013 7 10 52.9
## 5 2013 12 5 52.3
## 6 2013 5 23 51.1
## 7 2013 9 12 50.0
## 8 2013 6 28 48.8
## 9 2013 6 24 47.2
## 10 2013 7 22 46.7
## # ℹ 355 more rows
We see that March 8th, 2013 had the worst delay of the year. What happened on that day? Can you answer the question?
R provides many other summary functions when using
summarise()
to summarize data. As below are some of the
most helpful ones.
mean()
, median()
mean(x)
, but
median(x)
is also useful. The mean is the sum divided by
the length; the median is a value where 50% of x
is above
it, and 50% is below it.It’s sometimes useful to combine aggregation with logical subsetting. For example,
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data,
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day avg_delay1 avg_delay2
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 12.7 32.5
## 2 2013 1 2 12.7 32.0
## 3 2013 1 3 5.73 27.7
## 4 2013 1 4 -1.93 28.3
## 5 2013 1 5 -1.53 22.6
## 6 2013 1 6 4.24 24.4
## 7 2013 1 7 -4.95 27.8
## 8 2013 1 8 -3.23 20.8
## 9 2013 1 9 -0.264 25.6
## 10 2013 1 10 -5.90 27.3
## # ℹ 355 more rows
sum()
and n()
The sum()
function is commonly used to compute the
relative frequency, combining with the counting function
n()
.
my_summary <- summarise(group_by(flights, origin), n = n()) # n() counts how many rows there are for each group
my_summary
## # A tibble: 3 × 2
## origin n
## <chr> <int>
## 1 EWR 120835
## 2 JFK 111279
## 3 LGA 104662
my_summary <- mutate(my_summary, prop = n/sum(n)) # Use `mutate()` to add a column of "proportion"
my_summary
## # A tibble: 3 × 3
## origin n prop
## <chr> <int> <dbl>
## 1 EWR 120835 0.359
## 2 JFK 111279 0.330
## 3 LGA 104662 0.311
ggplot(my_summary) + stat_summary(mapping = aes(x = origin, y = prop), geom = "bar") +
labs(title = "Total Number of Flights from NYC in 2013", x = "Airport Code", y = "Relative Frequency") +
scale_y_continuous(labels = scales::percent) +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
sd(x)
, IQR(x)
,
mad(x)
.The mean squared deviation, or standard deviation or sd for short, is
the standard measure of spread. The interquartile range
IQR()
and median absolute deviation mad(x)
are
robust equivalents that may be more useful if you have outliers.
For example, let’s find the destination with the highest variation in arrival delay
not_canceled <- filter(flights, !is.na(arr_delay))
my_data <- group_by(not_canceled, dest)
my_summary <- summarise(my_data, sd = sd(arr_delay))
arrange(my_summary, desc(sd))
## # A tibble: 104 × 2
## dest sd
## <chr> <dbl>
## 1 HNL 60.8
## 2 TUL 60.3
## 3 TVC 59.3
## 4 CAK 58.3
## 5 TYS 58.0
## 6 SAT 57.8
## 7 DSM 56.9
## 8 MSN 56.8
## 9 BHM 56.2
## 10 CVG 55.1
## # ℹ 94 more rows
min(x)
,
quantile(x, 0.25)
, max(x)
.Quantiles are a generalisation of the median. For example,
quantile(x, 0.25)
will find a value of x
that
is greater than 25% of the values, and less than the remaining 75%.
For example, if we hope to know the first and last flights leave on each day, we may do the following:
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
my_data <- group_by(not_canceled, year, month, day)
summarise(my_data,
first = min(dep_time),
last = max(dep_time)
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day first last
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # ℹ 355 more rows
first(x)
, nth(x, 2)
,
last(x)
. These work similarly to x[1]
,
x[2]
, and x[length(x)]
but let you set a
default value if that position does not exist (i.e. you’re trying to get
the 3rd element from a group that only has two elements). For example,
we can find the first and last departure for each day:summarise(my_data,
first_dep = first(dep_time),
last_dep = last(dep_time)
)
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day first_dep last_dep
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # ℹ 355 more rows
n()
, which takes no arguments, returns the size
of the current group. To count the number of non-missing values, use
sum(!is.na(x))
. To count the number of distinct (unique)
values, use n_distinct(x)
.Counting is so frequently used for data analysis that there is a
separate function count()
to do the job.
count(<data_name>, <variable_names>)
is
equivalent to first group by those variables then count the number of
samples in each group.
count(flights, origin)
## # A tibble: 3 × 2
## origin n
## <chr> <int>
## 1 EWR 120835
## 2 JFK 111279
## 3 LGA 104662
my_data <- group_by(not_canceled, dest)
my_sum <- summarise(my_data, carriers = n_distinct(carrier))
arrange(my_sum, desc(carriers))
## # A tibble: 104 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # ℹ 94 more rows
Find which origin has the most destinations?
%>%
In most examples so far, we need to do a sequence of operations on the data set to fulfill our task. For example:
not_canceled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
my_data <- group_by(not_canceled, dest)
my_sum <- summarise(my_data, carriers = n_distinct(carrier))
arrange(my_sum, desc(carriers))
## # A tibble: 104 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # ℹ 94 more rows
As we see, we have to assign the transformed data to a new or existing data set in every step to save the intermediate results. when there are many steps to do, this is quite cumbersome and inconvenient. The code also becomes less clear for readers.
Instead, we can use the pipe operator
%>%
introduced in the package magrittr
,
which is automatically loaded when we load tidyverse
so we
don’t have to explicitly load it.
The code after using the pipe for the previous example looks like this:
my_data <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay)) %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers)) %>%
print()
## # A tibble: 104 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # ℹ 94 more rows
So the template for the pipe is:
<new_data_name> <- <original_data_name> %>%
operation1() %>%
operation2() %>%
...
Most importantly, in every operation function (basically all
functions we have introduced so far from delyr
) we can now
ignore the data set name as the first argument and
directly input the operations.
Rewrite the following analysis code in pipe operator:
my_data <- filter(diamonds, carat < 2)
my_data <- group_by(my_data, color)
my_summary <- summarise(my_data, mean_carat = mean(carat), mean_price = mean(price))
my_summary <- mutate(my_summary, mean_unit_price = mean_price/mean_carat)
print(my_summary)
## # A tibble: 7 × 4
## color mean_carat mean_price mean_unit_price
## <ord> <dbl> <dbl> <dbl>
## 1 D 0.646 3063. 4743.
## 2 E 0.643 2944. 4577.
## 3 F 0.716 3547. 4952.
## 4 G 0.737 3720. 5047.
## 5 H 0.828 3765. 4548.
## 6 I 0.884 3814. 4314.
## 7 J 0.980 3838. 3917.
Now let’s exercise using %>%
to do analysis.
We will analyze the data set babies
in the package
openintro
. Use the help documents to understand the meaning
of variables.
This can be done by data visualization alone without any transformation.
ggplot(babies) + geom_histogram(aes(age), binwidth = 1, boundary = 20.5) +
labs(title = "Analysis of 'babies' Data Set", x = "Mother's age when giving birth") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
We can create a box plot to study this question. Again this only involves data visualization.
ggplot(babies) +
geom_boxplot(mapping = aes(x = as.factor(smoke), y = gestation))
This figure does not look very nice, so we may consider summarizing
data instead of creating a graph. Note that we will create a new
categorical variable with labels “non-smoker”, “smoker” and “NA” to
replace “0”, “1” and “NA” in smoke
variable.
smoke_gest <- babies %>%
mutate(smoke_label = fct_recode(as.factor(smoke), "non-smoker" = "0", "smoker" = "1")) %>%
group_by(smoke_label) %>%
summarise(n = n(), mean_gest = mean(gestation, na.rm = TRUE), median_gest = median(gestation, na.rm = TRUE)) %>%
print()
## # A tibble: 3 × 4
## smoke_label n mean_gest median_gest
## <fct> <int> <dbl> <dbl>
## 1 non-smoker 742 280. 281
## 2 smoker 484 278. 279
## 3 <NA> 10 282. 280.
As we see, there is little difference in the measures and it seems that smoking or not does not matter to the gestation length.
babies
data set in openintro
, use
a summary table to investigate whether first pregnancy status correlates
with gestation length or not. Use pipe operator for your code. Submit
yoru code and the result.