summarise(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate(). In contrast to the four other data manipulation functions, summarise() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarzing statistics.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(hflights)
## Warning: package 'hflights' was built under R version 3.2.2
# Print out a summary with variables min_dist and max_dist
summarise(hflights, min_dist=min(Distance), max_dist=max(Distance))
## min_dist max_dist
## 1 79 3904
# Print out a summary with variable max_div
summarise(filter(hflights, Diverted==1), max_div=max(Distance))
## max_div
## 1 3904
You can use any function you like in summarise(), so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them. Here are some of the most useful:
- min(x) - minimum value of vector x.
- max(x) - maximum value of vector x.
- mean(x) - mean value of vector x.
- median(x) - median value of vector x.
- quantile(x, p) - pth quantile of vector x.
- sd(x) - standard deviation of vector x.
- var(x) - variance of vector x.
- IQR(x) - Inter Quartile Range (IQR) of vector x.
- diff(range(x)) - total range of vector x.
# Remove rows that have NA ArrDelay: temp1
temp1 <- filter(hflights, !is.na(ArrDelay))
# Generate summary about ArrDelay column of temp1
summarise(temp1, earliest=min(ArrDelay), average=mean(ArrDelay), latest=max(ArrDelay), sd=sd(ArrDelay))
## earliest average latest sd
## 1 -70 7.094334 978 30.70852
# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
temp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut))
# Print the maximum taxiing difference of temp2 with summarise()
summarise(temp2, max_taxi_diff=max(TaxiOut-TaxiIn))
## max_taxi_diff
## 1 160
dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:
- first(x) - The first element of vector x.
- last(x) - The last element of vector x.
- nth(x, n) - The nth element of vector x.
- n() - The number of rows in the data.frame or group of observations that summarise() describes.
- n_distinct(x) - The number of unique values in vector x.
Next to these dplyr-specific functions, you can also turn a logical test into an aggregating function with sum() or mean(). A logical test returns a vector of TRUE’s and FALSE’s. When you apply sum() or mean() to such a vector, R coerces each TRUE to a 1 and each FALSE to a 0. This allows you to find the total number or proportion of observations that passed the test, respectively
Print out a summary of hflights with the following variables:
- n_obs: the total number of observations,
- n_carrier: the total number of carriers,
- n_dest: the total number of destinations,
- dest100: the destination of the flight in the 100th row of hflights.
# Generate summarizing statistics for hflights
s1 <- summarise(hflights, n_obs = n(),
n_carrier = n_distinct(UniqueCarrier),
n_dest = n_distinct(Dest),
dest100 = nth(Dest, 100))
Filter hflights to keep all flights that are flown by American Airlines (“American”) and save the data frame as aa.
# Filter hflights to keep all American Airline flights: aa
aa <- filter(hflights, UniqueCarrier == "American")
Print out a summary of aa with the following variables:
- n_flights: the total number of flights,
- n_canc: the total number of cancelled flights,
- p_canc: the percentage of cancelled flights,
- avg_delay: the average arrival delay of flights whose delay is not NA.
# Generate summarizing statistics for aa
summarise(aa,
n_flights = n(),
n_canc = sum(Cancelled == 1),
p_canc = mean(Cancelled == 1) * 100,
avg_delay = mean(ArrDelay, na.rm = TRUE))
## n_flights n_canc p_canc avg_delay
## 1 0 0 NaN NaN
The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.
The following two statements are completely analogous:
mean(c(1, 2, 3, NA), na.rm = TRUE)
## [1] 2
c(1, 2, 3, NA) %>% mean(na.rm = TRUE)
## [1] 2
Starting with hflights, create a data frame d with the following variables:
- Dest, UniqueCarrier, Distance, and ActualElapsedTime,
- RealTime: the actual elapsed time plus 100 minutes. This will be an estimate of how much time a person spends getting from point A to point B while flying, including getting to the airport, security checks, etc.
- mph: the speed with which you travel if you do the calculations with RealTime. ####Filter d to only keep observations for which mph is not NA and for which mph is below 70. Pipe the result to a summarise() call with the following variables:
- n_less, the number of flights whose with non-NA mph under 70,
- n_dest, the number of destinations that were traveled to under these conditions;
- min_dist, the minimum distance of these flights;
- max_dist, the maximum distance of these flights.
# Part 1, concerning the selection and creation of columns
d <- hflights %>%
select(Dest, UniqueCarrier, Distance, ActualElapsedTime) %>%
mutate(RealTime = ActualElapsedTime + 100, mph = Distance / RealTime * 60)
# Part 2, concerning flights that had an actual average speed of < 70 mph.
d %>%
filter(!is.na(mph), mph < 70) %>%
summarise( n_less = n(),
n_dest = n_distinct(Dest),
min_dist = min(Distance),
max_dist = max(Distance))
## n_less n_dest min_dist max_dist
## 1 6726 13 79 305
Let’s define preferable flights as flights that are 150% faster than driving, i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or diverted flights are less preferable than driving.
Use one single piped call to print a summary with the following variables:
- n_non - the number of non-preferable flights in hflights,
- p_non - the percentage of non-preferable flights in hflights,
- n_dest - the number of destinations that non-preferable flights traveled to,
- min_dist - the minimum distance that non-preferable flights traveled,
- max_dist - the maximum distance that non-preferable flights traveled.
# Solve the exercise using a combination of dplyr verbs and %>%
hflights %>%
mutate(RealTime = ActualElapsedTime + 100, mph = Distance / RealTime * 60) %>%
filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>%
summarise(n_non = n(),
p_non = n_non / nrow(hflights) * 100,
n_dest = n_distinct(Dest),
min_dist = min (Distance),
max_dist = max(Distance))
## n_non p_non n_dest min_dist max_dist
## 1 42400 18.63769 113 79 3904
Use summarise() to create a summary of hflights with a single variable, n, that counts the number of overnight flights. These flights have an arrival time that is earlier than their departure time. Only include flights that have no NA values for both DepTime and ArrTime in your count.
# Count the number of overnight flights
hflights %>%
filter(!is.na(DepTime), !is.na(ArrTime), DepTime > ArrTime) %>%
summarise(n = n())
## n
## 1 2718