summarise(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate(). In contrast to the four other data manipulation functions, summarise() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarzing statistics.

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(hflights)
## Warning: package 'hflights' was built under R version 3.2.2
# Print out a summary with variables min_dist and max_dist
summarise(hflights, min_dist=min(Distance), max_dist=max(Distance))
##   min_dist max_dist
## 1       79     3904
# Print out a summary with variable max_div
summarise(filter(hflights, Diverted==1), max_div=max(Distance))
##   max_div
## 1    3904

You can use any function you like in summarise(), so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them. Here are some of the most useful:

# Remove rows that have NA ArrDelay: temp1
temp1 <- filter(hflights, !is.na(ArrDelay))

# Generate summary about ArrDelay column of temp1
summarise(temp1, earliest=min(ArrDelay), average=mean(ArrDelay), latest=max(ArrDelay), sd=sd(ArrDelay))
##   earliest  average latest       sd
## 1      -70 7.094334    978 30.70852
# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
temp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut))

# Print the maximum taxiing difference of temp2 with summarise()
summarise(temp2, max_taxi_diff=max(TaxiOut-TaxiIn))
##   max_taxi_diff
## 1           160

dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

Next to these dplyr-specific functions, you can also turn a logical test into an aggregating function with sum() or mean(). A logical test returns a vector of TRUE’s and FALSE’s. When you apply sum() or mean() to such a vector, R coerces each TRUE to a 1 and each FALSE to a 0. This allows you to find the total number or proportion of observations that passed the test, respectively

Filter hflights to keep all flights that are flown by American Airlines (“American”) and save the data frame as aa.

# Filter hflights to keep all American Airline flights: aa
aa <- filter(hflights, UniqueCarrier == "American")

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

The following two statements are completely analogous:

mean(c(1, 2, 3, NA), na.rm = TRUE)
## [1] 2
c(1, 2, 3, NA) %>% mean(na.rm = TRUE)
## [1] 2

Use dplyr functions and the pipe operator to transform the following English sentences into R code:

hflights %>%
  mutate(diff=(TaxiIn-TaxiOut)) %>%
  filter(is.na(diff)) %>%
  summarise(avg=mean(diff))
##   avg
## 1  NA

Starting with hflights, create a data frame d with the following variables:

# Part 1, concerning the selection and creation of columns
d <- hflights %>%
  select(Dest, UniqueCarrier, Distance, ActualElapsedTime) %>%  
  mutate(RealTime = ActualElapsedTime + 100, mph = Distance / RealTime * 60)    

# Part 2, concerning flights that had an actual average speed of < 70 mph.
d %>%
  filter(!is.na(mph), mph < 70) %>%
  summarise( n_less = n(), 
             n_dest = n_distinct(Dest), 
             min_dist = min(Distance), 
             max_dist = max(Distance))
##   n_less n_dest min_dist max_dist
## 1   6726     13       79      305

Let’s define preferable flights as flights that are 150% faster than driving, i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or diverted flights are less preferable than driving.

Use one single piped call to print a summary with the following variables:

# Solve the exercise using a combination of dplyr verbs and %>%
hflights %>%
  mutate(RealTime = ActualElapsedTime + 100, mph = Distance / RealTime * 60) %>%
  filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>%
  summarise(n_non = n(), 
            p_non = n_non / nrow(hflights) * 100, 
            n_dest = n_distinct(Dest), 
            min_dist = min (Distance), 
            max_dist = max(Distance))
##   n_non    p_non n_dest min_dist max_dist
## 1 42400 18.63769    113       79     3904

Use summarise() to create a summary of hflights with a single variable, n, that counts the number of overnight flights. These flights have an arrival time that is earlier than their departure time. Only include flights that have no NA values for both DepTime and ArrTime in your count.

# Count the number of overnight flights
hflights %>%
  filter(!is.na(DepTime), !is.na(ArrTime), DepTime > ArrTime) %>%
  summarise(n = n())
##      n
## 1 2718