summarise(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate(). In contrast to the four other data manipulation functions, summarise() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarzing statistics.

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.2

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(hflights)

## Warning: package 'hflights' was built under R version 3.2.2

# Print out a summary with variables min_dist and max_dist
summarise(hflights, min_dist=min(Distance), max_dist=max(Distance))

##   min_dist max_dist
## 1       79     3904

# Print out a summary with variable max_div
summarise(filter(hflights, Diverted==1), max_div=max(Distance))

##   max_div
## 1    3904

You can use any function you like in summarise(), so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them. Here are some of the most useful:

min(x) - minimum value of vector x.
max(x) - maximum value of vector x.
mean(x) - mean value of vector x.
median(x) - median value of vector x.
quantile(x, p) - pth quantile of vector x.
sd(x) - standard deviation of vector x.
var(x) - variance of vector x.
IQR(x) - Inter Quartile Range (IQR) of vector x.
diff(range(x)) - total range of vector x.

# Remove rows that have NA ArrDelay: temp1
temp1 <- filter(hflights, !is.na(ArrDelay))

# Generate summary about ArrDelay column of temp1
summarise(temp1, earliest=min(ArrDelay), average=mean(ArrDelay), latest=max(ArrDelay), sd=sd(ArrDelay))

##   earliest  average latest       sd
## 1      -70 7.094334    978 30.70852

# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
temp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut))

# Print the maximum taxiing difference of temp2 with summarise()
summarise(temp2, max_taxi_diff=max(TaxiOut-TaxiIn))

##   max_taxi_diff
## 1           160

dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

first(x) - The first element of vector x.
last(x) - The last element of vector x.
nth(x, n) - The nth element of vector x.
n() - The number of rows in the data.frame or group of observations that summarise() describes.
n_distinct(x) - The number of unique values in vector x.

Next to these dplyr-specific functions, you can also turn a logical test into an aggregating function with sum() or mean(). A logical test returns a vector of TRUE’s and FALSE’s. When you apply sum() or mean() to such a vector, R coerces each TRUE to a 1 and each FALSE to a 0. This allows you to find the total number or proportion of observations that passed the test, respectively

Print out a summary of hflights with the following variables:

n_obs: the total number of observations,
n_carrier: the total number of carriers,
n_dest: the total number of destinations,
dest100: the destination of the flight in the 100th row of hflights.

# Generate summarizing statistics for hflights
s1 <- summarise(hflights, n_obs = n(), 
                n_carrier = n_distinct(UniqueCarrier), 
                n_dest = n_distinct(Dest), 
                dest100 = nth(Dest, 100))

Filter hflights to keep all flights that are flown by American Airlines (“American”) and save the data frame as aa.

# Filter hflights to keep all American Airline flights: aa
aa <- filter(hflights, UniqueCarrier == "American")

Print out a summary of aa with the following variables:

n_flights: the total number of flights,
n_canc: the total number of cancelled flights,
p_canc: the percentage of cancelled flights,
avg_delay: the average arrival delay of flights whose delay is not NA.

# Generate summarizing statistics for aa 
summarise(aa, 
          n_flights = n(), 
          n_canc = sum(Cancelled == 1), 
          p_canc = mean(Cancelled == 1) * 100, 
          avg_delay = mean(ArrDelay, na.rm = TRUE))

##   n_flights n_canc p_canc avg_delay
## 1         0      0    NaN       NaN

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

The following two statements are completely analogous:

mean(c(1, 2, 3, NA), na.rm = TRUE)

## [1] 2

c(1, 2, 3, NA) %>% mean(na.rm = TRUE)

## [1] 2

Use dplyr functions and the pipe operator to transform the following English sentences into R code:

Take the hflights data set and then .
Add a variable named diff that is the result of subtracting TaxiIn from TaxiOut, and then .
Pick all of the rows whose diff value does not equal NA, and then .
Summarise the data set with a value named avg that is the mean diff value.

hflights %>%
  mutate(diff=(TaxiIn-TaxiOut)) %>%
  filter(is.na(diff)) %>%
  summarise(avg=mean(diff))

##   avg
## 1  NA

Starting with hflights, create a data frame d with the following variables:

Dest, UniqueCarrier, Distance, and ActualElapsedTime,
RealTime: the actual elapsed time plus 100 minutes. This will be an estimate of how much time a person spends getting from point A to point B while flying, including getting to the airport, security checks, etc.
mph: the speed with which you travel if you do the calculations with RealTime. ####Filter d to only keep observations for which mph is not NA and for which mph is below 70. Pipe the result to a summarise() call with the following variables:
n_less, the number of flights whose with non-NA mph under 70,
n_dest, the number of destinations that were traveled to under these conditions;
min_dist, the minimum distance of these flights;
max_dist, the maximum distance of these flights.

# Part 1, concerning the selection and creation of columns
d <- hflights %>%
  select(Dest, UniqueCarrier, Distance, ActualElapsedTime) %>%  
  mutate(RealTime = ActualElapsedTime + 100, mph = Distance / RealTime * 60)    

# Part 2, concerning flights that had an actual average speed of < 70 mph.
d %>%
  filter(!is.na(mph), mph < 70) %>%
  summarise( n_less = n(), 
             n_dest = n_distinct(Dest), 
             min_dist = min(Distance), 
             max_dist = max(Distance))

##   n_less n_dest min_dist max_dist
## 1   6726     13       79      305

Let’s define preferable flights as flights that are 150% faster than driving, i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or diverted flights are less preferable than driving.

Use one single piped call to print a summary with the following variables:

n_non - the number of non-preferable flights in hflights,
p_non - the percentage of non-preferable flights in hflights,
n_dest - the number of destinations that non-preferable flights traveled to,
min_dist - the minimum distance that non-preferable flights traveled,
max_dist - the maximum distance that non-preferable flights traveled.

# Solve the exercise using a combination of dplyr verbs and %>%
hflights %>%
  mutate(RealTime = ActualElapsedTime + 100, mph = Distance / RealTime * 60) %>%
  filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>%
  summarise(n_non = n(), 
            p_non = n_non / nrow(hflights) * 100, 
            n_dest = n_distinct(Dest), 
            min_dist = min (Distance), 
            max_dist = max(Distance))

##   n_non    p_non n_dest min_dist max_dist
## 1 42400 18.63769    113       79     3904

Use summarise() to create a summary of hflights with a single variable, n, that counts the number of overnight flights. These flights have an arrival time that is earlier than their departure time. Only include flights that have no NA values for both DepTime and ArrTime in your count.

# Count the number of overnight flights
hflights %>%
  filter(!is.na(DepTime), !is.na(ArrTime), DepTime > ArrTime) %>%
  summarise(n = n())

##      n
## 1 2718