group_by() lets you define groups within your data set. Its influence becomes clear when calling summarise() on a grouped dataset: summarising statistics are calculated for the different groups separately.

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(hflights)
## Warning: package 'hflights' was built under R version 3.2.2

Generate a per-carrier summary of hflights with the following variables:

hflights %>%
   group_by(UniqueCarrier) %>%
   summarise(n_flights = n(), 
             n_canc = sum(Cancelled == 1), 
             p_canc = mean(Cancelled == 1) * 100, 
             avg_delay = mean(ArrDelay, na.rm = TRUE)) %>%
   arrange(avg_delay, p_canc)
## Source: local data frame [15 x 5]
## 
##    UniqueCarrier n_flights n_canc    p_canc  avg_delay
##            (chr)     (int)  (int)     (dbl)      (dbl)
## 1             US      4082     46 1.1268986 -0.6307692
## 2             AA      3244     60 1.8495684  0.8917558
## 3             FL      2139     21 0.9817672  1.8536239
## 4             AS       365      0 0.0000000  3.1923077
## 5             YV        79      1 1.2658228  4.0128205
## 6             DL      2641     42 1.5903067  6.0841374
## 7             CO     70032    475 0.6782614  6.0986983
## 8             MQ      4648    135 2.9044750  7.1529751
## 9             EV      2204     76 3.4482759  7.2569543
## 10            WN     45343    703 1.5504047  7.5871430
## 11            F9       838      6 0.7159905  7.6682692
## 12            XE     73053   1132 1.5495599  8.1865242
## 13            OO     16061    224 1.3946828  8.6934922
## 14            B6       695     18 2.5899281  9.8588410
## 15            UA      2072     34 1.6409266 10.4628628

Next, order the carriers in the summary from low to high by their average arrival delay. Use percentage of flights cancelled to break any ties. Which airline scores best based on these statistics? Generate a per-day-of-week summary of hflights with the variable avg_taxi, the average total taxiing time. Pipe this summary into an arrange() call such that the day with the highest avg_taxi comes first.

hflights %>% 
   group_by(DayOfWeek) %>%
   summarise(avg_taxi = mean(TaxiIn + TaxiOut, na.rm=TRUE)) %>%
   arrange(desc(avg_taxi))
## Source: local data frame [7 x 2]
## 
##   DayOfWeek avg_taxi
##       (int)    (dbl)
## 1         1 21.77027
## 2         2 21.43505
## 3         4 21.26076
## 4         3 21.19055
## 5         5 21.15805
## 6         7 20.93726
## 7         6 20.43061

You can also combine group_by() with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.

rank(c(21, 22, 24, 23))
## [1] 1 2 4 3
#has output
#[1] 1 2 4 3

First, discard flights whose arrival delay equals NA. Next, create a by-carrier summary with a single variable: p_delay, the proportion of flights which are delayed at arrival. Next, create a new variable rank in the summary which is a rank according to p_delay. Finally, arrange the observations by this new rank.

hflights %>%
  filter(!is.na(ArrDelay)) %>%
  group_by(UniqueCarrier) %>%
  summarise(p_delay = mean(ArrDelay > 0)) %>%
  mutate(rank = rank(p_delay)) %>%
  arrange(rank)
## Source: local data frame [15 x 3]
## 
##    UniqueCarrier   p_delay  rank
##            (chr)     (dbl) (dbl)
## 1             AA 0.3030208     1
## 2             FL 0.3112269     2
## 3             US 0.3267990     3
## 4             EV 0.3677511     4
## 5             MQ 0.3696714     5
## 6             DL 0.3871092     6
## 7             B6 0.3952452     7
## 8             AS 0.4368132     8
## 9             WN 0.4644557     9
## 10            YV 0.4743590    10
## 11            CO 0.4907385    11
## 12            XE 0.4943420    12
## 13            UA 0.4963109    13
## 14            OO 0.5350105    14
## 15            F9 0.5564904    15

In a similar fashion, keep flights that are delayed (ArrDelay > 0 and not NA). Next, create a by-carrier summary with a single variable: avg, the average delay of the delayed flights. Again add a new variable rank to the summary according to avg. Finally, arrange by this rank variable.

hflights %>%
  filter(!is.na(ArrDelay), ArrDelay > 0) %>%
  group_by(UniqueCarrier) %>%
  summarise(avg = mean(ArrDelay)) %>%
  mutate(rank = rank(avg)) %>%
  arrange(rank)
## Source: local data frame [15 x 3]
## 
##    UniqueCarrier      avg  rank
##            (chr)    (dbl) (dbl)
## 1             YV 18.67568     1
## 2             F9 18.68683     2
## 3             US 20.70235     3
## 4             CO 22.13374     4
## 5             AS 22.91195     5
## 6             OO 24.14663     6
## 7             XE 24.19337     7
## 8             WN 25.27750     8
## 9             FL 27.85693     9
## 10            AA 28.49740    10
## 11            DL 32.12463    11
## 12            UA 32.48067    12
## 13            MQ 38.75135    13
## 14            EV 40.24231    14
## 15            B6 45.47744    15

Which plane (by tail number) flew out of Houston the most times? How many times? Name the column with this frequency n. Assign the result to adv1. To answer this question precisely, you will have to filter() as a final step to end up with only a single observation in adv1.

# Which plane (by tail number) flew out of Houston the most times? How many times? adv1
adv1 <- hflights %>%
          group_by(TailNum) %>%
          summarise(n = n()) %>%
          filter(n == max(n))

How many airplanes only flew to one destination from Houston? Save the resulting dataset in adv2, that contains only a single column, named nplanes and a single row.

# How many airplanes only flew to one destination from Houston? adv2
adv2 <- hflights %>%
          group_by(TailNum) %>%
          summarise(ndest = n_distinct(Dest)) %>%
          filter(ndest == 1) %>%
          summarise(nplanes = n())

Find the most visited destination for each carrier and save your solution to adv3. Your solution should contain four columns:

# Find the most visited destination for each carrier: adv3
adv3 <- hflights %>% 
          group_by(UniqueCarrier, Dest) %>%
          summarise(n = n()) %>%
          mutate(rank = rank(desc(n))) %>%
          filter(rank == 1)

For each destination, find the carrier that travels to that destination the most. Store the result in adv4. Again, your solution should contain 4 columns: Dest, UniqueCarrier, n and rank.

# Find the carrier that travels to each destination the most: adv4
adv4 <- hflights %>% 
          group_by(Dest, UniqueCarrier) %>%
          summarise(n = n()) %>%
          mutate(rank = rank(desc(n))) %>%
          filter(rank == 1)

hflights2 is a copy of hflights that is saved as a data table. hflights2 was made available in the background using the following code:

library(data.table)
## Warning: package 'data.table' was built under R version 3.2.2
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, last
hflights2 <- as.data.table(hflights)
# Use summarise to calculate n_carrier
s2 <- summarise(hflights2, n_carrier = n_distinct(UniqueCarrier))

nycflights is a mySQL database on the DataCamp server. It contains information about flights that departed from New York City in 2013. This data is similar to the data in hflights, but it does not contain information about cancellations or diversions (you can access the same data in the nycflights13 R package).

nycflights, an R object that stores a connection to the nycflights tbl that lives outside of R on the datacamp server, will be created for you on the right. You can use such connection objects to pull data from databases into R. This lets you work with datasets that are too large to fit in R.

You can learn a connection language to make sophisticated queries from such a database, or you can simply use dplyr. When you run a dplyr command on a database connection, dplyr will convert the command to the database’s native language and do the query for you. As such, just the data that you need from the database will be retrieved. This will usually be a fraction of the total data, which will fit in R without memory issues.

For example, we can easily retrieve a summary of how many carriers and how many flights flew in and out of New York City in 2013 with the code (note that in nycflights, the UniqueCarrier variable is named carrier):

#summarise(nycflights, 
#   n_carriers = n_distinct(carrier), 
#    n_flights = n())
# set up a src that connects to the mysql database (src_mysql is provided by dplyr)
library(dplyr)
#my_db <- src_mysql(dbname = "dplyr", 
#                  host = "dplyr.csrrinzqubik.us-east-1.rds.amazonaws.com", 
#                  port = 3306,
#                  user = "dplyr",
#                  password = "dplyr")

# and reference a table within that src: nycflights is now available as an R object that references to the remote nycflights table
#nycflights <- tbl(my_db, "dplyr")

# glimpse at nycflights
#glimpse(nycflights)

# Calculate the grouped summaries detailed in the instructions
#nycflights %>%
#   group_by(carrier) %>%
#   summarise(n_flights = n(), avg_delay = mean(arr_delay)) %>%
#   arrange(avg_delay)