group_by() lets you define groups within your data set. Its influence becomes clear when calling summarise() on a grouped dataset: summarising statistics are calculated for the different groups separately.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(hflights)
## Warning: package 'hflights' was built under R version 3.2.2
Generate a per-carrier summary of hflights with the following variables:
- n_flights, the number of flights flown by the carrier;
- n_canc, the number of cancelled flights;
- p_canc, the percentage of cancelled flights;
- avg_delay, the average arrival delay of flights whose delay does not equal NA.
hflights %>%
group_by(UniqueCarrier) %>%
summarise(n_flights = n(),
n_canc = sum(Cancelled == 1),
p_canc = mean(Cancelled == 1) * 100,
avg_delay = mean(ArrDelay, na.rm = TRUE)) %>%
arrange(avg_delay, p_canc)
## Source: local data frame [15 x 5]
##
## UniqueCarrier n_flights n_canc p_canc avg_delay
## (chr) (int) (int) (dbl) (dbl)
## 1 US 4082 46 1.1268986 -0.6307692
## 2 AA 3244 60 1.8495684 0.8917558
## 3 FL 2139 21 0.9817672 1.8536239
## 4 AS 365 0 0.0000000 3.1923077
## 5 YV 79 1 1.2658228 4.0128205
## 6 DL 2641 42 1.5903067 6.0841374
## 7 CO 70032 475 0.6782614 6.0986983
## 8 MQ 4648 135 2.9044750 7.1529751
## 9 EV 2204 76 3.4482759 7.2569543
## 10 WN 45343 703 1.5504047 7.5871430
## 11 F9 838 6 0.7159905 7.6682692
## 12 XE 73053 1132 1.5495599 8.1865242
## 13 OO 16061 224 1.3946828 8.6934922
## 14 B6 695 18 2.5899281 9.8588410
## 15 UA 2072 34 1.6409266 10.4628628
Next, order the carriers in the summary from low to high by their average arrival delay. Use percentage of flights cancelled to break any ties. Which airline scores best based on these statistics? Generate a per-day-of-week summary of hflights with the variable avg_taxi, the average total taxiing time. Pipe this summary into an arrange() call such that the day with the highest avg_taxi comes first.
hflights %>%
group_by(DayOfWeek) %>%
summarise(avg_taxi = mean(TaxiIn + TaxiOut, na.rm=TRUE)) %>%
arrange(desc(avg_taxi))
## Source: local data frame [7 x 2]
##
## DayOfWeek avg_taxi
## (int) (dbl)
## 1 1 21.77027
## 2 2 21.43505
## 3 4 21.26076
## 4 3 21.19055
## 5 5 21.15805
## 6 7 20.93726
## 7 6 20.43061
You can also combine group_by() with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.
rank(c(21, 22, 24, 23))
## [1] 1 2 4 3
#has output
#[1] 1 2 4 3
- As with arrange(), rank() ranks values from the largest to the smallest and this behavior can be reversed with the desc() function.
First, discard flights whose arrival delay equals NA. Next, create a by-carrier summary with a single variable: p_delay, the proportion of flights which are delayed at arrival. Next, create a new variable rank in the summary which is a rank according to p_delay. Finally, arrange the observations by this new rank.
hflights %>%
filter(!is.na(ArrDelay)) %>%
group_by(UniqueCarrier) %>%
summarise(p_delay = mean(ArrDelay > 0)) %>%
mutate(rank = rank(p_delay)) %>%
arrange(rank)
## Source: local data frame [15 x 3]
##
## UniqueCarrier p_delay rank
## (chr) (dbl) (dbl)
## 1 AA 0.3030208 1
## 2 FL 0.3112269 2
## 3 US 0.3267990 3
## 4 EV 0.3677511 4
## 5 MQ 0.3696714 5
## 6 DL 0.3871092 6
## 7 B6 0.3952452 7
## 8 AS 0.4368132 8
## 9 WN 0.4644557 9
## 10 YV 0.4743590 10
## 11 CO 0.4907385 11
## 12 XE 0.4943420 12
## 13 UA 0.4963109 13
## 14 OO 0.5350105 14
## 15 F9 0.5564904 15
In a similar fashion, keep flights that are delayed (ArrDelay > 0 and not NA). Next, create a by-carrier summary with a single variable: avg, the average delay of the delayed flights. Again add a new variable rank to the summary according to avg. Finally, arrange by this rank variable.
hflights %>%
filter(!is.na(ArrDelay), ArrDelay > 0) %>%
group_by(UniqueCarrier) %>%
summarise(avg = mean(ArrDelay)) %>%
mutate(rank = rank(avg)) %>%
arrange(rank)
## Source: local data frame [15 x 3]
##
## UniqueCarrier avg rank
## (chr) (dbl) (dbl)
## 1 YV 18.67568 1
## 2 F9 18.68683 2
## 3 US 20.70235 3
## 4 CO 22.13374 4
## 5 AS 22.91195 5
## 6 OO 24.14663 6
## 7 XE 24.19337 7
## 8 WN 25.27750 8
## 9 FL 27.85693 9
## 10 AA 28.49740 10
## 11 DL 32.12463 11
## 12 UA 32.48067 12
## 13 MQ 38.75135 13
## 14 EV 40.24231 14
## 15 B6 45.47744 15
Which plane (by tail number) flew out of Houston the most times? How many times? Name the column with this frequency n. Assign the result to adv1. To answer this question precisely, you will have to filter() as a final step to end up with only a single observation in adv1.
# Which plane (by tail number) flew out of Houston the most times? How many times? adv1
adv1 <- hflights %>%
group_by(TailNum) %>%
summarise(n = n()) %>%
filter(n == max(n))
How many airplanes only flew to one destination from Houston? Save the resulting dataset in adv2, that contains only a single column, named nplanes and a single row.
# How many airplanes only flew to one destination from Houston? adv2
adv2 <- hflights %>%
group_by(TailNum) %>%
summarise(ndest = n_distinct(Dest)) %>%
filter(ndest == 1) %>%
summarise(nplanes = n())
Find the most visited destination for each carrier and save your solution to adv3. Your solution should contain four columns:
- UniqueCarrier and Dest,
- n, how often a carrier visited a particular destination,
- rank, how each destination ranks per carrier. rank should be 1 for every row, as you want to find the most visited destination for each carrier.
# Find the most visited destination for each carrier: adv3
adv3 <- hflights %>%
group_by(UniqueCarrier, Dest) %>%
summarise(n = n()) %>%
mutate(rank = rank(desc(n))) %>%
filter(rank == 1)
For each destination, find the carrier that travels to that destination the most. Store the result in adv4. Again, your solution should contain 4 columns: Dest, UniqueCarrier, n and rank.
# Find the carrier that travels to each destination the most: adv4
adv4 <- hflights %>%
group_by(Dest, UniqueCarrier) %>%
summarise(n = n()) %>%
mutate(rank = rank(desc(n))) %>%
filter(rank == 1)
hflights2 is a copy of hflights that is saved as a data table. hflights2 was made available in the background using the following code:
library(data.table)
## Warning: package 'data.table' was built under R version 3.2.2
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:dplyr':
##
## between, last
hflights2 <- as.data.table(hflights)
- hflights2 contains all of the same information as hflights, but the information is stored in a different data structure. You can see this structure by typing hflights2 at the command line. Even though hflights2 is a different data structure, you can use the same dplyr functions to manipulate hflights2 as you used to manipulate hflights.
# Use summarise to calculate n_carrier
s2 <- summarise(hflights2, n_carrier = n_distinct(UniqueCarrier))
nycflights is a mySQL database on the DataCamp server. It contains information about flights that departed from New York City in 2013. This data is similar to the data in hflights, but it does not contain information about cancellations or diversions (you can access the same data in the nycflights13 R package).
nycflights, an R object that stores a connection to the nycflights tbl that lives outside of R on the datacamp server, will be created for you on the right. You can use such connection objects to pull data from databases into R. This lets you work with datasets that are too large to fit in R.
You can learn a connection language to make sophisticated queries from such a database, or you can simply use dplyr. When you run a dplyr command on a database connection, dplyr will convert the command to the database’s native language and do the query for you. As such, just the data that you need from the database will be retrieved. This will usually be a fraction of the total data, which will fit in R without memory issues.
For example, we can easily retrieve a summary of how many carriers and how many flights flew in and out of New York City in 2013 with the code (note that in nycflights, the UniqueCarrier variable is named carrier):
#summarise(nycflights,
# n_carriers = n_distinct(carrier),
# n_flights = n())
- Try to understand the already available code on the right. This code will create a reference to a tbl that resides on DataCamp’s servers.
- Glimpse at nycflights. Although nycflights is a reference to a tbl in a remote database, there is no difference in syntax nor output!
- Group the nycflights data by carrier, then create a grouped summary of the data that shows the number of flights (n_flights) flown by each carrier and the average arrival delay (avg_delay) of flights flown by each carrier. Finally, arrange the carriers by average delay from low to high.
# set up a src that connects to the mysql database (src_mysql is provided by dplyr)
library(dplyr)
#my_db <- src_mysql(dbname = "dplyr",
# host = "dplyr.csrrinzqubik.us-east-1.rds.amazonaws.com",
# port = 3306,
# user = "dplyr",
# password = "dplyr")
# and reference a table within that src: nycflights is now available as an R object that references to the remote nycflights table
#nycflights <- tbl(my_db, "dplyr")
# glimpse at nycflights
#glimpse(nycflights)
# Calculate the grouped summaries detailed in the instructions
#nycflights %>%
# group_by(carrier) %>%
# summarise(n_flights = n(), avg_delay = mean(arr_delay)) %>%
# arrange(avg_delay)