Joining Data with dplyr
In this activity we will explore more concepts of data cleaning and
data wrangling with R using the dplyr package.
Learning Objectives
Combine multiple related data sets
Understand the concept of and identify keys as columns for joining tables of data
Understand and implement different types of joins
Load packages necessary for this activity
Example data: Michigan flights
For this activity, we will be using the tidyverse package and several
related data sets that contain information on flights departing from
Michigan airports in 2021 obtained via the anyflights R
package. This package facilitates obtaining air travel and weather data
for airports across the United States.
Importing flights data
# Importing data
michiganFlights <- read_rds("fullMiFlights2021.rds")
# Use the following code to create explicit data tables in our global environment.
# Create tibbles in environment from list
list2env(michiganFlights, envir = .GlobalEnv)## <environment: R_GlobalEnv>
Use the skim() function to explore characteristics of
each of the tables of data.
The columns / variables used to link each pair of data tables are called keys. That is, a key is a column, or collection of columns, that uniquely identifies an observation.
| Name | flights |
| Number of rows | 149445 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 14 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| carrier | 0 | 1 | 2 | 2 | 0 | 15 | 0 |
| tailnum | 117 | 1 | 5 | 6 | 0 | 4136 | 0 |
| origin | 0 | 1 | 3 | 3 | 0 | 4 | 0 |
| dest | 0 | 1 | 3 | 3 | 0 | 114 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2021.00 | 0.00 | 2021 | 2021 | 2021 | 2021 | 2021 | ▁▁▇▁▁ |
| month | 0 | 1.00 | 6.77 | 3.35 | 1 | 4 | 7 | 10 | 12 | ▆▅▆▆▇ |
| day | 0 | 1.00 | 15.74 | 8.78 | 1 | 8 | 16 | 23 | 31 | ▇▇▇▇▆ |
| dep_time | 1382 | 0.99 | 1371.55 | 493.76 | 1 | 944 | 1353 | 1745 | 2400 | ▁▇▇▇▆ |
| sched_dep_time | 0 | 1.00 | 1369.27 | 486.06 | 500 | 930 | 1355 | 1737 | 2327 | ▇▇▆▇▆ |
| dep_delay | 1384 | 0.99 | 6.96 | 45.39 | -34 | -5 | -3 | 1 | 1948 | ▇▁▁▁▁ |
| arr_time | 1433 | 0.99 | 1464.41 | 517.39 | 1 | 1049 | 1455 | 1839 | 2400 | ▁▅▇▇▆ |
| sched_arr_time | 0 | 1.00 | 1481.47 | 507.65 | 1 | 1100 | 1504 | 1840 | 2359 | ▁▃▇▇▆ |
| arr_delay | 1715 | 0.99 | -0.22 | 47.48 | -79 | -17 | -9 | 1 | 1961 | ▇▁▁▁▁ |
| flight | 0 | 1.00 | 371.09 | 221.12 | 1 | 176 | 372 | 548 | 927 | ▇▇▇▆▁ |
| air_time | 1715 | 0.99 | 95.45 | 62.47 | 15 | 50 | 75 | 133 | 393 | ▇▃▁▁▁ |
| distance | 0 | 1.00 | 654.32 | 488.30 | 74 | 296 | 501 | 983 | 2986 | ▇▂▁▁▁ |
| hour | 0 | 1.00 | 13.42 | 4.83 | 5 | 9 | 13 | 17 | 23 | ▇▇▅▇▆ |
| minute | 0 | 1.00 | 27.52 | 18.71 | 0 | 10 | 29 | 45 | 59 | ▇▅▆▅▆ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| time_hour | 0 | 1 | 2021-01-01 05:00:00 | 2021-12-31 22:00:00 | 2021-07-12 21:00:00 | 6452 |
| Name | airports |
| Number of rows | 1251 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| faa | 0 | 1.00 | 3 | 3 | 0 | 1251 | 0 |
| name | 0 | 1.00 | 9 | 60 | 0 | 1248 | 0 |
| dst | 48 | 0.96 | 1 | 1 | 0 | 4 | 0 |
| tzone | 119 | 0.90 | 12 | 19 | 0 | 9 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| lat | 0 | 1.00 | 40.93 | 10.19 | 19.72 | 34.10 | 39.35 | 44.32 | 71.29 | ▂▇▅▁▂ |
| lon | 0 | 1.00 | -103.47 | 28.11 | -176.65 | -117.88 | -95.91 | -83.29 | 174.11 | ▅▇▁▁▁ |
| alt | 0 | 1.00 | 1121.99 | 1602.52 | -115.00 | 96.00 | 551.00 | 1226.00 | 9070.00 | ▇▁▁▁▁ |
| tz | 48 | 0.96 | -6.50 | 1.59 | -10.00 | -8.00 | -6.00 | -5.00 | 8.00 | ▆▇▁▁▁ |
| Name | airlines |
| Number of rows | 15 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| carrier | 0 | 1 | 2 | 2 | 0 | 15 | 0 |
| name | 0 | 1 | 9 | 22 | 0 | 15 | 0 |
| Name | planes |
| Number of rows | 3962 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| tailnum | 0 | 1 | 5 | 6 | 0 | 3962 | 0 |
| type | 0 | 1 | 23 | 23 | 0 | 1 | 0 |
| manufacturer | 0 | 1 | 6 | 29 | 0 | 11 | 0 |
| model | 0 | 1 | 5 | 15 | 0 | 78 | 0 |
| engine | 0 | 1 | 9 | 9 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 112 | 0.97 | 2008.85 | 7.24 | 1990 | 2003 | 2009 | 2015 | 2021 | ▁▆▇▅▇ |
| engines | 0 | 1.00 | 2.00 | 0.05 | 2 | 2 | 2 | 2 | 3 | ▇▁▁▁▁ |
| seats | 0 | 1.00 | 154.73 | 74.05 | 20 | 95 | 149 | 190 | 451 | ▆▇▂▁▁ |
| speed | 0 | 1.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | ▁▁▇▁▁ |
| Name | weather |
| Number of rows | 34897 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 13 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| origin | 0 | 1 | 3 | 3 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2021.00 | 0.00 | 2021.00 | 2021.00 | 2021.00 | 2021.00 | 2021.00 | ▁▁▇▁▁ |
| month | 0 | 1.00 | 6.51 | 3.44 | 1.00 | 4.00 | 7.00 | 9.00 | 12.00 | ▇▆▆▆▇ |
| day | 0 | 1.00 | 15.67 | 8.77 | 1.00 | 8.00 | 16.00 | 23.00 | 31.00 | ▇▇▇▇▆ |
| hour | 0 | 1.00 | 11.50 | 6.92 | 0.00 | 6.00 | 12.00 | 18.00 | 23.00 | ▇▇▆▇▇ |
| temp | 34395 | 0.01 | 45.75 | 18.05 | 10.90 | 30.90 | 37.90 | 64.00 | 82.90 | ▂▇▃▃▃ |
| dewp | 34396 | 0.01 | 36.89 | 18.27 | 5.00 | 21.90 | 30.00 | 51.10 | 73.90 | ▅▇▂▅▃ |
| humid | 34397 | 0.01 | 72.73 | 15.76 | 31.52 | 61.59 | 72.07 | 87.06 | 100.00 | ▂▃▇▆▇ |
| wind_dir | 952 | 0.97 | 181.80 | 107.41 | 0.00 | 90.00 | 200.00 | 270.00 | 360.00 | ▇▃▇▇▆ |
| wind_speed | 465 | 0.99 | 8.09 | 5.36 | 0.00 | 4.60 | 8.06 | 11.51 | 36.82 | ▇▆▂▁▁ |
| wind_gust | 465 | 0.99 | 9.31 | 6.17 | 0.00 | 5.30 | 9.27 | 13.24 | 42.38 | ▇▆▂▁▁ |
| precip | 33636 | 0.04 | 0.01 | 0.03 | 0.00 | 0.00 | 0.01 | 0.01 | 0.44 | ▇▁▁▁▁ |
| pressure | 34629 | 0.01 | 1011.31 | 7.07 | 1000.10 | 1004.38 | 1011.15 | 1018.50 | 1024.00 | ▇▅▅▅▆ |
| visib | 99 | 1.00 | 8.80 | 2.29 | 0.06 | 9.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| time_hour | 0 | 1 | 2021-01-01 | 2021-12-30 23:00:00 | 2021-07-01 23:00:00 | 8735 |
Combine the flights and airlines data
frames with left_join() to create a new data set called
flightsCarriers.
Specifying key values manually
Calculate the average flight distance for each carrier using the full name of the carriers. Who had the longest flights on average? Who had the shortest?
Combine the weather and flights data
frames with left_join() to create a new data set called
weatherFlights. How many rows does
weatherFlights have? Hint: First fix an issue with the
time_hour variable using the code below.
Fixing time_hour variable in weather data set
# Fixing time_hour variable in weather data set
weather <- weather |>
mutate(time_hour = lubridate::make_datetime(year, month, day, hour))# Left-join of weather and flights
weatherFlights <- weather |>
left_join(flights)
weatherFlights |> nrow()## [1] 168159
Left joins
Full joins
Combine the weather and flights data
frames with full_join() to create a new data set called
weatherFlightsFull. How many rows does
weatherFlightsFull have?
## [1] 168504
Inner joins
Finally, let’s do something fun with calendaR
Create a heatmap calendar with R using the calendR
package and the code below.
library(calendR)
# Calculating number of flights for each date
nFlights <- flights |>
dplyr::count(year, month, day) |>
arrange(year, month, day)
# Creating heatmap calendar of number of flights
flights_calendar <- calendR(year = 2021,
title = "Flights out of Major Michigan Airports, 2021",
special.days = nFlights$n,
gradient = TRUE,
monthnames = month.name,
months.size = 11,
weeknames.size = 3.3,
legend.pos = "bottom",
legend.title = "Number of flights",
ncol = 3,
margin = 1) +
scale_fill_gradient(low = "white", high = "dodgerblue",
guide = guide_colorbar(frame.colour = "black",
ticks.colour = "black",
title.position = "top")) +
theme(legend.title = )
# Saving plot
size_mult <- 22
ggsave(plot = flights_calendar,
filename = "calendar_plot.png",
units = "mm",
dpi = 120,
width = 7*size_mult,
height = 8.5*size_mult)
knitr::include_graphics("calendar_plot.png")