Joining Data with dplyr
In this activity we will explore more concepts of data cleaning and
data wrangling with R using the dplyr
package.
Learning Objectives
Combine multiple related data sets
Understand the concept of and identify keys as columns for joining tables of data
Understand and implement different types of joins
Load packages necessary for this activity
Example data: Michigan flights
For this activity, we will be using the tidyverse package and several
related data sets that contain information on flights departing from
Michigan airports in 2021 obtained via the anyflights
R
package. This package facilitates obtaining air travel and weather data
for airports across the United States.
Importing flights data
# Importing data
michiganFlights <- read_rds("fullMiFlights2021.rds")
# Use the following code to create explicit data tables in our global environment.
# Create tibbles in environment from list
list2env(michiganFlights, envir = .GlobalEnv)
## <environment: R_GlobalEnv>
Use the skim()
function to explore characteristics of
each of the tables of data.
The columns / variables used to link each pair of data tables are called keys. That is, a key is a column, or collection of columns, that uniquely identifies an observation.
Name | flights |
Number of rows | 149445 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 14 |
POSIXct | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
carrier | 0 | 1 | 2 | 2 | 0 | 15 | 0 |
tailnum | 117 | 1 | 5 | 6 | 0 | 4136 | 0 |
origin | 0 | 1 | 3 | 3 | 0 | 4 | 0 |
dest | 0 | 1 | 3 | 3 | 0 | 114 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2021.00 | 0.00 | 2021 | 2021 | 2021 | 2021 | 2021 | ▁▁▇▁▁ |
month | 0 | 1.00 | 6.77 | 3.35 | 1 | 4 | 7 | 10 | 12 | ▆▅▆▆▇ |
day | 0 | 1.00 | 15.74 | 8.78 | 1 | 8 | 16 | 23 | 31 | ▇▇▇▇▆ |
dep_time | 1382 | 0.99 | 1371.55 | 493.76 | 1 | 944 | 1353 | 1745 | 2400 | ▁▇▇▇▆ |
sched_dep_time | 0 | 1.00 | 1369.27 | 486.06 | 500 | 930 | 1355 | 1737 | 2327 | ▇▇▆▇▆ |
dep_delay | 1384 | 0.99 | 6.96 | 45.39 | -34 | -5 | -3 | 1 | 1948 | ▇▁▁▁▁ |
arr_time | 1433 | 0.99 | 1464.41 | 517.39 | 1 | 1049 | 1455 | 1839 | 2400 | ▁▅▇▇▆ |
sched_arr_time | 0 | 1.00 | 1481.47 | 507.65 | 1 | 1100 | 1504 | 1840 | 2359 | ▁▃▇▇▆ |
arr_delay | 1715 | 0.99 | -0.22 | 47.48 | -79 | -17 | -9 | 1 | 1961 | ▇▁▁▁▁ |
flight | 0 | 1.00 | 371.09 | 221.12 | 1 | 176 | 372 | 548 | 927 | ▇▇▇▆▁ |
air_time | 1715 | 0.99 | 95.45 | 62.47 | 15 | 50 | 75 | 133 | 393 | ▇▃▁▁▁ |
distance | 0 | 1.00 | 654.32 | 488.30 | 74 | 296 | 501 | 983 | 2986 | ▇▂▁▁▁ |
hour | 0 | 1.00 | 13.42 | 4.83 | 5 | 9 | 13 | 17 | 23 | ▇▇▅▇▆ |
minute | 0 | 1.00 | 27.52 | 18.71 | 0 | 10 | 29 | 45 | 59 | ▇▅▆▅▆ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
time_hour | 0 | 1 | 2021-01-01 05:00:00 | 2021-12-31 22:00:00 | 2021-07-12 21:00:00 | 6452 |
Name | airports |
Number of rows | 1251 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
faa | 0 | 1.00 | 3 | 3 | 0 | 1251 | 0 |
name | 0 | 1.00 | 9 | 60 | 0 | 1248 | 0 |
dst | 48 | 0.96 | 1 | 1 | 0 | 4 | 0 |
tzone | 119 | 0.90 | 12 | 19 | 0 | 9 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
lat | 0 | 1.00 | 40.93 | 10.19 | 19.72 | 34.10 | 39.35 | 44.32 | 71.29 | ▂▇▅▁▂ |
lon | 0 | 1.00 | -103.47 | 28.11 | -176.65 | -117.88 | -95.91 | -83.29 | 174.11 | ▅▇▁▁▁ |
alt | 0 | 1.00 | 1121.99 | 1602.52 | -115.00 | 96.00 | 551.00 | 1226.00 | 9070.00 | ▇▁▁▁▁ |
tz | 48 | 0.96 | -6.50 | 1.59 | -10.00 | -8.00 | -6.00 | -5.00 | 8.00 | ▆▇▁▁▁ |
Name | airlines |
Number of rows | 15 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
character | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
carrier | 0 | 1 | 2 | 2 | 0 | 15 | 0 |
name | 0 | 1 | 9 | 22 | 0 | 15 | 0 |
Name | planes |
Number of rows | 3962 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
tailnum | 0 | 1 | 5 | 6 | 0 | 3962 | 0 |
type | 0 | 1 | 23 | 23 | 0 | 1 | 0 |
manufacturer | 0 | 1 | 6 | 29 | 0 | 11 | 0 |
model | 0 | 1 | 5 | 15 | 0 | 78 | 0 |
engine | 0 | 1 | 9 | 9 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 112 | 0.97 | 2008.85 | 7.24 | 1990 | 2003 | 2009 | 2015 | 2021 | ▁▆▇▅▇ |
engines | 0 | 1.00 | 2.00 | 0.05 | 2 | 2 | 2 | 2 | 3 | ▇▁▁▁▁ |
seats | 0 | 1.00 | 154.73 | 74.05 | 20 | 95 | 149 | 190 | 451 | ▆▇▂▁▁ |
speed | 0 | 1.00 | 0.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | ▁▁▇▁▁ |
Name | weather |
Number of rows | 34897 |
Number of columns | 15 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 13 |
POSIXct | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
origin | 0 | 1 | 3 | 3 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2021.00 | 0.00 | 2021.00 | 2021.00 | 2021.00 | 2021.00 | 2021.00 | ▁▁▇▁▁ |
month | 0 | 1.00 | 6.51 | 3.44 | 1.00 | 4.00 | 7.00 | 9.00 | 12.00 | ▇▆▆▆▇ |
day | 0 | 1.00 | 15.67 | 8.77 | 1.00 | 8.00 | 16.00 | 23.00 | 31.00 | ▇▇▇▇▆ |
hour | 0 | 1.00 | 11.50 | 6.92 | 0.00 | 6.00 | 12.00 | 18.00 | 23.00 | ▇▇▆▇▇ |
temp | 34395 | 0.01 | 45.75 | 18.05 | 10.90 | 30.90 | 37.90 | 64.00 | 82.90 | ▂▇▃▃▃ |
dewp | 34396 | 0.01 | 36.89 | 18.27 | 5.00 | 21.90 | 30.00 | 51.10 | 73.90 | ▅▇▂▅▃ |
humid | 34397 | 0.01 | 72.73 | 15.76 | 31.52 | 61.59 | 72.07 | 87.06 | 100.00 | ▂▃▇▆▇ |
wind_dir | 952 | 0.97 | 181.80 | 107.41 | 0.00 | 90.00 | 200.00 | 270.00 | 360.00 | ▇▃▇▇▆ |
wind_speed | 465 | 0.99 | 8.09 | 5.36 | 0.00 | 4.60 | 8.06 | 11.51 | 36.82 | ▇▆▂▁▁ |
wind_gust | 465 | 0.99 | 9.31 | 6.17 | 0.00 | 5.30 | 9.27 | 13.24 | 42.38 | ▇▆▂▁▁ |
precip | 33636 | 0.04 | 0.01 | 0.03 | 0.00 | 0.00 | 0.01 | 0.01 | 0.44 | ▇▁▁▁▁ |
pressure | 34629 | 0.01 | 1011.31 | 7.07 | 1000.10 | 1004.38 | 1011.15 | 1018.50 | 1024.00 | ▇▅▅▅▆ |
visib | 99 | 1.00 | 8.80 | 2.29 | 0.06 | 9.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
time_hour | 0 | 1 | 2021-01-01 | 2021-12-30 23:00:00 | 2021-07-01 23:00:00 | 8735 |
Combine the flights
and airlines
data
frames with left_join()
to create a new data set called
flightsCarriers
.
Specifying key values manually
Calculate the average flight distance for each carrier using the full name of the carriers. Who had the longest flights on average? Who had the shortest?
Combine the weather
and flights
data
frames with left_join()
to create a new data set called
weatherFlights.
How many rows does
weatherFlights
have? Hint: First fix an issue with the
time_hour
variable using the code below.
Fixing time_hour
variable in weather data set
# Fixing time_hour variable in weather data set
weather <- weather |>
mutate(time_hour = lubridate::make_datetime(year, month, day, hour))
# Left-join of weather and flights
weatherFlights <- weather |>
left_join(flights)
weatherFlights |> nrow()
## [1] 168159
Left joins
Full joins
Combine the weather
and flights
data
frames with full_join()
to create a new data set called
weatherFlightsFull
. How many rows does
weatherFlightsFull
have?
## [1] 168504
Inner joins
Finally, let’s do something fun with calendaR
Create a heatmap calendar with R using the calendR
package and the code below.
library(calendR)
# Calculating number of flights for each date
nFlights <- flights |>
dplyr::count(year, month, day) |>
arrange(year, month, day)
# Creating heatmap calendar of number of flights
flights_calendar <- calendR(year = 2021,
title = "Flights out of Major Michigan Airports, 2021",
special.days = nFlights$n,
gradient = TRUE,
monthnames = month.name,
months.size = 11,
weeknames.size = 3.3,
legend.pos = "bottom",
legend.title = "Number of flights",
ncol = 3,
margin = 1) +
scale_fill_gradient(low = "white", high = "dodgerblue",
guide = guide_colorbar(frame.colour = "black",
ticks.colour = "black",
title.position = "top")) +
theme(legend.title = )
# Saving plot
size_mult <- 22
ggsave(plot = flights_calendar,
filename = "calendar_plot.png",
units = "mm",
dpi = 120,
width = 7*size_mult,
height = 8.5*size_mult)
knitr::include_graphics("calendar_plot.png")