Joining Data with dplyr

In this activity we will explore more concepts of data cleaning and data wrangling with R using the dplyr package.

Learning Objectives

  • Combine multiple related data sets

  • Understand the concept of and identify keys as columns for joining tables of data

  • Understand and implement different types of joins

Load packages necessary for this activity

library(tidyverse)
library(lubridate)
library(knitr)
library(skimr)
library(calendR)
library(gt)
library(flextable)

Example data: Michigan flights

For this activity, we will be using the tidyverse package and several related data sets that contain information on flights departing from Michigan airports in 2021 obtained via the anyflights R package. This package facilitates obtaining air travel and weather data for airports across the United States.

Importing flights data

# Importing data
michiganFlights <- read_rds("fullMiFlights2021.rds")

# Use the following code to create explicit data tables in our global environment.
# Create tibbles in environment from list
list2env(michiganFlights, envir = .GlobalEnv)
## <environment: R_GlobalEnv>

Use the skim() function to explore characteristics of each of the tables of data.

The columns / variables used to link each pair of data tables are called keys. That is, a key is a column, or collection of columns, that uniquely identifies an observation.

skim(flights)
Data summary
Name flights
Number of rows 149445
Number of columns 19
_______________________
Column type frequency:
character 4
numeric 14
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
carrier 0 1 2 2 0 15 0
tailnum 117 1 5 6 0 4136 0
origin 0 1 3 3 0 4 0
dest 0 1 3 3 0 114 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2021.00 0.00 2021 2021 2021 2021 2021 ▁▁▇▁▁
month 0 1.00 6.77 3.35 1 4 7 10 12 ▆▅▆▆▇
day 0 1.00 15.74 8.78 1 8 16 23 31 ▇▇▇▇▆
dep_time 1382 0.99 1371.55 493.76 1 944 1353 1745 2400 ▁▇▇▇▆
sched_dep_time 0 1.00 1369.27 486.06 500 930 1355 1737 2327 ▇▇▆▇▆
dep_delay 1384 0.99 6.96 45.39 -34 -5 -3 1 1948 ▇▁▁▁▁
arr_time 1433 0.99 1464.41 517.39 1 1049 1455 1839 2400 ▁▅▇▇▆
sched_arr_time 0 1.00 1481.47 507.65 1 1100 1504 1840 2359 ▁▃▇▇▆
arr_delay 1715 0.99 -0.22 47.48 -79 -17 -9 1 1961 ▇▁▁▁▁
flight 0 1.00 371.09 221.12 1 176 372 548 927 ▇▇▇▆▁
air_time 1715 0.99 95.45 62.47 15 50 75 133 393 ▇▃▁▁▁
distance 0 1.00 654.32 488.30 74 296 501 983 2986 ▇▂▁▁▁
hour 0 1.00 13.42 4.83 5 9 13 17 23 ▇▇▅▇▆
minute 0 1.00 27.52 18.71 0 10 29 45 59 ▇▅▆▅▆

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
time_hour 0 1 2021-01-01 05:00:00 2021-12-31 22:00:00 2021-07-12 21:00:00 6452
skim(airports)
Data summary
Name airports
Number of rows 1251
Number of columns 8
_______________________
Column type frequency:
character 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
faa 0 1.00 3 3 0 1251 0
name 0 1.00 9 60 0 1248 0
dst 48 0.96 1 1 0 4 0
tzone 119 0.90 12 19 0 9 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
lat 0 1.00 40.93 10.19 19.72 34.10 39.35 44.32 71.29 ▂▇▅▁▂
lon 0 1.00 -103.47 28.11 -176.65 -117.88 -95.91 -83.29 174.11 ▅▇▁▁▁
alt 0 1.00 1121.99 1602.52 -115.00 96.00 551.00 1226.00 9070.00 ▇▁▁▁▁
tz 48 0.96 -6.50 1.59 -10.00 -8.00 -6.00 -5.00 8.00 ▆▇▁▁▁
skim(airlines)
Data summary
Name airlines
Number of rows 15
Number of columns 2
_______________________
Column type frequency:
character 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
carrier 0 1 2 2 0 15 0
name 0 1 9 22 0 15 0
skim(planes)
Data summary
Name planes
Number of rows 3962
Number of columns 9
_______________________
Column type frequency:
character 5
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
tailnum 0 1 5 6 0 3962 0
type 0 1 23 23 0 1 0
manufacturer 0 1 6 29 0 11 0
model 0 1 5 15 0 78 0
engine 0 1 9 9 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 112 0.97 2008.85 7.24 1990 2003 2009 2015 2021 ▁▆▇▅▇
engines 0 1.00 2.00 0.05 2 2 2 2 3 ▇▁▁▁▁
seats 0 1.00 154.73 74.05 20 95 149 190 451 ▆▇▂▁▁
speed 0 1.00 0.00 0.00 0 0 0 0 0 ▁▁▇▁▁
skim(weather)
Data summary
Name weather
Number of rows 34897
Number of columns 15
_______________________
Column type frequency:
character 1
numeric 13
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
origin 0 1 3 3 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2021.00 0.00 2021.00 2021.00 2021.00 2021.00 2021.00 ▁▁▇▁▁
month 0 1.00 6.51 3.44 1.00 4.00 7.00 9.00 12.00 ▇▆▆▆▇
day 0 1.00 15.67 8.77 1.00 8.00 16.00 23.00 31.00 ▇▇▇▇▆
hour 0 1.00 11.50 6.92 0.00 6.00 12.00 18.00 23.00 ▇▇▆▇▇
temp 34395 0.01 45.75 18.05 10.90 30.90 37.90 64.00 82.90 ▂▇▃▃▃
dewp 34396 0.01 36.89 18.27 5.00 21.90 30.00 51.10 73.90 ▅▇▂▅▃
humid 34397 0.01 72.73 15.76 31.52 61.59 72.07 87.06 100.00 ▂▃▇▆▇
wind_dir 952 0.97 181.80 107.41 0.00 90.00 200.00 270.00 360.00 ▇▃▇▇▆
wind_speed 465 0.99 8.09 5.36 0.00 4.60 8.06 11.51 36.82 ▇▆▂▁▁
wind_gust 465 0.99 9.31 6.17 0.00 5.30 9.27 13.24 42.38 ▇▆▂▁▁
precip 33636 0.04 0.01 0.03 0.00 0.00 0.01 0.01 0.44 ▇▁▁▁▁
pressure 34629 0.01 1011.31 7.07 1000.10 1004.38 1011.15 1018.50 1024.00 ▇▅▅▅▆
visib 99 1.00 8.80 2.29 0.06 9.00 10.00 10.00 10.00 ▁▁▁▁▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
time_hour 0 1 2021-01-01 2021-12-30 23:00:00 2021-07-01 23:00:00 8735

Combine the flights and airlines data frames with left_join() to create a new data set called flightsCarriers.

# Left-joining tables
flightsCarriers <- flights |>
  left_join(airlines)

Specifying key values manually

# Left-joining tables
flightsCarriers <- flights |>
  left_join(airlines, by = c("carrier" = "carrier"))

Calculate the average flight distance for each carrier using the full name of the carriers. Who had the longest flights on average? Who had the shortest?

averageflightDistance <- flightsCarriers |>
  group_by(name) |>
  summarise(averageflightDistance = mean(distance)) |>
  arrange(desc(averageflightDistance)) |>
  flextable()

Combine the weather and flights data frames with left_join() to create a new data set called weatherFlights. How many rows does weatherFlights have? Hint: First fix an issue with the time_hour variable using the code below.

Fixing time_hour variable in weather data set

# Fixing time_hour variable in weather data set
weather <-  weather |> 
  mutate(time_hour = lubridate::make_datetime(year, month, day, hour))
# Left-join of weather and flights
weatherFlights <- weather |>
  left_join(flights)

weatherFlights |> nrow()
## [1] 168159

Left joins

Combine the flights and weather data frames with left_join() to create a new data set called flightsWeather. How many rows does flightsWeather have?

flightsWeather <- flights |>
  left_join(weather)

flightsWeather |> 
  nrow()
## [1] 149445

Full joins

Combine the weather and flights data frames with full_join() to create a new data set called weatherFlightsFull. How many rows does weatherFlightsFull have?

weatherFlightsFull <- weather |>
  full_join(flights)

weatherFlightsFull |>
  nrow()
## [1] 168504

Considering all of the data we have available, how many flights have missing wind speeds?

missingwindSpeeds <- flightsWeather |>
  filter(is.na(wind_speed))

missingwindSpeeds |>
  nrow()
## [1] 1526
## [1] 1526

Inner joins

Combine the weather and flights data frames with inner_join() to create a new data set called innerWeatherFlights. How many rows does innerWeatherFlights have?

innerWeatherFlights <- flights |>
  inner_join(weather)

innerWeatherFlights |> nrow()
## [1] 149100

Finally, let’s do something fun with calendaR

Create a heatmap calendar with R using the calendR package and the code below.

library(calendR)

# Calculating number of flights for each date
nFlights <- flights |> 
  dplyr::count(year, month, day) |> 
  arrange(year, month, day)

# Creating heatmap calendar of number of flights
flights_calendar <- calendR(year = 2021,
                            title = "Flights out of Major Michigan Airports, 2021",
        special.days = nFlights$n,
        gradient = TRUE,
        monthnames = month.name,
        months.size = 11,
        weeknames.size = 3.3,
        legend.pos = "bottom",
        legend.title = "Number of flights",
        ncol = 3,
        margin = 1) +
    scale_fill_gradient(low = "white", high = "dodgerblue",
  guide = guide_colorbar(frame.colour = "black", 
                                               ticks.colour = "black",
                         title.position = "top")) +
  theme(legend.title = )

# Saving plot
size_mult <- 22
ggsave(plot = flights_calendar,
       filename = "calendar_plot.png",
       units = "mm",
       dpi = 120,
       width = 7*size_mult,
       height = 8.5*size_mult)

knitr::include_graphics("calendar_plot.png")