Installing packages
library(tidyverse)
library(lubridate)
library(knitr)
library(skimr)Importing data
FullmichiganFlights<-readRDS('fullMiFlights2021.rds')creating an object called michiganFlights.
michiganFlights<-readRDS('fullMiFlights2021.rds')Use the View() function to explore the object michiganFlights. What type of object is michiganFlights? What is its length?
#View(michiganFlights)michiganFlights is a list type with 5 variables
Unnest the list
# Create tibbles in environment from list
list2env(michiganFlights, envir = .GlobalEnv)## <environment: R_GlobalEnv>
Exploratory data analysis
glimpse(airlines)
glimpse(airports)
glimpse((flights))
glimpse(planes)
glimpse(weather)
skim(weather)
skim(flights)The flights and planes data set connect through which variable(s)?
Te flights and planes are connected by tailnum and year
The flights and airports data set connect through which variable(s)?
They are connected by origin and destination variables
The flights and weather data set connect through which variable(s)?
Connect through year, month, day, hour and origin
Suppose we wanted to draw (approximately) the route each plane flies from its origin to its destination. Which variables would we need? Which tables would we need to combine?
we’ll need the flights(origin and destination) and airports(latitude and longitudes) we will combine flights and airports table
Now suppose we wanted to explore typical weather patterns for departing flights at different airports and explore the weather’s relationship with departure delays. Considering the wind speeds and amount of precipitation, which variables would we need for this? Which tables would we need to combine?
weather data set(precipitation, wind speed, wind direction year, origin, month, day, and hour) We also need to know flights data set(dep_time, delay, origin, year, month and hour)
Outer Joins
#Combine the flights and airlines data frames with left_join() to create a new data set called flightsCarriers
flightsCarriers<-flights %>% left_join(airlines)#Calculate the average flight distance for each carrier using the full name of the carriers. Who had the longest flights on average? Who had the shortest?
flightsCarriers %>% group_by(name) %>%
summarise(AvgDistance = mean(distance, na.rm=TRUE)) %>%
arrange(AvgDistance) %>% knitr::kable()| name | AvgDistance |
|---|---|
| Endeavor Air Inc. | 327.8897 |
| Envoy Air | 362.2711 |
| Republic Airline | 384.4040 |
| SkyWest Airlines Inc. | 428.0904 |
| PSA Airlines Inc. | 510.3483 |
| JetBlue Airways | 583.6000 |
| Southwest Airlines Co. | 600.4704 |
| United Air Lines Inc. | 678.6150 |
| Mesa Airlines Inc. | 763.6516 |
| American Airlines Inc. | 912.3228 |
| Delta Air Lines Inc. | 959.6660 |
| Spirit Air Lines | 1113.1827 |
| Allegiant Air | 1117.4148 |
| Frontier Airlines Inc. | 1177.0890 |
| Alaska Airlines Inc. | 1927.0000 |
Who had the longest flights on average? Alaska Airlines Inc
Who had the shortest? Endeavor Air Inc.
# Combine the weather and flights data frames with left_join() to create a new data set called weatherFlights. How many rows does weatherFlights have?
weatherFlights <-weather %>% left_join(flights, by = c("origin", "year", "month", "day", "hour"))
glimpse(weatherFlights)## Rows: 168,159
## Columns: 29
## $ origin <chr> "DTW", "DTW", "DTW", "DTW", "DTW", "DTW", "DTW", "DTW",…
## $ year <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ hour <dbl> 0, 1, 2, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7…
## $ temp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ dewp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ humid <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ wind_dir <dbl> 210, 0, 0, 0, 0, 0, 50, 50, 50, 50, 50, 50, 50, 90, 90,…
## $ wind_speed <dbl> 5.75390, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 2…
## $ wind_gust <dbl> 6.621473, 0.000000, 0.000000, 0.000000, 0.000000, 0.000…
## $ precip <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pressure <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ visib <dbl> 8.0, 7.0, 6.0, 5.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, …
## $ time_hour.x <dttm> 2021-01-01 00:00:00, 2021-01-01 01:00:00, 2021-01-01 0…
## $ dep_time <int> NA, NA, NA, NA, NA, 536, 558, 606, 611, 611, 624, NA, N…
## $ sched_dep_time <int> NA, NA, NA, NA, NA, 539, 600, 600, 615, 616, 615, 650, …
## $ dep_delay <dbl> NA, NA, NA, NA, NA, -3, -2, 6, -4, -5, 9, NA, NA, -5, -…
## $ arr_time <int> NA, NA, NA, NA, NA, 738, 700, 905, 809, 804, 806, NA, N…
## $ sched_arr_time <int> NA, NA, NA, NA, NA, 825, 730, 920, 822, 826, 800, 939, …
## $ arr_delay <dbl> NA, NA, NA, NA, NA, -47, -30, -15, -13, -22, 6, NA, NA,…
## $ carrier <chr> NA, NA, NA, NA, NA, "AA", "NK", "NK", "DL", "YX", "WN",…
## $ flight <int> NA, NA, NA, NA, NA, 90, 5, 21, 120, 491, 284, 76, 30, 3…
## $ tailnum <chr> NA, NA, NA, NA, NA, "N750UW", "N653NK", "N675NK", "N827…
## $ dest <chr> NA, NA, NA, NA, NA, "PHX", "LAS", "FLL", "ATL", "CLT", …
## $ air_time <dbl> NA, NA, NA, NA, NA, 227, 220, 162, 95, 87, 61, NA, NA, …
## $ distance <dbl> NA, NA, NA, NA, NA, 1671, 1749, 1127, 594, 500, 409, 95…
## $ minute <dbl> NA, NA, NA, NA, NA, 39, 0, 0, 15, 16, 15, 50, 43, 0, 18…
## $ time_hour.y <dttm> NA, NA, NA, NA, NA, 2021-01-01 05:00:00, 2021-01-01 06…
# How many rows does weatherFlights have?
nrow(weatherFlights)## [1] 168159
#Combine the flights and weather data frames with left_join() to create a new data set called flightsWeather. How many rows does flightsWeather have?
flightWeather <- flights%>%left_join(weather,
by = c("origin", "year", "month", "day", "hour"))
# How many rows does flightsWeather have? 149445 observations
nrow(flightWeather)## [1] 149445
#Combine the weather and flights data frames with full_join() to create a new data set called weatherFlightsFull. How many rows does weatherFlightsFull have?
weatherFlightsFull <- flights%>%full_join(weather,
by = c("origin", "year", "month", "day", "hour"))
# How many rows does flightsWeatherFull have?
nrow(weatherFlightsFull)## [1] 168504
Flights with missing wind speeds
flightWeather %>% pull(wind_speed) %>% is.na()%>%
sum()## [1] 1526
There are 1526 flights with missing wind speed
Inner Joins
#Combine the weather and flights data frames with inner_join() to create a new data set called innerWeatherFlights. How many rows does innerWeatherFlights have?
InnerWeatherFlights <- flights%>%inner_join(weather,
by = c("origin", "year", "month", "day", "hour"))
#How many rows does innerWeatherFlights have?
nrow(InnerWeatherFlights)## [1] 149100
#What does the number of rows in innerWeatherFlights and weatherFlights tell us about the data set?
nrow(InnerWeatherFlights)## [1] 149100
nrow(weatherFlights)## [1] 168159
nrow(weatherFlights)-nrow(InnerWeatherFlights)## [1] 19059
some airports had hourly data but no flights depart at that hour
#Convert the implicit missing values in the innerWeatherFlights data set to explicit missing values using the complete() function creating a data set called completeWeatherFlights
InnerWeatherFlights1 <- flights%>%inner_join(weather,
by = c("origin", "year", "month", "day", "hour", "time_hour"= "time_hour"))
completeWeatherFlights<-InnerWeatherFlights1 %>% complete(flights)
#Which data set is the same as completeWeatherFlights?
completeWeatherFlights %>% nrow()## [1] 149445
flights %>% nrow()## [1] 149445
flightWeather %>% nrow()## [1] 149445
flightsCarriers %>% nrow()## [1] 149445
Which data set is the same as completeWeatherFlights? completeWeatherFlights have same number of observations as flights, flightWeather and flightCarriers.
#Suppose we could assume that all precipitation values that are missing in the completeWeatherFlights are truly 0 (this may or may not be a reasonable assumption in real life - further investigation is certainly required!). Use the replace_na() function in tandem with the mutate() function to replace all missing values for the precipitation variable to be 0.
completeWeatherFlights %>% mutate(precip=replace_na(0))## # A tibble: 149,445 × 28
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2021 1 1 536 539 -3 738 825 -47 AA
## 2 2021 1 1 557 600 -3 758 748 10 DL
## 3 2021 1 1 558 600 -2 700 730 -30 NK
## 4 2021 1 1 600 607 -7 820 831 -11 OH
## 5 2021 1 1 606 600 6 905 920 -15 NK
## 6 2021 1 1 610 615 -5 809 832 -23 OH
## 7 2021 1 1 611 615 -4 809 822 -13 DL
## 8 2021 1 1 611 616 -5 804 826 -22 YX
## 9 2021 1 1 624 615 9 806 800 6 WN
## 10 2021 1 1 624 630 -6 711 723 -12 DL
## # … with 149,435 more rows, 18 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, temp <dbl>, dewp <dbl>, humid <dbl>,
## # wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
## # pressure <dbl>, visib <dbl>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay