Installing packages

library(tidyverse)
library(lubridate)
library(knitr)
library(skimr)

Importing data

FullmichiganFlights<-readRDS('fullMiFlights2021.rds')

creating an object called michiganFlights.

michiganFlights<-readRDS('fullMiFlights2021.rds')

Use the View() function to explore the object michiganFlights. What type of object is michiganFlights? What is its length?

#View(michiganFlights)

michiganFlights is a list type with 5 variables

Unnest the list

# Create tibbles in environment from list
list2env(michiganFlights, envir = .GlobalEnv)
## <environment: R_GlobalEnv>

Exploratory data analysis

glimpse(airlines)
glimpse(airports)
glimpse((flights))
glimpse(planes)
glimpse(weather)

skim(weather)
skim(flights)

The flights and planes data set connect through which variable(s)?

Te flights and planes are connected by tailnum and year

The flights and airports data set connect through which variable(s)?

They are connected by origin and destination variables

The flights and weather data set connect through which variable(s)?

Connect through year, month, day, hour and origin

Suppose we wanted to draw (approximately) the route each plane flies from its origin to its destination. Which variables would we need? Which tables would we need to combine?

we’ll need the flights(origin and destination) and airports(latitude and longitudes) we will combine flights and airports table

Now suppose we wanted to explore typical weather patterns for departing flights at different airports and explore the weather’s relationship with departure delays. Considering the wind speeds and amount of precipitation, which variables would we need for this? Which tables would we need to combine?

weather data set(precipitation, wind speed, wind direction year, origin, month, day, and hour) We also need to know flights data set(dep_time, delay, origin, year, month and hour)

Outer Joins

#Combine the flights and airlines  data frames with left_join() to create a new data set called flightsCarriers
flightsCarriers<-flights %>% left_join(airlines)
#Calculate the average flight distance for each carrier using the full name of the carriers. Who had the longest flights on average? Who had the shortest?
flightsCarriers %>% group_by(name) %>% 
  summarise(AvgDistance = mean(distance, na.rm=TRUE)) %>% 
  arrange(AvgDistance) %>% knitr::kable()
name AvgDistance
Endeavor Air Inc. 327.8897
Envoy Air 362.2711
Republic Airline 384.4040
SkyWest Airlines Inc. 428.0904
PSA Airlines Inc. 510.3483
JetBlue Airways 583.6000
Southwest Airlines Co. 600.4704
United Air Lines Inc. 678.6150
Mesa Airlines Inc. 763.6516
American Airlines Inc. 912.3228
Delta Air Lines Inc. 959.6660
Spirit Air Lines 1113.1827
Allegiant Air 1117.4148
Frontier Airlines Inc. 1177.0890
Alaska Airlines Inc. 1927.0000

Who had the longest flights on average? Alaska Airlines Inc

Who had the shortest? Endeavor Air Inc.

# Combine the weather and flights data frames with left_join() to create a new data set called weatherFlights. How many rows does weatherFlights have?
weatherFlights <-weather %>% left_join(flights, by = c("origin", "year", "month", "day", "hour"))
glimpse(weatherFlights)
## Rows: 168,159
## Columns: 29
## $ origin         <chr> "DTW", "DTW", "DTW", "DTW", "DTW", "DTW", "DTW", "DTW",…
## $ year           <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ hour           <dbl> 0, 1, 2, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7…
## $ temp           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ dewp           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ humid          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ wind_dir       <dbl> 210, 0, 0, 0, 0, 0, 50, 50, 50, 50, 50, 50, 50, 90, 90,…
## $ wind_speed     <dbl> 5.75390, 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 2…
## $ wind_gust      <dbl> 6.621473, 0.000000, 0.000000, 0.000000, 0.000000, 0.000…
## $ precip         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ pressure       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ visib          <dbl> 8.0, 7.0, 6.0, 5.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, …
## $ time_hour.x    <dttm> 2021-01-01 00:00:00, 2021-01-01 01:00:00, 2021-01-01 0…
## $ dep_time       <int> NA, NA, NA, NA, NA, 536, 558, 606, 611, 611, 624, NA, N…
## $ sched_dep_time <int> NA, NA, NA, NA, NA, 539, 600, 600, 615, 616, 615, 650, …
## $ dep_delay      <dbl> NA, NA, NA, NA, NA, -3, -2, 6, -4, -5, 9, NA, NA, -5, -…
## $ arr_time       <int> NA, NA, NA, NA, NA, 738, 700, 905, 809, 804, 806, NA, N…
## $ sched_arr_time <int> NA, NA, NA, NA, NA, 825, 730, 920, 822, 826, 800, 939, …
## $ arr_delay      <dbl> NA, NA, NA, NA, NA, -47, -30, -15, -13, -22, 6, NA, NA,…
## $ carrier        <chr> NA, NA, NA, NA, NA, "AA", "NK", "NK", "DL", "YX", "WN",…
## $ flight         <int> NA, NA, NA, NA, NA, 90, 5, 21, 120, 491, 284, 76, 30, 3…
## $ tailnum        <chr> NA, NA, NA, NA, NA, "N750UW", "N653NK", "N675NK", "N827…
## $ dest           <chr> NA, NA, NA, NA, NA, "PHX", "LAS", "FLL", "ATL", "CLT", …
## $ air_time       <dbl> NA, NA, NA, NA, NA, 227, 220, 162, 95, 87, 61, NA, NA, …
## $ distance       <dbl> NA, NA, NA, NA, NA, 1671, 1749, 1127, 594, 500, 409, 95…
## $ minute         <dbl> NA, NA, NA, NA, NA, 39, 0, 0, 15, 16, 15, 50, 43, 0, 18…
## $ time_hour.y    <dttm> NA, NA, NA, NA, NA, 2021-01-01 05:00:00, 2021-01-01 06…
# How many rows does weatherFlights have?
nrow(weatherFlights)
## [1] 168159
#Combine the flights and weather data frames with left_join() to create a new data set called flightsWeather. How many rows does flightsWeather have?
flightWeather <- flights%>%left_join(weather,
                                    by = c("origin", "year", "month", "day", "hour"))

# How many rows does flightsWeather have? 149445 observations
nrow(flightWeather)
## [1] 149445
#Combine the weather and flights data frames with full_join() to create a new data set called weatherFlightsFull. How many rows does weatherFlightsFull have?
weatherFlightsFull <- flights%>%full_join(weather,
                                    by = c("origin", "year", "month", "day", "hour"))

# How many rows does flightsWeatherFull have?

nrow(weatherFlightsFull)
## [1] 168504

Flights with missing wind speeds

flightWeather %>% pull(wind_speed) %>% is.na()%>%
  sum()
## [1] 1526

There are 1526 flights with missing wind speed

Inner Joins

#Combine the weather and flights data frames with inner_join() to create a new data set called innerWeatherFlights. How many rows does innerWeatherFlights have?
InnerWeatherFlights <- flights%>%inner_join(weather,
                                    by = c("origin", "year", "month", "day", "hour"))
#How many rows does innerWeatherFlights have? 
nrow(InnerWeatherFlights)
## [1] 149100
#What does the number of rows in innerWeatherFlights and weatherFlights tell us about the data set?

nrow(InnerWeatherFlights)
## [1] 149100
nrow(weatherFlights)
## [1] 168159
nrow(weatherFlights)-nrow(InnerWeatherFlights)
## [1] 19059

some airports had hourly data but no flights depart at that hour

#Convert the implicit missing values in the innerWeatherFlights data set to explicit missing values using the complete() function creating a data set called completeWeatherFlights

InnerWeatherFlights1 <- flights%>%inner_join(weather,
                                    by = c("origin", "year", "month", "day", "hour", "time_hour"= "time_hour"))
completeWeatherFlights<-InnerWeatherFlights1 %>% complete(flights)

#Which data set is the same as completeWeatherFlights?
completeWeatherFlights %>% nrow()
## [1] 149445
flights %>% nrow()
## [1] 149445
flightWeather %>% nrow()
## [1] 149445
flightsCarriers %>% nrow()
## [1] 149445

Which data set is the same as completeWeatherFlights? completeWeatherFlights have same number of observations as flights, flightWeather and flightCarriers.

#Suppose we could assume that all precipitation values that are missing in the completeWeatherFlights are truly 0 (this may or may not be a reasonable assumption in real life - further investigation is certainly required!).  Use the replace_na() function in tandem with the mutate() function to replace all missing values for the precipitation variable to be 0.

completeWeatherFlights %>% mutate(precip=replace_na(0))
## # A tibble: 149,445 × 28
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2021     1     1      536        539      -3     738     825     -47 AA     
##  2  2021     1     1      557        600      -3     758     748      10 DL     
##  3  2021     1     1      558        600      -2     700     730     -30 NK     
##  4  2021     1     1      600        607      -7     820     831     -11 OH     
##  5  2021     1     1      606        600       6     905     920     -15 NK     
##  6  2021     1     1      610        615      -5     809     832     -23 OH     
##  7  2021     1     1      611        615      -4     809     822     -13 DL     
##  8  2021     1     1      611        616      -5     804     826     -22 YX     
##  9  2021     1     1      624        615       9     806     800       6 WN     
## 10  2021     1     1      624        630      -6     711     723     -12 DL     
## # … with 149,435 more rows, 18 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, temp <dbl>, dewp <dbl>, humid <dbl>,
## #   wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay