Question 1

Create a working directory, and download all three data files from Canvas (CVG_Flights.csv, airlines.csv, and airports.csv), put them in the working directory, and read these data files into R as three data frames.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
flights = read.csv("CVG_FLights.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
airlines = read.csv("airlines.xls.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
airports = read.csv("airports.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
flights.df <- data_frame(flights)
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
airlines.df <- data.frame(airlines)
airports.df <- data.frame(airports)

This output shows that I successfully read the three data files as data frames.

Question 2

How many rows are there in each data frame?

dim(flights.df)[1]
## [1] 7763
dim(airlines.df)[1]
## [1] 14
dim(airports.df)[1]
## [1] 322

The output shows me that the rows for flights is 7763, airlines is 14 and airports is 322.

Question 3

How many columns are there in each data frame?

dim(flights.df)[2]
## [1] 27
dim(airlines.df)[2]
## [1] 2
dim(airports.df)[2]
## [1] 7

This output shows me that the columns for flights is 27, airlines is 2, and airports is 7.

Question 4

Merge all three data frames into one data frame according to the IATA code of airlines and airports. Note that, for airports.csv, please merge it according to the origina airport in CVG_Flights.csv.

first_df <- left_join(flights, airlines,by=c("AIRLINE"="IATA_CODE"))
totalflights_df <-left_join(first_df,airports,by=c("ORIGIN_AIRPORT" = "IATA_CODE"))

The output tells me that I have successfully merged the data frames into one data from according to the IATA code of the airlines and airports.

Question 5

For this merged dataset, print the first six rows

head(totalflights_df,6)
##   FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1    1/1/2015      MQ          3355            CVG                 MIA
## 2    1/1/2015      MQ          3597            CVG                 DFW
## 3    1/1/2015      OO          4811            BNA                 CVG
## 4    1/1/2015      DL          2315            CVG                 ATL
## 5    1/1/2015      F9          1308            IAD                 CVG
## 6    1/1/2015      DL          1502            CVG                 TPA
##   SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 1          6:00:00 AM     5:56:00 AM              -4       15 6:11:00 AM
## 2          6:00:00 AM           <NA>              NA       NA       <NA>
## 3          6:45:00 AM     6:37:00 AM              -8       16 6:53:00 AM
## 4          7:10:00 AM     7:08:00 AM              -2       36 7:44:00 AM
## 5          7:15:00 AM     7:05:00 AM             -10       11 7:16:00 AM
## 6          7:20:00 AM     7:19:00 AM              -1        9 7:28:00 AM
##   SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE  WHEELS_ON TAXI_IN
## 1            153          150      123      948 8:14:00 AM      12
## 2            165           NA       NA      812       <NA>      NA
## 3             72           66       45      230 8:38:00 AM       5
## 4            102          101       62      373 8:46:00 AM       3
## 5             95           91       75      388 8:31:00 AM       5
## 6            141          116      105      773 9:13:00 AM       2
##   SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1        8:33:00 AM   8:26:00 AM            -7        0         0
## 2        7:45:00 AM         <NA>            NA        0         1
## 3        8:57:00 AM   8:43:00 AM           -14        0         0
## 4        8:52:00 AM   8:49:00 AM            -3        0         0
## 5        8:50:00 AM   8:36:00 AM           -14        0         0
## 6        9:41:00 AM   9:15:00 AM           -26        0         0
##   CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1                <NA>               NA             NA            NA
## 2                   B               NA             NA            NA
## 3                <NA>               NA             NA            NA
## 4                <NA>               NA             NA            NA
## 5                <NA>               NA             NA            NA
## 6                <NA>               NA             NA            NA
##   LATE_AIRCRAFT_DELAY WEATHER_DELAY                    AIRLINE.y
## 1                  NA            NA American Eagle Airlines Inc.
## 2                  NA            NA American Eagle Airlines Inc.
## 3                  NA            NA        Skywest Airlines Inc.
## 4                  NA            NA         Delta Air Lines Inc.
## 5                  NA            NA       Frontier Airlines Inc.
## 6                  NA            NA         Delta Air Lines Inc.
##                                              AIRPORT      CITY STATE COUNTRY
## 1 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 2 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 3                    Nashville International Airport Nashville    TN     USA
## 4 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 5            Washington Dulles International Airport Chantilly    VA     USA
## 6 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
##   LATITUDE LONGITUDE
## 1 39.04614 -84.66217
## 2 39.04614 -84.66217
## 3 36.12448 -86.67818
## 4 39.04614 -84.66217
## 5 38.94453 -77.45581
## 6 39.04614 -84.66217

This output shows me that I successfully printed out the first six rows of the new data frame.

Question 6

For this merged data set, are there any missing values? In what variables are these missing values?

colnames(totalflights_df)[colSums(is.na(totalflights_df))>0]
##  [1] "DEPARTURE_TIME"      "DEPARTURE_DELAY"     "TAXI_OUT"           
##  [4] "WHEELS_OFF"          "ELAPSED_TIME"        "AIR_TIME"           
##  [7] "WHEELS_ON"           "TAXI_IN"             "ARRIVAL_TIME"       
## [10] "ARRIVAL_DELAY"       "CANCELLATION_REASON" "AIR_SYSTEM_DELAY"   
## [13] "SECURITY_DELAY"      "AIRLINE_DELAY"       "LATE_AIRCRAFT_DELAY"
## [16] "WEATHER_DELAY"

This output tells me there are some missing values throught the merged data set. The variables that have missing values are “DEPARTURE_TIME” “DEPARTURE_DELAY” “TAXI_OUT”
“WHEELS_OFF” “ELAPSED_TIME” “AIR_TIME”
“WHEELS_ON” “TAXI_IN” “ARRIVAL_TIME”
“ARRIVAL_DELAY” “CANCELLATION_REASON” “AIR_SYSTEM_DELAY”
“SECURITY_DELAY” “AIRLINE_DELAY” “LATE_AIRCRAFT_DELAY” “WEATHER_DELAY”

Question 7

What is the proportion of canceled flights (to all flights)? How many different cancellation reasons are there? What are they?

sum(is.na(totalflights_df$DEPARTURE_TIME)) / nrow(totalflights_df)
## [1] 0.04766199
totalflights_df %>% count(CANCELLATION_REASON, sort = TRUE)
##   CANCELLATION_REASON    n
## 1                <NA> 7385
## 2                   B  180
## 3                   C  127
## 4                   A   71

This output tells me that the proportion of canceled flights to all flights is .04766199. It also shows that there are 3 different types of cancellation reasoning, and they are labeled as reason “a”, “b”, and reason “c”.

Question 8

For DEPARTURE_TIME, are there missing values? Do we know why these values are missing?

totalflights_df %>% count(is.na(DEPARTURE_TIME))
##   is.na(DEPARTURE_TIME)    n
## 1                 FALSE 7393
## 2                  TRUE  370

The output shows that there are 370 missing values for DEPARTURE_TIME. These values are missing because the flight they were for were canceled, making the DEPARTURE_TIME of the flights have no value.

Question 9

In the merged data frame, create a new variable (i.e., new column) as the time difference between the SCHEDULED_TIME and the ELAPSED_TIME, i.e., SCHEDULED_TIME - ELAPSED_TIME. Print the first six elements of the new variable.

totalflights_df$TIME_DIFFERENCE <- totalflights_df$SCHEDULED_TIME - totalflights_df$ELAPSED_TIME
head(totalflights_df$TIME_DIFFERENCE,6)
## [1]  3 NA  6  1  4 25

The output shows that I successfully created the new variable TIME_DIFFERENCE. It also shows the first six elements fo the variable, shown as 3, NA, 6, 1, 4, and 25.

Question 10

Extract the observations (i.e., rows) with AIRLINE of Delta and ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport, and put these observations into a new data frame. Print the first six flight numbers of the new data frame.

delta_df <- totalflights_df[which(totalflights_df$AIRLINE =="DL" & totalflights_df$ORIGIN_AIRPORT == "CVG"),c (1:35)]

The output shows that I have successfully extracted the rows containing AIRLINE of Delta into a new data fram named delta_df, and printed the first six flight numbers for delta_df.

Question 11

Extract the observations (i.e., rows) with AIRLINE of Delta, ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport, and DEPARTURE_DELAY time larger than 30 minutes, and put these observations into a new data frame. Print the first six flight numbers of the new data frame.

delta_delay_df <- filter(delta_df, DEPARTURE_DELAY >30)
head(delta_delay_df, 6)
##   FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1    1/3/2015      DL          1741            CVG                 SLC
## 2    1/3/2015      DL          1833            CVG                 LAX
## 3    1/4/2015      DL          1854            CVG                 RSW
## 4    1/9/2015      DL          1503            CVG                 MCO
## 5    1/9/2015      DL            63            CVG                 BOS
## 6   1/10/2015      DL          1934            CVG                 ATL
##   SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT  WHEELS_OFF
## 1          8:25:00 AM    11:41:00 AM             196       11 11:52:00 AM
## 2          4:46:00 PM     5:43:00 PM              57        7  5:50:00 PM
## 3          9:35:00 AM    10:20:00 AM              45       15 10:35:00 AM
## 4          1:55:00 PM     2:39:00 PM              44       15  2:54:00 PM
## 5          4:15:00 PM     5:32:00 PM              77        9  5:41:00 PM
## 6          2:19:00 PM     3:58:00 PM              99        8  4:06:00 PM
##   SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE  WHEELS_ON TAXI_IN
## 1            245          222      208     1450 1:20:00 AM       3
## 2            294          269      250     1900 1:00:00 AM      12
## 3            153          156      139      879 1:54:00 AM       2
## 4            134          131      109      757 1:43:00 AM       7
## 5            121          104       90      752 1:11:00 AM       5
## 6             99           74       60      373 1:06:00 AM       6
##   SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1       10:30:00 AM   1:23:00 PM           173        0         0
## 2        6:40:00 PM   7:12:00 PM            32        0         0
## 3       12:08:00 PM  12:56:00 PM            48        0         0
## 4        4:09:00 PM   4:50:00 PM            41        0         0
## 5        6:16:00 PM   7:16:00 PM            60        0         0
## 6        3:58:00 PM   5:12:00 PM            74        0         0
##   CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1                <NA>                0              0           173
## 2                <NA>                0              0             9
## 3                <NA>               48              0             0
## 4                <NA>                0              0            41
## 5                <NA>                0              0             0
## 6                <NA>                0              0            74
##   LATE_AIRCRAFT_DELAY WEATHER_DELAY            AIRLINE.y
## 1                   0             0 Delta Air Lines Inc.
## 2                  23             0 Delta Air Lines Inc.
## 3                   0             0 Delta Air Lines Inc.
## 4                   0             0 Delta Air Lines Inc.
## 5                  60             0 Delta Air Lines Inc.
## 6                   0             0 Delta Air Lines Inc.
##                                              AIRPORT      CITY STATE COUNTRY
## 1 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 2 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 3 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 4 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 5 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 6 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
##   LATITUDE LONGITUDE TIME_DIFFERENCE
## 1 39.04614 -84.66217              23
## 2 39.04614 -84.66217              25
## 3 39.04614 -84.66217              -3
## 4 39.04614 -84.66217               3
## 5 39.04614 -84.66217              17
## 6 39.04614 -84.66217              25

The output shows that I successfully created the new data frame with departure delay times larger than 30 minutes. I also successfully printed out the first 6 flight numbers with a delay greater than 30 minutes.

Question 12

Use group_by() and summarize() to compute the average departure delay time for different airlines. Which airline has the longest and shortest average department delay?

by_airlines <- group_by(totalflights_df,AIRLINE.y)
summarize(by_airlines, mean_delay_time =mean(DEPARTURE_DELAY, na.rm = TRUE))
## # A tibble: 5 x 2
##   AIRLINE.y                    mean_delay_time
## * <chr>                                  <dbl>
## 1 American Eagle Airlines Inc.           19.1 
## 2 Atlantic Southeast Airlines            10.4 
## 3 Delta Air Lines Inc.                    6.09
## 4 Frontier Airlines Inc.                 23.9 
## 5 Skywest Airlines Inc.                  11.1

The output shows that I successfully computed the average delay times of the airlines. The airline with the longest average delay time is Frontier Airlines Inc. with an average of 23.9 minutes. The airline with the shortest average delay is Delta Air LInes Inc. with and an average delay of only 6.09 minutes.

Question 13

Use group_by() and summarize() to compute the average departure delay time for different ORIGIN_AIRPORT. Which ORIGIN_AIRPORT has the longest and shortest average department delay?

by_airport <- group_by(totalflights_df, ORIGIN_AIRPORT)
summarize(by_airport, mean_delay_time =mean(DEPARTURE_DELAY, na.rm = TRUE))
## # A tibble: 38 x 2
##    ORIGIN_AIRPORT mean_delay_time
##  * <chr>                    <dbl>
##  1 ATL                       7.12
##  2 BDL                      -4   
##  3 BNA                      13.3 
##  4 BOS                      21.1 
##  5 CLT                      10.0 
##  6 CVG                      11.6 
##  7 DEN                      15.1 
##  8 DFW                      16.9 
##  9 DTW                      21.3 
## 10 EWR                       7.29
## # ... with 28 more rows

The output shows that I have successfully computed the average delay time for each airport. TTN has the longest delay time at an average of 74.333 minutes behind schedule. STL has the shortest average delay time at 4.414 minutes ahead of schedule.