Create a working directory, and download all three data files from Canvas (CVG_Flights.csv, airlines.csv, and airports.csv), put them in the working directory, and read these data files into R as three data frames.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
flights = read.csv("CVG_FLights.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
airlines = read.csv("airlines.xls.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
airports = read.csv("airports.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
flights.df <- data_frame(flights)
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
airlines.df <- data.frame(airlines)
airports.df <- data.frame(airports)
This output shows that I successfully read the three data files as data frames.
How many rows are there in each data frame?
dim(flights.df)[1]
## [1] 7763
dim(airlines.df)[1]
## [1] 14
dim(airports.df)[1]
## [1] 322
The output shows me that the rows for flights is 7763, airlines is 14 and airports is 322.
How many columns are there in each data frame?
dim(flights.df)[2]
## [1] 27
dim(airlines.df)[2]
## [1] 2
dim(airports.df)[2]
## [1] 7
This output shows me that the columns for flights is 27, airlines is 2, and airports is 7.
Merge all three data frames into one data frame according to the IATA code of airlines and airports. Note that, for airports.csv, please merge it according to the origina airport in CVG_Flights.csv.
first_df <- left_join(flights, airlines,by=c("AIRLINE"="IATA_CODE"))
totalflights_df <-left_join(first_df,airports,by=c("ORIGIN_AIRPORT" = "IATA_CODE"))
The output tells me that I have successfully merged the data frames into one data from according to the IATA code of the airlines and airports.
For this merged dataset, print the first six rows
head(totalflights_df,6)
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1 1/1/2015 MQ 3355 CVG MIA
## 2 1/1/2015 MQ 3597 CVG DFW
## 3 1/1/2015 OO 4811 BNA CVG
## 4 1/1/2015 DL 2315 CVG ATL
## 5 1/1/2015 F9 1308 IAD CVG
## 6 1/1/2015 DL 1502 CVG TPA
## SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 1 6:00:00 AM 5:56:00 AM -4 15 6:11:00 AM
## 2 6:00:00 AM <NA> NA NA <NA>
## 3 6:45:00 AM 6:37:00 AM -8 16 6:53:00 AM
## 4 7:10:00 AM 7:08:00 AM -2 36 7:44:00 AM
## 5 7:15:00 AM 7:05:00 AM -10 11 7:16:00 AM
## 6 7:20:00 AM 7:19:00 AM -1 9 7:28:00 AM
## SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## 1 153 150 123 948 8:14:00 AM 12
## 2 165 NA NA 812 <NA> NA
## 3 72 66 45 230 8:38:00 AM 5
## 4 102 101 62 373 8:46:00 AM 3
## 5 95 91 75 388 8:31:00 AM 5
## 6 141 116 105 773 9:13:00 AM 2
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1 8:33:00 AM 8:26:00 AM -7 0 0
## 2 7:45:00 AM <NA> NA 0 1
## 3 8:57:00 AM 8:43:00 AM -14 0 0
## 4 8:52:00 AM 8:49:00 AM -3 0 0
## 5 8:50:00 AM 8:36:00 AM -14 0 0
## 6 9:41:00 AM 9:15:00 AM -26 0 0
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1 <NA> NA NA NA
## 2 B NA NA NA
## 3 <NA> NA NA NA
## 4 <NA> NA NA NA
## 5 <NA> NA NA NA
## 6 <NA> NA NA NA
## LATE_AIRCRAFT_DELAY WEATHER_DELAY AIRLINE.y
## 1 NA NA American Eagle Airlines Inc.
## 2 NA NA American Eagle Airlines Inc.
## 3 NA NA Skywest Airlines Inc.
## 4 NA NA Delta Air Lines Inc.
## 5 NA NA Frontier Airlines Inc.
## 6 NA NA Delta Air Lines Inc.
## AIRPORT CITY STATE COUNTRY
## 1 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 2 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 3 Nashville International Airport Nashville TN USA
## 4 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 5 Washington Dulles International Airport Chantilly VA USA
## 6 Cincinnati/Northern Kentucky International Airport Covington KY USA
## LATITUDE LONGITUDE
## 1 39.04614 -84.66217
## 2 39.04614 -84.66217
## 3 36.12448 -86.67818
## 4 39.04614 -84.66217
## 5 38.94453 -77.45581
## 6 39.04614 -84.66217
This output shows me that I successfully printed out the first six rows of the new data frame.
For this merged data set, are there any missing values? In what variables are these missing values?
colnames(totalflights_df)[colSums(is.na(totalflights_df))>0]
## [1] "DEPARTURE_TIME" "DEPARTURE_DELAY" "TAXI_OUT"
## [4] "WHEELS_OFF" "ELAPSED_TIME" "AIR_TIME"
## [7] "WHEELS_ON" "TAXI_IN" "ARRIVAL_TIME"
## [10] "ARRIVAL_DELAY" "CANCELLATION_REASON" "AIR_SYSTEM_DELAY"
## [13] "SECURITY_DELAY" "AIRLINE_DELAY" "LATE_AIRCRAFT_DELAY"
## [16] "WEATHER_DELAY"
This output tells me there are some missing values throught the merged data set. The variables that have missing values are “DEPARTURE_TIME” “DEPARTURE_DELAY” “TAXI_OUT”
“WHEELS_OFF” “ELAPSED_TIME” “AIR_TIME”
“WHEELS_ON” “TAXI_IN” “ARRIVAL_TIME”
“ARRIVAL_DELAY” “CANCELLATION_REASON” “AIR_SYSTEM_DELAY”
“SECURITY_DELAY” “AIRLINE_DELAY” “LATE_AIRCRAFT_DELAY” “WEATHER_DELAY”
What is the proportion of canceled flights (to all flights)? How many different cancellation reasons are there? What are they?
sum(is.na(totalflights_df$DEPARTURE_TIME)) / nrow(totalflights_df)
## [1] 0.04766199
totalflights_df %>% count(CANCELLATION_REASON, sort = TRUE)
## CANCELLATION_REASON n
## 1 <NA> 7385
## 2 B 180
## 3 C 127
## 4 A 71
This output tells me that the proportion of canceled flights to all flights is .04766199. It also shows that there are 3 different types of cancellation reasoning, and they are labeled as reason “a”, “b”, and reason “c”.
For DEPARTURE_TIME, are there missing values? Do we know why these values are missing?
totalflights_df %>% count(is.na(DEPARTURE_TIME))
## is.na(DEPARTURE_TIME) n
## 1 FALSE 7393
## 2 TRUE 370
The output shows that there are 370 missing values for DEPARTURE_TIME. These values are missing because the flight they were for were canceled, making the DEPARTURE_TIME of the flights have no value.
In the merged data frame, create a new variable (i.e., new column) as the time difference between the SCHEDULED_TIME and the ELAPSED_TIME, i.e., SCHEDULED_TIME - ELAPSED_TIME. Print the first six elements of the new variable.
totalflights_df$TIME_DIFFERENCE <- totalflights_df$SCHEDULED_TIME - totalflights_df$ELAPSED_TIME
head(totalflights_df$TIME_DIFFERENCE,6)
## [1] 3 NA 6 1 4 25
The output shows that I successfully created the new variable TIME_DIFFERENCE. It also shows the first six elements fo the variable, shown as 3, NA, 6, 1, 4, and 25.
Extract the observations (i.e., rows) with AIRLINE of Delta and ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport, and put these observations into a new data frame. Print the first six flight numbers of the new data frame.
delta_df <- totalflights_df[which(totalflights_df$AIRLINE =="DL" & totalflights_df$ORIGIN_AIRPORT == "CVG"),c (1:35)]
The output shows that I have successfully extracted the rows containing AIRLINE of Delta into a new data fram named delta_df, and printed the first six flight numbers for delta_df.
Extract the observations (i.e., rows) with AIRLINE of Delta, ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport, and DEPARTURE_DELAY time larger than 30 minutes, and put these observations into a new data frame. Print the first six flight numbers of the new data frame.
delta_delay_df <- filter(delta_df, DEPARTURE_DELAY >30)
head(delta_delay_df, 6)
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1 1/3/2015 DL 1741 CVG SLC
## 2 1/3/2015 DL 1833 CVG LAX
## 3 1/4/2015 DL 1854 CVG RSW
## 4 1/9/2015 DL 1503 CVG MCO
## 5 1/9/2015 DL 63 CVG BOS
## 6 1/10/2015 DL 1934 CVG ATL
## SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 1 8:25:00 AM 11:41:00 AM 196 11 11:52:00 AM
## 2 4:46:00 PM 5:43:00 PM 57 7 5:50:00 PM
## 3 9:35:00 AM 10:20:00 AM 45 15 10:35:00 AM
## 4 1:55:00 PM 2:39:00 PM 44 15 2:54:00 PM
## 5 4:15:00 PM 5:32:00 PM 77 9 5:41:00 PM
## 6 2:19:00 PM 3:58:00 PM 99 8 4:06:00 PM
## SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## 1 245 222 208 1450 1:20:00 AM 3
## 2 294 269 250 1900 1:00:00 AM 12
## 3 153 156 139 879 1:54:00 AM 2
## 4 134 131 109 757 1:43:00 AM 7
## 5 121 104 90 752 1:11:00 AM 5
## 6 99 74 60 373 1:06:00 AM 6
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1 10:30:00 AM 1:23:00 PM 173 0 0
## 2 6:40:00 PM 7:12:00 PM 32 0 0
## 3 12:08:00 PM 12:56:00 PM 48 0 0
## 4 4:09:00 PM 4:50:00 PM 41 0 0
## 5 6:16:00 PM 7:16:00 PM 60 0 0
## 6 3:58:00 PM 5:12:00 PM 74 0 0
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1 <NA> 0 0 173
## 2 <NA> 0 0 9
## 3 <NA> 48 0 0
## 4 <NA> 0 0 41
## 5 <NA> 0 0 0
## 6 <NA> 0 0 74
## LATE_AIRCRAFT_DELAY WEATHER_DELAY AIRLINE.y
## 1 0 0 Delta Air Lines Inc.
## 2 23 0 Delta Air Lines Inc.
## 3 0 0 Delta Air Lines Inc.
## 4 0 0 Delta Air Lines Inc.
## 5 60 0 Delta Air Lines Inc.
## 6 0 0 Delta Air Lines Inc.
## AIRPORT CITY STATE COUNTRY
## 1 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 2 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 3 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 4 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 5 Cincinnati/Northern Kentucky International Airport Covington KY USA
## 6 Cincinnati/Northern Kentucky International Airport Covington KY USA
## LATITUDE LONGITUDE TIME_DIFFERENCE
## 1 39.04614 -84.66217 23
## 2 39.04614 -84.66217 25
## 3 39.04614 -84.66217 -3
## 4 39.04614 -84.66217 3
## 5 39.04614 -84.66217 17
## 6 39.04614 -84.66217 25
The output shows that I successfully created the new data frame with departure delay times larger than 30 minutes. I also successfully printed out the first 6 flight numbers with a delay greater than 30 minutes.
Use group_by() and summarize() to compute the average departure delay time for different airlines. Which airline has the longest and shortest average department delay?
by_airlines <- group_by(totalflights_df,AIRLINE.y)
summarize(by_airlines, mean_delay_time =mean(DEPARTURE_DELAY, na.rm = TRUE))
## # A tibble: 5 x 2
## AIRLINE.y mean_delay_time
## * <chr> <dbl>
## 1 American Eagle Airlines Inc. 19.1
## 2 Atlantic Southeast Airlines 10.4
## 3 Delta Air Lines Inc. 6.09
## 4 Frontier Airlines Inc. 23.9
## 5 Skywest Airlines Inc. 11.1
The output shows that I successfully computed the average delay times of the airlines. The airline with the longest average delay time is Frontier Airlines Inc. with an average of 23.9 minutes. The airline with the shortest average delay is Delta Air LInes Inc. with and an average delay of only 6.09 minutes.
Use group_by() and summarize() to compute the average departure delay time for different ORIGIN_AIRPORT. Which ORIGIN_AIRPORT has the longest and shortest average department delay?
by_airport <- group_by(totalflights_df, ORIGIN_AIRPORT)
summarize(by_airport, mean_delay_time =mean(DEPARTURE_DELAY, na.rm = TRUE))
## # A tibble: 38 x 2
## ORIGIN_AIRPORT mean_delay_time
## * <chr> <dbl>
## 1 ATL 7.12
## 2 BDL -4
## 3 BNA 13.3
## 4 BOS 21.1
## 5 CLT 10.0
## 6 CVG 11.6
## 7 DEN 15.1
## 8 DFW 16.9
## 9 DTW 21.3
## 10 EWR 7.29
## # ... with 28 more rows
The output shows that I have successfully computed the average delay time for each airport. TTN has the longest delay time at an average of 74.333 minutes behind schedule. STL has the shortest average delay time at 4.414 minutes ahead of schedule.