Question 1

Create a working directory, and download all three data files from Canvas (CVG_Flights.csv, airlines.csv, and airports.csv) to the directory. Read these files into R as three data frames. Here is the code sample/template. flights = read.csv(“CVG_Flights.csv”, header = TRUE, na.strings = ““) Note that na.strings =”” turns blank “” to NA.

1.1 Import CVG_Flights.csv

flights = read.csv("CVG_Flights.csv", header = TRUE, na.strings = "")

1.2 Import airlines.csv

airlines = read.csv("airlines.csv", header = TRUE, na.strings = "")

1.3 Import airports.csv

airports = read.csv("airports.csv", header = TRUE,  na.strings = "")

Question 2

How many rows and columns are there in each data frame? What does each row represent (a plane, an airport, a flight, or an airline company)? What does each column represent? Explain the meanings of variables to the best of you understanding.

2.1 How many rows and columns are there in the flights data frame? What does each row represent (a plane, an airport, a flight, or an airline company)? What does each column represent?

glimpse(flights)
## Rows: 7,763
## Columns: 27
## $ FLIGHT_DATE         <chr> "1/1/2015", "1/1/2015", "1/1/2015", "1/1/2015", "1…
## $ AIRLINE             <chr> "MQ", "MQ", "OO", "DL", "F9", "DL", "MQ", "F9", "D…
## $ FLIGHT_NUMBER       <int> 3355, 3597, 4811, 2315, 1308, 1502, 3491, 1290, 23…
## $ ORIGIN_AIRPORT      <chr> "CVG", "CVG", "BNA", "CVG", "IAD", "CVG", "DFW", "…
## $ DESTINATION_AIRPORT <chr> "MIA", "DFW", "CVG", "ATL", "CVG", "TPA", "CVG", "…
## $ SCHEDULED_DEPARTURE <chr> "6:00:00 AM", "6:00:00 AM", "6:45:00 AM", "7:10:00…
## $ DEPARTURE_TIME      <chr> "5:56:00 AM", NA, "6:37:00 AM", "7:08:00 AM", "7:0…
## $ DEPARTURE_DELAY     <int> -4, NA, -8, -2, -10, -1, NA, -2, -9, 11, NA, -5, -…
## $ TAXI_OUT            <int> 15, NA, 16, 36, 11, 9, NA, 10, 27, 9, NA, 20, 18, …
## $ WHEELS_OFF          <chr> "6:11:00 AM", NA, "6:53:00 AM", "7:44:00 AM", "7:1…
## $ SCHEDULED_TIME      <int> 153, 165, 72, 102, 95, 141, 128, 150, 153, 127, 17…
## $ ELAPSED_TIME        <int> 150, NA, 66, 101, 91, 116, NA, 143, 150, 119, NA, …
## $ AIR_TIME            <int> 123, NA, 45, 62, 75, 105, NA, 126, 117, 103, NA, 2…
## $ DISTANCE            <int> 948, 812, 230, 373, 388, 773, 812, 932, 879, 532, …
## $ WHEELS_ON           <chr> "8:14:00 AM", NA, "8:38:00 AM", "8:46:00 AM", "8:3…
## $ TAXI_IN             <int> 12, NA, 5, 3, 5, 2, NA, 7, 6, 7, NA, 4, 5, 9, 10, …
## $ SCHEDULED_ARRIVAL   <chr> "8:33:00 AM", "7:45:00 AM", "8:57:00 AM", "8:52:00…
## $ ARRIVAL_TIME        <chr> "8:26:00 AM", NA, "8:43:00 AM", "8:49:00 AM", "8:3…
## $ ARRIVAL_DELAY       <int> -7, NA, -14, -3, -14, -26, NA, -9, -12, 3, NA, -11…
## $ DIVERTED            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ CANCELLED           <int> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ CANCELLATION_REASON <chr> NA, "B", NA, NA, NA, NA, "B", NA, NA, NA, "B", NA,…
## $ AIR_SYSTEM_DELAY    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ SECURITY_DELAY      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ AIRLINE_DELAY       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ LATE_AIRCRAFT_DELAY <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ WEATHER_DELAY       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

There are 7,763 rows and 27 columns.

Each row represents an individual flight. There are 27 columns with different meanings. FLIGHT_DATE is the date of the flight. AIRLINE is what airline the flight is apart of. FLIGHT_NUMBER is the flight number. ORIGIN_AIRPORT is the airport where the flight took off from while DESTINATION_AIRPORT is the airport the flight is headed to.SCHEDULED_DEPARTURE is what time the flight is scheduled to leave the gate and the DEPARTURE_TIME is the time the flight actually left the gate. DEPARTURE_DELAY is how many minutes delayed the flight was departing off from the gate from their scheduled departure time as seen in SCHEDULED_DEPARTURE. TAXI_OUT is how long it took to be taxi’d to the runway. WHEELS_OFF is what time the flight lifted off from the runway. SCHEDULED_TIME is the expected duration of the flight. ELASPSED_TIME is the actual duration of the flight. AIR_TIME is the duration that the flight spent in the air. DISTANCE is how far the flight traveled. WHEELS_ON is what time the flight touched down at the destination airport. TAXI_IN is how long it took to taxi the flight to its gate. SCHEDULED_ARRIVAL is what time the flight was expected to arrive. ARRIVAL_TIME is what time the flight actually arrived. ARRIVAL_DELAY is the time the flight was delayed from being able to taxi to its gate upon landing. DIVERTED is for if the flight had to be diverted to another airport. CANCELLED is for if the flight was cancelled. The reason for said cancellation can be found in the CANCELLATION_REASON column. AIR_SYSTEM_DELAY is for if there was a delay caused by the airport flight systems. SECURTIY_DELAY is for if there was a delay for security reasons. AIRLINE_DELAY is for if there was a delay caused by the airline themselves. LATE_AIRCRAFT_DELAY is for if the flight was delayed by the aircraft being late to arrive to the origin airport. WEATHER_DELAY is for if there was a delay cased by the weather.

2.2 How many rows and columns are there in the airlines data frame? What does each row represent (a plane, an airport, a flight, or an airline company)? What does each column represent?

glimpse(airlines)
## Rows: 14
## Columns: 2
## $ IATA_CODE <chr> "UA", "AA", "US", "F9", "B6", "OO", "AS", "NK", "WN", "DL", …
## $ AIRLINE   <chr> "United Air Lines Inc.", "American Airlines Inc.", "US Airwa…

There are 14 rows and 2 columns for the airlines data frame.

Each row reprsents a different airline company. The AIRLINE column gives the actual name of the airline. The IATA_CODE column gived the corresponding IATA_CODE name for said airline.

2.3 How many rows and columns are there in the airports data frame? What does each row represent (a plane, an airport, a flight, or an airline company)? What does each column represent?

glimpse(airports)
## Rows: 322
## Columns: 7
## $ IATA_CODE <chr> "ABE", "ABI", "ABQ", "ABR", "ABY", "ACK", "ACT", "ACV", "ACY…
## $ AIRPORT   <chr> "Lehigh Valley International Airport", "Abilene Regional Air…
## $ CITY      <chr> "Allentown", "Abilene", "Albuquerque", "Aberdeen", "Albany",…
## $ STATE     <chr> "PA", "TX", "NM", "SD", "GA", "MA", "TX", "CA", "NJ", "AK", …
## $ COUNTRY   <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA…
## $ LATITUDE  <dbl> 40.65236, 32.41132, 35.04022, 45.44906, 31.53552, 41.25305, …
## $ LONGITUDE <dbl> -75.44040, -99.68190, -106.60919, -98.42183, -84.19447, -70.…

There are 322 rows and 7 columns for the airports data frame.

Each row represents a different airport. The first column, IATA_CODE is the IATA code name for the airport. The second column, AIRPORT is the full name of the airport. The CITY, STATE and COUNTRY columns are the city, state and country that the airport is located in. The LATITUDE and LONGITUDE columns together give you the coordinates for the airport.

Question 3

Merge all three data frames into one data frame according to the IATA code of airlines and airports. For airports.csv, please merge it according to both the origin and destination airports in CVG_Flights.csv, which means you need to merge twice. Here is the code sample/template. merged_data <- left_join(flights_data, airlines_data, by=c(“AIRLINE”=“IATA_CODE”))

Merge flights and airlines

Flight_Airline_data <- left_join(flights, airlines, by=c("AIRLINE"="IATA_CODE"))

Merge airports with the other data according to the origin and destination airports

merged_data <- left_join(Flight_Airline_data, airports, by=c("ORIGIN_AIRPORT"="IATA_CODE"))

Question 4

For this merged data set, print the first six rows.

print(head(merged_data, 6))
##   FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1    1/1/2015      MQ          3355            CVG                 MIA
## 2    1/1/2015      MQ          3597            CVG                 DFW
## 3    1/1/2015      OO          4811            BNA                 CVG
## 4    1/1/2015      DL          2315            CVG                 ATL
## 5    1/1/2015      F9          1308            IAD                 CVG
## 6    1/1/2015      DL          1502            CVG                 TPA
##   SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 1          6:00:00 AM     5:56:00 AM              -4       15 6:11:00 AM
## 2          6:00:00 AM           <NA>              NA       NA       <NA>
## 3          6:45:00 AM     6:37:00 AM              -8       16 6:53:00 AM
## 4          7:10:00 AM     7:08:00 AM              -2       36 7:44:00 AM
## 5          7:15:00 AM     7:05:00 AM             -10       11 7:16:00 AM
## 6          7:20:00 AM     7:19:00 AM              -1        9 7:28:00 AM
##   SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE  WHEELS_ON TAXI_IN
## 1            153          150      123      948 8:14:00 AM      12
## 2            165           NA       NA      812       <NA>      NA
## 3             72           66       45      230 8:38:00 AM       5
## 4            102          101       62      373 8:46:00 AM       3
## 5             95           91       75      388 8:31:00 AM       5
## 6            141          116      105      773 9:13:00 AM       2
##   SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1        8:33:00 AM   8:26:00 AM            -7        0         0
## 2        7:45:00 AM         <NA>            NA        0         1
## 3        8:57:00 AM   8:43:00 AM           -14        0         0
## 4        8:52:00 AM   8:49:00 AM            -3        0         0
## 5        8:50:00 AM   8:36:00 AM           -14        0         0
## 6        9:41:00 AM   9:15:00 AM           -26        0         0
##   CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1                <NA>               NA             NA            NA
## 2                   B               NA             NA            NA
## 3                <NA>               NA             NA            NA
## 4                <NA>               NA             NA            NA
## 5                <NA>               NA             NA            NA
## 6                <NA>               NA             NA            NA
##   LATE_AIRCRAFT_DELAY WEATHER_DELAY                    AIRLINE.y
## 1                  NA            NA American Eagle Airlines Inc.
## 2                  NA            NA American Eagle Airlines Inc.
## 3                  NA            NA        Skywest Airlines Inc.
## 4                  NA            NA         Delta Air Lines Inc.
## 5                  NA            NA       Frontier Airlines Inc.
## 6                  NA            NA         Delta Air Lines Inc.
##                                              AIRPORT      CITY STATE COUNTRY
## 1 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 2 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 3                    Nashville International Airport Nashville    TN     USA
## 4 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
## 5            Washington Dulles International Airport Chantilly    VA     USA
## 6 Cincinnati/Northern Kentucky International Airport Covington    KY     USA
##   LATITUDE LONGITUDE
## 1 39.04614 -84.66217
## 2 39.04614 -84.66217
## 3 36.12448 -86.67818
## 4 39.04614 -84.66217
## 5 38.94453 -77.45581
## 6 39.04614 -84.66217

Question 5

For this merged data set, are there any missing values? In what variables are these missing values? What is the percentages of missing values for each variable (i.e., the number of missing values divided by the total number of observations)?

5.1 Are there any missing values?

sum(is.na.data.frame(merged_data))
## [1] 40270

There are 40270 missing values in the merged_data data frame.

5.2 What variables are the missing values in?

map(merged_data, ~sum(is.na(.)))
## $FLIGHT_DATE
## [1] 0
## 
## $AIRLINE
## [1] 0
## 
## $FLIGHT_NUMBER
## [1] 0
## 
## $ORIGIN_AIRPORT
## [1] 0
## 
## $DESTINATION_AIRPORT
## [1] 0
## 
## $SCHEDULED_DEPARTURE
## [1] 0
## 
## $DEPARTURE_TIME
## [1] 370
## 
## $DEPARTURE_DELAY
## [1] 370
## 
## $TAXI_OUT
## [1] 378
## 
## $WHEELS_OFF
## [1] 378
## 
## $SCHEDULED_TIME
## [1] 0
## 
## $ELAPSED_TIME
## [1] 392
## 
## $AIR_TIME
## [1] 392
## 
## $DISTANCE
## [1] 0
## 
## $WHEELS_ON
## [1] 381
## 
## $TAXI_IN
## [1] 381
## 
## $SCHEDULED_ARRIVAL
## [1] 0
## 
## $ARRIVAL_TIME
## [1] 381
## 
## $ARRIVAL_DELAY
## [1] 392
## 
## $DIVERTED
## [1] 0
## 
## $CANCELLED
## [1] 0
## 
## $CANCELLATION_REASON
## [1] 7385
## 
## $AIR_SYSTEM_DELAY
## [1] 5814
## 
## $SECURITY_DELAY
## [1] 5814
## 
## $AIRLINE_DELAY
## [1] 5814
## 
## $LATE_AIRCRAFT_DELAY
## [1] 5814
## 
## $WEATHER_DELAY
## [1] 5814
## 
## $AIRLINE.y
## [1] 0
## 
## $AIRPORT
## [1] 0
## 
## $CITY
## [1] 0
## 
## $STATE
## [1] 0
## 
## $COUNTRY
## [1] 0
## 
## $LATITUDE
## [1] 0
## 
## $LONGITUDE
## [1] 0

The missing values are in the DEPARTURE_TIME, DEPARTURE_DELAY, TAXI_OUT, WHEELS_OFF,ELAPSED_TIME, AIR_TIME, WHEELS_ON, TAXI_IN, ARRIVAL_TIME, ARRIVAL_DELAY, CANCELLATION_REASON, AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY and WEATHER_DELAY variables.

5.3 What is the percentage of missing values for each variable?

print((colMeans(is.na(merged_data)))*100)
##         FLIGHT_DATE             AIRLINE       FLIGHT_NUMBER      ORIGIN_AIRPORT 
##            0.000000            0.000000            0.000000            0.000000 
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE      DEPARTURE_TIME     DEPARTURE_DELAY 
##            0.000000            0.000000            4.766199            4.766199 
##            TAXI_OUT          WHEELS_OFF      SCHEDULED_TIME        ELAPSED_TIME 
##            4.869252            4.869252            0.000000            5.049594 
##            AIR_TIME            DISTANCE           WHEELS_ON             TAXI_IN 
##            5.049594            0.000000            4.907896            4.907896 
##   SCHEDULED_ARRIVAL        ARRIVAL_TIME       ARRIVAL_DELAY            DIVERTED 
##            0.000000            4.907896            5.049594            0.000000 
##           CANCELLED CANCELLATION_REASON    AIR_SYSTEM_DELAY      SECURITY_DELAY 
##            0.000000           95.130748           74.893727           74.893727 
##       AIRLINE_DELAY LATE_AIRCRAFT_DELAY       WEATHER_DELAY           AIRLINE.y 
##           74.893727           74.893727           74.893727            0.000000 
##             AIRPORT                CITY               STATE             COUNTRY 
##            0.000000            0.000000            0.000000            0.000000 
##            LATITUDE           LONGITUDE 
##            0.000000            0.000000

DEPARTURE_TIME and DEPARTURE_DELAY are both missing about 4.77% . TAXI_OUT and WHEELS_OFF are both missing about 4.87%. ELASPSED_TIME and AIR_TIME are both missing about 5.05%. WHEELS_ON, TAXI_IN and ARRIVAL_TIME are all missing about 4.91%. ARRIVAL_DELAY is missing around 5.05% and CANCELLATION_REASON is missing 95.13% of its values. AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY and WEATHER_DELAY are all missing about 74.89% of their values.

Question 6

What is the proportion of canceled flights (to all flights)? How many different cancellation reasons are there?

6.1 What is the proportion of canceled flights (to all flights)?

#Total number of flights in merged data
total_flights <- nrow(merged_data)
print(total_flights)
## [1] 7763
#Filter for Cancelled flights
all_cancelled_flights <- dplyr::filter( merged_data, CANCELLED %in% "1")

#Number of cancelled flights
num_cancelled_flights <- nrow(all_cancelled_flights)
print(num_cancelled_flights)
## [1] 378
#Number of non-cancelled flights
num_non_cancelled_flights <- total_flights - num_cancelled_flights
print(num_non_cancelled_flights)
## [1] 7385
#Proportion
print(num_cancelled_flights/total_flights)
## [1] 0.04869252

Approximately 49 flights are cancelled for every 1000 flights. About 4.87% of flights have been cancelled out out 7763 flights in the data frame.

6.2 How many different cancellation reasons are there?

table(merged_data$CANCELLATION_REASON)
## 
##   A   B   C 
##  71 180 127

There are three different cancellation reasons.

Question 7

For DEPARTURE_TIME, are there missing values? Do we know why these values are missing?

sum(is.na(merged_data$DEPARTURE_TIME))
## [1] 370

There are 370 values missing from DEPARTURE_TIME and they are most likely missing due to cancelled flights.

Question 8

In the merged data frame, create a new variable (i.e., new column) as the time difference between the SCHEDULED_TIME and the ELAPSED_TIME, i.e., SCHEDULED_TIME - ELAPSED_TIME. Print the first six elements of the new variable.

# create the new variable
SCHEDULED_MINUS_ELAPSED <- merged_data$SCHEDULED_TIME - merged_data$ELAPSED_TIME
# add new variable to merged_data via a join

merged_data2 <- cbind(merged_data, SCHEDULED_MINUS_ELAPSED)
print(head(merged_data$SCHEDULED_MINUS_ELAPSED, 6))
## NULL

Question 9

Extract the observations (i.e., rows) with AIRLINE of Delta and ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport, and DEPARTURE_DELAY time larger than 30 minutes, and put these observations into a new data frame. Print the first six flight numbers of the new data frame.

# Extract the observations and create a new data frame
DELTA_CVG_DEPARTURE_DELAY_30_MIN_PLUS <- subset(merged_data, AIRLINE == 'DL' & ORIGIN_AIRPORT == 'CVG' & DEPARTURE_DELAY > 30)
# Print the first 6 flight numbers in the new data frame. 
print(head(DELTA_CVG_DEPARTURE_DELAY_30_MIN_PLUS$FLIGHT_NUMBER, 6))
## [1] 1741 1833 1854 1503   63 1934

Question 10

Use group_by() and summarize() to compute the average departure delay time for different airlines. Which airline has the longest and shortest average department delay?

10.1 Use group_by() and summarize() to compute the average departure delay time for different airlines.

avg_dep_delay_by_airline <- merged_data %>%
  group_by(AIRLINE) %>%
  summarise(avg_dep_delay = mean(DEPARTURE_DELAY, is.na = TRUE))
print(avg_dep_delay_by_airline)
## # A tibble: 5 × 2
##   AIRLINE avg_dep_delay
##   <chr>           <dbl>
## 1 DL                 NA
## 2 EV                 NA
## 3 F9                 NA
## 4 MQ                 NA
## 5 OO                 NA

10.2 Which airline has the longest and shortest average department delay?

I am not sure which airlines have the longest and shortest departure delays.

Question 11

Use group_by() and summarize() to compute the average departure delay time for different ORIGIN_AIRPORT. Sort these airports descendingly according to the average departure delay time and print the top six rows, i.e., top six airports and their average delay times. Which ORIGIN_AIRPORT has the longest and shortest average department delay?

origin_airport_avg_dep_delay_time <- merged_data %>%
  group_by(ORIGIN_AIRPORT) %>%
  summarise(avg_dep_delay = mean(DEPARTURE_DELAY, is.na = TRUE))
  
##( origin_airport_avg_dep_delay_time <- arrange(origin_airport_avg_dep_delay_time, avg_dep_delay = desc) )## I cant figure out whats causing the error here. I turned this line of code into a comment because it was preventing me from knitting the document. 
print(head(origin_airport_avg_dep_delay_time, 6))
## # A tibble: 6 × 2
##   ORIGIN_AIRPORT avg_dep_delay
##   <chr>                  <dbl>
## 1 ATL                       NA
## 2 BDL                       -4
## 3 BNA                       NA
## 4 BOS                       NA
## 5 CLT                       NA
## 6 CVG                       NA

Im not sure what airports have the longest and shortest average department delays. I can’t seem to figure out what I am doing wrong on the last two questions.

#Question 12 ##For flights departing from CVG airport, count how many flights are offered by each airline. Print the entire list.

# Filter for flights departing CVG and create a new data frame
Flights_Departing_CVG <- subset(merged_data, ORIGIN_AIRPORT == 'CVG')
print(count(Flights_Departing_CVG, AIRLINE))
##   AIRLINE    n
## 1      DL 1149
## 2      EV  776
## 3      F9  258
## 4      MQ 1041
## 5      OO  662