QUESTION 1

Create a working directory, and download three data files from the homewokr page to the directory. Read these files into R as three data frames.

getwd()
## [1] "C:/Users/ridhi/OneDrive/Desktop"
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read data into R as data frames


flights <- read.csv("CVG_Flights.csv", header = TRUE, na.strings = "")
airlines <- read.csv("airlines.csv", header = TRUE, na.strings = "")
airports <- read.csv("airports.csv", header = TRUE, na.strings = "")

QUESTION 2

How many rows and columns are there in each data frame? What does each row represent (a plane, an airport, a flight, or an airline company)? What does each column represent? Explain the meanings of variables to the best of you understanding.

# flights
nrow(flights)
## [1] 7763
ncol(flights)
## [1] 27
# Each row represents a flight, with details such as flight number, airline, departure/arrival times, etc.
# Each column represents a specific attribute related to the flights, such as flight number, departure time, arrival time, airline, etc.
# airlines
nrow(airlines)
## [1] 14
ncol(airlines)
## [1] 2
# Each row represents an airline company, with details such as the airline name, IATA code, and airline ID.
# Each column represents a specific attribute related to the airlines, such as airline name, IATA code, and airline ID.
# airports
nrow(airports)
## [1] 322
ncol(airports)
## [1] 7
# Each row represents an airport, with details such as the airport name, IATA code, and airport ID.
# Each column represents a specific attribute related to the airports, such as airport name, IATA code, and airport ID.

The flights data has 7763 rows and 27 columns. Each row represents a flight information, each column represents a category such as flight date, airline, flight number. The airlines data has 14 rows and 2 columns. Each row represents an airline information, each column represents a category such as code and airline name. The airports data has 322 rows and 7 columns. Each row represents an airport information, each column represents a category such as code, airport name, city.

QUESTION 3

Merge all three data frames into one data frame according to the IATA code of airlines and airports.

merged_data <- left_join(flights, airlines, by=c("AIRLINE"="IATA_CODE"))
merged_data <- left_join(merged_data, airports, by=c("ORIGIN_AIRPORT"="IATA_CODE"))
merged_data <- left_join(merged_data, airports, by=c("DESTINATION_AIRPORT"="IATA_CODE"))

QUESTION 4

For this merged data set, print the first six rows.

# Print the first six rows of the merged data frame
head(merged_data)
##   FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1    1/1/2015      MQ          3355            CVG                 MIA
## 2    1/1/2015      MQ          3597            CVG                 DFW
## 3    1/1/2015      OO          4811            BNA                 CVG
## 4    1/1/2015      DL          2315            CVG                 ATL
## 5    1/1/2015      F9          1308            IAD                 CVG
## 6    1/1/2015      DL          1502            CVG                 TPA
##   SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 1          6:00:00 AM     5:56:00 AM              -4       15 6:11:00 AM
## 2          6:00:00 AM           <NA>              NA       NA       <NA>
## 3          6:45:00 AM     6:37:00 AM              -8       16 6:53:00 AM
## 4          7:10:00 AM     7:08:00 AM              -2       36 7:44:00 AM
## 5          7:15:00 AM     7:05:00 AM             -10       11 7:16:00 AM
## 6          7:20:00 AM     7:19:00 AM              -1        9 7:28:00 AM
##   SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE  WHEELS_ON TAXI_IN
## 1            153          150      123      948 8:14:00 AM      12
## 2            165           NA       NA      812       <NA>      NA
## 3             72           66       45      230 8:38:00 AM       5
## 4            102          101       62      373 8:46:00 AM       3
## 5             95           91       75      388 8:31:00 AM       5
## 6            141          116      105      773 9:13:00 AM       2
##   SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1        8:33:00 AM   8:26:00 AM            -7        0         0
## 2        7:45:00 AM         <NA>            NA        0         1
## 3        8:57:00 AM   8:43:00 AM           -14        0         0
## 4        8:52:00 AM   8:49:00 AM            -3        0         0
## 5        8:50:00 AM   8:36:00 AM           -14        0         0
## 6        9:41:00 AM   9:15:00 AM           -26        0         0
##   CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1                <NA>               NA             NA            NA
## 2                   B               NA             NA            NA
## 3                <NA>               NA             NA            NA
## 4                <NA>               NA             NA            NA
## 5                <NA>               NA             NA            NA
## 6                <NA>               NA             NA            NA
##   LATE_AIRCRAFT_DELAY WEATHER_DELAY                    AIRLINE.y
## 1                  NA            NA American Eagle Airlines Inc.
## 2                  NA            NA American Eagle Airlines Inc.
## 3                  NA            NA        Skywest Airlines Inc.
## 4                  NA            NA         Delta Air Lines Inc.
## 5                  NA            NA       Frontier Airlines Inc.
## 6                  NA            NA         Delta Air Lines Inc.
##                                            AIRPORT.x    CITY.x STATE.x
## 1 Cincinnati/Northern Kentucky International Airport Covington      KY
## 2 Cincinnati/Northern Kentucky International Airport Covington      KY
## 3                    Nashville International Airport Nashville      TN
## 4 Cincinnati/Northern Kentucky International Airport Covington      KY
## 5            Washington Dulles International Airport Chantilly      VA
## 6 Cincinnati/Northern Kentucky International Airport Covington      KY
##   COUNTRY.x LATITUDE.x LONGITUDE.x
## 1       USA   39.04614   -84.66217
## 2       USA   39.04614   -84.66217
## 3       USA   36.12448   -86.67818
## 4       USA   39.04614   -84.66217
## 5       USA   38.94453   -77.45581
## 6       USA   39.04614   -84.66217
##                                            AIRPORT.y            CITY.y STATE.y
## 1                        Miami International Airport             Miami      FL
## 2            Dallas/Fort Worth International Airport Dallas-Fort Worth      TX
## 3 Cincinnati/Northern Kentucky International Airport         Covington      KY
## 4   Hartsfield-Jackson Atlanta International Airport           Atlanta      GA
## 5 Cincinnati/Northern Kentucky International Airport         Covington      KY
## 6                        Tampa International Airport             Tampa      FL
##   COUNTRY.y LATITUDE.y LONGITUDE.y
## 1       USA   25.79325   -80.29056
## 2       USA   32.89595   -97.03720
## 3       USA   39.04614   -84.66217
## 4       USA   33.64044   -84.42694
## 5       USA   39.04614   -84.66217
## 6       USA   27.97547   -82.53325

QUESTION 5

For this merged data set, are there any missing values? In what variables are these missing values? What is the percentages of missing values for each variable (i.e., the number of missing values divided by the total number of observations)?

print(colSums(is.na(merged_data)))
##         FLIGHT_DATE             AIRLINE       FLIGHT_NUMBER      ORIGIN_AIRPORT 
##                   0                   0                   0                   0 
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE      DEPARTURE_TIME     DEPARTURE_DELAY 
##                   0                   0                 370                 370 
##            TAXI_OUT          WHEELS_OFF      SCHEDULED_TIME        ELAPSED_TIME 
##                 378                 378                   0                 392 
##            AIR_TIME            DISTANCE           WHEELS_ON             TAXI_IN 
##                 392                   0                 381                 381 
##   SCHEDULED_ARRIVAL        ARRIVAL_TIME       ARRIVAL_DELAY            DIVERTED 
##                   0                 381                 392                   0 
##           CANCELLED CANCELLATION_REASON    AIR_SYSTEM_DELAY      SECURITY_DELAY 
##                   0                7385                5814                5814 
##       AIRLINE_DELAY LATE_AIRCRAFT_DELAY       WEATHER_DELAY           AIRLINE.y 
##                5814                5814                5814                   0 
##           AIRPORT.x              CITY.x             STATE.x           COUNTRY.x 
##                   0                   0                   0                   0 
##          LATITUDE.x         LONGITUDE.x           AIRPORT.y              CITY.y 
##                   0                   0                   0                   0 
##             STATE.y           COUNTRY.y          LATITUDE.y         LONGITUDE.y 
##                   0                   0                   0                   0
sum(is.na(merged_data))/prod(dim(merged_data))
## [1] 0.1296857

There are 40270 missing values, from DEPARTURE_TIME,TAXI_OUT, AIR_TIME, WHEELS_ON, ARRIVAL_DELAY, AIR_SYSTEM_DELAY, AIRLINE_DELAY, WEATHER_DELAY, DEPARTURE_DELAY, WHEELS_OFF, ELAPSED_TIME, TAXI_IN, ARRIVAL_TIME, CANCELLATION_REASON, SECURITY_DELAY, LATE_AIRCRAFT_DELAY. Percentage of missing value = 40270/(7763*40) = 12.97%.

QUESTION 6

What is the proportion of canceled flights (to all flights)? How many different cancellation reasons are there?

prop_canceled <- sum(merged_data$CANCELLED == 1) / nrow(merged_data)
table(merged_data$CANCELLATION_REASON)
## 
##   A   B   C 
##  71 180 127

The proportion of cancelled flights: 0.0487. Cancellation reasons for A: 71, for B: 180, for C: 127.

QUESTION 7

For DEPARTURE_TIME, are there missing values? Do we know why these values are missing?

# Missing values in DEPARTURE_TIME and reasons

print(sum(is.na(merged_data$DEPARTURE_TIME)))
## [1] 370

QUESTION 8

In the merged data frame, create a new variable (i.e., new column) as the time difference between the SCHEDULED_TIME and the ELAPSED_TIME, i.e., SCHEDULED_TIME - ELAPSED_TIME. Print the first six elements of the new variable.``

# Create new variable for time difference between SCHEDULED_TIME and ELAPSED_TIME
merged_data$TIME_DIFFERENCE <- merged_data$SCHEDULED_TIME - merged_data$ELAPSED_TIME
head(merged_data$TIME_DIFFERENCE)
## [1]  3 NA  6  1  4 25

QUESTION 9

Extract the observations (i.e., rows) with AIRLINE of Delta and ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport, and DEPARTURE_DELAY time larger than 30 minutes, and put these observations into a new data frame. Print the first six flight numbers of the new data frame.

# Extract observations with specific conditions into a new data frame

delta_flights <- merged_data %>%
  filter(AIRLINE == "DL" , ORIGIN_AIRPORT == "CVG" , DEPARTURE_DELAY > 30)
print(head(delta_flights$FLIGHT_NUMBER))
## [1] 1741 1833 1854 1503   63 1934

QUESTION 10

Use group_by() and summarize() to compute the average departure delay time for different airlines. Which airline has the longest and shortest average department delay?

# Compute average departure delay for different airlines

average_delay <- merged_data %>% group_by(AIRLINE) %>% summarize(AVG_DELAY = mean(DEPARTURE_DELAY, na.rm = TRUE)) %>%
  arrange(desc(AVG_DELAY))
print(average_delay)
## # A tibble: 5 × 2
##   AIRLINE AVG_DELAY
##   <chr>       <dbl>
## 1 F9          23.9 
## 2 MQ          19.1 
## 3 OO          11.1 
## 4 EV          10.4 
## 5 DL           6.09

QUESTION 11

Use group_by() and summarize() to compute the average departure delay time for different ORIGIN_AIRPORT. Sort these airports descendingly according to the average departure delay time and print the top six rows, i.e., top six airports and their average delay times. Which ORIGIN_AIRPORT has the longest and shortest average department delay?

# Compute average departure delay for different ORIGIN_AIRPORTs


average_delay_by_airport <- merged_data %>% group_by(ORIGIN_AIRPORT) %>% summarize(AVG_DELAY = mean(DEPARTURE_DELAY, na.rm = TRUE)) %>% arrange(desc(AVG_DELAY))
print(head(average_delay_by_airport))
## # A tibble: 6 × 2
##   ORIGIN_AIRPORT AVG_DELAY
##   <chr>              <dbl>
## 1 TTN                 74.3
## 2 RDU                 70.8
## 3 PIT                 33.5
## 4 JFK                 27.4
## 5 LAS                 26.0
## 6 LGA                 25.6

QUESTION 12

For flights departing from CVG airport, count how many flights are offered by each airline. Print the entire list.

# Count flights departing from CVG airport by each airline

flights_from_CVG <- merged_data %>%
  filter(ORIGIN_AIRPORT == "CVG") %>%
  group_by(AIRLINE) %>%
  summarise(num_flights = n())
print(flights_from_CVG)
## # A tibble: 5 × 2
##   AIRLINE num_flights
##   <chr>         <int>
## 1 DL             1149
## 2 EV              776
## 3 F9              258
## 4 MQ             1041
## 5 OO              662