QUESTION 1
Create a working directory, and download three data files from the homewokr page to the directory. Read these files into R as three data frames.
getwd()
## [1] "C:/Users/ridhi/OneDrive/Desktop"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.3 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read data into R as data frames
flights <- read.csv("CVG_Flights.csv", header = TRUE, na.strings = "")
airlines <- read.csv("airlines.csv", header = TRUE, na.strings = "")
airports <- read.csv("airports.csv", header = TRUE, na.strings = "")
QUESTION 2
How many rows and columns are there in each data frame? What does each row represent (a plane, an airport, a flight, or an airline company)? What does each column represent? Explain the meanings of variables to the best of you understanding.
# flights
nrow(flights)
## [1] 7763
ncol(flights)
## [1] 27
# Each row represents a flight, with details such as flight number, airline, departure/arrival times, etc.
# Each column represents a specific attribute related to the flights, such as flight number, departure time, arrival time, airline, etc.
# airlines
nrow(airlines)
## [1] 14
ncol(airlines)
## [1] 2
# Each row represents an airline company, with details such as the airline name, IATA code, and airline ID.
# Each column represents a specific attribute related to the airlines, such as airline name, IATA code, and airline ID.
# airports
nrow(airports)
## [1] 322
ncol(airports)
## [1] 7
# Each row represents an airport, with details such as the airport name, IATA code, and airport ID.
# Each column represents a specific attribute related to the airports, such as airport name, IATA code, and airport ID.
The flights data has 7763 rows and 27 columns. Each row represents a flight information, each column represents a category such as flight date, airline, flight number. The airlines data has 14 rows and 2 columns. Each row represents an airline information, each column represents a category such as code and airline name. The airports data has 322 rows and 7 columns. Each row represents an airport information, each column represents a category such as code, airport name, city.
QUESTION 3
Merge all three data frames into one data frame according to the IATA code of airlines and airports.
merged_data <- left_join(flights, airlines, by=c("AIRLINE"="IATA_CODE"))
merged_data <- left_join(merged_data, airports, by=c("ORIGIN_AIRPORT"="IATA_CODE"))
merged_data <- left_join(merged_data, airports, by=c("DESTINATION_AIRPORT"="IATA_CODE"))
QUESTION 4
For this merged data set, print the first six rows.
# Print the first six rows of the merged data frame
head(merged_data)
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 1 1/1/2015 MQ 3355 CVG MIA
## 2 1/1/2015 MQ 3597 CVG DFW
## 3 1/1/2015 OO 4811 BNA CVG
## 4 1/1/2015 DL 2315 CVG ATL
## 5 1/1/2015 F9 1308 IAD CVG
## 6 1/1/2015 DL 1502 CVG TPA
## SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT WHEELS_OFF
## 1 6:00:00 AM 5:56:00 AM -4 15 6:11:00 AM
## 2 6:00:00 AM <NA> NA NA <NA>
## 3 6:45:00 AM 6:37:00 AM -8 16 6:53:00 AM
## 4 7:10:00 AM 7:08:00 AM -2 36 7:44:00 AM
## 5 7:15:00 AM 7:05:00 AM -10 11 7:16:00 AM
## 6 7:20:00 AM 7:19:00 AM -1 9 7:28:00 AM
## SCHEDULED_TIME ELAPSED_TIME AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## 1 153 150 123 948 8:14:00 AM 12
## 2 165 NA NA 812 <NA> NA
## 3 72 66 45 230 8:38:00 AM 5
## 4 102 101 62 373 8:46:00 AM 3
## 5 95 91 75 388 8:31:00 AM 5
## 6 141 116 105 773 9:13:00 AM 2
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 1 8:33:00 AM 8:26:00 AM -7 0 0
## 2 7:45:00 AM <NA> NA 0 1
## 3 8:57:00 AM 8:43:00 AM -14 0 0
## 4 8:52:00 AM 8:49:00 AM -3 0 0
## 5 8:50:00 AM 8:36:00 AM -14 0 0
## 6 9:41:00 AM 9:15:00 AM -26 0 0
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 1 <NA> NA NA NA
## 2 B NA NA NA
## 3 <NA> NA NA NA
## 4 <NA> NA NA NA
## 5 <NA> NA NA NA
## 6 <NA> NA NA NA
## LATE_AIRCRAFT_DELAY WEATHER_DELAY AIRLINE.y
## 1 NA NA American Eagle Airlines Inc.
## 2 NA NA American Eagle Airlines Inc.
## 3 NA NA Skywest Airlines Inc.
## 4 NA NA Delta Air Lines Inc.
## 5 NA NA Frontier Airlines Inc.
## 6 NA NA Delta Air Lines Inc.
## AIRPORT.x CITY.x STATE.x
## 1 Cincinnati/Northern Kentucky International Airport Covington KY
## 2 Cincinnati/Northern Kentucky International Airport Covington KY
## 3 Nashville International Airport Nashville TN
## 4 Cincinnati/Northern Kentucky International Airport Covington KY
## 5 Washington Dulles International Airport Chantilly VA
## 6 Cincinnati/Northern Kentucky International Airport Covington KY
## COUNTRY.x LATITUDE.x LONGITUDE.x
## 1 USA 39.04614 -84.66217
## 2 USA 39.04614 -84.66217
## 3 USA 36.12448 -86.67818
## 4 USA 39.04614 -84.66217
## 5 USA 38.94453 -77.45581
## 6 USA 39.04614 -84.66217
## AIRPORT.y CITY.y STATE.y
## 1 Miami International Airport Miami FL
## 2 Dallas/Fort Worth International Airport Dallas-Fort Worth TX
## 3 Cincinnati/Northern Kentucky International Airport Covington KY
## 4 Hartsfield-Jackson Atlanta International Airport Atlanta GA
## 5 Cincinnati/Northern Kentucky International Airport Covington KY
## 6 Tampa International Airport Tampa FL
## COUNTRY.y LATITUDE.y LONGITUDE.y
## 1 USA 25.79325 -80.29056
## 2 USA 32.89595 -97.03720
## 3 USA 39.04614 -84.66217
## 4 USA 33.64044 -84.42694
## 5 USA 39.04614 -84.66217
## 6 USA 27.97547 -82.53325
QUESTION 5
For this merged data set, are there any missing values? In what variables are these missing values? What is the percentages of missing values for each variable (i.e., the number of missing values divided by the total number of observations)?
print(colSums(is.na(merged_data)))
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## 0 0 0 0
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY
## 0 0 370 370
## TAXI_OUT WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME
## 378 378 0 392
## AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## 392 0 381 381
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED
## 0 381 392 0
## CANCELLED CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY
## 0 7385 5814 5814
## AIRLINE_DELAY LATE_AIRCRAFT_DELAY WEATHER_DELAY AIRLINE.y
## 5814 5814 5814 0
## AIRPORT.x CITY.x STATE.x COUNTRY.x
## 0 0 0 0
## LATITUDE.x LONGITUDE.x AIRPORT.y CITY.y
## 0 0 0 0
## STATE.y COUNTRY.y LATITUDE.y LONGITUDE.y
## 0 0 0 0
sum(is.na(merged_data))/prod(dim(merged_data))
## [1] 0.1296857
There are 40270 missing values, from DEPARTURE_TIME,TAXI_OUT, AIR_TIME, WHEELS_ON, ARRIVAL_DELAY, AIR_SYSTEM_DELAY, AIRLINE_DELAY, WEATHER_DELAY, DEPARTURE_DELAY, WHEELS_OFF, ELAPSED_TIME, TAXI_IN, ARRIVAL_TIME, CANCELLATION_REASON, SECURITY_DELAY, LATE_AIRCRAFT_DELAY. Percentage of missing value = 40270/(7763*40) = 12.97%.
QUESTION 6
What is the proportion of canceled flights (to all flights)? How many different cancellation reasons are there?
prop_canceled <- sum(merged_data$CANCELLED == 1) / nrow(merged_data)
table(merged_data$CANCELLATION_REASON)
##
## A B C
## 71 180 127
The proportion of cancelled flights: 0.0487. Cancellation reasons for A: 71, for B: 180, for C: 127.
QUESTION 7
For DEPARTURE_TIME, are there missing values? Do we know why these values are missing?
# Missing values in DEPARTURE_TIME and reasons
print(sum(is.na(merged_data$DEPARTURE_TIME)))
## [1] 370
QUESTION 8
In the merged data frame, create a new variable (i.e., new column) as the time difference between the SCHEDULED_TIME and the ELAPSED_TIME, i.e., SCHEDULED_TIME - ELAPSED_TIME. Print the first six elements of the new variable.``
# Create new variable for time difference between SCHEDULED_TIME and ELAPSED_TIME
merged_data$TIME_DIFFERENCE <- merged_data$SCHEDULED_TIME - merged_data$ELAPSED_TIME
head(merged_data$TIME_DIFFERENCE)
## [1] 3 NA 6 1 4 25
QUESTION 9
Extract the observations (i.e., rows) with AIRLINE of Delta and ORIGIN_AIRPORT of Cincinnati/Northern Kentucky International Airport, and DEPARTURE_DELAY time larger than 30 minutes, and put these observations into a new data frame. Print the first six flight numbers of the new data frame.
# Extract observations with specific conditions into a new data frame
delta_flights <- merged_data %>%
filter(AIRLINE == "DL" , ORIGIN_AIRPORT == "CVG" , DEPARTURE_DELAY > 30)
print(head(delta_flights$FLIGHT_NUMBER))
## [1] 1741 1833 1854 1503 63 1934
QUESTION 10
Use group_by() and summarize() to compute the average departure delay time for different airlines. Which airline has the longest and shortest average department delay?
# Compute average departure delay for different airlines
average_delay <- merged_data %>% group_by(AIRLINE) %>% summarize(AVG_DELAY = mean(DEPARTURE_DELAY, na.rm = TRUE)) %>%
arrange(desc(AVG_DELAY))
print(average_delay)
## # A tibble: 5 × 2
## AIRLINE AVG_DELAY
## <chr> <dbl>
## 1 F9 23.9
## 2 MQ 19.1
## 3 OO 11.1
## 4 EV 10.4
## 5 DL 6.09
QUESTION 11
Use group_by() and summarize() to compute the average departure delay time for different ORIGIN_AIRPORT. Sort these airports descendingly according to the average departure delay time and print the top six rows, i.e., top six airports and their average delay times. Which ORIGIN_AIRPORT has the longest and shortest average department delay?
# Compute average departure delay for different ORIGIN_AIRPORTs
average_delay_by_airport <- merged_data %>% group_by(ORIGIN_AIRPORT) %>% summarize(AVG_DELAY = mean(DEPARTURE_DELAY, na.rm = TRUE)) %>% arrange(desc(AVG_DELAY))
print(head(average_delay_by_airport))
## # A tibble: 6 × 2
## ORIGIN_AIRPORT AVG_DELAY
## <chr> <dbl>
## 1 TTN 74.3
## 2 RDU 70.8
## 3 PIT 33.5
## 4 JFK 27.4
## 5 LAS 26.0
## 6 LGA 25.6
QUESTION 12
For flights departing from CVG airport, count how many flights are offered by each airline. Print the entire list.
# Count flights departing from CVG airport by each airline
flights_from_CVG <- merged_data %>%
filter(ORIGIN_AIRPORT == "CVG") %>%
group_by(AIRLINE) %>%
summarise(num_flights = n())
print(flights_from_CVG)
## # A tibble: 5 × 2
## AIRLINE num_flights
## <chr> <int>
## 1 DL 1149
## 2 EV 776
## 3 F9 258
## 4 MQ 1041
## 5 OO 662