For this first question, I want to first read the data and load programs to assist in this case.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
CVG.Flights <- read.csv("CVG_Flights.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
airlines <- read.csv("airlines.xls.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
airports <- read.csv("airports.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
After reading, I can now run some code. First I will find how many variables there are in the data sets for CVG_Flights.
ncol(CVG.Flights)
## [1] 27
This output shows me that there are 27 different variables for the data set. Next I will look to see how many observations there are.
nrow(CVG.Flights)
## [1] 7763
This output tells me that CVG_Flights has 7763 observations.
To find the number of missing variables in the dataset, I will you the sum function to find the number of NA’s.
sum(is.na(CVG.Flights))
## [1] 40270
This output shows that there is 40,270 missing values in the data set. To help visualize the missing data, I will show the number of NA’s per variable.
colSums(is.na(CVG.Flights))
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## 0 0 0 0
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY
## 0 0 370 370
## TAXI_OUT WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME
## 378 378 0 392
## AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## 392 0 381 381
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED
## 0 381 392 0
## CANCELLED CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY
## 0 7385 5814 5814
## AIRLINE_DELAY LATE_AIRCRAFT_DELAY WEATHER_DELAY
## 5814 5814 5814
This output tells me that there are missing values for the following variables: * DEPARTURE_TIME * DEPARTURE_DELAY * TAXI_OUT * WHEELS_OFF * ELAPSED_TIME * AIR_TIME * WHEELS_ON * TAXI_IN * ARRIVAL_TIME * ARRIVAL_DELAY * CANCELLATION_REASON * AIR_SYSTEM_DELAY * SECURITY_DELAY * AIRLINE_DELAY * LATE_AIRCRAFT_DELAY * WEATHER_DELAY.
To begin, I will first get a summary of the data set to help with the observations to be made.
summary(CVG.Flights)
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 3/2/2015 : 139 DL:2289 Min. : 62 CVG :3886 CVG :3877
## 1/2/2015 : 136 EV:1553 1st Qu.:1908 ORD : 727 ORD : 762
## 3/5/2015 : 135 F9: 515 Median :3246 DFW : 458 DFW : 460
## 2/13/2015: 134 MQ:2080 Mean :3194 ATL : 412 ATL : 413
## 2/23/2015: 134 OO:1326 3rd Qu.:4536 EWR : 299 EWR : 294
## 3/6/2015 : 134 Max. :6297 IAH : 247 IAH : 242
## (Other) :6951 (Other):1734 (Other):1715
## SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT
## 4:15:00 PM: 210 9:28:00 AM: 58 Min. :-24.00 Min. : 1.00
## 9:30:00 AM: 190 9:27:00 AM: 54 1st Qu.: -5.00 1st Qu.: 12.00
## 6:00:00 AM: 179 9:25:00 AM: 40 Median : -1.00 Median : 15.00
## 6:00:00 PM: 106 9:30:00 AM: 40 Mean : 12.35 Mean : 18.52
## 4:00:00 PM: 104 9:29:00 AM: 35 3rd Qu.: 12.00 3rd Qu.: 21.00
## 8:30:00 AM: 103 (Other) :7166 Max. :708.00 Max. :122.00
## (Other) :6871 NA's : 370 NA's :370 NA's :378
## WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME AIR_TIME
## 9:41:00 AM: 35 Min. : 60.0 Min. : 49.0 Min. : 36.0
## 4:24:00 PM: 33 1st Qu.: 85.0 1st Qu.: 83.0 1st Qu.: 56.0
## 9:43:00 AM: 32 Median :126.0 Median :123.0 Median : 99.0
## 9:40:00 AM: 28 Mean :128.1 Mean :127.3 Mean :100.4
## 4:20:00 PM: 27 3rd Qu.:156.0 3rd Qu.:152.0 3rd Qu.:125.0
## (Other) :7230 Max. :313.0 Max. :340.0 Max. :319.0
## NA's : 378 NA's :392 NA's :392
## DISTANCE WHEELS_ON TAXI_IN SCHEDULED_ARRIVAL
## Min. : 229.0 1:13:00 AM: 98 Min. : 1.000 7:24:00 PM: 139
## 1st Qu.: 308.0 1:05:00 AM: 97 1st Qu.: 5.000 7:25:00 PM: 120
## Median : 589.0 1:12:00 AM: 97 Median : 6.000 3:30:00 PM: 87
## Mean : 675.9 1:19:00 AM: 97 Mean : 8.448 8:39:00 PM: 79
## 3rd Qu.: 871.0 1:25:00 AM: 97 3rd Qu.: 9.000 9:30:00 PM: 78
## Max. :2036.0 (Other) :6896 Max. :128.000 7:05:00 PM: 75
## NA's : 381 NA's :381 (Other) :7185
## ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 7:23:00 PM : 23 Min. :-56.00 Min. :0.000000 Min. :0.00000
## 10:25:00 AM: 20 1st Qu.:-12.00 1st Qu.:0.000000 1st Qu.:0.00000
## 6:06:00 PM : 20 Median : -2.00 Median :0.000000 Median :0.00000
## 10:53:00 AM: 19 Mean : 10.65 Mean :0.001803 Mean :0.04869
## 10:58:00 AM: 19 3rd Qu.: 16.00 3rd Qu.:0.000000 3rd Qu.:0.00000
## (Other) :7281 Max. :716.00 Max. :1.000000 Max. :1.00000
## NA's : 381 NA's :392
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## A : 71 Min. : 0.00 Min. :0.00 Min. : 0.00
## B : 180 1st Qu.: 0.00 1st Qu.:0.00 1st Qu.: 0.00
## C : 127 Median : 10.00 Median :0.00 Median : 0.00
## NA's:7385 Mean : 16.35 Mean :0.01 Mean : 17.06
## 3rd Qu.: 23.00 3rd Qu.:0.00 3rd Qu.: 18.00
## Max. :258.00 Max. :9.00 Max. :435.00
## NA's :5814 NA's :5814 NA's :5814
## LATE_AIRCRAFT_DELAY WEATHER_DELAY
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 2.00 Median : 0.000
## Mean : 23.24 Mean : 3.873
## 3rd Qu.: 30.00 3rd Qu.: 0.000
## Max. :435.00 Max. :638.000
## NA's :5814 NA's :5814
For FLIGHT_DATE, I observed that the data was only collected in Jan, Feb and March 2015.
For AIRLINE, there are 5 total airlines in the dataset, and Delta having the most inputs.
For Origin airport, I want to see how many total airports are in the data set.
summary(CVG.Flights$ORIGIN_AIRPORT)
## ATL BDL BNA BOS CLT CVG DEN DFW DTW EWR FLL GRR IAD IAH JFK LAS
## 412 3 100 58 31 3886 142 458 31 299 100 81 6 247 69 79
## LAX LGA MCI MCO MEM MIA MKE MSN MSP ORD PHX PIT RDU RSW SAN SEA
## 87 53 1 196 16 134 54 33 85 727 41 3 4 72 4 4
## SFO SLC STL TPA TTN XNA
## 47 92 29 59 3 17
This summary tells me there are 38 total airports in the data set.
For FLIGHT_NUMBER, the highest was 6297, meaning there was 6297 different flight numbers used in the dataset.
For DESTINATION_AIRPORT, I wanted to something similar to ORIGIN_AIRPORT to find the total number of airports used as destinations.
summary(CVG.Flights$DESTINATION_AIRPORT)
## ATL BDL BNA BOS CLT CVG DEN DFW DTW EWR FLL GRR IAD IAH JFK LAS
## 413 3 99 59 31 3877 118 460 31 294 102 80 4 242 68 80
## LAX LGA MCI MCO MEM MIA MKE MSN MSP ORD PHX PIT RDU RSW SAN SEA
## 88 52 1 197 16 133 57 33 81 762 43 3 5 73 3 4
## SFO SLC STL TPA TTN XNA
## 47 92 30 61 3 18
Again, there is a total of 38 different airport destinations, mirroring the origin airport.
For SCHEDULED_DEPARTURE, we can see that 4:15 PM is the most common departure time for aircraft at CVG.
For DEPARTURE_TIME, the most common time listed is 9:28 am. This means that planes scheduled for 4:30 rarely make their departure on time.
DEPARTURE_DELAY tells us that the average delay across all the airports is 12.35 minutes.
TAXI_OUT tells us that the average time is takes to taxi across all airports is 18.52 minutes.
WHEELS_OFF, unlike DEPARTURE_TIME is the exact moment the plane leaves the ground, THis means the most common time a plane takes off in these airports is 9:41 am.
FLIGHT_TIME for these aircraft have an average 127 mins of time in the air.
SCHEDULED_TIME for these aircraft go no lower than 60 minutes. This means that these airports do not have anyflights which would take less than one hour in the air.
AIR_TIME is the time a plane is actually in the air, not including taxi time and other things that make it wait. The average time a plane is in the air is 100.4 minutes.
The average DISTANCE planes fly from these airports is 589 miles.
1:13 am is the most often time planes touchdown after a flight.
On average, it takes a little under 9 minutes for planes to taxi in after touchdown.
The most common scheduled arrival time is 7:24, which mirrors the most common takeoff time of 4:30 pm.
The most common arrival time is 7:23, which makes the SCHEDULED_ARRIVAL time a very accurate predictor.
On average, aircraft arrive two minutes ahead of schedule.
for DIVERTED, I want to take a summary.
summary(CVG.Flights$DIVERTED)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000000 0.001803 0.000000 1.000000
This tells us that DIVERTED is a dummy variable, with 0 meaning no, and 1 meaning yes if a plane was diverted.
This is also true with CANCELLED, with 1 being yes and 0 being no.
CANCELLATION_REASON has three different levels: A, B and C. These are code for the three types of cancellation reasons the airports use.
AIR_SYSTEM_DELAY on average causes 16.25 minutes of delay.
SECURITY_DELAY does not happen often, but when it does, it causes at most 9 minutes of delay.
AIRLINE_DELAY causes on average 17.06 minutes of delay.
LATE_AIRCRAFT_DELAY causes 23.24 minutes of delay on average.
Finally, weather delays cause an average of 3.87 minutes of delay, and at most cause a delay of 638 minutes.
I want to look at the variable between DISTANCE and AIR_TIME. I anticipate a very strong correlation between them, as longer distance flights take more time. To visualize his data, I will use a scatter plot.
ggplot(CVG.Flights, aes(AIR_TIME, DISTANCE)) +
geom_point(alpha = 0.5) +
ggtitle("Distance v. Air Time") +
xlab("Air Time") +
ylab("Flight Distance")
## Warning: Removed 392 rows containing missing values (geom_point).
We can see there is definitely a positive linear relationship between AIR_TIME and DISTANCE. However, it can be seen that for each distance, there is a large number of different Air times. This could be caused by jetstreams from going east to west, and other factors such as landing delays.
For this next relationship, I wanted to see if longer scheduled flights have a longer delay time to leave the airport. This would be interesting as longer flights may have more passengers and prep that go into each flight. Within this comparison, I will also be comparing these flights by each airline, to see if we can see more in this relationship.
ggplot(CVG.Flights, aes(x = SCHEDULED_TIME, y = DEPARTURE_DELAY, col = AIRLINE)) +
geom_point() +
ggtitle("Scheduled Time and Delay by AIrline") +
xlab("Scheduled Flight Time") +
ylab("Departure Delay")
## Warning: Removed 370 rows containing missing values (geom_point).
We can see with this scatter plot that flights with a shorter flight time seem to have more instances of long delay. While these flights are more common, they still show large signs of high median delays compared to the longest of flight times. We can also see by airline that DL looks to have the lowest average delay time across the highest scheduled time of flights, while it is about the same from flights below 150 scheduled flight time.
For the airlines.cvs, I will first check the rows and columns.
ncol(airlines)
## [1] 2
nrow(airlines)
## [1] 14
For this output, we can see there are 2 variables and 14 observations in the airlines dataset.
Now to check the number of NAs in the dataset, I will do the sum function.
sum(is.na(airlines))
## [1] 0
This output tells me there is 0 NAs in the airlines dataset.
For airports.csv, I will again check the number of rows and columns to see the number of variables and observations.
ncol(airports)
## [1] 7
nrow(airports)
## [1] 322
This output tells me there are 7 variables and 322 observations.
To check the NA values, I will again sum function to check NA.
sum(is.na(airports))
## [1] 6
This output tells me there are 6 NA values in the airport dataset. To visualize this missing data, I will use the colsum function to find where the NA values are.
colSums(is.na(airports))
## IATA_CODE AIRPORT CITY STATE COUNTRY LATITUDE LONGITUDE
## 0 0 0 0 0 3 3
We can see that both LATITUDE and LONGITUDE each have 3 missing values. To visualize this even further, I will make a simple barplot to show the NA values.
barplot(colSums(is.na(airports)), las = 2, main = "Number of Missing Values", xlab = "Variables", ylab = "Number")
To merege these three datasets, I will be using the left_join function to create new dfs. I will call this final dataset with all three flightsmerged.
flightswairlines <- left_join(CVG.Flights, airlines, by = c("AIRLINE" = "IATA_CODE"))
flightsworigins <- left_join(flightswairlines, airports, by = c("ORIGIN_AIRPORT" = "IATA_CODE"))
flightsmerged <- left_join(flightsworigins, airports, by = c("DESTINATION_AIRPORT" = "IATA_CODE"))
This new flightsmerged data frame tells me I have successfully merged all three data sets by IATA_CODE. Here is a quick summary visualizing the merger.
summary(flightsmerged)
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## 3/2/2015 : 139 DL :2289 Min. : 62 CVG :3886
## 1/2/2015 : 136 MQ :2080 1st Qu.:1908 ORD : 727
## 3/5/2015 : 135 EV :1553 Median :3246 DFW : 458
## 2/13/2015: 134 OO :1326 Mean :3194 ATL : 412
## 2/23/2015: 134 F9 : 515 3rd Qu.:4536 EWR : 299
## 3/6/2015 : 134 AA : 0 Max. :6297 IAH : 247
## (Other) :6951 (Other): 0 (Other):1734
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY
## CVG :3877 4:15:00 PM: 210 9:28:00 AM: 58 Min. :-24.00
## ORD : 762 9:30:00 AM: 190 9:27:00 AM: 54 1st Qu.: -5.00
## DFW : 460 6:00:00 AM: 179 9:25:00 AM: 40 Median : -1.00
## ATL : 413 6:00:00 PM: 106 9:30:00 AM: 40 Mean : 12.35
## EWR : 294 4:00:00 PM: 104 9:29:00 AM: 35 3rd Qu.: 12.00
## IAH : 242 8:30:00 AM: 103 (Other) :7166 Max. :708.00
## (Other):1715 (Other) :6871 NA's : 370 NA's :370
## TAXI_OUT WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME
## Min. : 1.00 9:41:00 AM: 35 Min. : 60.0 Min. : 49.0
## 1st Qu.: 12.00 4:24:00 PM: 33 1st Qu.: 85.0 1st Qu.: 83.0
## Median : 15.00 9:43:00 AM: 32 Median :126.0 Median :123.0
## Mean : 18.52 9:40:00 AM: 28 Mean :128.1 Mean :127.3
## 3rd Qu.: 21.00 4:20:00 PM: 27 3rd Qu.:156.0 3rd Qu.:152.0
## Max. :122.00 (Other) :7230 Max. :313.0 Max. :340.0
## NA's :378 NA's : 378 NA's :392
## AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## Min. : 36.0 Min. : 229.0 1:13:00 AM: 98 Min. : 1.000
## 1st Qu.: 56.0 1st Qu.: 308.0 1:05:00 AM: 97 1st Qu.: 5.000
## Median : 99.0 Median : 589.0 1:12:00 AM: 97 Median : 6.000
## Mean :100.4 Mean : 675.9 1:19:00 AM: 97 Mean : 8.448
## 3rd Qu.:125.0 3rd Qu.: 871.0 1:25:00 AM: 97 3rd Qu.: 9.000
## Max. :319.0 Max. :2036.0 (Other) :6896 Max. :128.000
## NA's :392 NA's : 381 NA's :381
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED
## 7:24:00 PM: 139 7:23:00 PM : 23 Min. :-56.00 Min. :0.000000
## 7:25:00 PM: 120 10:25:00 AM: 20 1st Qu.:-12.00 1st Qu.:0.000000
## 3:30:00 PM: 87 6:06:00 PM : 20 Median : -2.00 Median :0.000000
## 8:39:00 PM: 79 10:53:00 AM: 19 Mean : 10.65 Mean :0.001803
## 9:30:00 PM: 78 10:58:00 AM: 19 3rd Qu.: 16.00 3rd Qu.:0.000000
## 7:05:00 PM: 75 (Other) :7281 Max. :716.00 Max. :1.000000
## (Other) :7185 NA's : 381 NA's :392
## CANCELLED CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY
## Min. :0.00000 A : 71 Min. : 0.00 Min. :0.00
## 1st Qu.:0.00000 B : 180 1st Qu.: 0.00 1st Qu.:0.00
## Median :0.00000 C : 127 Median : 10.00 Median :0.00
## Mean :0.04869 NA's:7385 Mean : 16.35 Mean :0.01
## 3rd Qu.:0.00000 3rd Qu.: 23.00 3rd Qu.:0.00
## Max. :1.00000 Max. :258.00 Max. :9.00
## NA's :5814 NA's :5814
## AIRLINE_DELAY LATE_AIRCRAFT_DELAY WEATHER_DELAY
## Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 2.00 Median : 0.000
## Mean : 17.06 Mean : 23.24 Mean : 3.873
## 3rd Qu.: 18.00 3rd Qu.: 30.00 3rd Qu.: 0.000
## Max. :435.00 Max. :435.00 Max. :638.000
## NA's :5814 NA's :5814 NA's :5814
## AIRLINE.y
## Delta Air Lines Inc. :2289
## American Eagle Airlines Inc.:2080
## Atlantic Southeast Airlines :1553
## Skywest Airlines Inc. :1326
## Frontier Airlines Inc. : 515
## Alaska Airlines Inc. : 0
## (Other) : 0
## AIRPORT.x
## Cincinnati/Northern Kentucky International Airport:3886
## Chicago O'Hare International Airport : 727
## Dallas/Fort Worth International Airport : 458
## Hartsfield-Jackson Atlanta International Airport : 412
## Newark Liberty International Airport : 299
## George Bush Intercontinental Airport : 247
## (Other) :1734
## CITY.x STATE.x COUNTRY.x LATITUDE.x
## Covington :3886 KY :3886 USA:7763 Min. :25.79
## Chicago : 727 IL : 727 1st Qu.:36.12
## Dallas-Fort Worth: 458 TX : 705 Median :39.05
## Atlanta : 412 FL : 561 Mean :37.64
## Newark : 299 GA : 412 3rd Qu.:39.05
## Houston : 247 NJ : 302 Max. :47.45
## (Other) :1734 (Other):1170
## LONGITUDE.x AIRPORT.y
## Min. :-122.37 Cincinnati/Northern Kentucky International Airport:3877
## 1st Qu.: -87.90 Chicago O'Hare International Airport : 762
## Median : -84.66 Dallas/Fort Worth International Airport : 460
## Mean : -87.06 Hartsfield-Jackson Atlanta International Airport : 413
## 3rd Qu.: -84.66 Newark Liberty International Airport : 294
## Max. : -71.01 George Bush Intercontinental Airport : 242
## (Other) :1715
## CITY.y STATE.y COUNTRY.y LATITUDE.y
## Covington :3877 KY :3877 USA:7763 Min. :25.79
## Chicago : 762 IL : 762 1st Qu.:36.08
## Dallas-Fort Worth: 460 TX : 702 Median :39.05
## Atlanta : 413 FL : 566 Mean :37.64
## Newark : 294 GA : 413 3rd Qu.:39.05
## Houston : 242 NJ : 297 Max. :47.45
## (Other) :1715 (Other):1146
## LONGITUDE.y
## Min. :-122.37
## 1st Qu.: -87.90
## Median : -84.66
## Mean : -87.02
## 3rd Qu.: -84.66
## Max. : -71.01
##