To begin, I will read the files into the Environment using read.csv()
CVG.Flights <- read.csv("CVG_Flights.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
airlines <- read.csv("airlines.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
airports <- read.csv("airports.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
In order to see the number of variables in the CVG.Flights dataset, I will use the ncol() function. This will return the number of columns in the dataset, with each column corresponding to a variable
ncol(CVG.Flights)
## [1] 27
This output shows that this dataset has 27 columns, meaning there are 27 variables.
To see the number of observations in the CVG.Flights dataset, I will use the nrow() function. This will return the number of rows in the dataset, with each row corresponding to an observation.
nrow(CVG.Flights)
## [1] 7763
This output shows that this dataset has 7763 rows, meaning there are 7763 records or observations.
Now I will be trying to see the number of missing values in the dataset. To do this, I will use the is.na() function within a sum() function to sum up the number of missing values.
sum(is.na(CVG.Flights))
## [1] 40270
The output shows that there are 40,270 missing values in the dataset. To get more information about which variables contain empty records, and the amount for each variable, I will use a colSums() function to extract the missing values for each column.
colSums(is.na(CVG.Flights))
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## 0 0 0 0
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY
## 0 0 370 370
## TAXI_OUT WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME
## 378 378 0 392
## AIR_TIME DISTANCE WHEELS_ON TAXI_IN
## 392 0 381 381
## SCHEDULED_ARRIVAL ARRIVAL_TIME ARRIVAL_DELAY DIVERTED
## 0 381 392 0
## CANCELLED CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY
## 0 7385 5814 5814
## AIRLINE_DELAY LATE_AIRCRAFT_DELAY WEATHER_DELAY
## 5814 5814 5814
This output tells me that there are missing values for the following variables: * DEPARTURE_TIME * DEPARTURE_DELAY * TAXI_OUT * WHEELS_OFF * ELAPSED_TIME * AIR_TIME * WHEELS_ON * TAXI_IN * ARRIVAL_TIME * ARRIVAL_DELAY * CANCELLATION_REASON * AIR_SYSTEM_DELAY * SECURITY_DELAY * AIRLINE_DELAY * LATE_AIRCRAFT_DELAY * WEATHER_DELAY
To visualize this in an easier format, I will create a barplot displaying the count of missing values for each variable.
barplot(colSums(is.na(CVG.Flights)), las = 3, main = "Missing Values for Each Variable", ylab = "Number of Missing Values")
Now, I’m going to take a general look at some of the key variables that I’ll be exploring in this project. To start, I’m going to use a summary() function to get a sense of all the variables at once, and then I will be focusing down on three variables that will be important in this data.
summary(CVG.Flights)
## FLIGHT_DATE AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 3/2/2015 : 139 DL:2289 Min. : 62 CVG :3886 CVG :3877
## 1/2/2015 : 136 EV:1553 1st Qu.:1908 ORD : 727 ORD : 762
## 3/5/2015 : 135 F9: 515 Median :3246 DFW : 458 DFW : 460
## 2/13/2015: 134 MQ:2080 Mean :3194 ATL : 412 ATL : 413
## 2/23/2015: 134 OO:1326 3rd Qu.:4536 EWR : 299 EWR : 294
## 3/6/2015 : 134 Max. :6297 IAH : 247 IAH : 242
## (Other) :6951 (Other):1734 (Other):1715
## SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT
## 4:15:00 PM: 210 9:28:00 AM: 58 Min. :-24.00 Min. : 1.00
## 9:30:00 AM: 190 9:27:00 AM: 54 1st Qu.: -5.00 1st Qu.: 12.00
## 6:00:00 AM: 179 9:25:00 AM: 40 Median : -1.00 Median : 15.00
## 6:00:00 PM: 106 9:30:00 AM: 40 Mean : 12.35 Mean : 18.52
## 4:00:00 PM: 104 9:29:00 AM: 35 3rd Qu.: 12.00 3rd Qu.: 21.00
## 8:30:00 AM: 103 (Other) :7166 Max. :708.00 Max. :122.00
## (Other) :6871 NA's : 370 NA's :370 NA's :378
## WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME AIR_TIME
## 9:41:00 AM: 35 Min. : 60.0 Min. : 49.0 Min. : 36.0
## 4:24:00 PM: 33 1st Qu.: 85.0 1st Qu.: 83.0 1st Qu.: 56.0
## 9:43:00 AM: 32 Median :126.0 Median :123.0 Median : 99.0
## 9:40:00 AM: 28 Mean :128.1 Mean :127.3 Mean :100.4
## 4:20:00 PM: 27 3rd Qu.:156.0 3rd Qu.:152.0 3rd Qu.:125.0
## (Other) :7230 Max. :313.0 Max. :340.0 Max. :319.0
## NA's : 378 NA's :392 NA's :392
## DISTANCE WHEELS_ON TAXI_IN SCHEDULED_ARRIVAL
## Min. : 229.0 1:13:00 AM: 98 Min. : 1.000 7:24:00 PM: 139
## 1st Qu.: 308.0 1:05:00 AM: 97 1st Qu.: 5.000 7:25:00 PM: 120
## Median : 589.0 1:12:00 AM: 97 Median : 6.000 3:30:00 PM: 87
## Mean : 675.9 1:19:00 AM: 97 Mean : 8.448 8:39:00 PM: 79
## 3rd Qu.: 871.0 1:25:00 AM: 97 3rd Qu.: 9.000 9:30:00 PM: 78
## Max. :2036.0 (Other) :6896 Max. :128.000 7:05:00 PM: 75
## NA's : 381 NA's :381 (Other) :7185
## ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## 7:23:00 PM : 23 Min. :-56.00 Min. :0.000000 Min. :0.00000
## 10:25:00 AM: 20 1st Qu.:-12.00 1st Qu.:0.000000 1st Qu.:0.00000
## 6:06:00 PM : 20 Median : -2.00 Median :0.000000 Median :0.00000
## 10:53:00 AM: 19 Mean : 10.65 Mean :0.001803 Mean :0.04869
## 10:58:00 AM: 19 3rd Qu.: 16.00 3rd Qu.:0.000000 3rd Qu.:0.00000
## (Other) :7281 Max. :716.00 Max. :1.000000 Max. :1.00000
## NA's : 381 NA's :392
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## A : 71 Min. : 0.00 Min. :0.00 Min. : 0.00
## B : 180 1st Qu.: 0.00 1st Qu.:0.00 1st Qu.: 0.00
## C : 127 Median : 10.00 Median :0.00 Median : 0.00
## NA's:7385 Mean : 16.35 Mean :0.01 Mean : 17.06
## 3rd Qu.: 23.00 3rd Qu.:0.00 3rd Qu.: 18.00
## Max. :258.00 Max. :9.00 Max. :435.00
## NA's :5814 NA's :5814 NA's :5814
## LATE_AIRCRAFT_DELAY WEATHER_DELAY
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 2.00 Median : 0.000
## Mean : 23.24 Mean : 3.873
## 3rd Qu.: 30.00 3rd Qu.: 0.000
## Max. :435.00 Max. :638.000
## NA's :5814 NA's :5814
The first variable I’ll focus a little deeper on is the AIRLINE column, using a levels() function and then a summary() function. This will provide information about the number of observations for each airline, which can help later in comparing and contrasting different aspects of the airlines.
levels(CVG.Flights$AIRLINE)
## [1] "DL" "EV" "F9" "MQ" "OO"
summary(CVG.Flights$AIRLINE)
## DL EV F9 MQ OO
## 2289 1553 515 2080 1326
From these functions, it shows that there are 5 airlines in the observations of this dataset. Among these 5 airlines, “DL” and “MQ” are the most common, with both containing over 2,000 observations. They’re followed by “EV” and “OO” which both have over 1,000 observations, and “F9” was the least common airline in this dataset, with just over 500 observations.
The next variable I’m going to explore is DEPARTURE_DELAY. This will likely be an important variable to explore in this dataset, because airlines would find it very valuable to learn about delay times and anything that might cause them. To do this, I’m going to use a simple summary() function to see some summary statistics.
summary(CVG.Flights$DEPARTURE_DELAY)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -24.00 -5.00 -1.00 12.35 12.00 708.00 370
The summary statistics provide some valuable insights into the distribution of this variable. Notably, the minimum value is -24, indicating that there are some flights that took off early, with the earliest flight being 24 minutes before the scheduled time. Also, the maximum value is 708, indicating that there was at least one flight that took off hours late. This could be interesting to explore later, to see if extremely delayed flights have a relationship with any specific causes.
Another takeaway from these statistics is that the median is -1.0. This means that at least half of scheduled flights took off earlier than the scheduled time. The mean, however, is 12.35. This indicates that the distribution of delay times is positively skewed, meaning there are some very high outliers pulling the mean higher. In fact, the mean is higher than the 3rd quartile value, further confirming that there are some very large outliers.
The final variable I will be highlighting is ARRIVAL_DELAY, which is another variable that would be valuable for airlines to understand and gain insight on. I’ll similarly use a summary() function for this, in addition to displaying a boxplot, which provides similar information as the summary() function, but in a more visual format. For this boxplot, I’ll set the limits for the y axis from -60 to 400, to better visualize it, even though this will cut off some of the extreme outliers.
summary(CVG.Flights$ARRIVAL_DELAY)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -56.00 -12.00 -2.00 10.65 16.00 716.00 392
boxplot(CVG.Flights$ARRIVAL_DELAY, ylim = c(-60, 400), ylab = "Arrival Delay", main = "Distribution of Arrival Delays")
This combination of summary statistics and the boxplot provides some useful information on the distribution of the Arrival Delays. We can see that the minimum is -56, which is an outlier on the boxplot. There are also a lot of outliers above the boxplot, which explains why the mean is 10.65, which is quite a bit higher than the median of -2.00.
Now I’m going to visualize the relationship between two variables, AIR_TIME and DISTANCE. I expect that there will be a very strong positive linear relationship, because it would make sense that longer distances require longer time in the air. If there is a different relationship that is surprising, I may need to do some further exploring to discover possible causes. The first thing I need to do is load the “tidyverse” package to use ggplots for visualizations.
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Now, to visualize this basic relationship, I’ll use a scatter plot through ggplot. I’m also going to add an alpha of 0.5 to distinguish where points are overlapping.
ggplot(CVG.Flights, aes(AIR_TIME, DISTANCE)) +
geom_point(alpha = 0.5) +
ggtitle("Distance v. Air Time") +
xlab("Air Time") +
ylab("Flight Distance")
## Warning: Removed 392 rows containing missing values (geom_point).
This scatter plot shows a general positive relationship between air time and distance, but it’s not as clear cut as I initially expected. This may be worth exploring later to see if there’s a reason why there seems to be lines for the air time varying around specific distances.
I will now be exploring the relationship between three variables. I am going to create a scatter plot displaying the relationship between DEPARTURE_DELAY and ARRIVAL_DELAY, with AIRLINE as a third factor (shown through color). I expect the relationship between DEPARTURE_DELAY and ARRIVAL_DELAY will be a very strong positive linear relationship, because delayed flights should lead to delayed arrivals, but there could be other factors that lead this to not be true. Visualizing the AIRLINES on this scatter plot may help identify if certain airlines see delays more often than others.
ggplot(CVG.Flights, aes(x = DEPARTURE_DELAY, y = ARRIVAL_DELAY, col = AIRLINE)) +
geom_point() +
ggtitle("Departure and Arrival Delays by Airlines") +
xlab("Departure Delay") +
ylab("Arrival Delay")
## Warning: Removed 392 rows containing missing values (geom_point).
This scatter plot confirms that there is a very strong positive linear relationship between Departure Delays and Arrival Delays. There doesn’t seem to be too much difference for these two variables based on the airlines, although we can see that DL tends to be grouped more toward 0, with some fairly large outliers. F9 and MQ are generally a bit more evenly spread out, and both of these airlines have the highest outliers on this scatter plot. This might be worth exploring deeper later on.
For the airlines dataset, I will use the ncol() function to find out the number of columns and the nrow() function to find out the number of rows.
ncol(airlines)
## [1] 2
nrow(airlines)
## [1] 14
These functions show that the airlines dataset has 2 columns and 14 rows. This means that there are 2 variables in this dataframe and 14 observations.
Next, I’ll use the is.na() function within a sum() function to find out the number of missing values in the dataframe.
sum(is.na(airlines))
## [1] 0
Since the sum of “NA” values in the dataframe is 0, this means that there are no missing values in this dataset.
Next, I will be performing the same steps on the airports dataframe. To start, I’ll use the ncol() and nrow() functions on this dataframe to find the number of columns and rows.
ncol(airports)
## [1] 7
nrow(airports)
## [1] 322
This output shows that there are 7 columns and 322 rows in the airports dataset. This means that there are 7 variables and 322 observations in this dataframe.
Next, once again I’ll use the is.na() function within a sum() function to find out the number of missing values in the dataset.
sum(is.na(airports))
## [1] 6
This output tells me that there are a total of 6 missing values in the dataset. To find out where these empty records occur, I’ll use a colSums() function.
colSums(is.na(airports))
## IATA_CODE AIRPORT CITY STATE COUNTRY LATITUDE LONGITUDE
## 0 0 0 0 0 3 3
This function shows me that the LATITUDE and LONGITUDE variables each contain 3 “NA’s”, or missing values, which accounts for the 6 empty records in this dataset.
In order to visualize the missing values by each variable, I will create a barplot.
barplot(colSums(is.na(airports)), las = 2, main = "Number of Missing Values", xlab = "Variables", ylab = "Number")
This barplot confirms that LATITUDE and LONGITUDE are the only variables with missing values, each with 3 empty records.
In order to combine these three datasets by their common values “IATA_CODE”, I will be using left_join() functions in the tidyverse package to create new data frames. This will input more information for each flight observation concerning the Airline and Origin/Destination Airports.
flights_w_airlines <- left_join(CVG.Flights, airlines, by = c("AIRLINE" = "IATA_CODE"))
flights_w_origins <- left_join(flights_w_airlines, airports, by = c("ORIGIN_AIRPORT" = "IATA_CODE"))
flights_merged <- left_join(flights_w_origins, airports, by = c("DESTINATION_AIRPORT" = "IATA_CODE"))
The final data frame, called “flights_merged,” has a lot more information that could be helpful to explore questions about this dataset. It has 7763 observations, and 40 variables. # Question 8 Now, I’m going to be propising five questions about this dataset to answer through summary statistics and data visualization. ### Section 8.1 #### Question One: Which airlines typically do the best job of getting flights out on time? Which airlines do the worst job? To explore the differences in departure delays for each of the airlines, I will use a boxplot with Airlines on the X-axis.
ggplot(flights_merged, aes(AIRLINE.y, DEPARTURE_DELAY)) +
geom_boxplot() +
ggtitle("Departure Delay by Airline") +
xlab("Airline") +
ylab("Departure Delay")
## Warning: Removed 370 rows containing non-finite values (stat_boxplot).
This barplot shows that overall, the Departure Delays for the different airlines seems to be fairly similar. Each airline has a lot of high outliers, meaning there are a lot of flights that were delayed for a very long time that are pulling up the average delay times for all of the airlines. We can also see that the airlines Frontier Airlines and American Eagle have the highest outliers.
It’s a bit difficult to interpret the interquartile range in the above boxplot due to the large variance of the outliers, so I will create a new boxplot with limits set on the y axis to make it easier to draw conclusions. This will remove some of the extreme outliers, and allow for a better understanding of the more common trends of departure delay times for each airline. I’ll also add a point for the mean into the boxplot.
ggplot(flights_merged, aes(AIRLINE.y, DEPARTURE_DELAY)) +
geom_boxplot() +
scale_y_continuous(limits = c(-50,200), breaks = seq(-50,200,50)) +
stat_summary(fun.y = mean) +
ggtitle("Departure Delays by Airline") +
xlab("Airline") +
ylab("Departure Delay")
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: Removed 429 rows containing non-finite values (stat_boxplot).
## Warning: Removed 429 rows containing non-finite values (stat_summary).
## Warning: Removed 5 rows containing missing values (geom_segment).
With this boxplot, it’s much easier to see the interquartile ranges for each airline, which is more of what we’re concerned about. Extreme outliers, such as the ones shown in the first boxplot, are most likely caused by factors other than poor operation speed/service of the airline. Focusing more on the Interquartile range, it is very obvious that Delta has the smallest difference between the first and third quartiles. The Interquartile range of this airline is clearly the smallest of all the airlines, meaning that their flight departures typically vary the least from the schedule time. Delta also has the lowest mean departure delay, which indicates that on average, Delta does the best job of getting flights departed on time, although Atlantic Southeast and Skywest Airlines aren’t far behind.
American Eagle and Frontier Airlines have the largest Interquartile ranges, with both third quartiles around a 25-minute departure delay.This means that there’s a bit more variance between the scheduled time and the actual time of departure for these airlines. They also have the highest mean flight times of any other airlines, indicating that on average, their flights have higher departure delays, and there are some more extreme outliers that pull the mean higher than the median. Overall, however, the airlines all have comparable median values hovering around 0, meaning that at least 50% of flights for each airline will be more-or-less on time, if not early. #### Question Two: Does Distance affect Departure Delay Time? For my next question, I’m going to explore whether the flight distance has an affect on departure delay time. I want to see if there is a relationship, and whether longer flights tend to have longer departure delays as a result of an increased chance of inclement weather or other factors. To explore this relationship, I will use a scatter plot.
ggplot(flights_merged, aes(DISTANCE, DEPARTURE_DELAY)) +
geom_point() +
ggtitle("Departure Delay v. Distance") +
xlab("Distance") +
ylab("Departure Delay")
## Warning: Removed 370 rows containing missing values (geom_point).
This chart does not clearly show any relationship between the flight distance and the departure delay times. In order to get a bit more information and see if there are any patterns not discernible by the human eye, I will create the same chart with a smooth line to see if there is any general trends.I’ll also add an alpha level of 0.5 to better visualize any points that may be overlapping.
ggplot(flights_merged, aes(DISTANCE, DEPARTURE_DELAY)) +
geom_point(alpha = 0.5) +
geom_smooth() +
ggtitle("Departure Delay v. Distance") +
xlab("Distance") +
ylab("Departure Delay")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 370 rows containing non-finite values (stat_smooth).
## Warning: Removed 370 rows containing missing values (geom_point).
The smooth line confirms that there is not much of a relationship between distance and departure delay times, and in fact suggests that departure delay time is pretty consistent among all flight distances.
I will now look at whether states have any effect on delay times. To do this, first I am going to look at origin states, and find the average departure delay times for each state. I will use an aggregate() function to find the average departure delay for each state, and then create a column chart to visualize the average delays.
Origin_Delays <- aggregate(flights_merged$DEPARTURE_DELAY ~ flights_merged$STATE.x, FUN = mean)
Origin_Delays
## flights_merged$STATE.x flights_merged$DEPARTURE_DELAY
## 1 AR 3.187500
## 2 AZ 12.121951
## 3 CA 6.413043
## 4 CO 15.140845
## 5 CT -4.000000
## 6 FL 9.710432
## 7 GA 7.120393
## 8 IL 21.583832
## 9 KY 11.610960
## 10 MA 21.090909
## 11 MI 8.620370
## 12 MN 8.529412
## 13 MO -3.666667
## 14 NC 17.176471
## 15 NJ 8.065134
## 16 NV 26.038462
## 17 NY 26.587156
## 18 PA 33.500000
## 19 TN 13.477477
## 20 TX 13.327893
## 21 UT 7.717391
## 22 VA 9.833333
## 23 WA -2.500000
## 24 WI 6.367816
ggplot(Origin_Delays, aes(Origin_Delays$`flights_merged$STATE.x`, Origin_Delays$`flights_merged$DEPARTURE_DELAY`)) +
geom_col() +
ggtitle("Average Departure Delays for Origin States") +
xlab("Origin State") +
ylab("Average Departure Delay")
## Warning: Use of `Origin_Delays$`flights_merged$STATE.x`` is discouraged. Use
## `flights_merged$STATE.x` instead.
## Warning: Use of `Origin_Delays$`flights_merged$DEPARTURE_DELAY`` is discouraged.
## Use `flights_merged$DEPARTURE_DELAY` instead.
This column chart provides some useful information. It shows that of the 26 origin states, the majority of them have a positive average delay time, with only Connecticut, Missouri, and Washington having negative average delay times. Pennsylvania, New York, and Nevada have noticeably higher average departure delays than other states, with Illinois and Massachusetts not too far behind.
Next, I’m going to look at Destination states, and find the average Arrival Delay times for each state using an aggregate() function. I’m hoping to compare the origin states bar chart with a destination states bar chart and see if any states have similar delay times when either exiting or entering the state.
Destination_Delays <- aggregate(flights_merged$ARRIVAL_DELAY ~ flights_merged$STATE.y, FUN = mean)
Destination_Delays
## flights_merged$STATE.y flights_merged$ARRIVAL_DELAY
## 1 AR 16.2352941
## 2 AZ 20.1190476
## 3 CA -8.7226277
## 4 CO 20.1025641
## 5 CT -18.0000000
## 6 FL 2.9089286
## 7 GA -4.0294118
## 8 IL 20.3788099
## 9 KY 12.1998378
## 10 MA 22.8888889
## 11 MI 1.0654206
## 12 MN 12.5641026
## 13 MO -0.6129032
## 14 NC 13.4000000
## 15 NJ 6.4798387
## 16 NV 7.6708861
## 17 NY 19.0495050
## 18 PA 33.0000000
## 19 TN 6.7857143
## 20 TX 13.4856712
## 21 UT -6.3586957
## 22 VA -11.2500000
## 23 WA -13.5000000
## 24 WI 12.8000000
ggplot(Destination_Delays, aes(Destination_Delays$`flights_merged$STATE.y`, Destination_Delays$`flights_merged$ARRIVAL_DELAY`)) +
geom_col() +
ggtitle("Average Arrival Delays for Destination States") +
xlab("Destination State") +
ylab("Average Arrival Delay")
## Warning: Use of `Destination_Delays$`flights_merged$STATE.y`` is discouraged.
## Use `flights_merged$STATE.y` instead.
## Warning: Use of `Destination_Delays$`flights_merged$ARRIVAL_DELAY`` is
## discouraged. Use `flights_merged$ARRIVAL_DELAY` instead.
From this chart, there are a few noticeable takeaways when comparing to the Origin State bar chart. First, Connecticut, Missouri, and Washington all have negative Arrival Delay times, just as they did for the Departure Delay times. This suggests that for some reason, these states tend to be more ahead of schedule when flying, whether entering or exiting the states.
Also, Pennsylvania has the highest arrival delay time, just as it did for the departure delay time. This suggests that flights in and out of Pennsylvania have higher delay times. All this information could be very useful to both consumers and airlines. Consumers could find this delay time information, and potentially plan for which state they want to fly in/out of for a trip. Airlines could use this information to dive deeper into the states that have the highest and lowest delay times, and study whether these states are doing anything wrong or right in how they are operating.
In this dataset, there are five possible reasons listed for why flights have an Arrival Delay. These reasons are recorded in the AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY, and WEATHER_DELAY columns. For any flight that has an Arrival Delay of 15 minutes or more, the amount of time attributed to each of the reasons for delay is recorded in the corresponding column. In other words, if all five of these columns are summed for any flight 15 minutes late or more, they add up to equal the ARRIVAL_DELAY time.
Three of these reasons for delay are non-controllable by the airlines: AIR_SYSTEM_DELAY, SECURITY_DELAY, and WEATHER_DELAY. The other two reasons for delay, AIRLINE_DELAY and LATE_AIRCRAFT_DELAY, are due to poor operational speeds by the airline, and are therefore indicative of poor service. I will be comparing the airlines based on the reasons for having an arrival delay, and whether their delay was a result of poor service or non-controllable circumstances.
To do this, first I will combine the three non-controllable variables into one new variable by summing them, and combine the two poor-service variables into a separate new variable by summing them.
non_controllable <- flights_merged$AIR_SYSTEM_DELAY + flights_merged$SECURITY_DELAY + flights_merged$WEATHER_DELAY
poor_service <- flights_merged$AIRLINE_DELAY +flights_merged$LATE_AIRCRAFT_DELAY
flights_merged <- cbind(flights_merged, non_controllable, poor_service)
After creating these new variables, I will now create a third variable, which will be the percent of the Arrival Delay attributed to Poor Service.
percent_poor_service <- (poor_service / (non_controllable + poor_service))*100
flights_merged <- cbind(flights_merged, percent_poor_service)
Now, I will use an aggregate() function to find the mean percentage of Arrival Delay times attributed to poor service for each airline, and then compare in a barplot.
avg_poor_service <- aggregate(flights_merged$percent_poor_service ~ flights_merged$AIRLINE.y, FUN = mean)
avg_poor_service
## flights_merged$AIRLINE.y flights_merged$percent_poor_service
## 1 American Eagle Airlines Inc. 58.34409
## 2 Atlantic Southeast Airlines 53.07303
## 3 Delta Air Lines Inc. 62.36648
## 4 Frontier Airlines Inc. 56.38028
## 5 Skywest Airlines Inc. 65.68554
ggplot(avg_poor_service, aes(avg_poor_service$`flights_merged$AIRLINE.y`, avg_poor_service$`flights_merged$percent_poor_service`)) +
geom_col() +
ggtitle("Percent of Weather Delays per Airline") +
xlab("Airline") +
ylab("Average Percent")
## Warning: Use of `avg_poor_service$`flights_merged$AIRLINE.y`` is discouraged.
## Use `flights_merged$AIRLINE.y` instead.
## Warning: Use of `avg_poor_service$`flights_merged$percent_poor_service`` is
## discouraged. Use `flights_merged$percent_poor_service` instead.
This chart and barplot shows that for all flights with arrival delays of 15 minutes or greater, Skywest Airlines has the greatest percent of delays attributed to poor service factors, at roughly 65%, followed by Delta, American Eagle, and Frontier. Atlantic Southeast Airlines has the lowest percent of arrival delays attributed to poor service, with about 53%. This was a bit surprising, as Delta had the smallest interquartile range for departure delays.
My final question is in regards to which states had the most problems with weather interfering with flights. To explore this, I will create a new dataframe called “weather,” and use filter() to only include records where the weather delay is greater than 0. I will then create another new dataframe and use aggregate() to count the number of flights with weather delays for each origin state.
weather <- filter(flights_merged, WEATHER_DELAY > 0)
weather_by_origin <- aggregate(weather$WEATHER_DELAY ~ weather$STATE.x, FUN = length)
weather_by_origin
## weather$STATE.x weather$WEATHER_DELAY
## 1 CO 2
## 2 FL 1
## 3 GA 6
## 4 IL 88
## 5 KY 70
## 6 MA 5
## 7 MN 1
## 8 NJ 3
## 9 NV 1
## 10 NY 4
## 11 TN 1
## 12 TX 29
## 13 WI 4
This table shows some basic information about the count of weather delays for each origin state. Of note, Illinois and Kentucky have by far the highest counts of weather delays. Now, I will visualize this data with a bar chart.
ggplot(weather_by_origin, aes(weather_by_origin$`weather$STATE.x`, weather_by_origin$`weather$WEATHER_DELAY`)) +
geom_col() +
ggtitle("Number of Weather Delays per Origin State") +
xlab("Origin State") +
ylab("Number of Weather Delays")
## Warning: Use of `weather_by_origin$`weather$STATE.x`` is discouraged. Use
## `weather$STATE.x` instead.
## Warning: Use of `weather_by_origin$`weather$WEATHER_DELAY`` is discouraged. Use
## `weather$WEATHER_DELAY` instead.
This shows that most states don’t have a whole lot of weather delays on flights leaving them, but Illinois, Kentucky, and Texas have quite a lot comparatively and account for most of the weather delays.
Next, I will perform the same process for the Destination states to try and compare with the Origin states. First, I’ll create a new data frame and use aggregate() to count the number of weather delays for each destination state.
weather_by_dest <- aggregate(weather$WEATHER_DELAY ~ weather$STATE.y, FUN = length)
weather_by_dest
## weather$STATE.y weather$WEATHER_DELAY
## 1 AZ 1
## 2 CA 1
## 3 FL 9
## 4 GA 4
## 5 IL 18
## 6 KY 145
## 7 MI 1
## 8 MN 1
## 9 NJ 3
## 10 NY 3
## 11 TX 26
## 12 UT 2
## 13 WI 1
This table shows that flights entering Kentucky have by far the highest number of weather delays, followed by Texas. Now, I’ll similarly create a bar chart to visualize this table.
ggplot(weather_by_dest, aes(weather_by_dest$`weather$STATE.y`, weather_by_dest$`weather$WEATHER_DELAY`)) +
geom_col() +
ggtitle("Number of Weather Delays per Destination State") +
xlab("Destination State") +
ylab("Number of Weather Delays")
## Warning: Use of `weather_by_dest$`weather$STATE.y`` is discouraged. Use
## `weather$STATE.y` instead.
## Warning: Use of `weather_by_dest$`weather$WEATHER_DELAY`` is discouraged. Use
## `weather$WEATHER_DELAY` instead.
This bar chart shows a similar pattern as the origin states bar chart, with the majority of flights that are delayed due to weather coming from just a few states. An important takeaway in comparing these charts is that Kentucky, Texas, and Illinois have the highest number of delays due to weather compared to all other states, both for flights entering and leaving the states.