To begin, I will read the files into the Environment using read.csv()

CVG.Flights <- read.csv("CVG_Flights.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
airlines <- read.csv("airlines.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")
airports <- read.csv("airports.csv", header = TRUE, stringsAsFactors = TRUE, na.strings = "")

Question 1

Section 1.1

In order to see the number of variables in the CVG.Flights dataset, I will use the ncol() function. This will return the number of columns in the dataset, with each column corresponding to a variable

ncol(CVG.Flights)

## [1] 27

This output shows that this dataset has 27 columns, meaning there are 27 variables.

Section 1.2

To see the number of observations in the CVG.Flights dataset, I will use the nrow() function. This will return the number of rows in the dataset, with each row corresponding to an observation.

nrow(CVG.Flights)

## [1] 7763

This output shows that this dataset has 7763 rows, meaning there are 7763 records or observations.

Question 2

Section 2.1

Now I will be trying to see the number of missing values in the dataset. To do this, I will use the is.na() function within a sum() function to sum up the number of missing values.

sum(is.na(CVG.Flights))

## [1] 40270

The output shows that there are 40,270 missing values in the dataset. To get more information about which variables contain empty records, and the amount for each variable, I will use a colSums() function to extract the missing values for each column.

colSums(is.na(CVG.Flights))

##         FLIGHT_DATE             AIRLINE       FLIGHT_NUMBER      ORIGIN_AIRPORT 
##                   0                   0                   0                   0 
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE      DEPARTURE_TIME     DEPARTURE_DELAY 
##                   0                   0                 370                 370 
##            TAXI_OUT          WHEELS_OFF      SCHEDULED_TIME        ELAPSED_TIME 
##                 378                 378                   0                 392 
##            AIR_TIME            DISTANCE           WHEELS_ON             TAXI_IN 
##                 392                   0                 381                 381 
##   SCHEDULED_ARRIVAL        ARRIVAL_TIME       ARRIVAL_DELAY            DIVERTED 
##                   0                 381                 392                   0 
##           CANCELLED CANCELLATION_REASON    AIR_SYSTEM_DELAY      SECURITY_DELAY 
##                   0                7385                5814                5814 
##       AIRLINE_DELAY LATE_AIRCRAFT_DELAY       WEATHER_DELAY 
##                5814                5814                5814

This output tells me that there are missing values for the following variables: * DEPARTURE_TIME * DEPARTURE_DELAY * TAXI_OUT * WHEELS_OFF * ELAPSED_TIME * AIR_TIME * WHEELS_ON * TAXI_IN * ARRIVAL_TIME * ARRIVAL_DELAY * CANCELLATION_REASON * AIR_SYSTEM_DELAY * SECURITY_DELAY * AIRLINE_DELAY * LATE_AIRCRAFT_DELAY * WEATHER_DELAY

To visualize this in an easier format, I will create a barplot displaying the count of missing values for each variable.

barplot(colSums(is.na(CVG.Flights)), las = 3, main = "Missing Values for Each Variable", ylab = "Number of Missing Values")

Question 3

Now, I’m going to take a general look at some of the key variables that I’ll be exploring in this project. To start, I’m going to use a summary() function to get a sense of all the variables at once, and then I will be focusing down on three variables that will be important in this data.

summary(CVG.Flights)

##     FLIGHT_DATE   AIRLINE   FLIGHT_NUMBER  ORIGIN_AIRPORT DESTINATION_AIRPORT
##  3/2/2015 : 139   DL:2289   Min.   :  62   CVG    :3886   CVG    :3877       
##  1/2/2015 : 136   EV:1553   1st Qu.:1908   ORD    : 727   ORD    : 762       
##  3/5/2015 : 135   F9: 515   Median :3246   DFW    : 458   DFW    : 460       
##  2/13/2015: 134   MQ:2080   Mean   :3194   ATL    : 412   ATL    : 413       
##  2/23/2015: 134   OO:1326   3rd Qu.:4536   EWR    : 299   EWR    : 294       
##  3/6/2015 : 134             Max.   :6297   IAH    : 247   IAH    : 242       
##  (Other)  :6951                            (Other):1734   (Other):1715       
##  SCHEDULED_DEPARTURE    DEPARTURE_TIME DEPARTURE_DELAY     TAXI_OUT     
##  4:15:00 PM: 210     9:28:00 AM:  58   Min.   :-24.00   Min.   :  1.00  
##  9:30:00 AM: 190     9:27:00 AM:  54   1st Qu.: -5.00   1st Qu.: 12.00  
##  6:00:00 AM: 179     9:25:00 AM:  40   Median : -1.00   Median : 15.00  
##  6:00:00 PM: 106     9:30:00 AM:  40   Mean   : 12.35   Mean   : 18.52  
##  4:00:00 PM: 104     9:29:00 AM:  35   3rd Qu.: 12.00   3rd Qu.: 21.00  
##  8:30:00 AM: 103     (Other)   :7166   Max.   :708.00   Max.   :122.00  
##  (Other)   :6871     NA's      : 370   NA's   :370      NA's   :378     
##       WHEELS_OFF   SCHEDULED_TIME   ELAPSED_TIME      AIR_TIME    
##  9:41:00 AM:  35   Min.   : 60.0   Min.   : 49.0   Min.   : 36.0  
##  4:24:00 PM:  33   1st Qu.: 85.0   1st Qu.: 83.0   1st Qu.: 56.0  
##  9:43:00 AM:  32   Median :126.0   Median :123.0   Median : 99.0  
##  9:40:00 AM:  28   Mean   :128.1   Mean   :127.3   Mean   :100.4  
##  4:20:00 PM:  27   3rd Qu.:156.0   3rd Qu.:152.0   3rd Qu.:125.0  
##  (Other)   :7230   Max.   :313.0   Max.   :340.0   Max.   :319.0  
##  NA's      : 378                   NA's   :392     NA's   :392    
##     DISTANCE           WHEELS_ON       TAXI_IN         SCHEDULED_ARRIVAL
##  Min.   : 229.0   1:13:00 AM:  98   Min.   :  1.000   7:24:00 PM: 139   
##  1st Qu.: 308.0   1:05:00 AM:  97   1st Qu.:  5.000   7:25:00 PM: 120   
##  Median : 589.0   1:12:00 AM:  97   Median :  6.000   3:30:00 PM:  87   
##  Mean   : 675.9   1:19:00 AM:  97   Mean   :  8.448   8:39:00 PM:  79   
##  3rd Qu.: 871.0   1:25:00 AM:  97   3rd Qu.:  9.000   9:30:00 PM:  78   
##  Max.   :2036.0   (Other)   :6896   Max.   :128.000   7:05:00 PM:  75   
##                   NA's      : 381   NA's   :381       (Other)   :7185   
##       ARRIVAL_TIME  ARRIVAL_DELAY       DIVERTED          CANCELLED      
##  7:23:00 PM :  23   Min.   :-56.00   Min.   :0.000000   Min.   :0.00000  
##  10:25:00 AM:  20   1st Qu.:-12.00   1st Qu.:0.000000   1st Qu.:0.00000  
##  6:06:00 PM :  20   Median : -2.00   Median :0.000000   Median :0.00000  
##  10:53:00 AM:  19   Mean   : 10.65   Mean   :0.001803   Mean   :0.04869  
##  10:58:00 AM:  19   3rd Qu.: 16.00   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  (Other)    :7281   Max.   :716.00   Max.   :1.000000   Max.   :1.00000  
##  NA's       : 381   NA's   :392                                          
##  CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY   
##  A   :  71           Min.   :  0.00   Min.   :0.00   Min.   :  0.00  
##  B   : 180           1st Qu.:  0.00   1st Qu.:0.00   1st Qu.:  0.00  
##  C   : 127           Median : 10.00   Median :0.00   Median :  0.00  
##  NA's:7385           Mean   : 16.35   Mean   :0.01   Mean   : 17.06  
##                      3rd Qu.: 23.00   3rd Qu.:0.00   3rd Qu.: 18.00  
##                      Max.   :258.00   Max.   :9.00   Max.   :435.00  
##                      NA's   :5814     NA's   :5814   NA's   :5814    
##  LATE_AIRCRAFT_DELAY WEATHER_DELAY    
##  Min.   :  0.00      Min.   :  0.000  
##  1st Qu.:  0.00      1st Qu.:  0.000  
##  Median :  2.00      Median :  0.000  
##  Mean   : 23.24      Mean   :  3.873  
##  3rd Qu.: 30.00      3rd Qu.:  0.000  
##  Max.   :435.00      Max.   :638.000  
##  NA's   :5814        NA's   :5814

Section 3.1

The first variable I’ll focus a little deeper on is the AIRLINE column, using a levels() function and then a summary() function. This will provide information about the number of observations for each airline, which can help later in comparing and contrasting different aspects of the airlines.

levels(CVG.Flights$AIRLINE)

## [1] "DL" "EV" "F9" "MQ" "OO"

summary(CVG.Flights$AIRLINE)

##   DL   EV   F9   MQ   OO 
## 2289 1553  515 2080 1326

From these functions, it shows that there are 5 airlines in the observations of this dataset. Among these 5 airlines, “DL” and “MQ” are the most common, with both containing over 2,000 observations. They’re followed by “EV” and “OO” which both have over 1,000 observations, and “F9” was the least common airline in this dataset, with just over 500 observations.

Section 3.2

The next variable I’m going to explore is DEPARTURE_DELAY. This will likely be an important variable to explore in this dataset, because airlines would find it very valuable to learn about delay times and anything that might cause them. To do this, I’m going to use a simple summary() function to see some summary statistics.

summary(CVG.Flights$DEPARTURE_DELAY)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -24.00   -5.00   -1.00   12.35   12.00  708.00     370

The summary statistics provide some valuable insights into the distribution of this variable. Notably, the minimum value is -24, indicating that there are some flights that took off early, with the earliest flight being 24 minutes before the scheduled time. Also, the maximum value is 708, indicating that there was at least one flight that took off hours late. This could be interesting to explore later, to see if extremely delayed flights have a relationship with any specific causes.

Another takeaway from these statistics is that the median is -1.0. This means that at least half of scheduled flights took off earlier than the scheduled time. The mean, however, is 12.35. This indicates that the distribution of delay times is positively skewed, meaning there are some very high outliers pulling the mean higher. In fact, the mean is higher than the 3rd quartile value, further confirming that there are some very large outliers.

Section 3.3

The final variable I will be highlighting is ARRIVAL_DELAY, which is another variable that would be valuable for airlines to understand and gain insight on. I’ll similarly use a summary() function for this, in addition to displaying a boxplot, which provides similar information as the summary() function, but in a more visual format. For this boxplot, I’ll set the limits for the y axis from -60 to 400, to better visualize it, even though this will cut off some of the extreme outliers.

summary(CVG.Flights$ARRIVAL_DELAY)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -56.00  -12.00   -2.00   10.65   16.00  716.00     392

boxplot(CVG.Flights$ARRIVAL_DELAY, ylim = c(-60, 400), ylab = "Arrival Delay", main = "Distribution of Arrival Delays")

This combination of summary statistics and the boxplot provides some useful information on the distribution of the Arrival Delays. We can see that the minimum is -56, which is an outlier on the boxplot. There are also a lot of outliers above the boxplot, which explains why the mean is 10.65, which is quite a bit higher than the median of -2.00.

Question 4

Section 4.1

Now I’m going to visualize the relationship between two variables, AIR_TIME and DISTANCE. I expect that there will be a very strong positive linear relationship, because it would make sense that longer distances require longer time in the air. If there is a different relationship that is surprising, I may need to do some further exploring to discover possible causes. The first thing I need to do is load the “tidyverse” package to use ggplots for visualizations.

library(tidyverse)

## -- Attaching packages ----------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts -------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Now, to visualize this basic relationship, I’ll use a scatter plot through ggplot. I’m also going to add an alpha of 0.5 to distinguish where points are overlapping.

ggplot(CVG.Flights, aes(AIR_TIME, DISTANCE)) +
  geom_point(alpha = 0.5) +
  ggtitle("Distance v. Air Time") +
  xlab("Air Time") +
  ylab("Flight Distance")

## Warning: Removed 392 rows containing missing values (geom_point).

This scatter plot shows a general positive relationship between air time and distance, but it’s not as clear cut as I initially expected. This may be worth exploring later to see if there’s a reason why there seems to be lines for the air time varying around specific distances.

Question 5

Section 5.1

I will now be exploring the relationship between three variables. I am going to create a scatter plot displaying the relationship between DEPARTURE_DELAY and ARRIVAL_DELAY, with AIRLINE as a third factor (shown through color). I expect the relationship between DEPARTURE_DELAY and ARRIVAL_DELAY will be a very strong positive linear relationship, because delayed flights should lead to delayed arrivals, but there could be other factors that lead this to not be true. Visualizing the AIRLINES on this scatter plot may help identify if certain airlines see delays more often than others.

ggplot(CVG.Flights, aes(x = DEPARTURE_DELAY, y = ARRIVAL_DELAY, col = AIRLINE)) +
  geom_point() +
  ggtitle("Departure and Arrival Delays by Airlines") +
  xlab("Departure Delay") +
  ylab("Arrival Delay")

## Warning: Removed 392 rows containing missing values (geom_point).

This scatter plot confirms that there is a very strong positive linear relationship between Departure Delays and Arrival Delays. There doesn’t seem to be too much difference for these two variables based on the airlines, although we can see that DL tends to be grouped more toward 0, with some fairly large outliers. F9 and MQ are generally a bit more evenly spread out, and both of these airlines have the highest outliers on this scatter plot. This might be worth exploring deeper later on.

Question 6

Section 6.1

For the airlines dataset, I will use the ncol() function to find out the number of columns and the nrow() function to find out the number of rows.

ncol(airlines)

## [1] 2

nrow(airlines)

## [1] 14

These functions show that the airlines dataset has 2 columns and 14 rows. This means that there are 2 variables in this dataframe and 14 observations.

Next, I’ll use the is.na() function within a sum() function to find out the number of missing values in the dataframe.

sum(is.na(airlines))

## [1] 0

Since the sum of “NA” values in the dataframe is 0, this means that there are no missing values in this dataset.

Section 6.2

Next, I will be performing the same steps on the airports dataframe. To start, I’ll use the ncol() and nrow() functions on this dataframe to find the number of columns and rows.

ncol(airports)

## [1] 7

nrow(airports)

## [1] 322

This output shows that there are 7 columns and 322 rows in the airports dataset. This means that there are 7 variables and 322 observations in this dataframe.

Next, once again I’ll use the is.na() function within a sum() function to find out the number of missing values in the dataset.

sum(is.na(airports))

## [1] 6

This output tells me that there are a total of 6 missing values in the dataset. To find out where these empty records occur, I’ll use a colSums() function.

colSums(is.na(airports))

## IATA_CODE   AIRPORT      CITY     STATE   COUNTRY  LATITUDE LONGITUDE 
##         0         0         0         0         0         3         3

This function shows me that the LATITUDE and LONGITUDE variables each contain 3 “NA’s”, or missing values, which accounts for the 6 empty records in this dataset.

In order to visualize the missing values by each variable, I will create a barplot.

barplot(colSums(is.na(airports)), las = 2, main = "Number of Missing Values", xlab = "Variables", ylab = "Number")

This barplot confirms that LATITUDE and LONGITUDE are the only variables with missing values, each with 3 empty records.

Question 7

Section 7.1

In order to combine these three datasets by their common values “IATA_CODE”, I will be using left_join() functions in the tidyverse package to create new data frames. This will input more information for each flight observation concerning the Airline and Origin/Destination Airports.

flights_w_airlines <- left_join(CVG.Flights, airlines, by = c("AIRLINE" = "IATA_CODE"))
flights_w_origins <- left_join(flights_w_airlines, airports, by = c("ORIGIN_AIRPORT" = "IATA_CODE"))
flights_merged <- left_join(flights_w_origins, airports, by = c("DESTINATION_AIRPORT" = "IATA_CODE"))

The final data frame, called “flights_merged,” has a lot more information that could be helpful to explore questions about this dataset. It has 7763 observations, and 40 variables. # Question 8 Now, I’m going to be propising five questions about this dataset to answer through summary statistics and data visualization. ### Section 8.1 #### Question One: Which airlines typically do the best job of getting flights out on time? Which airlines do the worst job? To explore the differences in departure delays for each of the airlines, I will use a boxplot with Airlines on the X-axis.

ggplot(flights_merged, aes(AIRLINE.y, DEPARTURE_DELAY)) +
  geom_boxplot() +
ggtitle("Departure Delay by Airline") +
  xlab("Airline") +
  ylab("Departure Delay")

## Warning: Removed 370 rows containing non-finite values (stat_boxplot).

This barplot shows that overall, the Departure Delays for the different airlines seems to be fairly similar. Each airline has a lot of high outliers, meaning there are a lot of flights that were delayed for a very long time that are pulling up the average delay times for all of the airlines. We can also see that the airlines Frontier Airlines and American Eagle have the highest outliers.

It’s a bit difficult to interpret the interquartile range in the above boxplot due to the large variance of the outliers, so I will create a new boxplot with limits set on the y axis to make it easier to draw conclusions. This will remove some of the extreme outliers, and allow for a better understanding of the more common trends of departure delay times for each airline. I’ll also add a point for the mean into the boxplot.

ggplot(flights_merged, aes(AIRLINE.y, DEPARTURE_DELAY)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(-50,200), breaks = seq(-50,200,50)) +
  stat_summary(fun.y = mean) +
ggtitle("Departure Delays by Airline") +
  xlab("Airline") +
  ylab("Departure Delay")

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: Removed 429 rows containing non-finite values (stat_boxplot).

## Warning: Removed 429 rows containing non-finite values (stat_summary).

## Warning: Removed 5 rows containing missing values (geom_segment).

With this boxplot, it’s much easier to see the interquartile ranges for each airline, which is more of what we’re concerned about. Extreme outliers, such as the ones shown in the first boxplot, are most likely caused by factors other than poor operation speed/service of the airline. Focusing more on the Interquartile range, it is very obvious that Delta has the smallest difference between the first and third quartiles. The Interquartile range of this airline is clearly the smallest of all the airlines, meaning that their flight departures typically vary the least from the schedule time. Delta also has the lowest mean departure delay, which indicates that on average, Delta does the best job of getting flights departed on time, although Atlantic Southeast and Skywest Airlines aren’t far behind.

American Eagle and Frontier Airlines have the largest Interquartile ranges, with both third quartiles around a 25-minute departure delay.This means that there’s a bit more variance between the scheduled time and the actual time of departure for these airlines. They also have the highest mean flight times of any other airlines, indicating that on average, their flights have higher departure delays, and there are some more extreme outliers that pull the mean higher than the median. Overall, however, the airlines all have comparable median values hovering around 0, meaning that at least 50% of flights for each airline will be more-or-less on time, if not early. #### Question Two: Does Distance affect Departure Delay Time? For my next question, I’m going to explore whether the flight distance has an affect on departure delay time. I want to see if there is a relationship, and whether longer flights tend to have longer departure delays as a result of an increased chance of inclement weather or other factors. To explore this relationship, I will use a scatter plot.

ggplot(flights_merged, aes(DISTANCE, DEPARTURE_DELAY)) +
  geom_point() +
ggtitle("Departure Delay v. Distance") +
  xlab("Distance") +
  ylab("Departure Delay")

## Warning: Removed 370 rows containing missing values (geom_point).

This chart does not clearly show any relationship between the flight distance and the departure delay times. In order to get a bit more information and see if there are any patterns not discernible by the human eye, I will create the same chart with a smooth line to see if there is any general trends.I’ll also add an alpha level of 0.5 to better visualize any points that may be overlapping.

ggplot(flights_merged, aes(DISTANCE, DEPARTURE_DELAY)) +
  geom_point(alpha = 0.5) +
  geom_smooth() +
ggtitle("Departure Delay v. Distance") +
  xlab("Distance") +
  ylab("Departure Delay")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 370 rows containing non-finite values (stat_smooth).

## Warning: Removed 370 rows containing missing values (geom_point).

The smooth line confirms that there is not much of a relationship between distance and departure delay times, and in fact suggests that departure delay time is pretty consistent among all flight distances.

Question Three: Do states have any effect on delay times?

I will now look at whether states have any effect on delay times. To do this, first I am going to look at origin states, and find the average departure delay times for each state. I will use an aggregate() function to find the average departure delay for each state, and then create a column chart to visualize the average delays.

Origin_Delays <- aggregate(flights_merged$DEPARTURE_DELAY ~ flights_merged$STATE.x, FUN = mean)
Origin_Delays

##    flights_merged$STATE.x flights_merged$DEPARTURE_DELAY
## 1                      AR                       3.187500
## 2                      AZ                      12.121951
## 3                      CA                       6.413043
## 4                      CO                      15.140845
## 5                      CT                      -4.000000
## 6                      FL                       9.710432
## 7                      GA                       7.120393
## 8                      IL                      21.583832
## 9                      KY                      11.610960
## 10                     MA                      21.090909
## 11                     MI                       8.620370
## 12                     MN                       8.529412
## 13                     MO                      -3.666667
## 14                     NC                      17.176471
## 15                     NJ                       8.065134
## 16                     NV                      26.038462
## 17                     NY                      26.587156
## 18                     PA                      33.500000
## 19                     TN                      13.477477
## 20                     TX                      13.327893
## 21                     UT                       7.717391
## 22                     VA                       9.833333
## 23                     WA                      -2.500000
## 24                     WI                       6.367816

ggplot(Origin_Delays, aes(Origin_Delays$`flights_merged$STATE.x`, Origin_Delays$`flights_merged$DEPARTURE_DELAY`)) +
  geom_col() +
ggtitle("Average Departure Delays for Origin States") +
  xlab("Origin State") +
  ylab("Average Departure Delay")

## Warning: Use of `Origin_Delays$`flights_merged$STATE.x`` is discouraged. Use
## `flights_merged$STATE.x` instead.

## Warning: Use of `Origin_Delays$`flights_merged$DEPARTURE_DELAY`` is discouraged.
## Use `flights_merged$DEPARTURE_DELAY` instead.

This column chart provides some useful information. It shows that of the 26 origin states, the majority of them have a positive average delay time, with only Connecticut, Missouri, and Washington having negative average delay times. Pennsylvania, New York, and Nevada have noticeably higher average departure delays than other states, with Illinois and Massachusetts not too far behind.

Next, I’m going to look at Destination states, and find the average Arrival Delay times for each state using an aggregate() function. I’m hoping to compare the origin states bar chart with a destination states bar chart and see if any states have similar delay times when either exiting or entering the state.

Destination_Delays <- aggregate(flights_merged$ARRIVAL_DELAY ~ flights_merged$STATE.y, FUN = mean)
Destination_Delays

##    flights_merged$STATE.y flights_merged$ARRIVAL_DELAY
## 1                      AR                   16.2352941
## 2                      AZ                   20.1190476
## 3                      CA                   -8.7226277
## 4                      CO                   20.1025641
## 5                      CT                  -18.0000000
## 6                      FL                    2.9089286
## 7                      GA                   -4.0294118
## 8                      IL                   20.3788099
## 9                      KY                   12.1998378
## 10                     MA                   22.8888889
## 11                     MI                    1.0654206
## 12                     MN                   12.5641026
## 13                     MO                   -0.6129032
## 14                     NC                   13.4000000
## 15                     NJ                    6.4798387
## 16                     NV                    7.6708861
## 17                     NY                   19.0495050
## 18                     PA                   33.0000000
## 19                     TN                    6.7857143
## 20                     TX                   13.4856712
## 21                     UT                   -6.3586957
## 22                     VA                  -11.2500000
## 23                     WA                  -13.5000000
## 24                     WI                   12.8000000

ggplot(Destination_Delays, aes(Destination_Delays$`flights_merged$STATE.y`, Destination_Delays$`flights_merged$ARRIVAL_DELAY`)) +
  geom_col() +
ggtitle("Average Arrival Delays for Destination States") +
  xlab("Destination State") +
  ylab("Average Arrival Delay")

## Warning: Use of `Destination_Delays$`flights_merged$STATE.y`` is discouraged.
## Use `flights_merged$STATE.y` instead.

## Warning: Use of `Destination_Delays$`flights_merged$ARRIVAL_DELAY`` is
## discouraged. Use `flights_merged$ARRIVAL_DELAY` instead.

From this chart, there are a few noticeable takeaways when comparing to the Origin State bar chart. First, Connecticut, Missouri, and Washington all have negative Arrival Delay times, just as they did for the Departure Delay times. This suggests that for some reason, these states tend to be more ahead of schedule when flying, whether entering or exiting the states.

Also, Pennsylvania has the highest arrival delay time, just as it did for the departure delay time. This suggests that flights in and out of Pennsylvania have higher delay times. All this information could be very useful to both consumers and airlines. Consumers could find this delay time information, and potentially plan for which state they want to fly in/out of for a trip. Airlines could use this information to dive deeper into the states that have the highest and lowest delay times, and study whether these states are doing anything wrong or right in how they are operating.

Question 4: For each airline, how much of their arrival delays can be attributed to operational errors, as opposed to non-controllable circumstances?

In this dataset, there are five possible reasons listed for why flights have an Arrival Delay. These reasons are recorded in the AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY, and WEATHER_DELAY columns. For any flight that has an Arrival Delay of 15 minutes or more, the amount of time attributed to each of the reasons for delay is recorded in the corresponding column. In other words, if all five of these columns are summed for any flight 15 minutes late or more, they add up to equal the ARRIVAL_DELAY time.

Three of these reasons for delay are non-controllable by the airlines: AIR_SYSTEM_DELAY, SECURITY_DELAY, and WEATHER_DELAY. The other two reasons for delay, AIRLINE_DELAY and LATE_AIRCRAFT_DELAY, are due to poor operational speeds by the airline, and are therefore indicative of poor service. I will be comparing the airlines based on the reasons for having an arrival delay, and whether their delay was a result of poor service or non-controllable circumstances.

To do this, first I will combine the three non-controllable variables into one new variable by summing them, and combine the two poor-service variables into a separate new variable by summing them.

non_controllable <- flights_merged$AIR_SYSTEM_DELAY + flights_merged$SECURITY_DELAY + flights_merged$WEATHER_DELAY
poor_service <- flights_merged$AIRLINE_DELAY +flights_merged$LATE_AIRCRAFT_DELAY
flights_merged <- cbind(flights_merged, non_controllable, poor_service)

After creating these new variables, I will now create a third variable, which will be the percent of the Arrival Delay attributed to Poor Service.

percent_poor_service <- (poor_service / (non_controllable + poor_service))*100
flights_merged <- cbind(flights_merged, percent_poor_service)

Now, I will use an aggregate() function to find the mean percentage of Arrival Delay times attributed to poor service for each airline, and then compare in a barplot.

avg_poor_service <- aggregate(flights_merged$percent_poor_service ~ flights_merged$AIRLINE.y, FUN = mean)
avg_poor_service

##       flights_merged$AIRLINE.y flights_merged$percent_poor_service
## 1 American Eagle Airlines Inc.                            58.34409
## 2  Atlantic Southeast Airlines                            53.07303
## 3         Delta Air Lines Inc.                            62.36648
## 4       Frontier Airlines Inc.                            56.38028
## 5        Skywest Airlines Inc.                            65.68554

ggplot(avg_poor_service, aes(avg_poor_service$`flights_merged$AIRLINE.y`, avg_poor_service$`flights_merged$percent_poor_service`)) +
  geom_col() +
ggtitle("Percent of Weather Delays per Airline") +
  xlab("Airline") +
  ylab("Average Percent")

## Warning: Use of `avg_poor_service$`flights_merged$AIRLINE.y`` is discouraged.
## Use `flights_merged$AIRLINE.y` instead.

## Warning: Use of `avg_poor_service$`flights_merged$percent_poor_service`` is
## discouraged. Use `flights_merged$percent_poor_service` instead.

This chart and barplot shows that for all flights with arrival delays of 15 minutes or greater, Skywest Airlines has the greatest percent of delays attributed to poor service factors, at roughly 65%, followed by Delta, American Eagle, and Frontier. Atlantic Southeast Airlines has the lowest percent of arrival delays attributed to poor service, with about 53%. This was a bit surprising, as Delta had the smallest interquartile range for departure delays.

Question 5: For flights delayed at least in part due to weather, what were the most common origin and destination states?

My final question is in regards to which states had the most problems with weather interfering with flights. To explore this, I will create a new dataframe called “weather,” and use filter() to only include records where the weather delay is greater than 0. I will then create another new dataframe and use aggregate() to count the number of flights with weather delays for each origin state.

weather <- filter(flights_merged, WEATHER_DELAY > 0)
weather_by_origin <- aggregate(weather$WEATHER_DELAY ~ weather$STATE.x, FUN = length)
weather_by_origin

##    weather$STATE.x weather$WEATHER_DELAY
## 1               CO                     2
## 2               FL                     1
## 3               GA                     6
## 4               IL                    88
## 5               KY                    70
## 6               MA                     5
## 7               MN                     1
## 8               NJ                     3
## 9               NV                     1
## 10              NY                     4
## 11              TN                     1
## 12              TX                    29
## 13              WI                     4

This table shows some basic information about the count of weather delays for each origin state. Of note, Illinois and Kentucky have by far the highest counts of weather delays. Now, I will visualize this data with a bar chart.

ggplot(weather_by_origin, aes(weather_by_origin$`weather$STATE.x`, weather_by_origin$`weather$WEATHER_DELAY`)) +
  geom_col() +
ggtitle("Number of Weather Delays per Origin State") +
  xlab("Origin State") +
  ylab("Number of Weather Delays")

## Warning: Use of `weather_by_origin$`weather$STATE.x`` is discouraged. Use
## `weather$STATE.x` instead.

## Warning: Use of `weather_by_origin$`weather$WEATHER_DELAY`` is discouraged. Use
## `weather$WEATHER_DELAY` instead.

This shows that most states don’t have a whole lot of weather delays on flights leaving them, but Illinois, Kentucky, and Texas have quite a lot comparatively and account for most of the weather delays.

Next, I will perform the same process for the Destination states to try and compare with the Origin states. First, I’ll create a new data frame and use aggregate() to count the number of weather delays for each destination state.

weather_by_dest <- aggregate(weather$WEATHER_DELAY ~ weather$STATE.y, FUN = length)
weather_by_dest

##    weather$STATE.y weather$WEATHER_DELAY
## 1               AZ                     1
## 2               CA                     1
## 3               FL                     9
## 4               GA                     4
## 5               IL                    18
## 6               KY                   145
## 7               MI                     1
## 8               MN                     1
## 9               NJ                     3
## 10              NY                     3
## 11              TX                    26
## 12              UT                     2
## 13              WI                     1

This table shows that flights entering Kentucky have by far the highest number of weather delays, followed by Texas. Now, I’ll similarly create a bar chart to visualize this table.

ggplot(weather_by_dest, aes(weather_by_dest$`weather$STATE.y`, weather_by_dest$`weather$WEATHER_DELAY`)) +
  geom_col() +
ggtitle("Number of Weather Delays per Destination State") +
  xlab("Destination State") +
  ylab("Number of Weather Delays")

## Warning: Use of `weather_by_dest$`weather$STATE.y`` is discouraged. Use
## `weather$STATE.y` instead.

## Warning: Use of `weather_by_dest$`weather$WEATHER_DELAY`` is discouraged. Use
## `weather$WEATHER_DELAY` instead.

This bar chart shows a similar pattern as the origin states bar chart, with the majority of flights that are delayed due to weather coming from just a few states. An important takeaway in comparing these charts is that Kentucky, Texas, and Illinois have the highest number of delays due to weather compared to all other states, both for flights entering and leaving the states.

Project 1

Jacob Hasselbeck

10/16/2020