Methodology


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Attaching package: 'lubridate'
The following object is masked from 'package:base':

    date

The dataset we are working with contains data on every U.S. domestic flight in January 2016. The data comes with information on the scheduled and actual times for arrival and departure, as well as aircraft/airline data and destination/departure locations. To better work with the data, many of the fields were recalculated to contain actual date/time values, from which other fields could be calculated.
For my assignment I decided to focus on regions of the United States, as according to the U.S. Census Bureau. I ultimately decided to look into the average delays experienced when flying from one region to another. The pattern of analysis would begin at the regional level, followed by identification of the specific states and airline carriers that resulted in the highest average delays in January 2016.

Performing analysis in this order allows us to come into the discovery without any bias by avoiding a certain hypothesis. Once completed, we can use the data to determine potential causes of the delays ranging from weather, specific carriers, air traffic, and others. I wanted to avoid extreme delays as they would easily skew the data in a direction that’s not representative of the majority of flights. The sample selected was composed of flights that were delayed between 1 and 120 minutes, or 2 hours. We will perform tests further on to ensure that this sample seems to fairly represent the majority of the delays in January 2016.

Region To Region Results

TOtal Delay (Arrival/Departure) Averages Between Regions
Departure Region Arrival Region Average Delay
Northeast West 26.86
Northeast Midwest 23.71
Midwest West 22.3
Northeast Northeast 20.04
Northeast South 19.87
South West 19.52
Midwest South 18.06
Midwest Midwest 18.02
South Midwest 17.81
South Northeast 17.71
Midwest Northeast 17.34
West Midwest 16.57
West West 16.24
West Northeast 16.07
West South 15.91
South South 15.3

Once the table was constructed we can see that two regional directions stand out from the others. This data is further backed up by the simple scatterplot revealing some interesting trends. Flights leaving the northeast and heading either to the western or midwestern US experienced the highest average delays. Now this we could probably rule out as simply weather, it is January after all. The only issue with backing this claim up is the fact that flights destined for the Northeast are not near the top of the list, and the Average delay figure takes into account both arrival and departure delays. One other interesting thing that we will not pursue is the tight distribution of flights that depart from western states. They are incredibly close and represent most of the lowest points on the scale. This low amount of variation seems to indicate that airports in the Western united States implement a higher degree of efficiency than elsewhere.

Delayed Northeast to Western US Flights

For the next stage of the analysis we’re going to be looking at delayed flights that originated in the Northeast and landed in the West.

Guessing width = 5 # range / 18

As mentioned earlier, we want to test to ensure that only selecting flights delayed 2 hours or less doesn’t skew the data away from its actual pattern. The histogram above helps us confidently say that the majority of delayed events were captured in our sample. The only risk would be if delays followed a bimodal approach and we have already lost the ability to see the second distribution. Since most of these were extreme values it was deigned appropriate to limit the sample size and work within the 2 hour range for all Northeast-Western flights.

Carrier Average Delays

Carrier Average Delay
AS 38.67
VX 29.65
DL 27.58
AA 26.82
UA 26.15
WN 26.13
NK 24.67
B6 23.46
HA 16.46

For the flights within our sample, we’ve identified the average delay by servicing carrier. We will focus on the top 3 for the rest of the analysis. The dataset codes each airline carrier as a two digit code. AS, VX, and DL translate to Alaskan Airlines, Virgin Airlines, and Delta.

While Alaskan(AS) may have the highest mean according to the chart, the boxplot above helps tell the whole story. We can see that Hawaiian Airlines (HA) seems to be superior in keeping delays short and within a limited variance. We can also observe that Virgin(VX) had the highest delayed flights in January, as demonstrated by the dots above their topmost whisker.

Average Delays From Northeast To Western US By State and Carrier

This sort of analysis works with datasets that are too large to gleam the most important insights at first glance. With the information we have, we can create models that reveal the most delayed flight paths for each state and airline to a certain region. One could use this data for a trip OR an airline company could utilize this data to decide whether or not to expand into a new market; the implications are far reaching to those who can read between the lines.

The tabs below contain tables detailing departure state - arrival state comparisons and average delays. Histograms examine distribution of total delays for each of the 3 airline carriers AS, VX, and DL.

New Jersey

OriginState DestState Carrier Average_Delay
NJ WA AS 37.84
NJ CA VX 32.98
NJ UT DL 17.79

Massachusets

OriginState DestState Carrier Average_Delay
MA CA AS 53.71
MA OR AS 41.04
MA WA AS 34.67
MA UT DL 29.24
MA CA VX 28.24
MA CA DL 24.3

New York

OriginState DestState Carrier Average_Delay
NY HI DL 38.62
NY NV DL 32.88
NY CA VX 29.07
NY CO DL 28.47
NY CA DL 28.07
NY NV VX 27.79
NY OR DL 27.38
NY UT DL 25.87
NY WA DL 25.29
NY AZ DL 23.79
NY WA AS 19.71
NY WY DL 14

Pennsylvania

OriginState DestState Carrier Average_Delay
PA WA AS 45.6
PA UT DL 22.83

State To State Comparison

This final plot represents the entirety of flights flown by the 3 identified airline carriers from the northeast to west. It reveals that New York flights are the most frequently delayed, and the most delayed flight was from New Jersey to California. We can see that Massachusetts flights are more likely to be delayed on their arrival in Oregon than New York customers. It would be interesting in a further study to identify why planes flying out of the Western US are so good at being on time, while there is such a large spread of dealys for incoming planes. Could it be strictly an issue with departure delays? Or are the western airports prioritizing their outbound flights. The sky is the limit!