Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Attaching package: 'lubridate'
The following object is masked from 'package:base':
date
The dataset we are working with contains data on every U.S. domestic flight in January 2016. The data comes with information on the scheduled and actual times for arrival and departure, as well as aircraft/airline data and destination/departure locations. To better work with the data, many of the fields were recalculated to contain actual date/time values, from which other fields could be calculated.
For my assignment I decided to focus on regions of the United States, as according to the U.S. Census Bureau. I ultimately decided to look into the average delays experienced when flying from one region to another. The pattern of analysis would begin at the regional level, followed by identification of the specific states and airline carriers that resulted in the highest average delays in January 2016.
Performing analysis in this order allows us to come into the discovery without any bias by avoiding a certain hypothesis. Once completed, we can use the data to determine potential causes of the delays ranging from weather, specific carriers, air traffic, and others. I wanted to avoid extreme delays as they would easily skew the data in a direction that’s not representative of the majority of flights. The sample selected was composed of flights that were delayed between 1 and 120 minutes, or 2 hours. We will perform tests further on to ensure that this sample seems to fairly represent the majority of the delays in January 2016.
| Departure Region | Arrival Region | Average Delay |
|---|---|---|
| Northeast | West | 26.86 |
| Northeast | Midwest | 23.71 |
| Midwest | West | 22.3 |
| Northeast | Northeast | 20.04 |
| Northeast | South | 19.87 |
| South | West | 19.52 |
| Midwest | South | 18.06 |
| Midwest | Midwest | 18.02 |
| South | Midwest | 17.81 |
| South | Northeast | 17.71 |
| Midwest | Northeast | 17.34 |
| West | Midwest | 16.57 |
| West | West | 16.24 |
| West | Northeast | 16.07 |
| West | South | 15.91 |
| South | South | 15.3 |
For the next stage of the analysis we’re going to be looking at delayed flights that originated in the Northeast and landed in the West.
Guessing width = 5 # range / 18
| Carrier | Average Delay |
|---|---|
| AS | 38.67 |
| VX | 29.65 |
| DL | 27.58 |
| AA | 26.82 |
| UA | 26.15 |
| WN | 26.13 |
| NK | 24.67 |
| B6 | 23.46 |
| HA | 16.46 |
For the flights within our sample, we’ve identified the average delay by servicing carrier. We will focus on the top 3 for the rest of the analysis. The dataset codes each airline carrier as a two digit code. AS, VX, and DL translate to Alaskan Airlines, Virgin Airlines, and Delta.
This sort of analysis works with datasets that are too large to gleam the most important insights at first glance. With the information we have, we can create models that reveal the most delayed flight paths for each state and airline to a certain region. One could use this data for a trip OR an airline company could utilize this data to decide whether or not to expand into a new market; the implications are far reaching to those who can read between the lines.
The tabs below contain tables detailing departure state - arrival state comparisons and average delays. Histograms examine distribution of total delays for each of the 3 airline carriers AS, VX, and DL.
| OriginState | DestState | Carrier | Average_Delay |
|---|---|---|---|
| NJ | WA | AS | 37.84 |
| NJ | CA | VX | 32.98 |
| NJ | UT | DL | 17.79 |
| OriginState | DestState | Carrier | Average_Delay |
|---|---|---|---|
| MA | CA | AS | 53.71 |
| MA | OR | AS | 41.04 |
| MA | WA | AS | 34.67 |
| MA | UT | DL | 29.24 |
| MA | CA | VX | 28.24 |
| MA | CA | DL | 24.3 |
| OriginState | DestState | Carrier | Average_Delay |
|---|---|---|---|
| NY | HI | DL | 38.62 |
| NY | NV | DL | 32.88 |
| NY | CA | VX | 29.07 |
| NY | CO | DL | 28.47 |
| NY | CA | DL | 28.07 |
| NY | NV | VX | 27.79 |
| NY | OR | DL | 27.38 |
| NY | UT | DL | 25.87 |
| NY | WA | DL | 25.29 |
| NY | AZ | DL | 23.79 |
| NY | WA | AS | 19.71 |
| NY | WY | DL | 14 |
| OriginState | DestState | Carrier | Average_Delay |
|---|---|---|---|
| PA | WA | AS | 45.6 |
| PA | UT | DL | 22.83 |
This final plot represents the entirety of flights flown by the 3 identified airline carriers from the northeast to west. It reveals that New York flights are the most frequently delayed, and the most delayed flight was from New Jersey to California. We can see that Massachusetts flights are more likely to be delayed on their arrival in Oregon than New York customers. It would be interesting in a further study to identify why planes flying out of the Western US are so good at being on time, while there is such a large spread of dealys for incoming planes. Could it be strictly an issue with departure delays? Or are the western airports prioritizing their outbound flights. The sky is the limit!