This report uses flight data for all domestic flights reported for on-time performance in the United States. There are 21 variables included with the file and another 11 variables will be created to enable a deeper analysis of the data. We will first check for missing data patterns as was done with the Maine flight files in the lectures.

## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'bindrcpp' was built under R version 3.3.3

Again, here NA values come from Cancelled flights so these will be filtered out in the calculations.

##   CRSDepTime DepTime DepDelay
## 1       1100    1057       -3
## 2       1100    1056       -4
## 3       1100    1055       -5
## 4       1100    1102        2
## 5       1100    1240      100
## 6       1100    1107        7

I will approach the analysis from a top 10 perspective to gain insight on what cities and what airlines have the greatest percentage of delayed flights. First we should realize that the data is organized in such a way that each record has a originating city and a destination city. So we will see how cities rank in terms of late incoming flights and late outgoing flights and by joining these two tables, we will gain possibly insight on whether these the lateness of incoming flights and outgoing flights are correlated for each city.

## # A tibble: 10 x 2
##                          CityName perc_delay_o
##                             <chr>        <dbl>
##  1                    Roswell, NM    0.7500000
##  2                      Miami, FL    0.5025398
##  3 West Palm Beach/Palm Beach, FL    0.4752688
##  4        North Bend/Coos Bay, OR    0.4705882
##  5                  Worcester, MA    0.4655172
##  6            Fort Lauderdale, FL    0.4611253
##  7              Arcata/Eureka, CA    0.4537815
##  8                    Oakland, CA    0.4475292
##  9                Adak Island, AK    0.4444444
## 10                   San Juan, PR    0.4431189

So we see that the airport with the most delays is Roswell, NM. Now we will look at the 10 worst performing airports from the point of view of the destination of the flight

## # A tibble: 10 x 2
##                     CityName perc_delay_d
##                        <chr>        <dbl>
##  1                  Guam, TT    0.7419355
##  2           Plattsburgh, NY    0.6774194
##  3 Newburgh/Poughkeepsie, NY    0.6166667
##  4             Worcester, MA    0.5081967
##  5                Bangor, ME    0.5000000
##  6         San Francisco, CA    0.4918084
##  7         Atlantic City, NJ    0.4914676
##  8              San Juan, PR    0.4784096
##  9   North Bend/Coos Bay, OR    0.4666667
## 10          White Plains, NY    0.4599237

Interestingly, we have some airports that show up both as the worst performing in terms of Destination flights and orginating flights, namely: Worcester, MA and San Juan, PR. Now we will join these two data sets together to see if how the delays compare for the same city whether incoming or outgoing.

##          CityName perc_delay_o perc_delay_d
## 1    Aberdeen, SD    0.2741935    0.2096774
## 2 Adak Island, AK    0.4444444    0.2222222
## 3   Aguadilla, PR    0.4233129    0.4024390
## 4       Akron, OH    0.2219917    0.3146998
## 5      Albany, GA    0.1898734    0.2500000
## 6      Albany, NY    0.2983651    0.3682065

so let’s see how these percent delays for origination and destination flights look when plotted against each other.

## Warning: package 'ggvis' was built under R version 3.3.3
## [1] 0.6044045

The correlation coefficient is strong and we can see that there is a pretty linear relationship between both orginating and destination flight delays. Now let’s move on to see how the airlines perform in terms of delayed flights.

## # A tibble: 12 x 2
##    Carrier perc_delay_d
##      <chr>        <dbl>
##  1      B6    0.4533249
##  2      NK    0.4417544
##  3      VX    0.4331100
##  4      UA    0.4026285
##  5      WN    0.3931576
##  6      DL    0.3178783
##  7      AA    0.3160170
##  8      OO    0.2967299
##  9      EV    0.2548652
## 10      F9    0.2543435
## 11      AS    0.2326888
## 12      HA    0.2098805

Here clearly B6 has many more delays as a percentage than F9 but since we would like to know the name of the actual carrier, we will need to use a look up table and add a new column to the dataset called CarrierName.

## # A tibble: 12 x 2
##           CarrierName perc_delay_d
##                 <chr>        <dbl>
##  1            JetBlue    0.4533249
##  2             Spirit    0.4417544
##  3      VirginAmerica    0.4331100
##  4             United    0.4026285
##  5          Southwest    0.3931576
##  6              Delta    0.3178783
##  7           American    0.3160170
##  8            SkyWest    0.2967299
##  9 Atlantic_Southeast    0.2548652
## 10           Frontier    0.2543435
## 11             Alaska    0.2326888
## 12           Hawaiian    0.2098805

So we see that JetBlue ranks at the top for the most delayed flights. One of the the behaviours I have noticed from captains is to tell the passengers that we will try to make up for the delay in flight, thus an airline such as JetBlue would be expected to travel much faster than say a frontier.

## # A tibble: 12 x 3
##           CarrierName    speed  Distance
##                 <chr>    <dbl>     <dbl>
##  1             United 441.0604 1233.1693
##  2           Frontier 438.0890 1038.5957
##  3      VirginAmerica 433.0323 1406.5242
##  4             Spirit 429.4281 1000.0764
##  5             Alaska 427.3375 1228.7804
##  6           American 418.2067 1001.1286
##  7            JetBlue 413.3508 1070.5318
##  8              Delta 413.1661  852.8951
##  9          Southwest 412.8481  742.4581
## 10            SkyWest 367.9893  510.3771
## 11 Atlantic_Southeast 358.8201  441.6146
## 12           Hawaiian 343.7229  643.7852

Here, the average speed of the airlines don’t really match with how late they tend to be but more as a function of the average distance of their flights. Let’s look at at a tight range of Distances say 800 to 1000 miles and see which airline is the most lead-footed.

## # A tibble: 11 x 4
##           CarrierName    speed Distance numflights
##                 <chr>    <dbl>    <dbl>      <int>
##  1             United 444.0059 901.4196       6835
##  2          Southwest 441.3598 900.0948      15425
##  3           Frontier 440.8243 900.9272       2293
##  4              Delta 440.3912 915.1052       6624
##  5      VirginAmerica 438.6490 954.0000        149
##  6             Spirit 438.0432 919.3939       2536
##  7            SkyWest 434.0981 891.1043       4392
##  8           American 433.5528 902.8221      10331
##  9            JetBlue 430.7922 920.0007       2959
## 10             Alaska 426.4751 919.1077       3084
## 11 Atlantic_Southeast 423.5503 882.3254       2698

It seems that although JetBlue has the worst delayed flight percentage, they are not particularly worried about doing anything about it while in the air.

Finally, this makes me question whether longer flights tend to be more delayed than shorter ones. The best way to approach this is to see how the average speed changes with a 100 mile change in distance.

So we see that you need distance to reach top speed. But what about flight distance and percent delayed flights?

Well, up to about 2000 miles distance, it seems that distance does effect the percentage of flights delayed, however after that it no longer holds true.

Conclusion Many airports may seem to have a reputation for the most delayed flights, but there are many factors that should be taken into consideration such as: Is it really the airport that is to blame or is it the airlines? Maybe the airport serves as the hub for an airline which may be not particularly concerned about making up lost time in the air and prefers to save money on fuel costs by flying more slowly. We see from the graph above also that it seems that airlines may be interested in saving on fuel costs as the flight distance increases. Thus there are many factors that may be affecting the reputation of an airport for delayed flights that might not have anything to do with the airport’s management itself.