Assignment 2: Airports You Want to Avoid Departing From

For Assignment 2, I decided to take a look at airports with bad performance indicators- high cancellations and delays. You always hear people talking about an airport that is the worst to fly out of, so I thought it would be interesting to see what the data says.

I will focus on three main performance indicators to make my conclusion:
* Airports of origin with the largest number of cancellations.
* Airports with the most delays, as well as those with the longest delays.

I began by formatting the data, similarly to what we did for the Maine flight data in this weeks Lecture Notes.

I wanted to clarify the headers, to ensure my reformatting and naming worked correctly.

##  [1] "FlightDate"     "Origin"         "OriginCityName" "CRSDepTime"    
##  [5] "DepTime"        "WheelsOff"      "WheelsOn"       "CRSArrTime"    
##  [9] "ArrTime"        "Cancelled"      "Diverted"       "SchedDepTime"  
## [13] "SchedArrTime"   "New_DepTime"    "New_ArrTime"    "WheelsUp"      
## [17] "WheelsDown"     "SchedSArrTime"

Identifying Bad Airports by Cancellations

To start, I thought I would look at flight cancellations. This is an obvious measure of negative performance in the mind of travelers.

This shows the top ten airports, in terms of number of cancellations.

Origin perc_cancel
SUN 0.3565217
MMH 0.3125000
EKO 0.1132075
IAD 0.0985416
EWR 0.0966456
DCA 0.0944625
ADQ 0.0930233
RIC 0.0883117
BWI 0.0878708
LBE 0.0860215

Identifying Bad Airports by Delays

For this performance indicator I want to look at two factors: number of delays and length of delay. First, I calculated the delays for flights that were not cancelled in minutes.

I wanted to create a visual of the airports with the most long delays, so I identified airports with delays greater than one hour.

Origin perc_delays
OTH 0.1764706
MQT 0.1698113
SWF 0.1538462
APN 0.1458333
CMX 0.1403509
SMX 0.1403509
EUG 0.1267606
GTR 0.1250000
LGB 0.1229730
COD 0.1212121

Determining the Airports With High Cancellations and Long Delays

I wanted to compare the two sets of departure airports, to see which airports occur in the top ten for most cancellations and top ten for most flights with a delay over 60 minutes.

I first did an inner_join of the two sets of data, but that indicated that there was no airports that were ranked in the top ten worst for both categories.

BadAirports <- inner_join(Delay_airport, cancels_airport)

BadAirports %>% ggvis(~perc_delays, ~perc_cancel) %>% 
  layer_points(fill = ~factor(Origin), size := 600, opacity := 1) %>% 
  add_axis("x", title = "Percentage of Flights Delayed Greater than 60 Minutes", title_offset = 50) %>% 
  add_axis("y", title = "Percentage of Flights Cancelled", title_offset = 50)

This created an empty chart as an output, so I re-assessed and decided to create a similar chart, but using all of the airports of origin in both categories. This helps to illustrate the top ten worst in terms of delays and cancellations in one visual.