For Assignment 2, I decided to take a look at airports with bad performance indicators- high cancellations and delays. You always hear people talking about an airport that is the worst to fly out of, so I thought it would be interesting to see what the data says.
I will focus on three main performance indicators to make my conclusion:
* Airports of origin with the largest number of cancellations.
* Airports with the most delays, as well as those with the longest delays.
I began by formatting the data, similarly to what we did for the Maine flight data in this weeks Lecture Notes.
I wanted to clarify the headers, to ensure my reformatting and naming worked correctly.
## [1] "FlightDate" "Origin" "OriginCityName" "CRSDepTime"
## [5] "DepTime" "WheelsOff" "WheelsOn" "CRSArrTime"
## [9] "ArrTime" "Cancelled" "Diverted" "SchedDepTime"
## [13] "SchedArrTime" "New_DepTime" "New_ArrTime" "WheelsUp"
## [17] "WheelsDown" "SchedSArrTime"
To start, I thought I would look at flight cancellations. This is an obvious measure of negative performance in the mind of travelers.
This shows the top ten airports, in terms of number of cancellations.
| Origin | perc_cancel |
|---|---|
| SUN | 0.3565217 |
| MMH | 0.3125000 |
| EKO | 0.1132075 |
| IAD | 0.0985416 |
| EWR | 0.0966456 |
| DCA | 0.0944625 |
| ADQ | 0.0930233 |
| RIC | 0.0883117 |
| BWI | 0.0878708 |
| LBE | 0.0860215 |
For this performance indicator I want to look at two factors: number of delays and length of delay. First, I calculated the delays for flights that were not cancelled in minutes.
I wanted to create a visual of the airports with the most long delays, so I identified airports with delays greater than one hour.
| Origin | perc_delays |
|---|---|
| OTH | 0.1764706 |
| MQT | 0.1698113 |
| SWF | 0.1538462 |
| APN | 0.1458333 |
| CMX | 0.1403509 |
| SMX | 0.1403509 |
| EUG | 0.1267606 |
| GTR | 0.1250000 |
| LGB | 0.1229730 |
| COD | 0.1212121 |
I wanted to compare the two sets of departure airports, to see which airports occur in the top ten for most cancellations and top ten for most flights with a delay over 60 minutes.
I first did an inner_join of the two sets of data, but that indicated that there was no airports that were ranked in the top ten worst for both categories.
BadAirports <- inner_join(Delay_airport, cancels_airport)
BadAirports %>% ggvis(~perc_delays, ~perc_cancel) %>%
layer_points(fill = ~factor(Origin), size := 600, opacity := 1) %>%
add_axis("x", title = "Percentage of Flights Delayed Greater than 60 Minutes", title_offset = 50) %>%
add_axis("y", title = "Percentage of Flights Cancelled", title_offset = 50)
This created an empty chart as an output, so I re-assessed and decided to create a similar chart, but using all of the airports of origin in both categories. This helps to illustrate the top ten worst in terms of delays and cancellations in one visual.