I want to choose the travel time and airline smartly, so I have the best chance to avoid delays and arrive at my destination in time. So I analyzed the hflight data with the KPI Arrival Delay:
#load data
library("hflights")
## Warning: package 'hflights' was built under R version 3.2.1
#change column data type
hflights$UniqueCarrier<-as.factor(hflights$UniqueCarrier)
hflights$DayOfWeek<-as.factor(hflights$DayOfWeek)
#create and append departure time field
FlightDate<-paste(hflights$Year, hflights$Month, hflights$DayofMonth,sep="-")
data<-cbind(hflights,FlightDate)
#select rows where flights are delayed
del<-data[which(data$ArrDelay >0),-(1:3)]
#subset columns of interest
sub<-subset(del, select=c(DayOfWeek,UniqueCarrier,ArrDelay,DepDelay,AirTime,FlightDate))
#summary statistics for data analysis
summary(sub)
## DayOfWeek UniqueCarrier ArrDelay DepDelay
## 1:16771 XE :35429 Min. : 1.00 Min. :-15.00
## 2:14243 CO :34044 1st Qu.: 5.00 1st Qu.: 0.00
## 3:13933 WN :20685 Median : 12.00 Median : 9.00
## 4:17256 OO : 8443 Mean : 24.28 Mean : 21.04
## 5:16483 MQ : 1665 3rd Qu.: 27.00 3rd Qu.: 26.00
## 6:13261 US : 1317 Max. :978.00 Max. :981.00
## 7:14973 (Other): 5337
## AirTime FlightDate
## Min. : 22.0 2011-5-20: 565
## 1st Qu.: 60.0 2011-6-22: 564
## Median :109.0 2011-3-14: 558
## Mean :111.5 2011-4-4 : 543
## 3rd Qu.:145.0 2011-6-21: 527
## Max. :549.0 2011-4-25: 520
## (Other) :103643
1. Histogram of Arrival Delays by day of week:
#qplot(UniqueCarrier, data=sub, geom="bar", fill=DayOfWeek)
#ggplot(sub, aes(UniqueCarrier)) + geom_freqpoly(aes(group = DayOfWeek, colour = DayOfWeek))
#Stacked bars are easy, but might be overloaded with information. Faceting might be a better solution.
## Warning: package 'ggthemes' was built under R version 3.2.1
Congestion tends to happen during weekend and mid week, although there are viarations for airlines with smaller volume.
2. Boxplot of Delay Time by airline:
Most of the delays are short (mean = 24 min), although there are significant outliers with certain airlines.
3. Evaluate airline performance:
Simply counting the frequency of delayed flights per airline is misleading, as bigger outlines represent majority of flights and likely most of the delays. Therefore, I created calculated fields for analysis:
#calculate number of delayed flights by carrier
library("sqldf")
ucd<-sqldf("select UniqueCarrier, count(ArrDelay) as delays from hflights where ArrDelay >0 group by UniqueCarrier")
#calculate number of total flights by carrier, if arrival information exists
uc<-sqldf("select UniqueCarrier, count(ArrDelay) as flights from hflights where ArrDelay is not null group by UniqueCarrier")
#merge the results and create calculated field: %of delayed flights
library(dplyr)
stats<-merge(ucd,uc,all=TRUE)
ratio<-with(stats, 100*delays/flights)
stats<-cbind(stats,ratio)
stats
## UniqueCarrier delays flights ratio
## 1 AA 963 3178 30.30208
## 2 AS 159 364 43.68132
## 3 B6 266 673 39.52452
## 4 CO 34044 69373 49.07385
## 5 DL 1003 2591 38.71092
## 6 EV 780 2121 36.77511
## 7 F9 463 832 55.64904
## 8 FL 657 2111 31.12269
## 9 MQ 1665 4504 36.96714
## 10 OO 8443 15781 53.50105
## 11 UA 1009 2033 49.63109
## 12 US 1317 4030 32.67990
## 13 WN 20685 44536 46.44557
## 14 XE 35429 71669 49.43420
## 15 YV 37 78 47.43590
Bar chart of percentage of delayed flights, organized by airline size; number of flights labeled:
AA has the lowest arrival delay rate while F9 the highest. Notice that from a certain point on, flight delay rate tend to increase as the volume of flights grows. This can also be seen in the following scatter plot:
4. (Is it obvious?) Scatter plot of arrival delay and departure delay colored by airline:
There is strong positive correlation between departure delays and arrival delays, suggesting airport congestion rather than flight time likely caused the delay.
My conclusion is that I want to avoid weekends; Tuesday/Thursday may be best. I would also choose to fly US, FL, or AA - they have decent number of flights but low delay rate.