Read Data from CSV
library(tidyr)
library(dplyr)
library(plyr)
library(ggplot2)
data = read.csv('flight_delays.txt')
data
## airline status los.angeles phoenix san.diego san.francisco seattle
## 1 alaska on time 497 221 212 503 1841
## 2 alaska delayed 62 12 20 102 305
## 3 am west on time 694 4840 383 320 201
## 4 am west delayed 117 415 65 129 61
We have two main issues - ‘city’ needs to be moved from columns to a variable, and status should be columns. Let’s start by melting the city columns
data2 <- gather(data,-airline, -status,key = 'city', value = 'frequency')
data2
## airline status city frequency
## 1 alaska on time los.angeles 497
## 2 alaska delayed los.angeles 62
## 3 am west on time los.angeles 694
## 4 am west delayed los.angeles 117
## 5 alaska on time phoenix 221
## 6 alaska delayed phoenix 12
## 7 am west on time phoenix 4840
## 8 am west delayed phoenix 415
## 9 alaska on time san.diego 212
## 10 alaska delayed san.diego 20
## 11 am west on time san.diego 383
## 12 am west delayed san.diego 65
## 13 alaska on time san.francisco 503
## 14 alaska delayed san.francisco 102
## 15 am west on time san.francisco 320
## 16 am west delayed san.francisco 129
## 17 alaska on time seattle 1841
## 18 alaska delayed seattle 305
## 19 am west on time seattle 201
## 20 am west delayed seattle 61
We now need to cast the “status” into two columns. Also let’s fix the .’s in the city names
data3 = spread(data2,key = status,value = 'frequency')
data3$city = sub('\\.',' ',data3$city)
data3
## airline city delayed on time
## 1 alaska los angeles 62 497
## 2 alaska phoenix 12 221
## 3 alaska san diego 20 212
## 4 alaska san francisco 102 503
## 5 alaska seattle 305 1841
## 6 am west los angeles 117 694
## 7 am west phoenix 415 4840
## 8 am west san diego 65 383
## 9 am west san francisco 129 320
## 10 am west seattle 61 201
Let’s look at the overall percentage delayed by airline
ddply(data3, "airline", summarize,
sum_delay = sum(delayed),
sum_ontime = sum(`on time`),
pct_delay = sum(delayed)/(sum(delayed)+sum(`on time`))
)
## airline sum_delay sum_ontime pct_delay
## 1 alaska 501 3274 0.1327152
## 2 am west 787 6438 0.1089273
So, Alaska seems to have more delayed flights overall. But how does it look within the city groups? I’ll add a percentage of delayed flights, and plot the results
data3$delay_percent = data3$delayed/(data3$delayed+data3$'on time')
ggplot(data3, aes(city, delay_percent, fill=airline)) +
geom_bar(position="dodge", stat="identity")
This is showing the opposite conclusion. AM West has more delays in each city. The reason that they have fewer delays overall is that they have a high frequency of flights in a low-delay city (Phoenix).
Let’s create a logistic regression model using ‘city’ and ‘airline’. This should give us an idea of if this difference is statistically significant.
counts = cbind(data3$delayed, data3$`on time`)
model = glm(counts ~ city + airline, data = data3, family=binomial(link="logit"))
summary(model)
##
## Call:
## glm(formula = counts ~ city + airline, family = binomial(link = "logit"),
## data = data3)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7
## 1.17724 0.28349 -0.13773 -0.67617 -0.14750 -0.82155 -0.04766
## 8 9 10
## 0.07989 0.66657 0.35280
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.24338 0.09908 -22.642 < 2e-16 ***
## cityphoenix -0.75498 0.09798 -7.706 1.30e-14 ***
## citysan diego -0.08540 0.14152 -0.603 0.546
## citysan francisco 0.72060 0.11105 6.489 8.66e-11 ***
## cityseattle 0.45474 0.10850 4.191 2.78e-05 ***
## airlineam west 0.54440 0.08426 6.461 1.04e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 253.4547 on 9 degrees of freedom
## Residual deviance: 3.2166 on 4 degrees of freedom
## AIC: 75.923
##
## Number of Fisher Scoring iterations: 3
As you can see, the dummy variable ‘airlineam west’ has a significant positive effect (P Value around 0.0000000001). Positive in this case means higher rate of delays. Let’s pull in the predictions from this model, then summarize by city to see the effect that airline would have on your chances of a delay. I multiplied the predictions by 100 to make them easier to look at.
data3$prediction = round(predict(model,type = "response")*100,1)
data3
## airline city delayed on time delay_percent prediction
## 1 alaska los angeles 62 497 0.11091234 9.6
## 2 alaska phoenix 12 221 0.05150215 4.8
## 3 alaska san diego 20 212 0.08620690 8.9
## 4 alaska san francisco 102 503 0.16859504 17.9
## 5 alaska seattle 305 1841 0.14212488 14.3
## 6 am west los angeles 117 694 0.14426634 15.5
## 7 am west phoenix 415 4840 0.07897241 7.9
## 8 am west san diego 65 383 0.14508929 14.4
## 9 am west san francisco 129 320 0.28730512 27.3
## 10 am west seattle 61 201 0.23282443 22.4
data4 = select(data3, city, airline, prediction)
data4 = spread(data4,key = airline,value = 'prediction')
data4$diff = data4$`am west`-data4$alaska
data4
## city alaska am west diff
## 1 los angeles 9.6 15.5 5.9
## 2 phoenix 4.8 7.9 3.1
## 3 san diego 8.9 14.4 5.5
## 4 san francisco 17.9 27.3 9.4
## 5 seattle 14.3 22.4 8.1
It depends on the city you’re in, but your expected delay rate is between 3-10% higher if you fly with AM West instead of Alaska.