Read Data from CSV

library(tidyr)
library(dplyr)
library(plyr)
library(ggplot2)

data = read.csv('flight_delays.txt')

data
##   airline  status los.angeles phoenix san.diego san.francisco seattle
## 1  alaska on time         497     221       212           503    1841
## 2  alaska delayed          62      12        20           102     305
## 3 am west on time         694    4840       383           320     201
## 4 am west delayed         117     415        65           129      61

We have two main issues - ‘city’ needs to be moved from columns to a variable, and status should be columns. Let’s start by melting the city columns

data2 <- gather(data,-airline, -status,key = 'city', value = 'frequency')
data2
##    airline  status          city frequency
## 1   alaska on time   los.angeles       497
## 2   alaska delayed   los.angeles        62
## 3  am west on time   los.angeles       694
## 4  am west delayed   los.angeles       117
## 5   alaska on time       phoenix       221
## 6   alaska delayed       phoenix        12
## 7  am west on time       phoenix      4840
## 8  am west delayed       phoenix       415
## 9   alaska on time     san.diego       212
## 10  alaska delayed     san.diego        20
## 11 am west on time     san.diego       383
## 12 am west delayed     san.diego        65
## 13  alaska on time san.francisco       503
## 14  alaska delayed san.francisco       102
## 15 am west on time san.francisco       320
## 16 am west delayed san.francisco       129
## 17  alaska on time       seattle      1841
## 18  alaska delayed       seattle       305
## 19 am west on time       seattle       201
## 20 am west delayed       seattle        61

We now need to cast the “status” into two columns. Also let’s fix the .’s in the city names

data3 = spread(data2,key = status,value = 'frequency')
data3$city = sub('\\.',' ',data3$city)
data3
##    airline          city delayed on time
## 1   alaska   los angeles      62     497
## 2   alaska       phoenix      12     221
## 3   alaska     san diego      20     212
## 4   alaska san francisco     102     503
## 5   alaska       seattle     305    1841
## 6  am west   los angeles     117     694
## 7  am west       phoenix     415    4840
## 8  am west     san diego      65     383
## 9  am west san francisco     129     320
## 10 am west       seattle      61     201

Let’s look at the overall percentage delayed by airline

ddply(data3, "airline", summarize,
 sum_delay = sum(delayed),
 sum_ontime = sum(`on time`),
 pct_delay =  sum(delayed)/(sum(delayed)+sum(`on time`))
 )
##   airline sum_delay sum_ontime pct_delay
## 1  alaska       501       3274 0.1327152
## 2 am west       787       6438 0.1089273

So, Alaska seems to have more delayed flights overall. But how does it look within the city groups? I’ll add a percentage of delayed flights, and plot the results

data3$delay_percent = data3$delayed/(data3$delayed+data3$'on time')

ggplot(data3, aes(city, delay_percent, fill=airline)) + 
       geom_bar(position="dodge", stat="identity")

This is showing the opposite conclusion. AM West has more delays in each city. The reason that they have fewer delays overall is that they have a high frequency of flights in a low-delay city (Phoenix).

Let’s create a logistic regression model using ‘city’ and ‘airline’. This should give us an idea of if this difference is statistically significant.

counts = cbind(data3$delayed, data3$`on time`)

model = glm(counts ~  city + airline, data = data3, family=binomial(link="logit"))
summary(model)
## 
## Call:
## glm(formula = counts ~ city + airline, family = binomial(link = "logit"), 
##     data = data3)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7  
##  1.17724   0.28349  -0.13773  -0.67617  -0.14750  -0.82155  -0.04766  
##        8         9        10  
##  0.07989   0.66657   0.35280  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.24338    0.09908 -22.642  < 2e-16 ***
## cityphoenix       -0.75498    0.09798  -7.706 1.30e-14 ***
## citysan diego     -0.08540    0.14152  -0.603    0.546    
## citysan francisco  0.72060    0.11105   6.489 8.66e-11 ***
## cityseattle        0.45474    0.10850   4.191 2.78e-05 ***
## airlineam west     0.54440    0.08426   6.461 1.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 253.4547  on 9  degrees of freedom
## Residual deviance:   3.2166  on 4  degrees of freedom
## AIC: 75.923
## 
## Number of Fisher Scoring iterations: 3

As you can see, the dummy variable ‘airlineam west’ has a significant positive effect (P Value around 0.0000000001). Positive in this case means higher rate of delays. Let’s pull in the predictions from this model, then summarize by city to see the effect that airline would have on your chances of a delay. I multiplied the predictions by 100 to make them easier to look at.

data3$prediction = round(predict(model,type = "response")*100,1)

data3
##    airline          city delayed on time delay_percent prediction
## 1   alaska   los angeles      62     497    0.11091234        9.6
## 2   alaska       phoenix      12     221    0.05150215        4.8
## 3   alaska     san diego      20     212    0.08620690        8.9
## 4   alaska san francisco     102     503    0.16859504       17.9
## 5   alaska       seattle     305    1841    0.14212488       14.3
## 6  am west   los angeles     117     694    0.14426634       15.5
## 7  am west       phoenix     415    4840    0.07897241        7.9
## 8  am west     san diego      65     383    0.14508929       14.4
## 9  am west san francisco     129     320    0.28730512       27.3
## 10 am west       seattle      61     201    0.23282443       22.4
data4 = select(data3, city, airline, prediction)

data4 = spread(data4,key = airline,value = 'prediction')
data4$diff = data4$`am west`-data4$alaska
data4
##            city alaska am west diff
## 1   los angeles    9.6    15.5  5.9
## 2       phoenix    4.8     7.9  3.1
## 3     san diego    8.9    14.4  5.5
## 4 san francisco   17.9    27.3  9.4
## 5       seattle   14.3    22.4  8.1

It depends on the city you’re in, but your expected delay rate is between 3-10% higher if you fly with AM West instead of Alaska.