Traveling by air is, contrary to popular opinion, the safest way to travel. Any incident with the airplane itself usually occur before take-off or when the airplane is landing. Nonetheless, this still leads to hundreds of frustrated flyers due to delays.
In this assignment, a data set of ALASKA and AM WEST on-time and delayed number of flights is giving for five cites. With this information, there will be an investigation to answer the following questions:
This assignment was accomplished by utilizing these packages for both data analysis and visualizations.
library("tidyr")
library("dplyr")
library("kableExtra")
library("ggplot2")
The data was recorded in a .csv file and imported into R via my GitHub (where you can also find the Rmd for this page). You will see below that the data is not in a very clean form to conduct analysis easily, therefore this data set needed to be tidy.
theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/HomeWork/HW4/TTD.csv"
untidyraw <- data.frame(read.csv(file = theURL, header = TRUE, sep = ","))
A couple data manipulations were required before the data set was presentable and suitable for use.
# Remove the NAs
untidy <- untidyraw[-c(3),]
# Identify the airplanes for each flight status
untidy[c(2,4),1] <- untidy[c(1,3),1]
# Column and row renaming
colnames(untidy)[c(1,2)] <- c('Airline','Status')
rownames(untidy) <- 1:nrow(untidy)
# Gather the data by city, then spread it by flight status.
tidy <- spread(gather(untidy, "Destination", "Time", 3:7), "Status", "Time")
# Convert character data into factor
tidy <- mutate_if(tidy, is.character, as.factor)
tidy$Airline<-factor(tidy$Airline)
Here is the untidy, raw data set and what is looks like after the tidying and transformations.
| X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| delayed | 62 | 12 | 20 | 102 | 305 | |
| NA | NA | NA | NA | NA | ||
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| delayed | 117 | 415 | 65 | 129 | 61 |
| Airline | Destination | delayed | on time |
|---|---|---|---|
| ALASKA | Los.Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San.Diego | 20 | 212 |
| ALASKA | San.Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AM WEST | Los.Angeles | 117 | 694 |
| AM WEST | Phoenix | 415 | 4840 |
| AM WEST | San.Diego | 65 | 383 |
| AM WEST | San.Francisco | 129 | 320 |
| AM WEST | Seattle | 61 | 201 |
At this step, any new variables needed to start the analysis were determined, and added to the tidy flight data frame. These include:
# Total number of flights
tidy$TotalFlights <- rowSums(tidy[,c(3, 4)])
# Probability of on-time flight status
tidy$ProbOnTime <- round((tidy$`on time`/ (tidy$TotalFlights)), digits = 3)
# Probability of delayed flight status
tidy$ProbDelay <- round((tidy$delayed/ (tidy$TotalFlights)), digits = 3)
If an airplane had more on-time status than delayed, that airplane had a better flight time status. The result suggested that the flight with the lowest probability of delayed flights was ALASKA, with an average delayed probability of 11.2%.
# Average probability of delayed flights
aggregate(ProbDelay ~ Airline, tidy, mean)
## Airline ProbDelay
## 1 ALASKA 0.1120
## 2 AM WEST 0.1776
Moreover, Figure 1 further depicts the difference in delayed flights for each destination by comparing ALASKA to AM WEST. When it pertains to overall flight status, both airplanes have more delayed flights when travelling to San Francisco while both airplanes have less delayed flights when travelling to Phoenix. Nonetheless, ALASKA still had less delayed flights than AM WEST for each destination.
What made flights to Phoenix less likely to be delayed than San Francisco? Even though there are numerous variables that can be accountable to cause this, such as weather, the economy, etc., using the count data available, it can investigated if the expected counts of delayed flights are influenced by the destination based on the Airline flight records.
A Poisson regression model was designed to investigate the number of delayed flights based on the airline, destination, and total flights as predictors. The response variable is the number of delayed recorded at each of five different Destination and by the two Airline categories. The difference in the total number of flights will be accounted for with the log exposure variable TotalFlights.
The results below shows that the model fits reasonably well because the goodness-of-fit chi-squared test is not statistically significant , p > 0.05. The indicator variable Destination is compared among four cities with Los Angeles being the reference category, the expected log count for San Diego was the only non-significant indicator variable at 95% CI. Specifically, the expected log count for Phoenix is 0.66 lower than the expected log count for Los Angeles. The expected log count for San Francisco is 0.59 higher than the reference. To determine if Destination itself, overall, is statistically significant, the model was compare to another without Destination.
# Poisson Regression of delayed flights on the predictors
model <- glm(delayed ~ offset(log(TotalFlights)) + Destination + Airline, family=poisson(link=log), data = tidy)
summary(model)
##
## Call:
## glm(formula = delayed ~ offset(log(TotalFlights)) + Destination +
## Airline, family = poisson(link = log), data = tidy)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7
## 0.98591 0.08384 -0.23311 -0.44964 -0.11775 -0.67399 -0.01414
## 8 9 10
## 0.13228 0.41072 0.26693
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.32689 0.09163 -25.394 < 2e-16 ***
## DestinationPhoenix -0.66354 0.09152 -7.250 4.16e-13 ***
## DestinationSan.Diego -0.07243 0.13180 -0.550 0.582595
## DestinationSan.Francisco 0.59083 0.10029 5.891 3.83e-09 ***
## DestinationSeattle 0.38258 0.09989 3.830 0.000128 ***
## AirlineAM WEST 0.45247 0.07625 5.934 2.95e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 219.1238 on 9 degrees of freedom
## Residual deviance: 1.9613 on 4 degrees of freedom
## AIC: 76.264
##
## Number of Fisher Scoring iterations: 3
with(model, cbind(res.deviance = deviance, df = df.residual,
p = pchisq(deviance, df.residual, lower.tail=FALSE)))
## res.deviance df p
## [1,] 1.961347 4 0.7428683
Here, a test of the overall effect of Destination is done by comparing the deviance of the full model with the deviance of the model excluding Destination. The four degree-of-freedom chi-square test indicates that Destination, taken together, is a statistically significant predictor of delayed.
# Update Model 1 by dropping the destination
model2 <- update(model, . ~ . - Destination)
# Test the model differences with Chi-Square test
anova(model2, model, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: delayed ~ Airline + offset(log(TotalFlights))
## Model 2: delayed ~ offset(log(TotalFlights)) + Destination + Airline
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 8 207.371
## 2 4 1.961 4 205.41 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lastly, a graph of the Before probability of delayed flights and After probability of the predicted number of delayed flight indicates that the most delayed flights are still predicted for AM WEST for all the destination if the travel the some number of times. Specifically, while the actual count data showed ALASKA had a better flight rate for Seattle, San Francisco and San Diego, the prediction suggests there will likely be a bit more delays than compared to the same flights with AM WEST, which are less. Moreover, AM WEST is expected to still have similar delays when flying to Phoenix while ALASKA has a better chance of being on-time, this is similar for flights to Los Angeles.
In conclusion, while both airplanes have more delayed flights when travelling to San Francisco and less delayed flights when travelling to Phoenix, ALASKA has less delayed flights than AM WEST for each destination. This could be, as the Poisson regression suggests, that flight destination is a statistically significant predictor of delayed flights. When this prediction was further analyzed, it is predicted that ALASKA will eventually see more delayed flights while AM WEST may not for Seattle, San Francisco and San Diego destinations, than the other two destinations, but the probability of delayed flights is still higher for AM WEST overall.
Tinsley, Howard E. Handbook of Applied Multivariate Statistics and Mathematical Modeling. Academic Press, 2006.