library(openintro)
library(tinytex)
library(tidyverse)
library(stringr)
library(magrittr)
library(gridExtra)
In this project we will transform a csv file into a usable dataframe using TidyR and Dyplr. After the transformation we will conduct some analyses.
We load the data …
dfFlights_raw <- as.data.frame(read.delim("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/Flights.csv", header = TRUE, stringsAsFactors = TRUE, sep=","))
… eliminate the unnecessary rows, and change some column names:
dfFlights <- dfFlights_raw %>%
drop_na(Phoenix) %>%
rename(c(Airline = "X", Status = "X.1")) %>%
mutate_all(list(~na_if(.,"")))
head(dfFlights)
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 AMWEST on time 694 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
Now we use TidyR to normalize the table - this is an extremely powerful library that does a lot in a couple lines …
dfFlights %<>%
gather("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle", key="City", value="NumOfFlights") %<>%
fill(Airline)
… and that’s all it takes!
Now we can analyse the airlines delays. We start with the mean delay percentage (flights delayed over total flights) for each airline. Amwest appears to have a better track record than Alaska.
dfSummaryStats <- dfFlights %>%
group_by(Airline, City) %>%
summarize(sum_DelayedFlights=sum(NumOfFlights[Status=="delayed"]), sum_Flights = sum(NumOfFlights), propOfDelay=sum(NumOfFlights[Status=="delayed"])/sum(NumOfFlights)*100)
## `summarise()` regrouping output by 'Airline' (override with `.groups` argument)
dfSummaryStats2 <- dfFlights %>%
group_by(Airline) %>%
summarize(sum_DelayedFlights=sum(NumOfFlights[Status=="delayed"]), sum_Flights = sum(NumOfFlights), propOfDelay=sum(NumOfFlights[Status=="delayed"])/sum(NumOfFlights)*100)
## `summarise()` ungrouping output (override with `.groups` argument)
dfDelaysOnly <- dfFlights %>%
filter(Status=="delayed")
dfDelaysOnly_Amwest <- dfDelaysOnly %>%
filter(Airline=="AMWEST")
dfDelaysOnly_Alaska <- dfDelaysOnly %>%
filter(Status=="ALASKA[")
dfPercentDiff <- dfSummaryStats %>%
select(Airline, City, propOfDelay) %>%
group_by(City) %>%
summarize(propDiff=propOfDelay[Airline=="AMWEST"] - propOfDelay[Airline=="ALASKA"])
## `summarise()` ungrouping output (override with `.groups` argument)
Here are the numbers showing proportion of delayed flights per airline:
print(dfSummaryStats2)
## # A tibble: 2 x 4
## Airline sum_DelayedFlights sum_Flights propOfDelay
## <fct> <int> <int> <dbl>
## 1 ALASKA 501 3775 13.3
## 2 AMWEST 787 7225 10.9
However, when we look at the percentage of delayed flights by city, Alaska is the clear winner, beating Amwest in every city:.
g3<-ggplot(dfDelaysOnly, aes(x=City, y=NumOfFlights, group=Airline, fill=Airline)) +
geom_col(position = position_dodge()) +
ggtitle("Number of Delays By Airline By City") +
theme(axis.text.x = element_text(angle = 90))
g4<-ggplot(dfSummaryStats, aes(x=City, y=propOfDelay, group=Airline, fill=Airline)) +
geom_col(position = position_dodge()) +
ggtitle("% of Delays By Airline By City") +
theme(axis.text.x = element_text(angle = 90))
grid.arrange (g3, g4, ncol=2)
How is this possible?
If we look at the distribution of delay % by city, we can see that although Alaska beats Amwest when they are head to head in a city, in fact some of the % delays for Amwest are lower than some of those for Alaska. And Amwest’s delay percentage is at its lowest in a city that has a highly disproportionate number of Amwest flights (Phoenix). This is going to bring the overall mean % of Amwest’s delays way down:
print(dfSummaryStats)
## # A tibble: 10 x 5
## # Groups: Airline [2]
## Airline City sum_DelayedFlights sum_Flights propOfDelay
## <fct> <chr> <int> <int> <dbl>
## 1 ALASKA Los.Angeles 62 559 11.1
## 2 ALASKA Phoenix 12 233 5.15
## 3 ALASKA San.Diego 20 232 8.62
## 4 ALASKA San.Francisco 102 605 16.9
## 5 ALASKA Seattle 305 2146 14.2
## 6 AMWEST Los.Angeles 117 811 14.4
## 7 AMWEST Phoenix 415 5255 7.90
## 8 AMWEST San.Diego 65 448 14.5
## 9 AMWEST San.Francisco 129 449 28.7
## 10 AMWEST Seattle 61 262 23.3
Is there anything in the data that might explain Amwest’s poorer per city performance relative to Alaska’s (besides incompetence, priority or some other reason internal to Amwest)? There is not much data to work with, but one difference might be the level of traffic - maybe the larger (or smaller) an airline is, the higher its delay percentage is.
These graphs display the percentage of delays by total flights for each airline. The second one has two outliers removed:
g4<-ggplot(dfSummaryStats, aes(x=sum_Flights, y=propOfDelay)) +
geom_point() +
ggtitle("% of Delays By Total Flights")
dfSummaryStats_OutliersRemoved <- dfSummaryStats %>%
filter(sum_Flights<2000)
g5<-ggplot(dfSummaryStats_OutliersRemoved, aes(x=sum_Flights, y=propOfDelay)) +
geom_point() +
ggtitle("% of Delays By Total Flights _ outliers removed")
grid.arrange(g4, g5, ncol=2)
lmHeight = lm(sum_Flights~propOfDelay, data = dfSummaryStats_OutliersRemoved) #Create the linear regression
summary(lmHeight) #Review the results
##
## Call:
## lm(formula = sum_Flights ~ propOfDelay, data = dfSummaryStats_OutliersRemoved)
##
## Residuals:
## Min 1Q Median 3Q Max
## -218.91 -180.75 -25.92 131.56 364.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 389.997 183.107 2.130 0.0772 .
## propOfDelay 3.905 10.813 0.361 0.7304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219.8 on 6 degrees of freedom
## Multiple R-squared: 0.02128, Adjusted R-squared: -0.1418
## F-statistic: 0.1304 on 1 and 6 DF, p-value: 0.7304
Neither of the graphs nor the regression analysis that follows suggest that % of delays is correlated with number of flights.
Perhaps it’s something more sinister. Is it possible there is some kind of favoritism going on among airport staff and crew? There is no easy way to measure that directly, but as an albeit weak proxy we observe that Phoenix appears to be some kind of hub for Amwest and Seattle a hub for Alaska:
g3<-ggplot(dfFlights, aes(x=City, y=NumOfFlights, group=Airline, fill=Airline)) +
geom_col(position = position_dodge()) +
ggtitle("Number of Flights By Airline By City") +
theme(axis.text.x = element_text(angle = 90))
g3
While Amwest has a higher percentage of delays across the board, is the gap between Amwest’s and Alaska’s delay percentages lowest in Phoenix and highest in Seattle? This might suggest that where Amwest is most prominent they get favored treatment by the airport, and where Alaska is most prominent Amwest gets the least favored treatment.
This barchart shows the difference in the % delays between the two airlines, by city.
g2<-ggplot(dfPercentDiff, aes(x=City, y=propDiff)) +
geom_col(position = position_dodge()) +
ggtitle("Diff in % of Delays By City") +
theme(axis.text.x = element_text(angle = 90))
g2
Amwest clearly does best in Phoenix relative to Alaska, and very poorly in Seattle. However, the biggest difference between the two is in San Francisco where Amwest has MORE total flights than Alaska.
Perhaps it is the weather. Amwest does worst in the Northern cities - but at this point there is too little data to determine anything more than anecdotal. Besides, blaming it on the weather, while we have all done it, is hardly a viable excuse.
Amwest filed for bankruptcy in the early ninieties and eventually was swallowed up by US Air. This is a prophetic story. Had they employed data scientists who could have shown them the above charts they might have survived to this day.