library(openintro)
library(tinytex)
library(tidyverse)
library(stringr)
library(magrittr)
library(gridExtra)

Using TidyR and Dyplr for Data Transformation

Introduction

In this project we will transform a csv file into a usable dataframe using TidyR and Dyplr. After the transformation we will conduct some analyses.

We load the data …

dfFlights_raw <- as.data.frame(read.delim("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/Flights.csv", header = TRUE, stringsAsFactors = TRUE, sep=","))

… eliminate the unnecessary rows, and change some column names:

dfFlights <- dfFlights_raw %>% 
  drop_na(Phoenix) %>%
  rename(c(Airline =  "X", Status = "X.1")) %>%
  mutate_all(list(~na_if(.,"")))

head(dfFlights)
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 3  AMWEST on time         694    4840       383           320     201
## 4    <NA> delayed         117     415        65           129      61

Now we use TidyR to normalize the table - this is an extremely powerful library that does a lot in a couple lines …

dfFlights %<>%
  gather("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle", key="City", value="NumOfFlights") %<>%
  fill(Airline)

… and that’s all it takes!

Analysis of Delays

Now we can analyse the airlines delays. We start by getting a look at the median and spread of delays for each airline. We can also look at the delay percentage by city (delays over total flights) which might be more meaningful:

dfSummaryStats <- dfFlights %>% 
  group_by(Airline, City) %>% 
  summarize(ave_Flights=mean(NumOfFlights), sum_Flights = sum(NumOfFlights), propOfDelay=sum(NumOfFlights[Status=="delayed"])/sum(NumOfFlights)*100)
## `summarise()` regrouping output by 'Airline' (override with `.groups` argument)
dfDelaysOnly <- dfFlights %>%
  filter(Status=="delayed")

dfDelaysOnly_Amwest <- dfDelaysOnly %>%
  filter(Airline=="AMWEST")

dfDelaysOnly_Alaska <- dfDelaysOnly %>%
  filter(Status=="ALASKA[")

dfPercentDiff <- dfSummaryStats %>%
  select(Airline, City, propOfDelay) %>%
  group_by(City) %>%
  summarize(propDiff=propOfDelay[Airline=="AMWEST"] - propOfDelay[Airline=="ALASKA"])
## `summarise()` ungrouping output (override with `.groups` argument)
g1 <- ggplot(dfDelaysOnly, aes(x=Airline, y=NumOfFlights)) +
  geom_boxplot() +
  ggtitle("Total Delays By Airline")

g2 <- ggplot(dfSummaryStats, aes(x=Airline, y=propOfDelay)) +
  geom_boxplot() +
  ggtitle("Percentage Delays By Airline By City")

grid.arrange (g1, g2, ncol=2)

Amwest has a higher median total delayed flights and a higher median percentage of delays per city. Might this be due to an outlier city which is skewing the results? Apparently not, as this bar chart shows.

g3<-ggplot(dfDelaysOnly, aes(x=City, y=NumOfFlights, group=Airline, fill=Airline)) +
  geom_col(position = position_dodge()) +
  ggtitle("Number of Delays By Airline By City") +
  theme(axis.text.x = element_text(angle = 90))


g4<-ggplot(dfSummaryStats, aes(x=City, y=propOfDelay, group=Airline, fill=Airline)) +
  geom_col(position = position_dodge()) +
  ggtitle("% of Delays By Airline By City") +
  theme(axis.text.x = element_text(angle = 90))


grid.arrange (g3, g4, ncol=2)

Amwest has a higher % of delays for every city.

So is there anything in the data that might explain this (besides incompetence, priority or some other reason internal to Amwest? One difference might be the level of traffic - maybe the larger (or smaller) an airline is, the higher its delay percentage is.

These graphs display the percentage of delays by total flights for each airline. The second one has two outliers removed:

g4<-ggplot(dfSummaryStats, aes(x=sum_Flights, y=propOfDelay)) +
  geom_point() +
  ggtitle("% of Delays By Total Flights") 


dfSummaryStats_OutliersRemoved <- dfSummaryStats %>%
  filter(sum_Flights<2000)

g5<-ggplot(dfSummaryStats_OutliersRemoved, aes(x=sum_Flights, y=propOfDelay)) +
  geom_point() +
  ggtitle("% of Delays By Total Flights _ outliers removed") 

grid.arrange(g4, g5, ncol=2)

lmHeight = lm(sum_Flights~propOfDelay, data = dfSummaryStats_OutliersRemoved) #Create the linear regression
summary(lmHeight) #Review the results
## 
## Call:
## lm(formula = sum_Flights ~ propOfDelay, data = dfSummaryStats_OutliersRemoved)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -218.91 -180.75  -25.92  131.56  364.67 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  389.997    183.107   2.130   0.0772 .
## propOfDelay    3.905     10.813   0.361   0.7304  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.8 on 6 degrees of freedom
## Multiple R-squared:  0.02128,    Adjusted R-squared:  -0.1418 
## F-statistic: 0.1304 on 1 and 6 DF,  p-value: 0.7304

Neither of the graphs nor the regression analysis suggest that % of delays is correlated with number of flights.

Perhaps it’s something more sinister. Is it possible there is some kind of favoritism going on among airport staff and crew? There is no easy way to measure that directly, but as an albeit weak proxy we observe that Phoenix appears to be some kind of hub for Amwest and Seattle a hub for Alaska:

g3<-ggplot(dfFlights, aes(x=City, y=NumOfFlights, group=Airline, fill=Airline)) +
  geom_col(position = position_dodge()) +
  ggtitle("Number of Flights By Airline By City") +
  theme(axis.text.x = element_text(angle = 90))
g3

While Amwest has a higher percentage of delays across the board, is the gap between Amwest’s and Alaska’s delay percentages lowest in Phoenix and highest in Seattle? This might suggest that where Amwest is most prominent they get favored treatment by the airport, and where Alaska is most prominent Amwest gets the least favored treatment.

This barchart shows the difference in the % delays between the two airlines, by city.

g2<-ggplot(dfPercentDiff, aes(x=City, y=propDiff)) +
  geom_col(position = position_dodge()) +
  ggtitle("Diff in % of Delays By City") +
  theme(axis.text.x = element_text(angle = 90))
g2

Amwest clearly does best in Phoenix relative to Alaska, and very poorly in Seattle, but the biggest difference is in San Francisco where they have MORE total flights than Alaska. Could it be the weather? Amwest does worst in the Northern cities - but at this point there is too little data to determine. Besides, blaming it on the weather, while we have all done it, is hardly a viable excuse.

Conclusion

Amwest filed for bankruptcy in the early ninieties and eventually was swallowed up by US Air. This is a prophetic story. Had they employed data scientists who could have shown them the above charts they might have survived to this day.