Eric_Hirsch_607_Assignment

Using TidyR and Dyplr for Data Transformation

In this project we will transform a csv file into a usable dataframe using TidyR and Dyplr. After the transformation we will conduct some analyses.

First we load the data …

dfFlights_raw <- as.data.frame(read.delim("https://raw.githubusercontent.com/ericonsi/CUNY_607/main/Projects/Project%202/Flights.csv", header = TRUE, stringsAsFactors = TRUE, sep=","))

… eliminate the unnecessary rows, and change some column names:

dfFlights <- dfFlights_raw %>% 
  drop_na(Phoenix) %>%
  rename(c(Airline =  "X", Status = "X.1")) %>%
  mutate_all(list(~na_if(.,"")))

kable(head(dfFlights))

Airline	Status	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
AMWEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

Now we use TidyR to normalize the table - this is an extremely powerful library that does a lot in a couple lines …

dfFlights %<>%
  gather("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle", key="City", value="NumOfFlights") %<>%
  fill(Airline)

… and that’s all it takes!

Analysis of Delays

Now we can analyse the airline’s delays. We start with the mean delay percentage (flights delayed over total flights) for each airline. Amwest appears to have a better track record than Alaska.

dfSummaryStats <- dfFlights %>% 
  group_by(Airline, City) %>% 
  summarize(sum_DelayedFlights=sum(NumOfFlights[Status=="delayed"]), sum_Flights = sum(NumOfFlights), propOfDelay=sum(NumOfFlights[Status=="delayed"])/sum(NumOfFlights)*100)

## `summarise()` regrouping output by 'Airline' (override with `.groups` argument)

dfSummaryStats2 <- dfFlights %>% 
  group_by(Airline) %>% 
  summarize(sum_DelayedFlights=sum(NumOfFlights[Status=="delayed"]), sum_Flights = sum(NumOfFlights), propOfDelay=sum(NumOfFlights[Status=="delayed"])/sum(NumOfFlights)*100)

## `summarise()` ungrouping output (override with `.groups` argument)

dfDelaysOnly <- dfFlights %>%
  filter(Status=="delayed")

dfDelaysOnly_Amwest <- dfDelaysOnly %>%
  filter(Airline=="AMWEST")

dfDelaysOnly_Alaska <- dfDelaysOnly %>%
  filter(Status=="ALASKA[")

dfPercentDiff <- dfSummaryStats %>%
  select(Airline, City, propOfDelay) %>%
  group_by(City) %>%
  summarize(propDiff=propOfDelay[Airline=="AMWEST"] - propOfDelay[Airline=="ALASKA"])

## `summarise()` ungrouping output (override with `.groups` argument)

gg2 <-ggplot(dfSummaryStats, aes(reorder(City, propOfDelay), propOfDelay)) +
geom_boxplot() + theme(axis.text.x = element_text(angle = 90), axis.title = element_text(size=15), axis.text = element_text(size = 15), plot.title = element_text(size=18)) +
  ggtitle("Percentage of Delayed Flights by City") +
  ylab("Percentage of Flights Delayed") +
  xlab("City") + 
  coord_flip()
gg2

gg1 <- ggplot(dfSummaryStats, aes(reorder(City, propOfDelay), propOfDelay)) + theme_tufte() +
geom_tufteboxplot(outlier.colour="transparent") + theme(axis.title=element_blank(), axis.text.x = element_text(angle = 90), axis.text = element_text(size = 20), plot.title = element_text(size=20)) +
annotate("text", x = 0, y =30, adj=0,  family="serif", label = "") +
  ggtitle("Percentage of Delayed Flights by City") + 
  coord_flip()
gg1

Here are the numbers showing proportion of delayed flights per airline:

kable(dfSummaryStats2)

Airline	sum_DelayedFlights	sum_Flights	propOfDelay
ALASKA	501	3775	13.27152
AMWEST	787	7225	10.89273

Fig. 1: A comparison by airline of the % of delays by city

ggplot(dfSummaryStats, aes(x=reorder(City, propOfDelay), y=propOfDelay, group=Airline, fill=Airline)) +
  theme_tufte() +
  geom_col(position = position_dodge()) +
  theme(axis.text.x = element_text(angle = 90), axis.text = element_text(size = 20), legend.text = element_text(size=20), legend.title=element_blank()) +
  ylab("") + 
  xlab("") +
  coord_flip()

However, when we look at the percentage of delayed flights by city, Alaska is the clear winner, beating Amwest in every city (Fig 1.).

How is this possible?

Fig. 1: A comparison by airline of the % of delays by city

ggplot(dfSummaryStats, aes(x=reorder(City, propOfDelay), y=propOfDelay, group=Airline, fill=Airline)) + theme_tufte(base_size=14, ticks=F) +
  geom_bar(position = position_dodge(), width=0.3, stat = "identity") +  theme(axis.title=element_blank()) +
  annotate("text", x = 3.5, y = 5, adj=1,  family="Gill Sans MT",
label = "") +
    theme(axis.text.x = element_text(angle = 0), axis.text = element_text(size = 22, family="Gill Sans MT"), legend.text = element_text(size=20), legend.title=element_blank()) +
  coord_flip()

⊕Did you know: The paradox we are encountering in this exercise is called “Simpson’s Paradox,” first described in a technical paper by Edward H. Simpson in 1951.

If we look at the distribution of delay % by city, we can see that although Alaska beats Amwest when they are head to head in a city, in fact some of the % delays for Amwest are lower than some of those for Alaska. And Amwest’s delay percentage is at its lowest in a city that has a highly disproportionate number of Amwest flights (Phoenix). This is going to bring the overall mean % of Amwest’s delays way down:

kable(dfSummaryStats)

Airline	City	sum_DelayedFlights	sum_Flights	propOfDelay
ALASKA	Los.Angeles	62	559	11.091234
ALASKA	Phoenix	12	233	5.150215
ALASKA	San.Diego	20	232	8.620690
ALASKA	San.Francisco	102	605	16.859504
ALASKA	Seattle	305	2146	14.212488
AMWEST	Los.Angeles	117	811	14.426634
AMWEST	Phoenix	415	5255	7.897241
AMWEST	San.Diego	65	448	14.508929
AMWEST	San.Francisco	129	449	28.730512
AMWEST	Seattle	61	262	23.282443

Is there anything in the data that might explain Amwest’s poorer per city performance relative to Alaska’s (besides incompetence, priority or some other reason internal to Amwest)? There is not much data to work with, but one difference might be the level of traffic - maybe the larger (or smaller) an airline is, the higher its delay percentage is.

These graphs display the percentage of delays by total flights for each airline. The second one has two outliers removed:

Fig. 2: Percentage of delays by total flights by airline

g4<-ggplot(dfSummaryStats, aes(x=sum_Flights, y=propOfDelay)) +
  geom_point() +
  ggtitle("% of Delays By Total Flights") 


dfSummaryStats_OutliersRemoved <- dfSummaryStats %>%
  filter(sum_Flights<2000)

g5<-ggplot(dfSummaryStats_OutliersRemoved, aes(x=sum_Flights, y=propOfDelay)) +
  geom_point() +
  ggtitle("% of Delays By Total Flights _ outliers removed") 

grid.arrange(g4, g5, ncol=2)

lmHeight = lm(sum_Flights~propOfDelay, data = dfSummaryStats_OutliersRemoved) #Create the linear regression
summary(lmHeight) #Review the results

## 
## Call:
## lm(formula = sum_Flights ~ propOfDelay, data = dfSummaryStats_OutliersRemoved)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -218.91 -180.75  -25.92  131.56  364.67 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  389.997    183.107   2.130   0.0772 .
## propOfDelay    3.905     10.813   0.361   0.7304  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219.8 on 6 degrees of freedom
## Multiple R-squared:  0.02128,    Adjusted R-squared:  -0.1418 
## F-statistic: 0.1304 on 1 and 6 DF,  p-value: 0.7304

Neither of the graphs nor the regression analysis that follows suggest that % of delays is correlated with number of flights.

Perhaps it’s something more sinister. Is it possible there is some kind of favoritism going on among airport staff and crew? There is no easy way to measure that directly, but as an albeit weak proxy we observe that Phoenix appears to be some kind of hub for Amwest and Seattle a hub for Alaska:

Fig. 3: Number of flights by airline by city

g3<-ggplot(dfFlights, aes(x=City, y=NumOfFlights, group=Airline, fill=Airline)) +
  geom_col(position = position_dodge()) +
  ggtitle("Number of Flights By Airline By City") +
  theme(axis.text.x = element_text(angle = 90))
g3

While Amwest has a higher percentage of delays across the board, is the gap between Amwest’s and Alaska’s delay percentages lowest in Phoenix and highest in Seattle? This might suggest that where Amwest is most prominent they get favored treatment by the airport, and where Alaska is most prominent Amwest gets the least favored treatment.

This barchart shows the difference in the % delays between the two airlines, by city.

Fig. 4: The difference in the percentage of delays by city

g2<-ggplot(dfPercentDiff, aes(x=City, y=propDiff)) +
  geom_col(position = position_dodge()) +
  theme(axis.text.x = element_text(angle = 90), axis.text = element_text(size = 20)) +
  ylab("") + 
  xlab("") +
  coord_flip()

g2

dfFlightYears <- read_csv("D:\\RStudio\\CUNY_607\\Assigments\\Assignment 5 - Tufte\\FileForSlopeGraph.csv")

## Warning: Missing column names filled in: 'X1' [1]

## 
## -- Column specification --------------------------------------------------------
## cols(
##   X1 = col_double(),
##   Year = col_double(),
##   Alaska = col_double(),
##   Amwest = col_double(),
##   British = col_double(),
##   Continental = col_double(),
##   Delta = col_double(),
##   Eastern = col_double(),
##   Hawaiian = col_double()
## )

dfFlightYears %<>%
  gather("Alaska", "Amwest", "British", "Continental", "Delta", "Eastern", "Hawaiian", key="airline", value="flights")  %<>%
  mutate(flights=as.integer(flights))


dfFlightYears %<>%
  mutate(yearCat = as.character(as.integer(Year/10)*10))

dfFl <- dfFlightYears %>%
  group_by(airline, yearCat) %>%
  summarize(Flights=round(mean(flights)))

## `summarise()` regrouping output by 'airline' (override with `.groups` argument)

#library(devtools)
#install_github("leeper/slopegraph")#install Leeper's package from Github
library(slopegraph)
ggslopegraph2(dataframe = dfFl,
                times = yearCat,
                measurement = Flights,
                grouping = airline,
                title = NULL,
                caption = NULL,
                linecolor = "gray",
                subtitle = NULL, 
                linethickness = .5,
                ytextsize = 3
                )

Note the footnote displayed to the right of me! 11 This is a footnote.

library(reshape)

## Warning: package 'reshape' was built under R version 4.0.4

## 
## Attaching package: 'reshape'

## The following object is masked from 'package:dplyr':
## 
##     rename

## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths

## The following object is masked from 'package:openintro':
## 
##     tips

library(RCurl)

## 
## Attaching package: 'RCurl'

## The following object is masked from 'package:tidyr':
## 
##     complete

library(extrafont)
#font_import()
#loadfonts(device = "win")

d <- dfFlightYears
mins <- group_by(d, airline) %>% slice(which.min(flights))
maxs <- group_by(d, airline) %>% slice(which.max(flights))
ends <- group_by(d, airline) %>% filter(Year == max(Year))
quarts <- d %>% group_by(airline) %>%
  summarize(quart1 = quantile(flights, 0.25),
            quart2 = quantile(flights, 0.75)) %>%
  right_join(d)

## `summarise()` ungrouping output (override with `.groups` argument)

## Joining, by = "airline"

Fig. 4: Numbers of flights over time

ggplot(d, aes(x=Year, y=flights)) + 
  facet_grid(airline ~ ., scales = "free_y") + 
  geom_ribbon(data = quarts, aes(ymin = quart1, max = quart2), fill = 'grey90') +
  geom_line(size=0.7) +
  geom_point(data = mins, col = 'green', size=5) +
  geom_point(data = maxs, col = 'blue', size = 5) +
  geom_text(data = mins, aes(label = flights), vjust = -1, size=6, family="Gill Sans MT", col='red') +
  geom_text(data = maxs, aes(label = flights), vjust = 2.5, size=6, family="Gill Sans MT") +
  geom_text(data = ends, aes(label = flights), hjust = 0, nudge_x = 1, size=6, family="Gill Sans MT", col='red') +
  geom_text(data = ends, aes(label = airline), hjust = 0, nudge_x = 8, size=8, family="Gill Sans MT") +
  expand_limits(x = max(d$Year) + (0.5 * (max(d$Year) - min(d$Year)))) +
  scale_x_continuous(breaks = seq(1960, 2010, 10)) +
  scale_y_continuous(expand = c(0.2, 0)) +
  theme_tufte() +
  theme(axis.title=element_blank(), axis.text.y = element_blank(), 
        axis.ticks = element_blank(), strip.text = element_blank(), axis.text = element_text(size = 14, family="Gill Sans MT")) +
  theme(panel.background = element_rect(fill = "transparent", colour = NA),
        plot.background = element_rect(fill = "transparent", colour = NA))

Amwest clearly does best in Phoenix relative to Alaska, and very poorly in Seattle. However, the biggest difference between the two is in San Francisco where Amwest has MORE total flights than Alaska.

Perhaps it is the weather. Amwest does worst in the Northern cities - but at this point there is too little data to determine anything more than anecdotal. Besides, blaming it on the weather, while we have all done it, is hardly a viable excuse.

Conclusion

Amwest filed for bankruptcy in the early ninieties and eventually was swallowed up by US Air. This is a prophetic story. Had they employed data scientists who could have shown them the above charts they might have survived to this day.

Eric_Hirsch_607_Assignment_5

Eric Hirsch

2021-03-31

Using TidyR and Dyplr for Data Transformation

Analysis of Delays

Conclusion