In this assignment we are provided an untidy dataset of containing flight counts for two airlines and five cities by flight status (on time vs. delayed). Our task is to create a csv file for this information in wide format and then, using dplyr and tidyr, put it into tidy form. We are also asked to analyze the data and create appropriate graphics related to the following interactions:
We are also asked to explain any discrepancies between the overall and on time performances as well as any paradoxical conclusions.
I divided this work into the following steps:
I completed Step 1 outside of RPubs. I then imported the resulting untidy csv file (numbersense.csv) into Rstudio and converted it to tidy form using the tidyverse library - which includes dplyr and tidyr. I then saved the resulting dataframe as a separate csv file (hwfivecldata.csv).
In addition, I developed a SQL script to create and populate the latter into my local MYSQL database. I also attempted to upload the dataset to an AWS RMD but was not successful.
# Read in csv file
num <- read.csv("numbersense.csv", sep=",")
#Tidy the dataframe
num%<>%mutate(Phoenix=str_remove(Phoenix, ","))%>%mutate(Seattle=str_remove(Seattle, ","))
num%<>%mutate_at(c(4,7), as.numeric)
num%<>%clean_names()%>%pivot_longer(cols=-c(i:x), names_to = "City", values_to = "Flight_Count", values_drop_na=TRUE)
num%<>%rename(Airline = i, Flight_Status=x)
num[c(6:10),1] ="ALASKA"
num[c(16:20),1] ="AMWEST"
num <- num %>% mutate(Id = row_number())%>%relocate(Id, .before = Airline)
# Create table for first five rows
head(num, 5)%>%kbl%>%kable_material(c("striped"))
Id | Airline | Flight_Status | City | Flight_Count |
---|---|---|---|---|
1 | ALASKA | on time | los_angeles | 497 |
2 | ALASKA | on time | phoenix | 221 |
3 | ALASKA | on time | san_diego | 212 |
4 | ALASKA | on time | san_francisco | 503 |
5 | ALASKA | on time | seattle | 1841 |
# Save as csv file for import into MYSQL
write.csv(num,"C:\\Users\\seanc\\Documents\\Data_Science\\CUNY\\Data 607 Acquisition and Management\\Assignments\\WK5\\MyData.csv", row.names = FALSE)
First, I calculated on-time performance by airline and city. Then I calculated it for the airlines.
On-time performance was determined by dividing on-time arrival counts by the sum of on-time and delayed counts for each airline.
Finally, I calculated the total number of flights documented for each airline.
# Compute performance metric for each airline by city.
(air_city<-num%>%select(Airline, Flight_Status, City, Flight_Count)%>%pivot_wider(names_from = Flight_Status, values_from = Flight_Count)%>%clean_names()%>% mutate(performance=round((on_time/(on_time+delayed))*100)))
air_city%>%kbl%>%kable_material(c("striped"))
airline | city | on_time | delayed | performance |
---|---|---|---|---|
ALASKA | los_angeles | 497 | 62 | 89 |
ALASKA | phoenix | 221 | 12 | 95 |
ALASKA | san_diego | 212 | 20 | 91 |
ALASKA | san_francisco | 503 | 102 | 83 |
ALASKA | seattle | 1841 | 305 | 86 |
AMWEST | los_angeles | 694 | 117 | 86 |
AMWEST | phoenix | 4840 | 415 | 92 |
AMWEST | san_diego | 383 | 65 | 85 |
AMWEST | san_francisco | 320 | 129 | 71 |
AMWEST | seattle | 201 | 61 | 77 |
# Compute overall on time performance for each airline
air_overall<-num%>%select(Airline, Flight_Status, Flight_Count)%>%group_by(Airline, Flight_Status)%>%summarize(Flight_Total=sum(Flight_Count))
## `summarise()` has grouped output by 'Airline'. You can override using the `.groups` argument.
air_overall%<>%pivot_wider(names_from = Flight_Status, values_from = Flight_Total)%>%clean_names()%>%mutate(performance=round((on_time/(on_time+delayed))*100))
air_overall%>%kbl%>%kable_material(c("striped"))
airline | delayed | on_time | performance |
---|---|---|---|
ALASKA | 501 | 3274 | 87 |
AMWEST | 787 | 6438 | 89 |
#Compute the total number of flights for each airline
n_total <- num%>%group_by(Airline)%>%summarize(Frequency = sum(Flight_Count))
n_total%>%kbl%>%kable_material(c("striped"))
Airline | Frequency |
---|---|
ALASKA | 3775 |
AMWEST | 7225 |
Note: I performed addtional analyses but these were not required on the rubric
# Descriptive statistics grouped by airline and flight status
stat1sum<-num%>%group_by(Airline, Flight_Status)%>%summarize(Total=sum(Flight_Count), Median=round(median(Flight_Count, na.rm=TRUE)), Minimum=min(Flight_Count), Maximum=max(Flight_Count),Range=range(Flight_Count),Standard_Deviation=round(sd(Flight_Count)))
## `summarise()` has grouped output by 'Airline', 'Flight_Status'. You can override using the `.groups` argument.
stat1sum%<>%distinct(Airline, .keep_all = TRUE)
stat1sum%>%kbl%>%kable_material(c("striped"))
Airline | Flight_Status | Total | Median | Minimum | Maximum | Range | Standard_Deviation |
---|---|---|---|---|---|---|---|
ALASKA | delayed | 501 | 62 | 12 | 305 | 12 | 120 |
ALASKA | on time | 3274 | 497 | 212 | 1841 | 212 | 678 |
AMWEST | delayed | 787 | 117 | 61 | 415 | 61 | 147 |
AMWEST | on time | 6438 | 383 | 201 | 4840 | 201 | 1994 |
# Descriptive statistics grouped by city and flight status.
(stat2sum<-num%>%group_by(City, Flight_Status)%>%summarize(Median=round(median(Flight_Count, na.rm=TRUE)), Minimum=min(Flight_Count), Maximum=max(Flight_Count), Range=range(Flight_Count), Standard_Deviation=round(sd(Flight_Count))))
## `summarise()` has grouped output by 'City', 'Flight_Status'. You can override using the `.groups` argument.
stat2sum %<>% distinct(City, Flight_Status, .keep_all = TRUE)
stat2sum%>%kbl%>%kable_material(c("striped"))
City | Flight_Status | Median | Minimum | Maximum | Range | Standard_Deviation |
---|---|---|---|---|---|---|
los_angeles | delayed | 90 | 62 | 117 | 62 | 39 |
los_angeles | on time | 596 | 497 | 694 | 497 | 139 |
phoenix | delayed | 214 | 12 | 415 | 12 | 285 |
phoenix | on time | 2530 | 221 | 4840 | 221 | 3266 |
san_diego | delayed | 42 | 20 | 65 | 20 | 32 |
san_diego | on time | 298 | 212 | 383 | 212 | 121 |
san_francisco | delayed | 116 | 102 | 129 | 102 | 19 |
san_francisco | on time | 412 | 320 | 503 | 320 | 129 |
seattle | delayed | 183 | 61 | 305 | 61 | 173 |
seattle | on time | 1021 | 201 | 1841 | 201 | 1160 |
# Descriptive statistics grouped by airline, city, and flight status.
stat3sum<-num%>%group_by(City, Airline, Flight_Status)%>%summarize(median=round(median(Flight_Count)), minimum=min(Flight_Count), maximum=max(Flight_Count), range=range(Flight_Count))
## `summarise()` has grouped output by 'City', 'Airline', 'Flight_Status'. You can override using the `.groups` argument.
stat3sum %<>% distinct(Airline, City, Flight_Status, .keep_all = TRUE)%>%arrange(Airline)
stat3sum%>%kbl%>%kable_material(c("striped"))
City | Airline | Flight_Status | median | minimum | maximum | range |
---|---|---|---|---|---|---|
los_angeles | ALASKA | delayed | 62 | 62 | 62 | 62 |
los_angeles | ALASKA | on time | 497 | 497 | 497 | 497 |
phoenix | ALASKA | delayed | 12 | 12 | 12 | 12 |
phoenix | ALASKA | on time | 221 | 221 | 221 | 221 |
san_diego | ALASKA | delayed | 20 | 20 | 20 | 20 |
san_diego | ALASKA | on time | 212 | 212 | 212 | 212 |
san_francisco | ALASKA | delayed | 102 | 102 | 102 | 102 |
san_francisco | ALASKA | on time | 503 | 503 | 503 | 503 |
seattle | ALASKA | delayed | 305 | 305 | 305 | 305 |
seattle | ALASKA | on time | 1841 | 1841 | 1841 | 1841 |
los_angeles | AMWEST | delayed | 117 | 117 | 117 | 117 |
los_angeles | AMWEST | on time | 694 | 694 | 694 | 694 |
phoenix | AMWEST | delayed | 415 | 415 | 415 | 415 |
phoenix | AMWEST | on time | 4840 | 4840 | 4840 | 4840 |
san_diego | AMWEST | delayed | 65 | 65 | 65 | 65 |
san_diego | AMWEST | on time | 383 | 383 | 383 | 383 |
san_francisco | AMWEST | delayed | 129 | 129 | 129 | 129 |
san_francisco | AMWEST | on time | 320 | 320 | 320 | 320 |
seattle | AMWEST | delayed | 61 | 61 | 61 | 61 |
seattle | AMWEST | on time | 201 | 201 | 201 | 201 |
I created two bar plots for on time performance. The first compared performance for both airlines by city. The second compared on time performance by airline.
In order to create visually legible boxplots, I removed flight counts > 1000. This applied to airports in Seattle and Phoenix.
I also tried to facet a boxplot graphic highlight counts by flight status, city, and airline. However, the resulting form was not visually legible and, despite much effort, I was unable to improve it.
# Bar plot of on time performance by airline and city.
(air_city%>%ggplot(aes(x=city, y=performance, fill=airline, alpha=.5))+geom_bar(stat="identity", position = position_dodge(width = .6))+
theme(axis.text.x= element_text(size=8))+
theme(axis.text.y = element_text(size=8))+
coord_flip()+
theme_bw()+
ggtitle("Figure 1. Performance of Two Airlines by City")+xlab("City")+ylab("On Time Performance(%)"))
# Bar plot of on time performance by airline.
(air_overall%>%ggplot(aes(x=airline, y=performance))+geom_bar(stat="identity", width=0.5, color="blue", alpha=0.5)+
theme(axis.text.x= element_text(size=8))+
theme(axis.text.y = element_text(size=8))+
coord_flip()+
theme_bw()+
ggtitle("Figure 2. On Time Performance of Two Airlines")+xlab("Airline")+ylab("On Time Performance(%)"))
#Plot boxplots with outliers removed
(plot1 <- num%>%filter(Flight_Count<1000)%>%
ggplot(aes(x=Airline, y=Flight_Count)) +
geom_boxplot(alpha=.3, fill="blue")+
theme(axis.text.x= element_text(size=8))+
theme(axis.text.y = element_text(size=8))+
coord_flip()+
facet_grid(.~Flight_Status, space="free_x")+
theme_bw()+
ggtitle("Figure 3. Timeliness of Flights by Airline")+
xlab("Airline")+
ylab("Flight Count"))
(plot2 <- num%>%filter(Flight_Count<1000)%>%
ggplot(aes(x=City, y=Flight_Count)) +
geom_boxplot(alpha=.7, fill="blue")+
theme(axis.text.x= element_text(size=8))+
theme(axis.text.y = element_text(size=8))+
coord_flip()+
facet_grid(.~Flight_Status)+
theme_bw()+
ggtitle("Figure 4. Timeliness of Flights by City")+xlab("City")+
ylab("Flight Count"))
# Attempt to facet with facet_grid
(plot3 <- num%>%filter(Flight_Count<1000, City=="los_angeles"|City=="phoenix" )%>%
ggplot(aes(x=Airline, y=Flight_Count))+
geom_boxplot(alpha=.7, fill="blue")+
facet_grid_paginate(City~Flight_Status,
ncol=2, page=1, space="free_y")+
theme_bw())
The on-time performance of ALASKA exceeded AMWEST when compared on a city-by-city basis (Figure 1).
However, the overall on-time performance of AMWEST (89%) exceeded ALASKA (87%). See Figure 2.
I ascribe this discrepancy to the fact that the number of flights flown by AMWEST (n=7225) was almost double that of Alaska airlines (n=3775). And the majority of this difference was attributable to on-time flights by AMWEST to Phoenix (n=4840).
It is clear from this dataset that an aggregate measure of performance can lead to a different conclusion (AMWEST outperformed ALASKA) than when distributed measures of performance are analyzed individually (ALASKA outperformed AMWEST).