Packages

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Loading Dataset

url = "https://raw.githubusercontent.com/schoolkidrich/R/main/DATA%20607/week5/delay.csv"

delays = read.csv(url)
head(delays)
##   Airline Tardiness Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA   on time         497     221       212           503    1841
## 2  ALASKA   delayed          62      12        20           102     305
## 3 AM WEST   on time         694    4840       383           320     201
## 4 AM WEST   delayed         117     415        65           129      61

pivoting dataset on cities

len = dim(delays)[2]
col = names(delays)
cities = pivot_longer(delays,col[3:len],names_to = 'city')
head(cities)
## # A tibble: 6 x 4
##   Airline Tardiness city          value
##   <chr>   <chr>     <chr>         <int>
## 1 ALASKA  on time   Los.Angeles     497
## 2 ALASKA  on time   Phoenix         221
## 3 ALASKA  on time   San.Diego       212
## 4 ALASKA  on time   San.Francisco   503
## 5 ALASKA  on time   Seattle        1841
## 6 ALASKA  delayed   Los.Angeles      62

pivoting dataset on tardiness

clean_data = pivot_wider(cities,names_from = "Tardiness")
names(clean_data)[3] = 'on_time'
head(clean_data)
## # A tibble: 6 x 4
##   Airline city          on_time delayed
##   <chr>   <chr>           <int>   <int>
## 1 ALASKA  Los.Angeles       497      62
## 2 ALASKA  Phoenix           221      12
## 3 ALASKA  San.Diego         212      20
## 4 ALASKA  San.Francisco     503     102
## 5 ALASKA  Seattle          1841     305
## 6 AM WEST Los.Angeles       694     117

on time by cities (percent)

Alaksa airlines has a higher rate of arriving on time for every city

clean_data$percent_on_time = clean_data$on_time/(clean_data$on_time+clean_data$delayed)

ggplot(clean_data,aes(x = reorder(city,-percent_on_time), y = percent_on_time, fill = Airline)) + geom_bar(stat='identity', position = position_dodge()) + labs(x = "city",title = "Percentage of Timely Arrivals by City")

by count

There are more on time fligts for Alaska in some cities and more for Am west in others

ggplot(clean_data,aes(x = reorder(city,-on_time), y = on_time, fill = Airline)) + geom_bar(stat='identity', position = position_dodge()) + labs(x = "city", title = "Count of Timely Arrivals by City")

on time by airlines (percent)

unsurprisingly, Alaska airline also has a higher rate of being on time in total

airline_on_time = group_by(clean_data,Airline) %>%
  summarize(percent = sum(on_time)/(sum(on_time)+sum(delayed)))

ggplot(airline_on_time, aes(x = Airline, y = percent))+geom_bar(stat='identity') + labs(title = "Percentage of Timely Arrivals")

by count

There are more on time flights in total from AM WEST. This probably due to the abundance of flights in the city of pheonix where flights are primarily AM WEST.

ggplot(clean_data, aes(x = Airline,y=on_time))+geom_bar(stat="identity")+
  labs(title = "Count of Timely Arrivals")

Conclusion

Although Alaksa airlines out performs AM WEST, in terms of tardiness rate, in every city the overall performance is worse. This is because most of AM WESTS flights are from pheonix while most of Alaskas flights are in Seattle. AM WEST in pheonix outperforms Alaska in Seattle which and those are the cities that are weighted the most (simpsons paradox)