Packages
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Loading Dataset
url = "https://raw.githubusercontent.com/schoolkidrich/R/main/DATA%20607/week5/delay.csv"
delays = read.csv(url)
head(delays)
## Airline Tardiness Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
pivoting dataset on cities
len = dim(delays)[2]
col = names(delays)
cities = pivot_longer(delays,col[3:len],names_to = 'city')
head(cities)
## # A tibble: 6 x 4
## Airline Tardiness city value
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
pivoting dataset on tardiness
clean_data = pivot_wider(cities,names_from = "Tardiness")
names(clean_data)[3] = 'on_time'
head(clean_data)
## # A tibble: 6 x 4
## Airline city on_time delayed
## <chr> <chr> <int> <int>
## 1 ALASKA Los.Angeles 497 62
## 2 ALASKA Phoenix 221 12
## 3 ALASKA San.Diego 212 20
## 4 ALASKA San.Francisco 503 102
## 5 ALASKA Seattle 1841 305
## 6 AM WEST Los.Angeles 694 117
on time by cities (percent)
Alaksa airlines has a higher rate of arriving on time for every city
clean_data$percent_on_time = clean_data$on_time/(clean_data$on_time+clean_data$delayed)
ggplot(clean_data,aes(x = reorder(city,-percent_on_time), y = percent_on_time, fill = Airline)) + geom_bar(stat='identity', position = position_dodge()) + labs(x = "city",title = "Percentage of Timely Arrivals by City")

by count
There are more on time fligts for Alaska in some cities and more for Am west in others
ggplot(clean_data,aes(x = reorder(city,-on_time), y = on_time, fill = Airline)) + geom_bar(stat='identity', position = position_dodge()) + labs(x = "city", title = "Count of Timely Arrivals by City")

on time by airlines (percent)
unsurprisingly, Alaska airline also has a higher rate of being on time in total
airline_on_time = group_by(clean_data,Airline) %>%
summarize(percent = sum(on_time)/(sum(on_time)+sum(delayed)))
ggplot(airline_on_time, aes(x = Airline, y = percent))+geom_bar(stat='identity') + labs(title = "Percentage of Timely Arrivals")

by count
There are more on time flights in total from AM WEST. This probably due to the abundance of flights in the city of pheonix where flights are primarily AM WEST.
ggplot(clean_data, aes(x = Airline,y=on_time))+geom_bar(stat="identity")+
labs(title = "Count of Timely Arrivals")

Conclusion
Although Alaksa airlines out performs AM WEST, in terms of tardiness rate, in every city the overall performance is worse. This is because most of AM WESTS flights are from pheonix while most of Alaskas flights are in Seattle. AM WEST in pheonix outperforms Alaska in Seattle which and those are the cities that are weighted the most (simpsons paradox)