Overview: We are tasked with creating the csv pictured in the assignment, loading in the csv in to R, and cleaning/reshaping the data into a format where we can do analysis. I use tidyverse and ggplot2 to do this.

1

First I need to create a .csv file in the same format as the table shown in the instructions

r1 = c('Alaska', 'on time', 497, 221, 212, 503, 1841)
r2 = c(NA, 'delayed', 62, 12, 20, 102, 305)
r3 = rep(NA, 7)
r4 = c('AM WEST', 'on time', 694, 4840, 383, 320, 201)
r5 = c(NA, 'delayed', 117, 415, 65, 129, 61)
df = data.frame(t(cbind(r1, r2, r3, r4, r5)))
colnames(df) = c(NA, NA, 'Los Angeles', 'Phoenix', 'San Diego', 'San Francisco', 'Seattle')
write.csv(df, 'flights.csv', row.names=FALSE)

2

Read in csv from github, fill in two missing airline values

df = read_csv('https://raw.githubusercontent.com/jmhsi/DATA_607/master/flights.csv')
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   `Los Angeles` = col_double(),
##   Phoenix = col_double(),
##   `San Diego` = col_double(),
##   `San Francisco` = col_double(),
##   Seattle = col_double()
## )
# reshaped = cast(ratings_table_2, names ~ movies, value='ratings')
# fill missing Airline lines
df[2,1] = 'Alaska'
df[5,1] = 'AM WEST'
kable(df, caption='Data from csv with added airline information')
Data from csv with added airline information
X1 X2 Los Angeles Phoenix San Diego San Francisco Seattle
Alaska on time 497 221 212 503 1841
Alaska delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

We use pivot_longer to turn columns 3:7 into a single column called location

df = df %>% pivot_longer(cols = 3:7, names_to = 'location', values_to ='counts', values_drop_na = TRUE)
colnames(df) = c('airline', 'status', 'location', 'counts')
kable(df, caption='Pivot longer the location column')
Pivot longer the location column
airline status location counts
Alaska on time Los Angeles 497
Alaska on time Phoenix 221
Alaska on time San Diego 212
Alaska on time San Francisco 503
Alaska on time Seattle 1841
Alaska delayed Los Angeles 62
Alaska delayed Phoenix 12
Alaska delayed San Diego 20
Alaska delayed San Francisco 102
Alaska delayed Seattle 305
AM WEST on time Los Angeles 694
AM WEST on time Phoenix 4840
AM WEST on time San Diego 383
AM WEST on time San Francisco 320
AM WEST on time Seattle 201
AM WEST delayed Los Angeles 117
AM WEST delayed Phoenix 415
AM WEST delayed San Diego 65
AM WEST delayed San Francisco 129
AM WEST delayed Seattle 61

Pivot the status column into 2 columns since we are going to be doing analysis about delayed vs on time.

df = df %>% pivot_wider(names_from=status, values_from=counts) %>% arrange(airline)
kable(df, caption='Pivot wider the status column')
Pivot wider the status column
airline location on time delayed
Alaska Los Angeles 497 62
Alaska Phoenix 221 12
Alaska San Diego 212 20
Alaska San Francisco 503 102
Alaska Seattle 1841 305
AM WEST Los Angeles 694 117
AM WEST Phoenix 4840 415
AM WEST San Diego 383 65
AM WEST San Francisco 320 129
AM WEST Seattle 201 61

3

Perform analysis to compare the arrival delays for the two airlines. Make totals column to find percents with.

df$total_flights = df$`on time` + df$delayed
df$pct_on_time = df$`on time`/df$total_flights
df$pct_delayed = df$delayed/df$total_flights
df = df %>% pivot_longer(cols = 6:7, names_to = 'pct_status', values_to = 'pct')
delays_df = df %>% filter(pct_status == 'pct_delayed')
ontime_df = df %>% filter(pct_status == 'pct_on_time')
alaska_df = df %>% pivot_wider(names_from = pct_status, values_from = pct) %>% filter(airline == 'Alaska')
amwest_df = df %>% pivot_wider(names_from = pct_status, values_from = pct) %>% filter(airline == 'AM WEST')
alaska_overall_delay_pct = sum(alaska_df$delayed)/sum(alaska_df$total_flights)
amwest_overall_delay_pct = sum(amwest_df$delayed)/sum(amwest_df$total_flights)
ggplot(delays_df, aes(x = location, y = pct, fill = airline)) + geom_bar(stat = 'identity', position = 'dodge') + ylab("pct_delays") + annotate("text", x = 2, y = .28, label = paste("Overall Alaska Delay Rate:", alaska_overall_delay_pct)) +
annotate("text", x = 2.1, y = .25, label = paste("Overall AMWEST Delay Rate:", amwest_overall_delay_pct))

Conclusions: We can see across all locations that Alaska has a smaller percentage of delayed arrivals. However, this contrasts with the overall delay rate, which shows that AMWEST actually has a lower overall percent of delayed arrivals to total flights. So depending on the context, our determination of which airline has more delays can change.