Overview: We are tasked with creating the csv pictured in the assignment, loading in the csv in to R, and cleaning/reshaping the data into a format where we can do analysis. I use tidyverse and ggplot2 to do this.

1

First I need to create a .csv file in the same format as the table shown in the instructions

r1 = c('Alaska', 'on time', 497, 221, 212, 503, 1841)
r2 = c(NA, 'delayed', 62, 12, 20, 102, 305)
r3 = rep(NA, 7)
r4 = c('AM WEST', 'on time', 694, 4840, 383, 320, 201)
r5 = c(NA, 'delayed', 117, 415, 65, 129, 61)
df = data.frame(t(cbind(r1, r2, r3, r4, r5)))
colnames(df) = c(NA, NA, 'Los Angeles', 'Phoenix', 'San Diego', 'San Francisco', 'Seattle')
write.csv(df, 'flights.csv', row.names=FALSE)

2

Read in csv from github, fill in two missing airline values

df = read_csv('https://raw.githubusercontent.com/jmhsi/DATA_607/master/flights.csv')

## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]

## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   `Los Angeles` = col_double(),
##   Phoenix = col_double(),
##   `San Diego` = col_double(),
##   `San Francisco` = col_double(),
##   Seattle = col_double()
## )

# reshaped = cast(ratings_table_2, names ~ movies, value='ratings')
# fill missing Airline lines
df[2,1] = 'Alaska'
df[5,1] = 'AM WEST'
kable(df, caption='Data from csv with added airline information')

Data from csv with added airline information
X1	X2	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
Alaska	on time	497	221	212	503	1841
Alaska	delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

We use pivot_longer to turn columns 3:7 into a single column called location

df = df %>% pivot_longer(cols = 3:7, names_to = 'location', values_to ='counts', values_drop_na = TRUE)
colnames(df) = c('airline', 'status', 'location', 'counts')
kable(df, caption='Pivot longer the location column')

Pivot longer the location column
airline	status	location	counts
Alaska	on time	Los Angeles	497
Alaska	on time	Phoenix	221
Alaska	on time	San Diego	212
Alaska	on time	San Francisco	503
Alaska	on time	Seattle	1841
Alaska	delayed	Los Angeles	62
Alaska	delayed	Phoenix	12
Alaska	delayed	San Diego	20
Alaska	delayed	San Francisco	102
Alaska	delayed	Seattle	305
AM WEST	on time	Los Angeles	694
AM WEST	on time	Phoenix	4840
AM WEST	on time	San Diego	383
AM WEST	on time	San Francisco	320
AM WEST	on time	Seattle	201
AM WEST	delayed	Los Angeles	117
AM WEST	delayed	Phoenix	415
AM WEST	delayed	San Diego	65
AM WEST	delayed	San Francisco	129
AM WEST	delayed	Seattle	61

Pivot the status column into 2 columns since we are going to be doing analysis about delayed vs on time.

df = df %>% pivot_wider(names_from=status, values_from=counts) %>% arrange(airline)
kable(df, caption='Pivot wider the status column')

Pivot wider the status column
airline	location	on time	delayed
Alaska	Los Angeles	497	62
Alaska	Phoenix	221	12
Alaska	San Diego	212	20
Alaska	San Francisco	503	102
Alaska	Seattle	1841	305
AM WEST	Los Angeles	694	117
AM WEST	Phoenix	4840	415
AM WEST	San Diego	383	65
AM WEST	San Francisco	320	129
AM WEST	Seattle	201	61

3

Perform analysis to compare the arrival delays for the two airlines. Make totals column to find percents with.

df$total_flights = df$`on time` + df$delayed
df$pct_on_time = df$`on time`/df$total_flights
df$pct_delayed = df$delayed/df$total_flights
df = df %>% pivot_longer(cols = 6:7, names_to = 'pct_status', values_to = 'pct')

delays_df = df %>% filter(pct_status == 'pct_delayed')
ontime_df = df %>% filter(pct_status == 'pct_on_time')
alaska_df = df %>% pivot_wider(names_from = pct_status, values_from = pct) %>% filter(airline == 'Alaska')
amwest_df = df %>% pivot_wider(names_from = pct_status, values_from = pct) %>% filter(airline == 'AM WEST')
alaska_overall_delay_pct = sum(alaska_df$delayed)/sum(alaska_df$total_flights)
amwest_overall_delay_pct = sum(amwest_df$delayed)/sum(amwest_df$total_flights)
ggplot(delays_df, aes(x = location, y = pct, fill = airline)) + geom_bar(stat = 'identity', position = 'dodge') + ylab("pct_delays") + annotate("text", x = 2, y = .28, label = paste("Overall Alaska Delay Rate:", alaska_overall_delay_pct)) +
annotate("text", x = 2.1, y = .25, label = paste("Overall AMWEST Delay Rate:", amwest_overall_delay_pct))

Conclusions: We can see across all locations that Alaska has a smaller percentage of delayed arrivals. However, this contrasts with the overall delay rate, which shows that AMWEST actually has a lower overall percent of delayed arrivals to total flights. So depending on the context, our determination of which airline has more delays can change.

Data 607 Week 5 Assignment

Justin Hsi

2/27/2020

1

2

3