Overview: We are tasked with creating the csv pictured in the assignment, loading in the csv in to R, and cleaning/reshaping the data into a format where we can do analysis. I use tidyverse and ggplot2 to do this.
First I need to create a .csv file in the same format as the table shown in the instructions
r1 = c('Alaska', 'on time', 497, 221, 212, 503, 1841)
r2 = c(NA, 'delayed', 62, 12, 20, 102, 305)
r3 = rep(NA, 7)
r4 = c('AM WEST', 'on time', 694, 4840, 383, 320, 201)
r5 = c(NA, 'delayed', 117, 415, 65, 129, 61)
df = data.frame(t(cbind(r1, r2, r3, r4, r5)))
colnames(df) = c(NA, NA, 'Los Angeles', 'Phoenix', 'San Diego', 'San Francisco', 'Seattle')
write.csv(df, 'flights.csv', row.names=FALSE)
Read in csv from github, fill in two missing airline values
df = read_csv('https://raw.githubusercontent.com/jmhsi/DATA_607/master/flights.csv')
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## `Los Angeles` = col_double(),
## Phoenix = col_double(),
## `San Diego` = col_double(),
## `San Francisco` = col_double(),
## Seattle = col_double()
## )
# reshaped = cast(ratings_table_2, names ~ movies, value='ratings')
# fill missing Airline lines
df[2,1] = 'Alaska'
df[5,1] = 'AM WEST'
kable(df, caption='Data from csv with added airline information')
| X1 | X2 | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| Alaska | on time | 497 | 221 | 212 | 503 | 1841 |
| Alaska | delayed | 62 | 12 | 20 | 102 | 305 |
| NA | NA | NA | NA | NA | NA | NA |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
We use pivot_longer to turn columns 3:7 into a single column called location
df = df %>% pivot_longer(cols = 3:7, names_to = 'location', values_to ='counts', values_drop_na = TRUE)
colnames(df) = c('airline', 'status', 'location', 'counts')
kable(df, caption='Pivot longer the location column')
| airline | status | location | counts |
|---|---|---|---|
| Alaska | on time | Los Angeles | 497 |
| Alaska | on time | Phoenix | 221 |
| Alaska | on time | San Diego | 212 |
| Alaska | on time | San Francisco | 503 |
| Alaska | on time | Seattle | 1841 |
| Alaska | delayed | Los Angeles | 62 |
| Alaska | delayed | Phoenix | 12 |
| Alaska | delayed | San Diego | 20 |
| Alaska | delayed | San Francisco | 102 |
| Alaska | delayed | Seattle | 305 |
| AM WEST | on time | Los Angeles | 694 |
| AM WEST | on time | Phoenix | 4840 |
| AM WEST | on time | San Diego | 383 |
| AM WEST | on time | San Francisco | 320 |
| AM WEST | on time | Seattle | 201 |
| AM WEST | delayed | Los Angeles | 117 |
| AM WEST | delayed | Phoenix | 415 |
| AM WEST | delayed | San Diego | 65 |
| AM WEST | delayed | San Francisco | 129 |
| AM WEST | delayed | Seattle | 61 |
Pivot the status column into 2 columns since we are going to be doing analysis about delayed vs on time.
df = df %>% pivot_wider(names_from=status, values_from=counts) %>% arrange(airline)
kable(df, caption='Pivot wider the status column')
| airline | location | on time | delayed |
|---|---|---|---|
| Alaska | Los Angeles | 497 | 62 |
| Alaska | Phoenix | 221 | 12 |
| Alaska | San Diego | 212 | 20 |
| Alaska | San Francisco | 503 | 102 |
| Alaska | Seattle | 1841 | 305 |
| AM WEST | Los Angeles | 694 | 117 |
| AM WEST | Phoenix | 4840 | 415 |
| AM WEST | San Diego | 383 | 65 |
| AM WEST | San Francisco | 320 | 129 |
| AM WEST | Seattle | 201 | 61 |
Perform analysis to compare the arrival delays for the two airlines. Make totals column to find percents with.
df$total_flights = df$`on time` + df$delayed
df$pct_on_time = df$`on time`/df$total_flights
df$pct_delayed = df$delayed/df$total_flights
df = df %>% pivot_longer(cols = 6:7, names_to = 'pct_status', values_to = 'pct')
delays_df = df %>% filter(pct_status == 'pct_delayed')
ontime_df = df %>% filter(pct_status == 'pct_on_time')
alaska_df = df %>% pivot_wider(names_from = pct_status, values_from = pct) %>% filter(airline == 'Alaska')
amwest_df = df %>% pivot_wider(names_from = pct_status, values_from = pct) %>% filter(airline == 'AM WEST')
alaska_overall_delay_pct = sum(alaska_df$delayed)/sum(alaska_df$total_flights)
amwest_overall_delay_pct = sum(amwest_df$delayed)/sum(amwest_df$total_flights)
ggplot(delays_df, aes(x = location, y = pct, fill = airline)) + geom_bar(stat = 'identity', position = 'dodge') + ylab("pct_delays") + annotate("text", x = 2, y = .28, label = paste("Overall Alaska Delay Rate:", alaska_overall_delay_pct)) +
annotate("text", x = 2.1, y = .25, label = paste("Overall AMWEST Delay Rate:", amwest_overall_delay_pct))
Conclusions: We can see across all locations that Alaska has a smaller percentage of delayed arrivals. However, this contrasts with the overall delay rate, which shows that AMWEST actually has a lower overall percent of delayed arrivals to total flights. So depending on the context, our determination of which airline has more delays can change.