Week5_Assignment

Load packages

library(readr)
library(knitr)
library(kableExtra)# manipulate table styles
suppressMessages(library(zoo))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))

Read the data

wide_df <- read_csv("info.csv")

## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]

## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   `Los Angeles` = col_integer(),
##   Phoenix = col_number(),
##   `San Diego` = col_integer(),
##   `San Francisco` = col_integer(),
##   Seattle = col_number()
## )

colnames(wide_df)[1] <- "Airline"
colnames(wide_df)[2] <- "Status"


kable(wide_df)%>%
        kable_styling(bootstrap_options = "striped", full_width = F)

Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

Adjust data

Use filter and any_vars functions from dplyr to remove the empty row (all vars are NAs).

kable(wide_df <- wide_df %>% filter_all(any_vars(!is.na(.))))%>%
        kable_styling(bootstrap_options = "striped", full_width = F)

Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
AM WEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

Also use na.locf function from zoo package to replace each NA with the most recent non-NA prior to it.

kable(wide_df <- wide_df %>% na.locf())%>%
        kable_styling(bootstrap_options = "striped", full_width = F)

Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
AM WEST	on time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

Tidying the data

The variables containing in the data are airline, status, destination and count. This data is considered to be wide format because the variable destination is structured as if each city is a variable.

Let’s reshape the data into the long format by taking the counts for each city and combining them into one single column.

kable(long_df <- gather(wide_df, Destination, Count,3:7))%>%
        kable_styling(bootstrap_options = "striped", full_width = F)

Airline	Status	Destination	Count
ALASKA	on time	Los Angeles	497
ALASKA	delayed	Los Angeles	62
AM WEST	on time	Los Angeles	694
AM WEST	delayed	Los Angeles	117
ALASKA	on time	Phoenix	221
ALASKA	delayed	Phoenix	12
AM WEST	on time	Phoenix	4840
AM WEST	delayed	Phoenix	415
ALASKA	on time	San Diego	212
ALASKA	delayed	San Diego	20
AM WEST	on time	San Diego	383
AM WEST	delayed	San Diego	65
ALASKA	on time	San Francisco	503
ALASKA	delayed	San Francisco	102
AM WEST	on time	San Francisco	320
AM WEST	delayed	San Francisco	129
ALASKA	on time	Seattle	1841
ALASKA	delayed	Seattle	305
AM WEST	on time	Seattle	201
AM WEST	delayed	Seattle	61

Next we reshape the data in long format to have each status in its own column.

df <- spread(long_df, Status, Count)
colnames(df)[3] <- "Delayed"
colnames(df)[4] <- "On_Time"
kable(df)%>%
        kable_styling(bootstrap_options = "striped", full_width = F)

Airline	Destination	Delayed	On_Time
ALASKA	Los Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San Diego	20	212
ALASKA	San Francisco	102	503
ALASKA	Seattle	305	1841
AM WEST	Los Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San Diego	65	383
AM WEST	San Francisco	129	320
AM WEST	Seattle	61	201

Analysis

Analysis 1

In average, what destination(s) had the most delayed flights and the least on-time flights?

kable(means <- df %>% group_by(Destination) %>% summarise(Mean_Delayed = mean(Delayed), Mean_On_Time = mean(On_Time)))%>%
        kable_styling(bootstrap_options = "striped", full_width = F)

Destination	Mean_Delayed	Mean_On_Time
Los Angeles	89.5	595.5
Phoenix	213.5	2530.5
San Diego	42.5	297.5
San Francisco	115.5	411.5
Seattle	183.0	1021.0

We see that Phoenix as destination for both airlines had the most number of flights delayed in average. And San Diego had the least number of flights on time in average.

Analysis 2

From the given data, what airline was most likely to have delayed flights? Alternatively, one may ask which airline was least likely to be on time.

For each airline, we compute the probability of having a flight delayed and/or the probability of having a flight on time.

(number of an airline’s delayed flights) / (number of all delayed flights)

(number of an airline’s on-time flights) / (number of all on-time flights)

new_df <- df %>% group_by(Airline) %>%
        summarise(Freq_Delayed=sum(Delayed), Freq_OnTime=sum(On_Time))

new_df <- new_df %>% group_by(Airline) %>%
        mutate(Prob_Delayed=Freq_Delayed/sum(Freq_Delayed,Freq_OnTime),    Prob_OnTime=Freq_OnTime/sum(Freq_Delayed,Freq_OnTime))

kable(new_df)%>%
        kable_styling(bootstrap_options = "striped", full_width = F)

Airline	Freq_Delayed	Freq_OnTime	Prob_Delayed	Prob_OnTime
ALASKA	501	3274	0.1327152	0.8672848
AM WEST	787	6438	0.1089273	0.8910727

Therefore, Alaska is the most likely airline to have delayed flights and also least likely to have flights on time. Based on the given information, one might consider not to buy plane tickets from Alaska Airlines.