Information is read from a csv file into R. The data has a “wide” structure. In this assigment, we practice tidying and tranforming the data, and also performing some analysis.
library(readr)
library(knitr)
library(kableExtra)# manipulate table styles
suppressMessages(library(zoo))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))
wide_df <- read_csv("info.csv")
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## `Los Angeles` = col_integer(),
## Phoenix = col_number(),
## `San Diego` = col_integer(),
## `San Francisco` = col_integer(),
## Seattle = col_number()
## )
colnames(wide_df)[1] <- "Airline"
colnames(wide_df)[2] <- "Status"
kable(wide_df)%>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| NA | delayed | 62 | 12 | 20 | 102 | 305 |
| NA | NA | NA | NA | NA | NA | NA |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| NA | delayed | 117 | 415 | 65 | 129 | 61 |
Use filter and any_vars functions from dplyr to remove the empty row (all vars are NAs).
kable(wide_df <- wide_df %>% filter_all(any_vars(!is.na(.))))%>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| NA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| NA | delayed | 117 | 415 | 65 | 129 | 61 |
Also use na.locf function from zoo package to replace each NA with the most recent non-NA prior to it.
kable(wide_df <- wide_df %>% na.locf())%>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
The variables containing in the data are airline, status, destination and count. This data is considered to be wide format because the variable destination is structured as if each city is a variable.
Let’s reshape the data into the long format by taking the counts for each city and combining them into one single column.
kable(long_df <- gather(wide_df, Destination, Count,3:7))%>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Airline | Status | Destination | Count |
|---|---|---|---|
| ALASKA | on time | Los Angeles | 497 |
| ALASKA | delayed | Los Angeles | 62 |
| AM WEST | on time | Los Angeles | 694 |
| AM WEST | delayed | Los Angeles | 117 |
| ALASKA | on time | Phoenix | 221 |
| ALASKA | delayed | Phoenix | 12 |
| AM WEST | on time | Phoenix | 4840 |
| AM WEST | delayed | Phoenix | 415 |
| ALASKA | on time | San Diego | 212 |
| ALASKA | delayed | San Diego | 20 |
| AM WEST | on time | San Diego | 383 |
| AM WEST | delayed | San Diego | 65 |
| ALASKA | on time | San Francisco | 503 |
| ALASKA | delayed | San Francisco | 102 |
| AM WEST | on time | San Francisco | 320 |
| AM WEST | delayed | San Francisco | 129 |
| ALASKA | on time | Seattle | 1841 |
| ALASKA | delayed | Seattle | 305 |
| AM WEST | on time | Seattle | 201 |
| AM WEST | delayed | Seattle | 61 |
Next we reshape the data in long format to have each status in its own column.
df <- spread(long_df, Status, Count)
colnames(df)[3] <- "Delayed"
colnames(df)[4] <- "On_Time"
kable(df)%>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Airline | Destination | Delayed | On_Time |
|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San Diego | 20 | 212 |
| ALASKA | San Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AM WEST | Los Angeles | 117 | 694 |
| AM WEST | Phoenix | 415 | 4840 |
| AM WEST | San Diego | 65 | 383 |
| AM WEST | San Francisco | 129 | 320 |
| AM WEST | Seattle | 61 | 201 |
In average, what destination(s) had the most delayed flights and the least on-time flights?
kable(means <- df %>% group_by(Destination) %>% summarise(Mean_Delayed = mean(Delayed), Mean_On_Time = mean(On_Time)))%>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Destination | Mean_Delayed | Mean_On_Time |
|---|---|---|
| Los Angeles | 89.5 | 595.5 |
| Phoenix | 213.5 | 2530.5 |
| San Diego | 42.5 | 297.5 |
| San Francisco | 115.5 | 411.5 |
| Seattle | 183.0 | 1021.0 |
We see that Phoenix as destination for both airlines had the most number of flights delayed in average. And San Diego had the least number of flights on time in average.
From the given data, what airline was most likely to have delayed flights? Alternatively, one may ask which airline was least likely to be on time.
For each airline, we compute the probability of having a flight delayed and/or the probability of having a flight on time.
(number of an airline’s delayed flights) / (number of all delayed flights)
(number of an airline’s on-time flights) / (number of all on-time flights)
new_df <- df %>% group_by(Airline) %>%
summarise(Freq_Delayed=sum(Delayed), Freq_OnTime=sum(On_Time))
new_df <- new_df %>% group_by(Airline) %>%
mutate(Prob_Delayed=Freq_Delayed/sum(Freq_Delayed,Freq_OnTime), Prob_OnTime=Freq_OnTime/sum(Freq_Delayed,Freq_OnTime))
kable(new_df)%>%
kable_styling(bootstrap_options = "striped", full_width = F)
| Airline | Freq_Delayed | Freq_OnTime | Prob_Delayed | Prob_OnTime |
|---|---|---|---|---|
| ALASKA | 501 | 3274 | 0.1327152 | 0.8672848 |
| AM WEST | 787 | 6438 | 0.1089273 | 0.8910727 |
Therefore, Alaska is the most likely airline to have delayed flights and also least likely to have flights on time. Based on the given information, one might consider not to buy plane tickets from Alaska Airlines.