Tidying and Transforming Data

Information is read from a csv file into R. The data has a “wide” structure. In this assigment, we practice tidying and tranforming the data, and also performing some analysis.

Load packages

library(readr)
library(knitr)
library(kableExtra)# manipulate table styles
suppressMessages(library(zoo))
suppressMessages(library(tidyr))
suppressMessages(library(dplyr))

Read the data

wide_df <- read_csv("info.csv")
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   `Los Angeles` = col_integer(),
##   Phoenix = col_number(),
##   `San Diego` = col_integer(),
##   `San Francisco` = col_integer(),
##   Seattle = col_number()
## )
colnames(wide_df)[1] <- "Airline"
colnames(wide_df)[2] <- "Status"


kable(wide_df)%>%
        kable_styling(bootstrap_options = "striped", full_width = F)
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

Adjust data

Use filter and any_vars functions from dplyr to remove the empty row (all vars are NAs).

kable(wide_df <- wide_df %>% filter_all(any_vars(!is.na(.))))%>%
        kable_styling(bootstrap_options = "striped", full_width = F)
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

Also use na.locf function from zoo package to replace each NA with the most recent non-NA prior to it.

kable(wide_df <- wide_df %>% na.locf())%>%
        kable_styling(bootstrap_options = "striped", full_width = F)
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Tidying the data

The variables containing in the data are airline, status, destination and count. This data is considered to be wide format because the variable destination is structured as if each city is a variable.

Let’s reshape the data into the long format by taking the counts for each city and combining them into one single column.

kable(long_df <- gather(wide_df, Destination, Count,3:7))%>%
        kable_styling(bootstrap_options = "striped", full_width = F)
Airline Status Destination Count
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
AM WEST on time Phoenix 4840
AM WEST delayed Phoenix 415
ALASKA on time San Diego 212
ALASKA delayed San Diego 20
AM WEST on time San Diego 383
AM WEST delayed San Diego 65
ALASKA on time San Francisco 503
ALASKA delayed San Francisco 102
AM WEST on time San Francisco 320
AM WEST delayed San Francisco 129
ALASKA on time Seattle 1841
ALASKA delayed Seattle 305
AM WEST on time Seattle 201
AM WEST delayed Seattle 61

Next we reshape the data in long format to have each status in its own column.

df <- spread(long_df, Status, Count)
colnames(df)[3] <- "Delayed"
colnames(df)[4] <- "On_Time"
kable(df)%>%
        kable_styling(bootstrap_options = "striped", full_width = F)
Airline Destination Delayed On_Time
ALASKA Los Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San Diego 20 212
ALASKA San Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San Diego 65 383
AM WEST San Francisco 129 320
AM WEST Seattle 61 201

Analysis

Analysis 1

In average, what destination(s) had the most delayed flights and the least on-time flights?

kable(means <- df %>% group_by(Destination) %>% summarise(Mean_Delayed = mean(Delayed), Mean_On_Time = mean(On_Time)))%>%
        kable_styling(bootstrap_options = "striped", full_width = F)
Destination Mean_Delayed Mean_On_Time
Los Angeles 89.5 595.5
Phoenix 213.5 2530.5
San Diego 42.5 297.5
San Francisco 115.5 411.5
Seattle 183.0 1021.0

We see that Phoenix as destination for both airlines had the most number of flights delayed in average. And San Diego had the least number of flights on time in average.

Analysis 2

From the given data, what airline was most likely to have delayed flights? Alternatively, one may ask which airline was least likely to be on time.

For each airline, we compute the probability of having a flight delayed and/or the probability of having a flight on time.

(number of an airline’s delayed flights) / (number of all delayed flights)

(number of an airline’s on-time flights) / (number of all on-time flights)

new_df <- df %>% group_by(Airline) %>%
        summarise(Freq_Delayed=sum(Delayed), Freq_OnTime=sum(On_Time))

new_df <- new_df %>% group_by(Airline) %>%
        mutate(Prob_Delayed=Freq_Delayed/sum(Freq_Delayed,Freq_OnTime),    Prob_OnTime=Freq_OnTime/sum(Freq_Delayed,Freq_OnTime))

kable(new_df)%>%
        kable_styling(bootstrap_options = "striped", full_width = F)
Airline Freq_Delayed Freq_OnTime Prob_Delayed Prob_OnTime
ALASKA 501 3274 0.1327152 0.8672848
AM WEST 787 6438 0.1089273 0.8910727

Therefore, Alaska is the most likely airline to have delayed flights and also least likely to have flights on time. Based on the given information, one might consider not to buy plane tickets from Alaska Airlines.