Import libraries

library(dplyr)
library(tidyr)

Collect the data

The data is a csv format and can be found using this link

df = read.csv('https://raw.githubusercontent.com/Kossi-Akplaka/Data607-data_acquisition_and_management/main/Assignment5/Assignment5-data607.csv', sep = ',')

Remove the third row since there is no information and fill missing values

df[2, 1] <- "ALASKA"
df[5, 1] <- "AM WEST"
df_clean = df[-3,]

(df_clean)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

Now, let’s tidy and transform the data.

flight_data = df_clean %>% 
  rename(Airline = X) %>% 
  rename(Status = X.1)

flight_data
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61
flight_data_long <- flight_data %>% 
  pivot_longer(cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
               names_to = "Destination",
               values_to = "Count")

flight_data_long
## # A tibble: 20 × 4
##    Airline Status  Destination   Count
##    <chr>   <chr>   <chr>         <int>
##  1 ALASKA  on time Los.Angeles     497
##  2 ALASKA  on time Phoenix         221
##  3 ALASKA  on time San.Diego       212
##  4 ALASKA  on time San.Francisco   503
##  5 ALASKA  on time Seattle        1841
##  6 ALASKA  delayed Los.Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San.Diego        20
##  9 ALASKA  delayed San.Francisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on time Los.Angeles     694
## 12 AM WEST on time Phoenix        4840
## 13 AM WEST on time San.Diego       383
## 14 AM WEST on time San.Francisco   320
## 15 AM WEST on time Seattle         201
## 16 AM WEST delayed Los.Angeles     117
## 17 AM WEST delayed Phoenix         415
## 18 AM WEST delayed San.Diego        65
## 19 AM WEST delayed San.Francisco   129
## 20 AM WEST delayed Seattle          61

Analyse the data

Now, we can perform an analysis to compare the arrival delays for the two airlines.

Let’s look at the average, min, and max of delay in these two airline

flight_data_long  %>%
   filter(Status == "delayed") %>% 
   group_by(Airline) %>%
   summarise( Min = min(Count),  Max = max(Count), Mean = round(mean(Count)))
## # A tibble: 2 × 4
##   Airline   Min   Max  Mean
##   <chr>   <int> <int> <dbl>
## 1 ALASKA     12   305   100
## 2 AM WEST    61   415   157

The summary statistics suggests that it’s better to take Alaska Airline because it has less delays on average. The minimum and maximum number ALASKA airlines’s delayed to different destinations are also lower than AM WEST.

Simpson Paradox

Professor Andy introduced in class the Simpson’s Paradox which is a statistical phenomenon where a trend appears in different groups of data but disappears or reverses when these groups are combined.

To check if Simpson’s Paradox is present in the data, we would need to examine how the summary statistics change when we consider the data at different levels of aggregation.

You can find more information about Simpson’sParadox in this article “https://towardsdatascience.com/simpsons-paradox-and-interpreting-data-6a0443516765

Calculate summary statistics for the entire dataset without grouping by airline.

flight_data_long %>%
  filter(Status == "delayed") %>%
  summarise(Min = min(Count), Max = max(Count), Mean = round(mean(Count)))
## # A tibble: 1 × 3
##     Min   Max  Mean
##   <int> <int> <dbl>
## 1    12   415   129