library(dplyr)
library(tidyr)
The data is a csv format and can be found using this link
df = read.csv('https://raw.githubusercontent.com/Kossi-Akplaka/Data607-data_acquisition_and_management/main/Assignment5/Assignment5-data607.csv', sep = ',')
Remove the third row since there is no information and fill missing values
df[2, 1] <- "ALASKA"
df[5, 1] <- "AM WEST"
df_clean = df[-3,]
(df_clean)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Now, let’s tidy and transform the data.
flight_data = df_clean %>%
rename(Airline = X) %>%
rename(Status = X.1)
flight_data
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
flight_data_long <- flight_data %>%
pivot_longer(cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
names_to = "Destination",
values_to = "Count")
flight_data_long
## # A tibble: 20 × 4
## Airline Status Destination Count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
Now, we can perform an analysis to compare the arrival delays for the two airlines.
Let’s look at the average, min, and max of delay in these two airline
flight_data_long %>%
filter(Status == "delayed") %>%
group_by(Airline) %>%
summarise( Min = min(Count), Max = max(Count), Mean = round(mean(Count)))
## # A tibble: 2 × 4
## Airline Min Max Mean
## <chr> <int> <int> <dbl>
## 1 ALASKA 12 305 100
## 2 AM WEST 61 415 157
The summary statistics suggests that it’s better to take Alaska Airline because it has less delays on average. The minimum and maximum number ALASKA airlines’s delayed to different destinations are also lower than AM WEST.
Professor Andy introduced in class the Simpson’s Paradox which is a statistical phenomenon where a trend appears in different groups of data but disappears or reverses when these groups are combined.
To check if Simpson’s Paradox is present in the data, we would need to examine how the summary statistics change when we consider the data at different levels of aggregation.
You can find more information about Simpson’sParadox in this article “https://towardsdatascience.com/simpsons-paradox-and-interpreting-data-6a0443516765”
Calculate summary statistics for the entire dataset without grouping by airline.
flight_data_long %>%
filter(Status == "delayed") %>%
summarise(Min = min(Count), Max = max(Count), Mean = round(mean(Count)))
## # A tibble: 1 × 3
## Min Max Mean
## <int> <int> <dbl>
## 1 12 415 129