1.How will I tackle the problem?

I will input the data into a spreadsheet and export it as a CSV file. Then, I will read the data into R and perform data cleaning. Finally, I will complete the assignment requirements (comparing user spreadsheets, apply Simpson’s Paradox, etc.).

2.What data challenges do I anticipate?

During computation and aggregation, prevent incorrect values or improper handling of missing data from affecting the final interpretation. I will verify the understanding of LLM-generated code before applying it.

source: “https://raw.githubusercontent.com/XxY-coder/data607-week5a.Y/refs/heads/main/airlines.csv

3. Transformed data from wide to long format.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
airlines_df <- read.csv("https://raw.githubusercontent.com/XxY-coder/data607-week5a.Y/refs/heads/main/airlines.csv")
names(airlines_df)
## [1] "X"             "X.1"           "LOS.Angeles"   "Phoenix"      
## [5] "San.Diego"     "San.Francisco" "Seattle"
head(airlines_df)
##        X     X.1 LOS.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time         497     221       212           503    1841
## 2        delayed          62      12        20           102     305
## 3                         NA      NA        NA            NA      NA
## 4 AMWEST on time         694    4840       383           320     201
## 5        delayed         117     415        65           129      61

rename and clean the NA values.

flights_df <-
  airlines_df %>%
  rename(
    airline = X,
    flight_condition = X.1,
) %>%
  mutate(airline = na_if(airline, "")) %>%
  fill(airline, 
       .direction = "down") %>%
  pivot_longer(
    cols = c("LOS.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"), 
    names_to = "City", 
    values_to = "Count",
    values_drop_na = TRUE
  )
flights_df
## # A tibble: 20 × 4
##    airline flight_condition City          Count
##    <chr>   <chr>            <chr>         <int>
##  1 ALASKA  on time          LOS.Angeles     497
##  2 ALASKA  on time          Phoenix         221
##  3 ALASKA  on time          San.Diego       212
##  4 ALASKA  on time          San.Francisco   503
##  5 ALASKA  on time          Seattle        1841
##  6 ALASKA  delayed          LOS.Angeles      62
##  7 ALASKA  delayed          Phoenix          12
##  8 ALASKA  delayed          San.Diego        20
##  9 ALASKA  delayed          San.Francisco   102
## 10 ALASKA  delayed          Seattle         305
## 11 AMWEST  on time          LOS.Angeles     694
## 12 AMWEST  on time          Phoenix        4840
## 13 AMWEST  on time          San.Diego       383
## 14 AMWEST  on time          San.Francisco   320
## 15 AMWEST  on time          Seattle         201
## 16 AMWEST  delayed          LOS.Angeles     117
## 17 AMWEST  delayed          Phoenix         415
## 18 AMWEST  delayed          San.Diego        65
## 19 AMWEST  delayed          San.Francisco   129
## 20 AMWEST  delayed          Seattle          61

3.Compared percentage of either delays or arrival rates for two airlines overall

overall_delay_compare <- 
  flights_df %>%
  group_by(airline, flight_condition) %>%
  summarise(
    total_flight = sum(Count),
    .groups = 'drop') %>%
  group_by(airline) %>%
  mutate(percentage = round(total_flight / sum(total_flight) * 100,2)) %>%
  filter(flight_condition == "delayed")
overall_delay_compare
## # A tibble: 2 × 4
## # Groups:   airline [2]
##   airline flight_condition total_flight percentage
##   <chr>   <chr>                   <int>      <dbl>
## 1 ALASKA  delayed                   501       13.3
## 2 AMWEST  delayed                   787       10.9
overall_onTime_compare <- 
  flights_df %>%
  group_by(airline, flight_condition) %>%
  summarise(
    total_flight = sum(Count),
    .groups = 'drop') %>%
  group_by(airline) %>%
  mutate(percentage = round(total_flight / sum(total_flight) * 100, 2)) %>%
  filter(flight_condition == "on time")
overall_onTime_compare
## # A tibble: 2 × 4
## # Groups:   airline [2]
##   airline flight_condition total_flight percentage
##   <chr>   <chr>                   <int>      <dbl>
## 1 ALASKA  on time                  3274       86.7
## 2 AMWEST  on time                  6438       89.1
overall_compare <-
  bind_rows(
    overall_onTime_compare,
    overall_delay_compare
  )
overall_compare
## # A tibble: 4 × 4
## # Groups:   airline [2]
##   airline flight_condition total_flight percentage
##   <chr>   <chr>                   <int>      <dbl>
## 1 ALASKA  on time                  3274       86.7
## 2 AMWEST  on time                  6438       89.1
## 3 ALASKA  delayed                   501       13.3
## 4 AMWEST  delayed                   787       10.9
ggplot(
  overall_compare,
  aes(x=airline, y = percentage, fill = flight_condition)
) +
  geom_col(position = "dodge") +
  geom_text(aes(label = percentage),vjust = -0.8, size = 4) +
  labs(title = "Overall Flight Conditions Percentage by Airline", 
       x = "Airline", 
       y = "Percentage (%)") +
  theme_minimal() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.3)))

By overall proportion, AMWEST’s on-time rate is approximately 89.11%, with a delay rate of about 10.89%; ALASKA’s on-time rate is approximately 86.73%, with a delay rate of about 13.27%.

AMWEST’s delay rate is approximately 2.38 percentage lower than ALASKA’s, while its on-time rate is correspondingly about 2.38 percentage higher. Therefore, MWEST’s performance is superior to ALASKA’s.


4.Compare by each cites.

city_delay <-
  flights_df %>%
  group_by(airline, City, flight_condition) %>%
  summarise(city_flights = sum(Count), 
            .groups = 'drop') %>%
  group_by(airline, City) %>%
  mutate(percentage = round(city_flights / sum(city_flights) * 100,2)) %>%
  filter(flight_condition == "delayed")
city_delay
## # A tibble: 10 × 5
## # Groups:   airline, City [10]
##    airline City          flight_condition city_flights percentage
##    <chr>   <chr>         <chr>                   <int>      <dbl>
##  1 ALASKA  LOS.Angeles   delayed                    62      11.1 
##  2 ALASKA  Phoenix       delayed                    12       5.15
##  3 ALASKA  San.Diego     delayed                    20       8.62
##  4 ALASKA  San.Francisco delayed                   102      16.9 
##  5 ALASKA  Seattle       delayed                   305      14.2 
##  6 AMWEST  LOS.Angeles   delayed                   117      14.4 
##  7 AMWEST  Phoenix       delayed                   415       7.9 
##  8 AMWEST  San.Diego     delayed                    65      14.5 
##  9 AMWEST  San.Francisco delayed                   129      28.7 
## 10 AMWEST  Seattle       delayed                    61      23.3
city_onTime <-
  flights_df %>%
  group_by(airline, City, flight_condition) %>%
  summarise(city_flights = sum(Count), 
            .groups = 'drop') %>%
  group_by(airline, City) %>%
  mutate(percentage = round(city_flights / sum(city_flights) * 100,2)) %>%
  filter(flight_condition == "on time")
city_onTime
## # A tibble: 10 × 5
## # Groups:   airline, City [10]
##    airline City          flight_condition city_flights percentage
##    <chr>   <chr>         <chr>                   <int>      <dbl>
##  1 ALASKA  LOS.Angeles   on time                   497       88.9
##  2 ALASKA  Phoenix       on time                   221       94.8
##  3 ALASKA  San.Diego     on time                   212       91.4
##  4 ALASKA  San.Francisco on time                   503       83.1
##  5 ALASKA  Seattle       on time                  1841       85.8
##  6 AMWEST  LOS.Angeles   on time                   694       85.6
##  7 AMWEST  Phoenix       on time                  4840       92.1
##  8 AMWEST  San.Diego     on time                   383       85.5
##  9 AMWEST  San.Francisco on time                   320       71.3
## 10 AMWEST  Seattle       on time                   201       76.7
city_compare <-
  bind_rows(
    city_onTime,
    city_delay
  )
city_compare
## # A tibble: 20 × 5
## # Groups:   airline, City [10]
##    airline City          flight_condition city_flights percentage
##    <chr>   <chr>         <chr>                   <int>      <dbl>
##  1 ALASKA  LOS.Angeles   on time                   497      88.9 
##  2 ALASKA  Phoenix       on time                   221      94.8 
##  3 ALASKA  San.Diego     on time                   212      91.4 
##  4 ALASKA  San.Francisco on time                   503      83.1 
##  5 ALASKA  Seattle       on time                  1841      85.8 
##  6 AMWEST  LOS.Angeles   on time                   694      85.6 
##  7 AMWEST  Phoenix       on time                  4840      92.1 
##  8 AMWEST  San.Diego     on time                   383      85.5 
##  9 AMWEST  San.Francisco on time                   320      71.3 
## 10 AMWEST  Seattle       on time                   201      76.7 
## 11 ALASKA  LOS.Angeles   delayed                    62      11.1 
## 12 ALASKA  Phoenix       delayed                    12       5.15
## 13 ALASKA  San.Diego     delayed                    20       8.62
## 14 ALASKA  San.Francisco delayed                   102      16.9 
## 15 ALASKA  Seattle       delayed                   305      14.2 
## 16 AMWEST  LOS.Angeles   delayed                   117      14.4 
## 17 AMWEST  Phoenix       delayed                   415       7.9 
## 18 AMWEST  San.Diego     delayed                    65      14.5 
## 19 AMWEST  San.Francisco delayed                   129      28.7 
## 20 AMWEST  Seattle       delayed                    61      23.3
ggplot(city_compare, aes(x = City, y = percentage, fill = flight_condition)) +
  geom_col() +
  facet_wrap(~ airline) +
  geom_text(aes(label = round(percentage, 2)),
            position = position_stack(vjust = 0.5),
            size = 4) +
  labs(
    title = "On-time vs Delayed Percentage Across Cities",
    x = "City",
    y = "Percentage (%)",
    fill = "Status"
  ) +
  theme_minimal()

According to the city delay rate comparison chart, ALASKA’s delay rate is generally lower than AMWEST’s. ALASKA’s delay rates across five cities range from 5.15% to 16.86%; AMWEST exhibits higher overall delays, with San Francisco at approximately 28.73% and Seattle at 23.28%. Comparing city-specific delay rates, ALASKA maintains lower delays.

5.Conclusion

When comparing by city, ALASKA had lower delay rates than AMWEST in all five cities; however, when combining all cities, AMWEST’s overall delay rate was lower. This is a typical example of Simpson’s paradox.