Overview

This assignment goes over tidying and transforming data.

Loading the Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(dplyr)

Read the Data

  1. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
data <- read.csv(url("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-607/refs/heads/main/DATA_607_Assignment_4.csv"))
print(data)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

The first major problem is that we have missing airline values in 2 rows. Additionally, the column names either have bad format or do not make sense (X for example). So let’s first fix both:

# Fill in the missing values
data[2,1] = "ALASKA"
data[5, 1] = "AM WEST"

# Rename columns
names(data) = c("airline", "on_time_or_delayed", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")
print(data)
##   airline on_time_or_delayed Los Angeles Phoenix San Diego San Francisco
## 1  ALASKA            on time         497     221       212           503
## 2  ALASKA            delayed          62      12        20           102
## 3                                     NA      NA        NA            NA
## 4 AM WEST            on time         694    4840       383           320
## 5 AM WEST            delayed         117     415        65           129
##   Seattle
## 1    1841
## 2     305
## 3      NA
## 4     201
## 5      61

Now we have a data frame we can work with.

Tidy and Transform the Data

Let’s first turn the individual destination columns into values:

data2 <- data |>
  pivot_longer("Los Angeles":"Seattle", names_to = "dest", values_to = "count", values_drop_na = TRUE) |>
  arrange(airline, dest)
print(data2)
## # A tibble: 20 × 4
##    airline on_time_or_delayed dest          count
##    <chr>   <chr>              <chr>         <int>
##  1 ALASKA  on time            Los Angeles     497
##  2 ALASKA  delayed            Los Angeles      62
##  3 ALASKA  on time            Phoenix         221
##  4 ALASKA  delayed            Phoenix          12
##  5 ALASKA  on time            San Diego       212
##  6 ALASKA  delayed            San Diego        20
##  7 ALASKA  on time            San Francisco   503
##  8 ALASKA  delayed            San Francisco   102
##  9 ALASKA  on time            Seattle        1841
## 10 ALASKA  delayed            Seattle         305
## 11 AM WEST on time            Los Angeles     694
## 12 AM WEST delayed            Los Angeles     117
## 13 AM WEST on time            Phoenix        4840
## 14 AM WEST delayed            Phoenix         415
## 15 AM WEST on time            San Diego       383
## 16 AM WEST delayed            San Diego        65
## 17 AM WEST on time            San Francisco   320
## 18 AM WEST delayed            San Francisco   129
## 19 AM WEST on time            Seattle         201
## 20 AM WEST delayed            Seattle          61

pivot_longer has made the data frame more clear. The destinations are actually values, so we now have a new column dest representing these destinations. Since there was a random empty row, I added values_drop_na = TRUE since this row was empty purely by mistake in the original dataset.

This data frame is mostly tidy, but the on_time_or_delayed column is not a variable; it stores the names of variables (on time or delayed).

To fix this, we’ll now need to use pivot_wider(), pivoting on_time_or_delayed and count back out across multiple columns:

data3 <- data2 |>
  pivot_wider(names_from = on_time_or_delayed, values_from = count)
colnames(data3)[3] <- "on_time" # Rename `on time` to `on_time`
print(data3)
## # A tibble: 10 × 4
##    airline dest          on_time delayed
##    <chr>   <chr>           <int>   <int>
##  1 ALASKA  Los Angeles       497      62
##  2 ALASKA  Phoenix           221      12
##  3 ALASKA  San Diego         212      20
##  4 ALASKA  San Francisco     503     102
##  5 ALASKA  Seattle          1841     305
##  6 AM WEST Los Angeles       694     117
##  7 AM WEST Phoenix          4840     415
##  8 AM WEST San Diego         383      65
##  9 AM WEST San Francisco     320     129
## 10 AM WEST Seattle           201      61

The values, variables, and observations are more clear now. The data frame contains 40 values representing 4 variables and 10 observations. The variables are:

This form is tidy since there’s one variable in each column, and each row now represents a flight unit corresponding to an airline and destination combination. Also, each cell contains a single value.

Tidying the data has also made it easier to perform summary statistics which we’ll explore in the next section.

Analysis

  1. Perform analysis to compare the arrival delays for the two airlines.

Rates are more meaningful than counts, so let’s find the on time % for each airline and destination.

Let’s examine the on time arrival rates overall across the 2 airlines:

# Total on time arrivals percentage across all destinations
on_time_rate_data3_airline <- data3 |>
  group_by(airline) |>
  summarise(on_time_rate = sum(on_time) / (sum(on_time) + sum(delayed))) |>
  arrange(desc(on_time_rate))
on_time_rate_data3_airline
## # A tibble: 2 × 2
##   airline on_time_rate
##   <chr>          <dbl>
## 1 AM WEST        0.891
## 2 ALASKA         0.867
# Plot the data
on_time_rate_data3_airline |>
  group_by(airline) |>
  summarise(mean_on_time_rate = mean(on_time_rate)) |>
  ggplot(aes(x = airline, y = mean_on_time_rate)) +
  geom_col() +
  labs(x = "Airline", y = "Overall On Time Arrival Rate")

This shows AM WEST as being more reliable with a higher overall on time arrival percentage of 89%.

Just for reference, let’s see how many flights we’re looking at for each airline:

# Add total amount of flights observed for each airline
data3_total_count <- data3 |>
  group_by(airline) |>
  summarise(total_flights = sum(on_time) + sum(delayed)) |>
  arrange(desc(total_flights))
data3_total_count
## # A tibble: 2 × 2
##   airline total_flights
##   <chr>           <int>
## 1 AM WEST          7225
## 2 ALASKA           3775
# Plot the data
ggplot(data3_total_count, aes(x=airline, y=total_flights)) +
    geom_bar(stat='identity', position='dodge')

AM WEST also has around twice the amount of data than ALASKA, which is something to keep in mind since more data is always beneficial.

Now let’s focus on destinations:

# Look at on time rates, grouped by destination and airline
on_time_rate_data3_airline_dest <- data3 |>
  group_by(dest, airline) |>
  summarise(on_time_rate = sum(on_time) / (sum(on_time) + sum(delayed)))
## `summarise()` has grouped output by 'dest'. You can override using the
## `.groups` argument.
on_time_rate_data3_airline_dest
## # A tibble: 10 × 3
## # Groups:   dest [5]
##    dest          airline on_time_rate
##    <chr>         <chr>          <dbl>
##  1 Los Angeles   ALASKA         0.889
##  2 Los Angeles   AM WEST        0.856
##  3 Phoenix       ALASKA         0.948
##  4 Phoenix       AM WEST        0.921
##  5 San Diego     ALASKA         0.914
##  6 San Diego     AM WEST        0.855
##  7 San Francisco ALASKA         0.831
##  8 San Francisco AM WEST        0.713
##  9 Seattle       ALASKA         0.858
## 10 Seattle       AM WEST        0.767
# Graph the on time arrival rates grouped by airline
on_time_rate_data3_airline_dest |>
  group_by(airline) |>
  summarise(mean_on_time_rate = mean(on_time_rate)) |>
  ggplot(aes(x = airline, y = mean_on_time_rate)) +
  geom_col() +
  labs(x = "Airline", y = "Mean On Time Arrival Rate for All Destinations")

# Graph the on time arrival rates grouped by airline and destination
ggplot(on_time_rate_data3_airline_dest, aes(x=dest, y=on_time_rate, fill=airline)) +
    geom_bar(stat='identity', position='dodge')

As you can see in the data frame, ALASKA now has the higher on time arrival rate for each destination. We can also see this visually in the first graph where only the average on time arrival rate for both airlines are shown. The second graph, which plots both airline’s on time arrival rates right next to each other for each destination, shows this as well.

This method of looking at the data returned a different result from the first method. To look into this, let’s now see the total amount of flights per destination:

# Add total flights columns
on_time_rate_data3_airline_dest_total <- data3 |>
  group_by(dest, airline) |>
  summarise(total_flights = sum(on_time) + sum(delayed))
## `summarise()` has grouped output by 'dest'. You can override using the
## `.groups` argument.
on_time_rate_data3_airline_dest_total
## # A tibble: 10 × 3
## # Groups:   dest [5]
##    dest          airline total_flights
##    <chr>         <chr>           <int>
##  1 Los Angeles   ALASKA            559
##  2 Los Angeles   AM WEST           811
##  3 Phoenix       ALASKA            233
##  4 Phoenix       AM WEST          5255
##  5 San Diego     ALASKA            232
##  6 San Diego     AM WEST           448
##  7 San Francisco ALASKA            605
##  8 San Francisco AM WEST           449
##  9 Seattle       ALASKA           2146
## 10 Seattle       AM WEST           262
# Plot the data
ggplot(on_time_rate_data3_airline_dest_total, aes(x=dest, y=total_flights, fill=airline)) +
    geom_bar(stat='identity', position='dodge')

As you can see, most of AM WEST’s flights are coming from Phoenix (AM WEST overall has around twice the amount of flights observed). So ALASKA is the better option when you look at each destination individually, but when you combine all the destinations, it suddenly has worse results than AM WEST. I believe this is showing Simpson’s paradox as AM WEST has way more flights (mainly coming from Phoenix), and the Phoenix on time arrival rate is significantly high. In this example, where the data is coming from (destinations) and the amount of data is influencing the on time arrival rates.

If you removed Phoenix data, ALASKA would win overall:

# Total on time arrivals percentage across all destinations
on_time_rate_data3_airline_filtered <- data3 |>
  filter(dest != "Phoenix") |>
  group_by(airline) |>
  summarise(on_time_rate = sum(on_time) / (sum(on_time) + sum(delayed))) |>
  arrange(desc(on_time_rate))
on_time_rate_data3_airline_filtered
## # A tibble: 2 × 2
##   airline on_time_rate
##   <chr>          <dbl>
## 1 ALASKA         0.862
## 2 AM WEST        0.811
# Plot the data
on_time_rate_data3_airline_filtered |>
  group_by(airline) |>
  summarise(mean_on_time_rate = mean(on_time_rate)) |>
  ggplot(aes(x = airline, y = mean_on_time_rate)) +
  geom_col() +
  labs(x = "Airline", y = "Overall On Time Arrival Rate")

This further emphasizes that Phoenix is heavily influencing comparisons.

Conclusions

In tidy data:

Other findings: