This assignment goes over tidying and transforming data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(dplyr)
data <- read.csv(url("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-607/refs/heads/main/DATA_607_Assignment_4.csv"))
print(data)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
The first major problem is that we have missing airline values in 2
rows. Additionally, the column names either have bad format or do not
make sense (X
for example). So let’s first fix both:
# Fill in the missing values
data[2,1] = "ALASKA"
data[5, 1] = "AM WEST"
# Rename columns
names(data) = c("airline", "on_time_or_delayed", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")
print(data)
## airline on_time_or_delayed Los Angeles Phoenix San Diego San Francisco
## 1 ALASKA on time 497 221 212 503
## 2 ALASKA delayed 62 12 20 102
## 3 NA NA NA NA
## 4 AM WEST on time 694 4840 383 320
## 5 AM WEST delayed 117 415 65 129
## Seattle
## 1 1841
## 2 305
## 3 NA
## 4 201
## 5 61
Now we have a data frame we can work with.
Let’s first turn the individual destination columns into values:
data2 <- data |>
pivot_longer("Los Angeles":"Seattle", names_to = "dest", values_to = "count", values_drop_na = TRUE) |>
arrange(airline, dest)
print(data2)
## # A tibble: 20 × 4
## airline on_time_or_delayed dest count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA delayed Los Angeles 62
## 3 ALASKA on time Phoenix 221
## 4 ALASKA delayed Phoenix 12
## 5 ALASKA on time San Diego 212
## 6 ALASKA delayed San Diego 20
## 7 ALASKA on time San Francisco 503
## 8 ALASKA delayed San Francisco 102
## 9 ALASKA on time Seattle 1841
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST delayed Los Angeles 117
## 13 AM WEST on time Phoenix 4840
## 14 AM WEST delayed Phoenix 415
## 15 AM WEST on time San Diego 383
## 16 AM WEST delayed San Diego 65
## 17 AM WEST on time San Francisco 320
## 18 AM WEST delayed San Francisco 129
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
pivot_longer
has made the data frame more clear. The
destinations are actually values, so we now have a new column
dest
representing these destinations. Since there was a
random empty row, I added values_drop_na = TRUE
since this
row was empty purely by mistake in the original dataset.
This data frame is mostly tidy, but the
on_time_or_delayed
column is not a variable; it stores the
names of variables (on time
or delayed
).
To fix this, we’ll now need to use pivot_wider()
,
pivoting on_time_or_delayed
and count
back out
across multiple columns:
data3 <- data2 |>
pivot_wider(names_from = on_time_or_delayed, values_from = count)
colnames(data3)[3] <- "on_time" # Rename `on time` to `on_time`
print(data3)
## # A tibble: 10 × 4
## airline dest on_time delayed
## <chr> <chr> <int> <int>
## 1 ALASKA Los Angeles 497 62
## 2 ALASKA Phoenix 221 12
## 3 ALASKA San Diego 212 20
## 4 ALASKA San Francisco 503 102
## 5 ALASKA Seattle 1841 305
## 6 AM WEST Los Angeles 694 117
## 7 AM WEST Phoenix 4840 415
## 8 AM WEST San Diego 383 65
## 9 AM WEST San Francisco 320 129
## 10 AM WEST Seattle 201 61
The values, variables, and observations are more clear now. The data frame contains 40 values representing 4 variables and 10 observations. The variables are:
airline
with 2 possible values (ALASKA and AM
WEST)dest
with 5 possible values (Los Angeles, Phoenix, San
Diego, San Francisco and Seattle)on time
is a countdelayed
is a countThis form is tidy since there’s one variable in each column, and each
row now represents a flight unit corresponding to an
airline
and destination
combination. Also,
each cell contains a single value.
Tidying the data has also made it easier to perform summary statistics which we’ll explore in the next section.
Rates are more meaningful than counts, so let’s find the
on time
% for each airline and destination.
Let’s examine the on time arrival rates overall across the 2 airlines:
# Total on time arrivals percentage across all destinations
on_time_rate_data3_airline <- data3 |>
group_by(airline) |>
summarise(on_time_rate = sum(on_time) / (sum(on_time) + sum(delayed))) |>
arrange(desc(on_time_rate))
on_time_rate_data3_airline
## # A tibble: 2 × 2
## airline on_time_rate
## <chr> <dbl>
## 1 AM WEST 0.891
## 2 ALASKA 0.867
# Plot the data
on_time_rate_data3_airline |>
group_by(airline) |>
summarise(mean_on_time_rate = mean(on_time_rate)) |>
ggplot(aes(x = airline, y = mean_on_time_rate)) +
geom_col() +
labs(x = "Airline", y = "Overall On Time Arrival Rate")
This shows AM WEST as being more reliable with a higher overall on time arrival percentage of 89%.
Just for reference, let’s see how many flights we’re looking at for each airline:
# Add total amount of flights observed for each airline
data3_total_count <- data3 |>
group_by(airline) |>
summarise(total_flights = sum(on_time) + sum(delayed)) |>
arrange(desc(total_flights))
data3_total_count
## # A tibble: 2 × 2
## airline total_flights
## <chr> <int>
## 1 AM WEST 7225
## 2 ALASKA 3775
# Plot the data
ggplot(data3_total_count, aes(x=airline, y=total_flights)) +
geom_bar(stat='identity', position='dodge')
AM WEST also has around twice the amount of data than ALASKA, which is something to keep in mind since more data is always beneficial.
Now let’s focus on destinations:
# Look at on time rates, grouped by destination and airline
on_time_rate_data3_airline_dest <- data3 |>
group_by(dest, airline) |>
summarise(on_time_rate = sum(on_time) / (sum(on_time) + sum(delayed)))
## `summarise()` has grouped output by 'dest'. You can override using the
## `.groups` argument.
on_time_rate_data3_airline_dest
## # A tibble: 10 × 3
## # Groups: dest [5]
## dest airline on_time_rate
## <chr> <chr> <dbl>
## 1 Los Angeles ALASKA 0.889
## 2 Los Angeles AM WEST 0.856
## 3 Phoenix ALASKA 0.948
## 4 Phoenix AM WEST 0.921
## 5 San Diego ALASKA 0.914
## 6 San Diego AM WEST 0.855
## 7 San Francisco ALASKA 0.831
## 8 San Francisco AM WEST 0.713
## 9 Seattle ALASKA 0.858
## 10 Seattle AM WEST 0.767
# Graph the on time arrival rates grouped by airline
on_time_rate_data3_airline_dest |>
group_by(airline) |>
summarise(mean_on_time_rate = mean(on_time_rate)) |>
ggplot(aes(x = airline, y = mean_on_time_rate)) +
geom_col() +
labs(x = "Airline", y = "Mean On Time Arrival Rate for All Destinations")
# Graph the on time arrival rates grouped by airline and destination
ggplot(on_time_rate_data3_airline_dest, aes(x=dest, y=on_time_rate, fill=airline)) +
geom_bar(stat='identity', position='dodge')
As you can see in the data frame, ALASKA now has the higher on time arrival rate for each destination. We can also see this visually in the first graph where only the average on time arrival rate for both airlines are shown. The second graph, which plots both airline’s on time arrival rates right next to each other for each destination, shows this as well.
This method of looking at the data returned a different result from the first method. To look into this, let’s now see the total amount of flights per destination:
# Add total flights columns
on_time_rate_data3_airline_dest_total <- data3 |>
group_by(dest, airline) |>
summarise(total_flights = sum(on_time) + sum(delayed))
## `summarise()` has grouped output by 'dest'. You can override using the
## `.groups` argument.
on_time_rate_data3_airline_dest_total
## # A tibble: 10 × 3
## # Groups: dest [5]
## dest airline total_flights
## <chr> <chr> <int>
## 1 Los Angeles ALASKA 559
## 2 Los Angeles AM WEST 811
## 3 Phoenix ALASKA 233
## 4 Phoenix AM WEST 5255
## 5 San Diego ALASKA 232
## 6 San Diego AM WEST 448
## 7 San Francisco ALASKA 605
## 8 San Francisco AM WEST 449
## 9 Seattle ALASKA 2146
## 10 Seattle AM WEST 262
# Plot the data
ggplot(on_time_rate_data3_airline_dest_total, aes(x=dest, y=total_flights, fill=airline)) +
geom_bar(stat='identity', position='dodge')
As you can see, most of AM WEST’s flights are coming from Phoenix (AM WEST overall has around twice the amount of flights observed). So ALASKA is the better option when you look at each destination individually, but when you combine all the destinations, it suddenly has worse results than AM WEST. I believe this is showing Simpson’s paradox as AM WEST has way more flights (mainly coming from Phoenix), and the Phoenix on time arrival rate is significantly high. In this example, where the data is coming from (destinations) and the amount of data is influencing the on time arrival rates.
If you removed Phoenix data, ALASKA would win overall:
# Total on time arrivals percentage across all destinations
on_time_rate_data3_airline_filtered <- data3 |>
filter(dest != "Phoenix") |>
group_by(airline) |>
summarise(on_time_rate = sum(on_time) / (sum(on_time) + sum(delayed))) |>
arrange(desc(on_time_rate))
on_time_rate_data3_airline_filtered
## # A tibble: 2 × 2
## airline on_time_rate
## <chr> <dbl>
## 1 ALASKA 0.862
## 2 AM WEST 0.811
# Plot the data
on_time_rate_data3_airline_filtered |>
group_by(airline) |>
summarise(mean_on_time_rate = mean(on_time_rate)) |>
ggplot(aes(x = airline, y = mean_on_time_rate)) +
geom_col() +
labs(x = "Airline", y = "Overall On Time Arrival Rate")
This further emphasizes that Phoenix is heavily influencing comparisons.
In tidy data:
Other findings: