607 Assignment 5A Dylan Gold

Approach

In this assignment we need to recreate a table given to us from an image, then tidy that data set.
When tidying a data set there are several things we have to do. Specifically we need to recreate the data in a way such that each column is a variable, row is a observation and each cell is a value. Then we need to count data and some sort of analysis with this, compare percentages and other things shown in the rubric.

My plan is to recreate the data in csv format. I will just do it by hand because it is only 20 cell values and afew row and column names.
I will add an extra row name, which is implied in the data but it would be easier to work with if it already had the row value.
Bring it into R and use tidyr and dplyr with to make it molten/long and go from there to perform the analysis. Functions like pivot_longer and pivot_wider will help.

Codebase

Setup libraries

library(tidyverse)
library(dplyr)
library(ggplot2)

First I manually make a csv from the image, then I import it in after pushing it to github.

url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Assignment-4/refs/heads/main/5A_Data.csv"

df <- read_csv(
  file = url,
  col_names = TRUE
)

df

# A tibble: 6 × 7
  ...1   ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
  <chr>  <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
1 ALASKA on time           497     221         212             503    1841
2 <NA>   delayed            62      12          20             102     305
3 <NA>   <NA>               NA      NA          NA              NA      NA
4 AMWEST on time           694    4840         383             320     201
5 <NA>   delayed           117     415          65             129      61
6 <NA>   <NA>               NA      NA          NA              NA      NA

Tidying

We can start to make it tidy. We can start by fixing the first two column names. These were empty in the csv.
We can then create the ALASKA and AMWEST value for the 2nd and 4th column.
Then drop_na to get rid of the empty rows.

airport_arrivals <- df %>% rename(c(airline = "...1", arrival_status = "...2"))
airport_arrivals[2,1] <- "ALASKA" #Place alaska value in 2nd row first column
airport_arrivals[5,1] <- "AMWEST" #Place amwest value 
airport_arrivals <- airport_arrivals %>% drop_na()
airport_arrivals

# A tibble: 4 × 7
  airline arrival_status `Los Angeles` Phoenix `San Diego` `San Francisco`
  <chr>   <chr>                  <dbl>   <dbl>       <dbl>           <dbl>
1 ALASKA  on time                  497     221         212             503
2 ALASKA  delayed                   62      12          20             102
3 AMWEST  on time                  694    4840         383             320
4 AMWEST  delayed                  117     415          65             129
# ℹ 1 more variable: Seattle <dbl>

Now we can make it tidy by combining all the columns into row values. We need to have a row specifying the destination, as well as a column containing the frequency displayed in the cells in the data frame above.

airport_arrivals <- airport_arrivals %>%
  pivot_longer(
    cols = !c(airline:arrival_status),
    names_to = "destination",
    values_to = "frequency"
    )

head(airport_arrivals, 8)

# A tibble: 8 × 4
  airline arrival_status destination   frequency
  <chr>   <chr>          <chr>             <dbl>
1 ALASKA  on time        Los Angeles         497
2 ALASKA  on time        Phoenix             221
3 ALASKA  on time        San Diego           212
4 ALASKA  on time        San Francisco       503
5 ALASKA  on time        Seattle            1841
6 ALASKA  delayed        Los Angeles          62
7 ALASKA  delayed        Phoenix              12
8 ALASKA  delayed        San Diego            20

Analysis

The data is now in tidy form. We can now do some analysis based on the count/frequency. We can start by getting the total arrivals.
We can mutate a new row after group_by the airline and destinations.

airport_arrivals<- airport_arrivals %>%
  group_by(airline, destination) %>%
    mutate(city_total = sum(frequency))
head(airport_arrivals, 8)

# A tibble: 8 × 5
# Groups:   airline, destination [5]
  airline arrival_status destination   frequency city_total
  <chr>   <chr>          <chr>             <dbl>      <dbl>
1 ALASKA  on time        Los Angeles         497        559
2 ALASKA  on time        Phoenix             221        233
3 ALASKA  on time        San Diego           212        232
4 ALASKA  on time        San Francisco       503        605
5 ALASKA  on time        Seattle            1841       2146
6 ALASKA  delayed        Los Angeles          62        559
7 ALASKA  delayed        Phoenix              12        233
8 ALASKA  delayed        San Diego            20        232

We can also get the total across all 5 cities as a column.

airport_arrivals<- airport_arrivals %>%
  group_by(airline) %>%
    mutate( airline_total = sum(frequency))
head(airport_arrivals, 8)

# A tibble: 8 × 6
# Groups:   airline [1]
  airline arrival_status destination   frequency city_total airline_total
  <chr>   <chr>          <chr>             <dbl>      <dbl>         <dbl>
1 ALASKA  on time        Los Angeles         497        559          3775
2 ALASKA  on time        Phoenix             221        233          3775
3 ALASKA  on time        San Diego           212        232          3775
4 ALASKA  on time        San Francisco       503        605          3775
5 ALASKA  on time        Seattle            1841       2146          3775
6 ALASKA  delayed        Los Angeles          62        559          3775
7 ALASKA  delayed        Phoenix              12        233          3775
8 ALASKA  delayed        San Diego            20        232          3775

Lets look at the total on-time and total delayed for each airline

airport_total_arrival <- airport_arrivals %>%
  group_by(airline, arrival_status) %>%
    summarise(n = sum(frequency), .groups = "keep")
airport_total_arrival

# A tibble: 4 × 3
# Groups:   airline, arrival_status [4]
  airline arrival_status     n
  <chr>   <chr>          <dbl>
1 ALASKA  delayed          501
2 ALASKA  on time         3274
3 AMWEST  delayed          787
4 AMWEST  on time         6438

We can see that Alaska has less overall arrivals, AMWEST has greater total delayed and on time arrivals.

Looking at the percentages would be more useful.

airport_arrivals_percentages <- airport_arrivals %>%
  mutate(
    city_percentage = frequency/city_total,
    airline_percentage = frequency/airline_total
  )
head(airport_arrivals_percentages, 8)

# A tibble: 8 × 8
# Groups:   airline [1]
  airline arrival_status destination   frequency city_total airline_total
  <chr>   <chr>          <chr>             <dbl>      <dbl>         <dbl>
1 ALASKA  on time        Los Angeles         497        559          3775
2 ALASKA  on time        Phoenix             221        233          3775
3 ALASKA  on time        San Diego           212        232          3775
4 ALASKA  on time        San Francisco       503        605          3775
5 ALASKA  on time        Seattle            1841       2146          3775
6 ALASKA  delayed        Los Angeles          62        559          3775
7 ALASKA  delayed        Phoenix              12        233          3775
8 ALASKA  delayed        San Diego            20        232          3775
# ℹ 2 more variables: city_percentage <dbl>, airline_percentage <dbl>

We can create graphics for these values. First lets do for the airline. Delayed and on time are complementary, we can just look at the delays

airline_delay_percentages <- airport_arrivals_percentages %>%
  filter(arrival_status == "delayed") %>%
    select(c(airline, airline_percentage)) %>% #Get the columns we need
      group_by(airline) %>%
        summarise(delay_percentage = sum(airline_percentage))

airline_delay_percentages

# A tibble: 2 × 2
  airline delay_percentage
  <chr>              <dbl>
1 ALASKA             0.133
2 AMWEST             0.109

From the table we can see that ALASKA airline has a greater percentage of delayed arrivals by approximately 3%

ggplot(airline_delay_percentages, aes(x = airline, y = delay_percentage)) +
  geom_bar(stat="identity", fill="steelblue") +
  geom_text(aes(label=delay_percentage), vjust=1.6, color="white", size=3.5) + 
# I got this formating from https://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization
  labs(title="ALASKA vs AMWEST Delay Percentages")

Now lets do it across the cities.

city_delay_percentages <- airport_arrivals_percentages %>%
  filter(arrival_status == "delayed") %>%
    select(airline, destination, delay_percentage = city_percentage)

head(city_delay_percentages, 8)

# A tibble: 8 × 3
# Groups:   airline [2]
  airline destination   delay_percentage
  <chr>   <chr>                    <dbl>
1 ALASKA  Los Angeles             0.111 
2 ALASKA  Phoenix                 0.0515
3 ALASKA  San Diego               0.0862
4 ALASKA  San Francisco           0.169 
5 ALASKA  Seattle                 0.142 
6 AMWEST  Los Angeles             0.144 
7 AMWEST  Phoenix                 0.0790
8 AMWEST  San Diego               0.145

Create a plot. We can see that AMWEST has a higher delay rate when comparing each of the cities indivudually.
This is a discrepancy when compared to what we saw in the total airline percentage of delays.
This is an example of Simpsons Paradox. While individually AMWEST has a higher delay in each city, due to the counts of the arrivals in each city.

ggplot(city_delay_percentages, aes(fill = airline, x = destination, y = delay_percentage)) +
  geom_bar(position="dodge", stat="identity") +
  geom_text(aes(label= round(delay_percentage, 4)), vjust=1.5, color="black", size=3.5) + 
# I got this formating from https://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization
  labs(title="ALASKA vs AMWEST Delay Percentages")

Conclusion

In this assignment we were given an image to recreate a csv. We then tidied the csv by fixing null values, as well as dropping fully empty columns. We made the csv longer by using pivot longer to convert the column data into more rows. After the data was tidy we looked at the total count of each of the necessary grouped rows, like across cities and the total. We then looked at the percentages across cities and across airlines to compare each of the airlines. We saw an example of simpsons paradox where when looking at each city’s percentage comparison we saw AMWEST have more delays. When looking at the total we say ALASKA have more delays as a percentage. This was due to how each of the cities had different total arrivals.
One way we could further improve this assignment could be to figure out how significant the difference in percentages are, or compare more airlines other than Alaska and Amwest