library(tidyverse)
library(dplyr)
library(ggplot2)607 Assignment 5A Dylan Gold
607 Assignment 5A Dylan Gold
Approach
In this assignment we need to recreate a table given to us from an image, then tidy that data set.
When tidying a data set there are several things we have to do. Specifically we need to recreate the data in a way such that each column is a variable, row is a observation and each cell is a value. Then we need to count data and some sort of analysis with this, compare percentages and other things shown in the rubric.
My plan is to recreate the data in csv format. I will just do it by hand because it is only 20 cell values and afew row and column names.
I will add an extra row name, which is implied in the data but it would be easier to work with if it already had the row value.
Bring it into R and use tidyr and dplyr with to make it molten/long and go from there to perform the analysis. Functions like pivot_longer and pivot_wider will help.
Codebase
Setup libraries
First I manually make a csv from the image, then I import it in after pushing it to github.
url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Assignment-4/refs/heads/main/5A_Data.csv"
df <- read_csv(
file = url,
col_names = TRUE
)
df# A tibble: 6 × 7
...1 ...2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALASKA on time 497 221 212 503 1841
2 <NA> delayed 62 12 20 102 305
3 <NA> <NA> NA NA NA NA NA
4 AMWEST on time 694 4840 383 320 201
5 <NA> delayed 117 415 65 129 61
6 <NA> <NA> NA NA NA NA NA
Tidying
We can start to make it tidy. We can start by fixing the first two column names. These were empty in the csv.
We can then create the ALASKA and AMWEST value for the 2nd and 4th column.
Then drop_na to get rid of the empty rows.
airport_arrivals <- df %>% rename(c(airline = "...1", arrival_status = "...2"))
airport_arrivals[2,1] <- "ALASKA" #Place alaska value in 2nd row first column
airport_arrivals[5,1] <- "AMWEST" #Place amwest value
airport_arrivals <- airport_arrivals %>% drop_na()
airport_arrivals# A tibble: 4 × 7
airline arrival_status `Los Angeles` Phoenix `San Diego` `San Francisco`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 ALASKA on time 497 221 212 503
2 ALASKA delayed 62 12 20 102
3 AMWEST on time 694 4840 383 320
4 AMWEST delayed 117 415 65 129
# ℹ 1 more variable: Seattle <dbl>
Now we can make it tidy by combining all the columns into row values. We need to have a row specifying the destination, as well as a column containing the frequency displayed in the cells in the data frame above.
airport_arrivals <- airport_arrivals %>%
pivot_longer(
cols = !c(airline:arrival_status),
names_to = "destination",
values_to = "frequency"
)
head(airport_arrivals, 8)# A tibble: 8 × 4
airline arrival_status destination frequency
<chr> <chr> <chr> <dbl>
1 ALASKA on time Los Angeles 497
2 ALASKA on time Phoenix 221
3 ALASKA on time San Diego 212
4 ALASKA on time San Francisco 503
5 ALASKA on time Seattle 1841
6 ALASKA delayed Los Angeles 62
7 ALASKA delayed Phoenix 12
8 ALASKA delayed San Diego 20
Analysis
The data is now in tidy form. We can now do some analysis based on the count/frequency. We can start by getting the total arrivals.
We can mutate a new row after group_by the airline and destinations.
airport_arrivals<- airport_arrivals %>%
group_by(airline, destination) %>%
mutate(city_total = sum(frequency))
head(airport_arrivals, 8)# A tibble: 8 × 5
# Groups: airline, destination [5]
airline arrival_status destination frequency city_total
<chr> <chr> <chr> <dbl> <dbl>
1 ALASKA on time Los Angeles 497 559
2 ALASKA on time Phoenix 221 233
3 ALASKA on time San Diego 212 232
4 ALASKA on time San Francisco 503 605
5 ALASKA on time Seattle 1841 2146
6 ALASKA delayed Los Angeles 62 559
7 ALASKA delayed Phoenix 12 233
8 ALASKA delayed San Diego 20 232
We can also get the total across all 5 cities as a column.
airport_arrivals<- airport_arrivals %>%
group_by(airline) %>%
mutate( airline_total = sum(frequency))
head(airport_arrivals, 8)# A tibble: 8 × 6
# Groups: airline [1]
airline arrival_status destination frequency city_total airline_total
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 ALASKA on time Los Angeles 497 559 3775
2 ALASKA on time Phoenix 221 233 3775
3 ALASKA on time San Diego 212 232 3775
4 ALASKA on time San Francisco 503 605 3775
5 ALASKA on time Seattle 1841 2146 3775
6 ALASKA delayed Los Angeles 62 559 3775
7 ALASKA delayed Phoenix 12 233 3775
8 ALASKA delayed San Diego 20 232 3775
Lets look at the total on-time and total delayed for each airline
airport_total_arrival <- airport_arrivals %>%
group_by(airline, arrival_status) %>%
summarise(n = sum(frequency), .groups = "keep")
airport_total_arrival# A tibble: 4 × 3
# Groups: airline, arrival_status [4]
airline arrival_status n
<chr> <chr> <dbl>
1 ALASKA delayed 501
2 ALASKA on time 3274
3 AMWEST delayed 787
4 AMWEST on time 6438
We can see that Alaska has less overall arrivals, AMWEST has greater total delayed and on time arrivals.
Looking at the percentages would be more useful.
airport_arrivals_percentages <- airport_arrivals %>%
mutate(
city_percentage = frequency/city_total,
airline_percentage = frequency/airline_total
)
head(airport_arrivals_percentages, 8)# A tibble: 8 × 8
# Groups: airline [1]
airline arrival_status destination frequency city_total airline_total
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 ALASKA on time Los Angeles 497 559 3775
2 ALASKA on time Phoenix 221 233 3775
3 ALASKA on time San Diego 212 232 3775
4 ALASKA on time San Francisco 503 605 3775
5 ALASKA on time Seattle 1841 2146 3775
6 ALASKA delayed Los Angeles 62 559 3775
7 ALASKA delayed Phoenix 12 233 3775
8 ALASKA delayed San Diego 20 232 3775
# ℹ 2 more variables: city_percentage <dbl>, airline_percentage <dbl>
We can create graphics for these values. First lets do for the airline. Delayed and on time are complementary, we can just look at the delays
airline_delay_percentages <- airport_arrivals_percentages %>%
filter(arrival_status == "delayed") %>%
select(c(airline, airline_percentage)) %>% #Get the columns we need
group_by(airline) %>%
summarise(delay_percentage = sum(airline_percentage))
airline_delay_percentages# A tibble: 2 × 2
airline delay_percentage
<chr> <dbl>
1 ALASKA 0.133
2 AMWEST 0.109
From the table we can see that ALASKA airline has a greater percentage of delayed arrivals by approximately 3%
ggplot(airline_delay_percentages, aes(x = airline, y = delay_percentage)) +
geom_bar(stat="identity", fill="steelblue") +
geom_text(aes(label=delay_percentage), vjust=1.6, color="white", size=3.5) +
# I got this formating from https://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization
labs(title="ALASKA vs AMWEST Delay Percentages")Now lets do it across the cities.
city_delay_percentages <- airport_arrivals_percentages %>%
filter(arrival_status == "delayed") %>%
select(airline, destination, delay_percentage = city_percentage)
head(city_delay_percentages, 8)# A tibble: 8 × 3
# Groups: airline [2]
airline destination delay_percentage
<chr> <chr> <dbl>
1 ALASKA Los Angeles 0.111
2 ALASKA Phoenix 0.0515
3 ALASKA San Diego 0.0862
4 ALASKA San Francisco 0.169
5 ALASKA Seattle 0.142
6 AMWEST Los Angeles 0.144
7 AMWEST Phoenix 0.0790
8 AMWEST San Diego 0.145
Create a plot. We can see that AMWEST has a higher delay rate when comparing each of the cities indivudually.
This is a discrepancy when compared to what we saw in the total airline percentage of delays.
This is an example of Simpsons Paradox. While individually AMWEST has a higher delay in each city, due to the counts of the arrivals in each city.
ggplot(city_delay_percentages, aes(fill = airline, x = destination, y = delay_percentage)) +
geom_bar(position="dodge", stat="identity") +
geom_text(aes(label= round(delay_percentage, 4)), vjust=1.5, color="black", size=3.5) +
# I got this formating from https://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization
labs(title="ALASKA vs AMWEST Delay Percentages")Conclusion
In this assignment we were given an image to recreate a csv. We then tidied the csv by fixing null values, as well as dropping fully empty columns. We made the csv longer by using pivot longer to convert the column data into more rows. After the data was tidy we looked at the total count of each of the necessary grouped rows, like across cities and the total. We then looked at the percentages across cities and across airlines to compare each of the airlines. We saw an example of simpsons paradox where when looking at each city’s percentage comparison we saw AMWEST have more delays. When looking at the total we say ALASKA have more delays as a percentage. This was due to how each of the cities had different total arrivals.
One way we could further improve this assignment could be to figure out how significant the difference in percentages are, or compare more airlines other than Alaska and Amwest