Tidying and Transforming Data

Install Packages

library(glue)
## Warning: package 'glue' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

Recreate Given Data

Data was recreated directly in excel. Then it was read into R and tidied

flight_data <- read.csv("flight_data.csv")


# remove third row 
flight_data <- flight_data[-3,]

# Adjust column names 
colnames(flight_data)[c(1,2)] <- c("Airlines", "Status")

# Fill in missing data
flight_data$Airlines[c(2,4)] <- c('Alaska', 'AM West')

Data Analysis

Compare respective percentages for on time and delayed flights for the two airlines across all 5 cities

ala_total <- sum(flight_data[c(1,2),3:7])
am_total <- sum(flight_data[c(3,4),3:7])

#repetitive code
ala_avg_delay <- (sum(flight_data[2,3:7]) / ala_total * 100) %>%
  round(2) %>%
  paste("%")
am_avg_delay <- (sum(flight_data[4,3:7]) / am_total * 100) %>%
  round(2) %>%
  paste("%")

# On time averages
ala_avg_ont <- (sum(flight_data[1,3:7]) / ala_total * 100) %>%
  round(2) %>%
  paste("%")
am_avg_ont <- (sum(flight_data[3,3:7]) / am_total* 100) %>%
  round(2) %>%
  paste("%")

com_percents <- data.frame(Alaska = c(ala_avg_delay, ala_avg_ont),
                           AM.West = c(am_avg_delay, am_avg_ont))

rownames(com_percents) <- c("Delayed", "On Time")

com_percents
##          Alaska AM.West
## Delayed 13.27 % 10.89 %
## On Time 86.73 % 89.11 %

On average, Alaska has more delays and AM West has more on time flights

Comparing percentages by each city

  ala_city_del <- (as.numeric(flight_data[2,3:7]) / ala_total * 100) %>%
  round(2) %>%
  paste("%")
ala_city_ont <- (as.numeric(flight_data[1,3:7]) / ala_total * 100) %>%
  round(2) %>%
  paste("%")

am_city_del <- (as.numeric(flight_data[4,3:7]) / am_total * 100) %>%
  round(2) %>%
  paste("%")
am_city_ont <- (as.numeric(flight_data[3,3:7]) / am_total * 100) %>%
  round(2) %>%
  paste("%")

city_percents <- data.frame(Alaska.On.Time = ala_city_ont,
                            Alaska.Delayed = ala_city_del,
                            AM.West.On.Time = am_city_ont,
                            AM.West.Delayed = am_city_del)

rownames(city_percents) <- c("Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seatle")

city_percents
##               Alaska.On.Time Alaska.Delayed AM.West.On.Time AM.West.Delayed
## Los Angeles          13.17 %         1.64 %          9.61 %          1.62 %
## Phoenix               5.85 %         0.32 %         66.99 %          5.74 %
## San Diego             5.62 %         0.53 %           5.3 %           0.9 %
## San Francisco        13.32 %          2.7 %          4.43 %          1.79 %
## Seatle               48.77 %         8.08 %          2.78 %          0.84 %

We can see that out of the on time flights for Alaska came from Seattle, which makes sense given their geographic proximity. The On time flights for AM West came mostly from Phoenix. Compared to their overall flight percentages, this does show in greater detail which specific cities contribute the most to the airline’s performance. If Alaksa is only on time for Seattle, then you would not want to rely on them for any other city, despite their overall on time percentage being 86%.