For this assignment, we’re given the below data and tasked with reading the file into R as a .CSV, preparing the data and performing a basic analysis comparing delay rates for two airlines:
First we load packages and read the data into R:
library(tidyverse)
library(ggplot2)
library(gridExtra)
url <- "https://raw.githubusercontent.com/josh1den/DATA-607/Projects/HW6/flights.csv"
raw_data <- read.csv(url, stringsAsFactors=FALSE)
raw_data
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
# remove na row
# rename first two columns
# convert blank airline rows to NA in order to fill
# pivot data on cities
# pivot data on arrival
# create total_flights, delay_pct column
# replace periods in cities with spaces
df <- na.omit(raw_data) |>
rename(Airline = X, Arrival = X.1) |>
mutate(Airline = na_if(Airline, "")) |>
fill(Airline, .direction = "down") |>
pivot_longer(3:7, names_to = "City", values_to = "Total") |>
pivot_wider(names_from = Arrival, values_from = Total) |>
rename(on_time = 'on time') |>
mutate(total_flights = on_time + delayed,
delay_pct = round(delayed/(on_time + delayed) * 100,2),
City = gsub("\\.", " ", City))
df
## # A tibble: 10 × 6
## Airline City on_time delayed total_flights delay_pct
## <chr> <chr> <int> <int> <int> <dbl>
## 1 Alaska Los Angeles 497 62 559 11.1
## 2 Alaska Phoenix 221 12 233 5.15
## 3 Alaska San Diego 212 20 232 8.62
## 4 Alaska San Francisco 503 102 605 16.9
## 5 Alaska Seattle 1841 305 2146 14.2
## 6 AM WEST Los Angeles 694 117 811 14.4
## 7 AM WEST Phoenix 4840 415 5255 7.9
## 8 AM WEST San Diego 383 65 448 14.5
## 9 AM WEST San Francisco 320 129 449 28.7
## 10 AM WEST Seattle 201 61 262 23.3
total_flights <- df |>
select(1, 3, 4) |>
group_by(Airline) |>
summarise(on_time = sum(on_time), delayed = sum(delayed),
total_flights = sum(on_time) + sum(delayed),
delay_pct = round(sum(delayed)/
(sum(on_time) + sum(delayed)) * 100, 2))
total_flights
## # A tibble: 2 × 5
## Airline on_time delayed total_flights delay_pct
## <chr> <int> <int> <int> <dbl>
## 1 Alaska 3274 501 3775 13.3
## 2 AM WEST 6438 787 7225 10.9
# avg delay
b1 <- ggplot(total_flights, aes(x=Airline, y=delay_pct, fill=Airline)) +
geom_bar(stat="identity") +
labs(title = "Average Delay") +
scale_x_discrete(guide = guide_axis(angle = 45))
# total flights
b2 <- ggplot(total_flights, aes(x=Airline, y=total_flights, fill=Airline)) +
geom_bar(stat="identity") +
labs(title = "Total Flights") +
scale_x_discrete(guide = guide_axis(angle = 45))
grid.arrange(b1, b2, ncol=2)
As we can see here, Alaska Air has nearly half as many flights as America West and a 2.5% higher delay percentage, but Alaska Air has lower percentages of delay in every city. Let’s take a closer look
f1 <- ggplot(df, aes(x=Airline, y=delay_pct, fill=Airline)) +
geom_bar(stat="identity") +
labs(title = "Delay Pct") +
facet_wrap(~ City)
f2 <- ggplot(df, aes(x=Airline, y=total_flights, fill=Airline)) +
geom_bar(stat="identity") +
labs(title = "Total Flights") +
facet_wrap(~ City)
grid.arrange(f1, f2)
As we can see here, although Alaska Air has lower delay percentages in each city than America West, the Phoenix market is weighing the overall total, as Phoenix holds each airlines lowest delay rate and America West dramatically outbalances Alaska in terms of total number of flights from Phoenix.