This assignment will use the following packages:
library(readr)
library(RCurl)
library(stringr)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
The goal of this assignment is to utilize untidy airport data that were on time or delayed from 5 different locations (cities) from two different airlines. The following data will also be uploaded onto my github page.
To get the data, we will extract the data from the github file from here.
raw_file <- getURL("https://raw.githubusercontent.com/spacerome/Data607_Assignment_4/refs/heads/main/airline_data.csv")
csvfile <- data.frame(read.csv(text=raw_file, sep= "\t", stringsAsFactors = FALSE, check.names = FALSE))
csvfile
## Var.1 Var.2 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
Looking at the data the raw data gives X
and
X.1
for the first columns which should be
airline
and status
which should be fixed. The
third and fifth rows have an empty value which will be fixed in the next
section.
colnames(csvfile)[1] <- "airline"
colnames(csvfile)[2] <- "status"
csvfile
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
After fixing the columns, we will now fix the empty rows from rows 3
and 5, to have ALASKA
and AMWEST
to continue
prepping the data for analysis on the next section. The cities that have
a space (e.g. Los Angeles) were outputed as Los.Angeles
which will be fixed by utilizing gsub
to fix it.
csvfile[csvfile==" "] <- NA
colnames(csvfile) <- gsub("\\.", " ", colnames(csvfile))
csvfile <- fill(csvfile, airline)
csvfile
## airline status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
After cleaning the data then we utilize pivot_longer
to
prep csvfile
for further data analysis.
csvfile <- pivot_longer(csvfile, cols = !c(airline, status),
names_to = "city", values_to = "flights")
head(csvfile)
## # A tibble: 6 × 4
## airline status city flights
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
After cleaning the data we will have three variables utilized for
analysis: airlineData
, cityData
, and
cityairlineData
.
airlineData <- csvfile %>%
group_by(airline) %>% summarise(flightFrequency = sum(flights), flightDelay = sum(ifelse(status =='delayed',flights,0)))
airlineData <- airlineData %>%
mutate(ontimeFlights = flightFrequency - flightDelay,
delayRate = flightDelay/ontimeFlights,
ontimeRate = ontimeFlights/flightFrequency) %>% arrange(delayRate)
airlineData
## # A tibble: 2 × 6
## airline flightFrequency flightDelay ontimeFlights delayRate ontimeRate
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AM WEST 7225 787 6438 0.122 0.891
## 2 ALASKA 3775 501 3274 0.153 0.867
airlineData %>%
mutate(
delayRate = scales::percent(delayRate, accuracy = 0.01),
ontimeRate = scales::percent(ontimeRate, accuracy = 0.01)
) %>%
select(airline, delayRate, ontimeRate) %>%
rename(
Airline = airline,
`Delay Rate` = delayRate,
`Ontime Rate` = ontimeRate
) %>%
kable(format = "html", caption = "Delay Rate and On-Time Rate by Airline") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), position = "center") %>%
row_spec(0, bold = TRUE, color = "white", background = "#4CAF50")
Airline | Delay Rate | Ontime Rate |
---|---|---|
AM WEST | 12.22% | 89.11% |
ALASKA | 15.30% | 86.73% |
From analyzing the airlineData
, AM WEST
has
the highest ontime rate of 89.1%
, and lowest delay rate of
12.2%
, whereas ALASKA
has a higher delay rate
of 15.3%
, but lower ontime rate of 86.7%
.
cityData <- csvfile %>%
group_by(city) %>% summarise(flightFrequency = sum(flights), flightDelay = sum(ifelse(status =='delayed',flights,0)))
cityData <- cityData %>%
mutate(ontimeFlights = flightFrequency - flightDelay,
delayRate = flightDelay/ontimeFlights,
ontimeRate = ontimeFlights/flightFrequency) %>% arrange(delayRate)
cityData
## # A tibble: 5 × 6
## city flightFrequency flightDelay ontimeFlights delayRate ontimeRate
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Phoenix 5488 427 5061 0.0844 0.922
## 2 San Diego 680 85 595 0.143 0.875
## 3 Los Angeles 1370 179 1191 0.150 0.869
## 4 Seattle 2408 366 2042 0.179 0.848
## 5 San Francisco 1054 231 823 0.281 0.781
From analyzing the cityData
, Phoenix has the highest
ontime Rate of 92.2%
, and lowest delay rate of
8.44%
. San Francisco has the highest delay rate of
28.1%
and its ontime rate is 78.1%
.
cityairlineData <- csvfile %>%
group_by(city, airline) %>%
summarise(flightFrequency = sum(flights, na.rm = TRUE),
flightDelay = sum(ifelse(status == 'delayed', flights, 0), na.rm = TRUE)) %>%
mutate(ontimeFlights = flightFrequency - flightDelay,
delayRate = flightDelay / flightFrequency, ontimeRate = ontimeFlights/flightFrequency) %>%
ungroup()
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
cityairlineData
## # A tibble: 10 × 7
## city airline flightFrequency flightDelay ontimeFlights delayRate ontimeRate
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Los A… ALASKA 559 62 497 0.111 0.889
## 2 Los A… AM WEST 811 117 694 0.144 0.856
## 3 Phoen… ALASKA 233 12 221 0.0515 0.948
## 4 Phoen… AM WEST 5255 415 4840 0.0790 0.921
## 5 San D… ALASKA 232 20 212 0.0862 0.914
## 6 San D… AM WEST 448 65 383 0.145 0.855
## 7 San F… ALASKA 605 102 503 0.169 0.831
## 8 San F… AM WEST 449 129 320 0.287 0.713
## 9 Seatt… ALASKA 2146 305 1841 0.142 0.858
## 10 Seatt… AM WEST 262 61 201 0.233 0.767
cityairlineData %>%
mutate(
delayRate = scales::percent(delayRate, accuracy = 0.01),
ontimeRate = scales::percent(ontimeRate, accuracy = 0.01)
) %>%
select(city, airline, delayRate, ontimeRate) %>%
rename(
City = city,
Airline = airline,
`Delay Rate` = delayRate,
`Ontime Rate` = ontimeRate
) %>%
kable(format = "html", caption = "Delay Rate and On-Time Rate by City and Airline") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), position = "center") %>%
row_spec(0, bold = TRUE, color = "white", background = "#4CAF50")
City | Airline | Delay Rate | Ontime Rate |
---|---|---|---|
Los Angeles | ALASKA | 11.09% | 88.91% |
Los Angeles | AM WEST | 14.43% | 85.57% |
Phoenix | ALASKA | 5.15% | 94.85% |
Phoenix | AM WEST | 7.90% | 92.10% |
San Diego | ALASKA | 8.62% | 91.38% |
San Diego | AM WEST | 14.51% | 85.49% |
San Francisco | ALASKA | 16.86% | 83.14% |
San Francisco | AM WEST | 28.73% | 71.27% |
Seattle | ALASKA | 14.21% | 85.79% |
Seattle | AM WEST | 23.28% | 76.72% |
From this data, AM WEST
had a higher delay rate than
ALASKA
within 4 of the 5 cities, but ALASKA
had the highest delay rate at Seattle and had lesser delay rate on the
other locations, but the delay rate for AMWEST
is higher
than ALASKA
on all 5 cities. The on time rate was the
highest at Phoenix for both airlines which indicates it was most likely
the easiest location for both airlines.
The following code blocks below will be visualizations from the data.
ggplot(airlineData, aes(x = reorder(airline, -flightFrequency))) +
geom_bar(aes(y = ontimeRate, fill = "On-Time"), stat = "identity") +
geom_bar(aes(y = -delayRate, fill = "Delayed"), stat = "identity") +
labs(title = "On-Time and Delayed Rates Overall by Airline", x = "Airline", y = "Rate", fill = "Status") +
scale_fill_manual(values = c("On-Time" = "skyblue", "Delayed" = "salmon")) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
Though AM WEST
has the highest frequencies, it appears
to have the most delayed flights as opposed to ALASKA
.
ggplot(cityairlineData, aes(x = airline, y = delayRate, fill = airline)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Delay Rate by Airline and City",
x = "Airline", y = "Delay Rate") +
facet_wrap(~ city, scales = "free") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Airline"))
This bar plot shows that AM WEST
has a higher delay rate
than ALASKA
, and reinforces the statement about Phoenix
having the highest ontime rate, but lowest ontime rate, whereas San
Francisco has the highest delay rate for both airlines.
ggplot(cityairlineData, aes(x = airline, y = ontimeRate, fill = airline)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "On Time Rate by Airline and City",
x = "Airline", y = "On Time Rate") +
facet_wrap(~ city, scales = "free") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Airline"))
Over here this bar plot shows a pretty tight on-time rate between all
five cities, with ALASKA
being slightly higher than
AM WEST
.
Outputs a csv file to replicate anaylsis:
write.csv(csvfile,"airline_data.csv")
Overall, from the data used for this assignment, AM WEST
has a higher delay rate in all cities, but overall,
ALASKA
has a higher delay rate overall than
AM WEST
.
I believe if I had more data, such as more cities for each airline, or maybe more airlines, we can get a general understanding on what may be causing the delays for each city. I believe there is also an r library for flights that can be utilized to get a general understanding on why there are flights not arriving on time for San Francisco. I do believe if we include factors such as air traffic, weather, and other uncontrollable factors, this will give us a general understanding on why this is occurring.