The objective of this assignment is to read in a csv file that has
the flight delays for two airlines in a wide format. Then to use
tidyr
and dplyr
to tidy and transform the
data. Lastly, to perform an analysis to compare the arrival delays for
the two airlines.
First, load required packages.
library(tidyverse)
library(kableExtra)
Next, read in the csv file and take a look.
# df <- read.csv(choose.files())
df <-read.csv("https://raw.githubusercontent.com/klgriffen96/spring23_data607_hw5/main/flights.csv")
kable(head(df)) |>
kable_styling("striped")
X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1,841 |
delayed | 62 | 12 | 20 | 102 | 305 | |
NA | NA | NA | ||||
AMWEST | on time | 694 | 4,840 | 383 | 320 | 201 |
delayed | 117 | 415 | 65 | 129 | 61 |
Rename headers.
df <- df |> rename(Airline = colnames(df)[1], Status = colnames(df)[2])
Now, remove unnecessary blanks, fill in airlines, and ensure all are numeric.
# Take only the rows which have values
df <- subset(df, nchar(Status) > 0)
# Fill in airline, even when it isn't there
for (i in 1:dim(df)[1]){
temp <- nchar(df$Airline[i])
if (temp > 0) {
airline <- df$Airline[i]
} else {
df$Airline[i] <- airline
}
}
# Make sure all values are numeric
n <- colnames(df)
for (i in 3:length(n)){
df[[n[i]]] <- as.numeric(str_remove(df[[n[i]]], ","))
}
kable(head(df)) |>
kable_styling("striped")
Airline | Status | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle | |
---|---|---|---|---|---|---|---|
1 | ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
2 | ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
4 | AMWEST | on time | 694 | 4840 | 383 | 320 | 201 |
5 | AMWEST | delayed | 117 | 415 | 65 | 129 | 61 |
Perform the pivot to go from wide format to long format so that each Airline, Airport combination is a row, then back to a wide format so that the on time and delay are their own columns.
df_long <-df |>
pivot_longer(cols = Los.Angeles:Seattle,
names_to = "Airport",
values_to = "count")
df_wide <- df_long |>
pivot_wider(names_from = Status,
values_from = count
)
kable(head(df_wide)) |>
kable_styling("striped")
Airline | Airport | on time | delayed |
---|---|---|---|
ALASKA | Los.Angeles | 497 | 62 |
ALASKA | Phoenix | 221 | 12 |
ALASKA | San.Diego | 212 | 20 |
ALASKA | San.Francisco | 503 | 102 |
ALASKA | Seattle | 1841 | 305 |
AMWEST | Los.Angeles | 694 | 117 |
To analyze the arrival delays, do a basic comparison of the overall percent of delayed flights were for each airline, then each airline by each location.
percent_delayed <- df_wide |>
group_by(Airline, Airport) |>
summarize(
Percent_Delayed = round(`delayed`*100/(`delayed` + `on time`))
, .groups = "drop") |>
arrange(desc(Percent_Delayed))
overall_delayed <- df_wide |>
group_by(Airline) |>
summarise(
Overall_Percent_Delayed = round(sum(`delayed`)*100/sum(`delayed` + `on time`))
)
# Display table - head
kable(head(percent_delayed)) |>
kable_styling("striped")
Airline | Airport | Percent_Delayed |
---|---|---|
AMWEST | San.Francisco | 29 |
AMWEST | Seattle | 23 |
ALASKA | San.Francisco | 17 |
AMWEST | San.Diego | 15 |
ALASKA | Seattle | 14 |
AMWEST | Los.Angeles | 14 |
# Display overall percent
kable(overall_delayed) |>
kable_styling("striped")
Airline | Overall_Percent_Delayed |
---|---|
ALASKA | 13 |
AMWEST | 11 |
Alaska airline overall delayed about 13% of it’s flights and AM West overall delayed about 11% of its flights. The highest percent of flights delayed by Alaska were to San Francisco at 17% and for AM West it was also San Francisco at 29%.
Create a bar chart showing the ratios of on-time and delayed for each location.
ggplot(percent_delayed, aes(x=Airport, y=Percent_Delayed, fill= Airline)) +
geom_bar(stat="identity" , position=position_dodge())
The bar chart makes it clear that across all destinations, AM West had a higher percent of delayed flights than Alaska did. This is an interesting observation because overall - Alaska had a greater percentage of delays than AM West - but when you look at each destination individually you can see that AM West had a greater percentage of delays by each destination.
In conclusion, I read in a csv file that had the flight delays for
two airlines in a wide format. Then I used tidyr
and
dplyr
to tidy and transform the data. Lastly, I performed
an analysis to compare the arrival delays for the two airlines.