library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Create the data manually
flights_data <- data.frame(
X = c("ALASKA", NA, NA, "AM WEST", NA),
Y = c("on time", "delayed", NA, "on time", "delayed"),
Los_Angeles = c(497, 62,NA, 694, 117),
Phoenix = c(221, 12,NA, 4840, 415),
San_Diego = c(212, 20, NA, 383, 65),
San_Francisco = c(503, 102, NA, 320, 129),
Seattle = c(1841, 305, NA , 210, 61)
)
# Write to CSV file
write.csv(flights_data, "airline_flights.csv", row.names = FALSE)
flights_data <- read.csv("airline_flights.csv", header = T)
flights_data
flights_data <- flights_data[c(1,2,4,5), ]
flights_data
flights_data[c(2,4),1] <- flights_data[c(1,3),1]
colnames(flights_data)[1:2] <- c('Airlines', 'flights_status')
flights_data
Long_flights_data <- gather(flights_data, key = 'Arrival_Destination', value = 'n', 3:7)
Long_flights_data
ANSWER: I would like to spread the flights status column into two categories including delayed/on time and by airlines using the tidyr package to facilitate the percent comparison
Long_flights_data <- spread(Long_flights_data,flights_status,n)
Long_flights_data
To calculate the “on time” and “delayed” percent per arrival, we will first find the total number of flights of each arrival destination which is equivalent to flights (on time + delayed) and then divide each number of flights delayed / on time data by the total number of flights for every arrival destination.
Long_flights_data <- Long_flights_data %>%
mutate(total_flights = delayed + `on time`,
percent_on_time = round(`on time`/total_flights*100, 1),
percent_delayed = round(delayed/total_flights*100, 1))
Long_flights_data
city_perf <- Long_flights_data %>%
mutate(total_flights = delayed + `on time`,
Delayed_rate = round(delayed/total_flights*100, 2))
ggplot(city_perf, aes(x = Arrival_Destination, y = Delayed_rate, fill = Airlines)) +
geom_col(position = "dodge") +
labs(title = "City-by-City Delays % by Airline",
y = "Delays Percentage", x = "City") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
compareOverAll <- Long_flights_data %>%
group_by(Airlines) %>%
summarize(Delayed_flights = sum(delayed),OnTime_flights = sum(`on time`),
Percent_Delayed = round(Delayed_flights/sum(OnTime_flights,Delayed_flights),2))%>%
select(Airlines, Percent_Delayed)
compareOverAll %>% arrange(desc(Percent_Delayed))
print(compareOverAll)
## # A tibble: 2 × 2
## Airlines Percent_Delayed
## <chr> <dbl>
## 1 ALASKA 0.13
## 2 AM WEST 0.11
ggplot(compareOverAll, aes(x = Airlines, y = Percent_Delayed, fill = Airlines)) +
geom_col() +
labs(title = "Overall Delayed_flights % by Airline",
y = "Delayed_flights Percentage", x = "Airline") +
theme_minimal()
Analysis: The overall percentage delays result between Alaska and AM West airlines is different from my expectations. Indeed, Alaska has less delayed flights in 4 out the five arrival destination including Los Angeles, Phoenix, San Diego and San Francisco compared to AM West but the Alaska airlines overall percentage delayed flights is higher than AM West.
The result paradox when comparing Alaska and AM West airlines delayed flights perfomance city by city percentages vs Overall percentages can be explained by the fact that both airlines do not have the same amount flights and the worst part is there is high significant total number of flights difference in some cities by airline which of course will impact the overall percent comparison. For instance, in Phoenix, although ALASKA has less delayed flights than AM WEST, the total number of ALASKA flights to Phoenix is less than 5% compared to the total number of AM WEST flights to Phoenix (233 flights for ALASKA vs 5255 flights for AM WEST ).