We have been provided with a chart that describes arrival delays for two airlines across five destinations. In this analysis, I will compare the arrival delays for the two airlines and for each airport: Los Angeles, Phoenix, San Diego, San Francisco, and Seattle.
First, let’s load the required packages.
library(RMySQL)
library(dplyr)
library(tidyr)
The data is stored in a SQL database. Here, we will load the data into R from SQL.
host <- "localhost"
source("logincredentials.R")
dbname <- "arrivals"
# Establish the database connection
con <- dbConnect(MySQL(), user = user, password = password, dbname = dbname, host = host)
query <- "SELECT * FROM alaska_am"
# Fetch data into a data frame
arrivals <- dbGetQuery(con, query)
show(arrivals)
## _ __ LosAngeles Phoenix SanDiego SanFrancisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Next, we will remove the empty rows and fill the empty cells.
# remove the empty row
arrivals <- arrivals[c(1,2,4,5),]
# rename the column names
colnames(arrivals) <- c("Airline", "Status",
"LosAngeles", "Phoenix",
"SanDiego", "SanFrancisco", "Seattle")
# rename the NAs in the Airline column
arrivals$Airline[2] <- "ALASKA"
arrivals$Airline[4] <- "AM WEST"
arrivals$Status<- ifelse(grepl("on time", arrivals$Status, ignore.case = TRUE), "on_time", arrivals$Status)
show(arrivals)
## Airline Status LosAngeles Phoenix SanDiego SanFrancisco Seattle
## 1 ALASKA on_time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on_time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Here, we will rearrange the data using tidyr to make it easier to perform the analysis on the delays.
# rearrange the data so that the destination columns are now put into one column
tidyarrivals <- pivot_longer(arrivals,
cols = c(LosAngeles, Phoenix, SanDiego, SanFrancisco, Seattle),
names_to = "Destination",
values_to = "NumberOfFlights")
# rearrange the data so that the "Number of Flights" is split between two columns: on time and delayed
tidyarrivals <- pivot_wider(tidyarrivals, names_from = Status, values_from = NumberOfFlights)
show(tidyarrivals)
## # A tibble: 10 × 4
## Airline Destination on_time delayed
## <chr> <chr> <int> <int>
## 1 ALASKA LosAngeles 497 62
## 2 ALASKA Phoenix 221 12
## 3 ALASKA SanDiego 212 20
## 4 ALASKA SanFrancisco 503 102
## 5 ALASKA Seattle 1841 305
## 6 AM WEST LosAngeles 694 117
## 7 AM WEST Phoenix 4840 415
## 8 AM WEST SanDiego 383 65
## 9 AM WEST SanFrancisco 320 129
## 10 AM WEST Seattle 201 61
Here, we will use the dplyr pacakage to analyze the data.
# create a column for total flights
tidyarrivals <- tidyarrivals %>%
mutate(
total_flights = delayed + on_time,
# calculate the percentage of flights that are delayed, show only 2 decimal points
percentage_delayed = round((delayed / total_flights) *100, digits = 2) ,
# calculate the percentage of flights that are on time, show only 2 decimal points
percentage_ontime = round((on_time / total_flights) * 100, digits = 2)
)
# rename Los Angeles
tidyarrivals$Destination <- ifelse(grepl("LosAngeles", tidyarrivals$Destination),
"Los Angeles", tidyarrivals$Destination)
# rename San Francisco
tidyarrivals$Destination <- ifelse(grepl("SanFrancisco", tidyarrivals$Destination,
ignore.case = TRUE), "San Francisco",
tidyarrivals$Destination)
# rename San Diego
tidyarrivals$Destination <- ifelse(grepl("SanDiego", tidyarrivals$Destination,
ignore.case = TRUE), "San Diego",
tidyarrivals$Destination)
show(tidyarrivals)
## # A tibble: 10 × 7
## Airline Destination on_time delayed total_flights percentage_delayed
## <chr> <chr> <int> <int> <int> <dbl>
## 1 ALASKA Los Angeles 497 62 559 11.1
## 2 ALASKA Phoenix 221 12 233 5.15
## 3 ALASKA San Diego 212 20 232 8.62
## 4 ALASKA San Francisco 503 102 605 16.9
## 5 ALASKA Seattle 1841 305 2146 14.2
## 6 AM WEST Los Angeles 694 117 811 14.4
## 7 AM WEST Phoenix 4840 415 5255 7.9
## 8 AM WEST San Diego 383 65 448 14.5
## 9 AM WEST San Francisco 320 129 449 28.7
## 10 AM WEST Seattle 201 61 262 23.3
## # ℹ 1 more variable: percentage_ontime <dbl>
Below, we will create a table showing the average delayed and on time flight percentages for each airline.
# create a table showing the average delayed and on time percentages for each airline
airlineArrivalPercent <- tidyarrivals %>%
group_by(Airline) %>%
summarize(
# show the average percentage delayed for each airline
"Average Delayed Percentage" = round(mean(percentage_delayed), digits = 2),
# show the average percentage on time for each airline
"Average On Time Percentage" = round(mean(percentage_ontime), digits = 2)
) %>%
as.data.frame()
show(airlineArrivalPercent)
## Airline Average Delayed Percentage Average On Time Percentage
## 1 ALASKA 11.19 88.81
## 2 AM WEST 17.77 82.23
From the above table, we can conlude that Alaska has the least amount of delays out of the two airlines.
Below, we will create a table that shows the average delayed and on time flight percentages for each destination.
# create a table showing the average delayed and on time percentages for each destination
destArrivalPercent <- tidyarrivals %>%
group_by(Destination) %>%
summarize(
# show the average percentage delayed for each destination
"Average Delayed Percentage" = round(mean(percentage_delayed), 2),
# show the average percentage on time for each destination
"Average On Time Percentage" = round(mean(percentage_ontime),2)
) %>%
as.data.frame()
show(destArrivalPercent)
## Destination Average Delayed Percentage Average On Time Percentage
## 1 Los Angeles 12.76 87.24
## 2 Phoenix 6.53 93.47
## 3 San Diego 11.56 88.44
## 4 San Francisco 22.80 77.20
## 5 Seattle 18.75 81.25
From the above table, we can conclude that Phoenix has the least amount of delays out of the 5 airports.
Below, we will create a graph that shows the distribution of flight status for each airline and each flight destination.
library(ggplot2)
percent_delayed <- tidyarrivals$percentage_delayed / 100
percent_ontime <- tidyarrivals$percentage_ontime / 100
ggplot(tidyarrivals, aes(x = Destination)) +
geom_bar(aes(y = percent_ontime * total_flights, fill = "On-Time"), stat = "identity") +
geom_bar(aes(y = percent_delayed * total_flights, fill = "Delayed"), stat = "identity") +
labs(title = "Delayed vs. On-Time Flights by Destination", x = "Destination", y = "Number of Flights") +
scale_fill_manual(values = c("On-Time" = "blue", "Delayed" = "red"), name = "Flight Status") +
theme_minimal() +
theme(legend.position = 'right',
axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_wrap(~ Airline, scales = "free")
The above graph shows the frequency of delayed and on time flights for each destination, split by airline.
We can tell from the analysis that Alaska has the lower amount of delays out of the two airlines. Out of the five airports, Phoenix has the least amount of delays. If you would like to avoid flight delays, based on the above analysis, the best option would be to fly Alaska airlines and fly into Phoenix.
total_flights <- select(tidyarrivals, Airline, Destination, total_flights) %>%
as.data.frame()
total_airline <- total_flights %>%
group_by(Airline) %>%
summarise(total_flights = sum(total_flights)) %>%
as.data.frame()
total_airline
## Airline total_flights
## 1 ALASKA 3775
## 2 AM WEST 7225
total_destination <- total_flights %>%
group_by(Destination) %>%
summarize(total_flights = sum(total_flights)) %>%
as.data.frame()
total_destination
## Destination total_flights
## 1 Los Angeles 1370
## 2 Phoenix 5488
## 3 San Diego 680
## 4 San Francisco 1054
## 5 Seattle 2408
As you can see, the sample sizes for each airline and each airport are not close in size. In order for future analyses to be more accurate, I would suggest that the sample sizes used to be closer in size. The larger the sample size, the more representative of the actual trend for the population. For example, the conclusions I came to in regards to Phoenix flight delays are likely more accurate than the conclusions I came to for San Diego flight delays.
Arrival Delay Chart: Numbersense, Kaiser Fung, McGraw Hill, 2013