# Here's a picture of the table
knitr::include_graphics("pic.jpg")The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2)Readtheinformationfromyour.CSVfileintoR,andusetidyrand dplyr asneededtotidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
library(RMySQL)
# Connect to the database using the environment variablescon <- dbConnect(MySQL(),
host = "localhost",
username = "root",
password = "Alex9297248844",
dbname = "Airport")con <- dbGetQuery(con, "SELECT * FROM airlines")str(con)## 'data.frame': 4 obs. of 7 variables:
## $ airport_name : chr "ALASKA" "ALASKA" "AM WEST" "AM WEST"
## $ arrival_performance: chr "on time" "delayed" "on time" "delayed"
## $ los_angeles : int 497 62 694 117
## $ phoenix : int 221 12 4840 415
## $ san_diego : int 212 20 383 65
## $ san_francisco : int 503 102 320 129
## $ seattle : int 1841 305 201 61
print(con)## airport_name arrival_performance los_angeles phoenix san_diego san_francisco
## 1 ALASKA on time 497 221 212 503
## 2 ALASKA delayed 62 12 20 102
## 3 AM WEST on time 694 4840 383 320
## 4 AM WEST delayed 117 415 65 129
## seattle
## 1 1841
## 2 305
## 3 201
## 4 61
To tidy and transform the data, we are going to use the tidyr and dplyr packages.
# convert the data from wide format to long format
library(tidyr)
airlines_long <- con %>%
pivot_longer(cols = c("los_angeles", "phoenix", "san_diego", "san_francisco", "seattle"),
names_to = "destination",
values_to = "arrivals")knitr::kable(airlines_long)| airport_name | arrival_performance | destination | arrivals |
|---|---|---|---|
| ALASKA | on time | los_angeles | 497 |
| ALASKA | on time | phoenix | 221 |
| ALASKA | on time | san_diego | 212 |
| ALASKA | on time | san_francisco | 503 |
| ALASKA | on time | seattle | 1841 |
| ALASKA | delayed | los_angeles | 62 |
| ALASKA | delayed | phoenix | 12 |
| ALASKA | delayed | san_diego | 20 |
| ALASKA | delayed | san_francisco | 102 |
| ALASKA | delayed | seattle | 305 |
| AM WEST | on time | los_angeles | 694 |
| AM WEST | on time | phoenix | 4840 |
| AM WEST | on time | san_diego | 383 |
| AM WEST | on time | san_francisco | 320 |
| AM WEST | on time | seattle | 201 |
| AM WEST | delayed | los_angeles | 117 |
| AM WEST | delayed | phoenix | 415 |
| AM WEST | delayed | san_diego | 65 |
| AM WEST | delayed | san_francisco | 129 |
| AM WEST | delayed | seattle | 61 |
# Then, use dplyr to calculate the total number of arrivals for each airline and destination
library(dplyr)
arrivals_summary <- airlines_long %>%
group_by(airport_name, destination) %>%
summarise(total_arrivals = sum(arrivals))knitr::kable(arrivals_summary)| airport_name | destination | total_arrivals |
|---|---|---|
| ALASKA | los_angeles | 559 |
| ALASKA | phoenix | 233 |
| ALASKA | san_diego | 232 |
| ALASKA | san_francisco | 605 |
| ALASKA | seattle | 2146 |
| AM WEST | los_angeles | 811 |
| AM WEST | phoenix | 5255 |
| AM WEST | san_diego | 448 |
| AM WEST | san_francisco | 449 |
| AM WEST | seattle | 262 |
# Finally, use dplyr to calculate the percentage of arrivals that were delayed for each airline and destination
delay_summary <- airlines_long %>%
filter(arrival_performance == "delayed") %>%
group_by(airport_name, destination) %>%
summarise(delay_percentage = sum(arrivals)/sum(airlines_long$arrivals[airlines_long$airport_name == airport_name]))# You can then join the two summary tables together if you want to see both the total number of arrivals and the percentage of delayed arrivals for each airline and destination
summary_table <- left_join(arrivals_summary, delay_summary)knitr::kable(summary_table)| airport_name | destination | total_arrivals | delay_percentage |
|---|---|---|---|
| ALASKA | los_angeles | 559 | 0.0164238 |
| ALASKA | phoenix | 233 | 0.0031788 |
| ALASKA | san_diego | 232 | 0.0052980 |
| ALASKA | san_francisco | 605 | 0.0270199 |
| ALASKA | seattle | 2146 | 0.0807947 |
| AM WEST | los_angeles | 811 | 0.0161938 |
| AM WEST | phoenix | 5255 | 0.0574394 |
| AM WEST | san_diego | 448 | 0.0089965 |
| AM WEST | san_francisco | 449 | 0.0178547 |
| AM WEST | seattle | 262 | 0.0084429 |
library(ggplot2)
ggplot(summary_table, aes(x = destination, y = total_arrivals, fill = airport_name)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Arrivals by Destination and Airline",
x = "Destination",
y = "Total Arrivals",
fill = "Airline")This code creates a bar chart that shows the total number of arrivals for each destination, broken down by airline. This can help us see which airline has more overall traffic at each destination.
ggplot(delay_summary, aes(x = destination, y = delay_percentage, fill = airport_name)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Percentage of Delayed Arrivals by Destination and Airline",
x = "Destination",
y = "Percentage of Delayed Arrivals",
fill = "Airline")The bar chart that shows the percentage of delayed arrivals for each destination, broken down by airline. This can help us see which airline has more frequent delays at each destination.
delay_times <- airlines_long %>%
filter(arrival_performance == "delayed") %>%
group_by(airport_name, destination) %>%
summarise(avg_delay_time = mean(arrivals)) %>%
ungroup()
ggplot(delay_times, aes(x = destination, y = avg_delay_time, fill = airport_name)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average Delay Time for Delayed Arrivals by Destination and Airline",
x = "Destination",
y = "Average Delay Time",
fill = "Airline")The bar chart shows the average delay time for delayed arrivals at each destination, broken down by airline. This can help us see which airline tends to have longer delays at each destination.
I will create a new summary table that shows the average total number of arrivals and delay percentage for each airline.
airline_summary <- summary_table %>%
group_by(airport_name) %>%
summarise(avg_total_arrivals = mean(total_arrivals),
avg_delay_percentage = mean(delay_percentage))
knitr:: kable(airline_summary)| airport_name | avg_total_arrivals | avg_delay_percentage |
|---|---|---|
| ALASKA | 755 | 0.0265430 |
| AM WEST | 1445 | 0.0217855 |
Based on the analyses, it apperas that Alaska has a higher delay percentage than Am West, but fewer total arrivals. Meanwhile, Am West has more total arrivals but a lower delay percentage. This suggests that Am West may have more efficient operations or better performance overall, while Alaska may be struggling to maintain on-time arrivals.