Introduction

# Here's a picture of the table 

knitr::include_graphics("pic.jpg")

The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2)Readtheinformationfromyour.CSVfileintoR,andusetidyrand dplyr asneededtotidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Data cleanup

Connect to the database using
library(RMySQL)
# Connect to the database using the environment variables
con <- dbConnect(MySQL(),
                 host = "localhost",
                 username = "root",
                 password = "Alex9297248844",
                 dbname = "Airport")
Load data from the database into an R dataframe
con <- dbGetQuery(con, "SELECT * FROM airlines")
str(con)
## 'data.frame':    4 obs. of  7 variables:
##  $ airport_name       : chr  "ALASKA" "ALASKA" "AM WEST" "AM WEST"
##  $ arrival_performance: chr  "on time" "delayed" "on time" "delayed"
##  $ los_angeles        : int  497 62 694 117
##  $ phoenix            : int  221 12 4840 415
##  $ san_diego          : int  212 20 383 65
##  $ san_francisco      : int  503 102 320 129
##  $ seattle            : int  1841 305 201 61
print(con)
##   airport_name arrival_performance los_angeles phoenix san_diego san_francisco
## 1       ALASKA             on time         497     221       212           503
## 2       ALASKA             delayed          62      12        20           102
## 3      AM WEST             on time         694    4840       383           320
## 4      AM WEST             delayed         117     415        65           129
##   seattle
## 1    1841
## 2     305
## 3     201
## 4      61
Tidy and transform the data

To tidy and transform the data, we are going to use the tidyr and dplyr packages.

# convert the data from wide format to long format
library(tidyr)
airlines_long <- con %>% 
  pivot_longer(cols = c("los_angeles", "phoenix", "san_diego", "san_francisco", "seattle"), 
               names_to = "destination", 
               values_to = "arrivals")
knitr::kable(airlines_long)
airport_name arrival_performance destination arrivals
ALASKA on time los_angeles 497
ALASKA on time phoenix 221
ALASKA on time san_diego 212
ALASKA on time san_francisco 503
ALASKA on time seattle 1841
ALASKA delayed los_angeles 62
ALASKA delayed phoenix 12
ALASKA delayed san_diego 20
ALASKA delayed san_francisco 102
ALASKA delayed seattle 305
AM WEST on time los_angeles 694
AM WEST on time phoenix 4840
AM WEST on time san_diego 383
AM WEST on time san_francisco 320
AM WEST on time seattle 201
AM WEST delayed los_angeles 117
AM WEST delayed phoenix 415
AM WEST delayed san_diego 65
AM WEST delayed san_francisco 129
AM WEST delayed seattle 61

Analysis

# Then, use dplyr to calculate the total number of arrivals for each airline and destination
library(dplyr)
arrivals_summary <- airlines_long %>% 
  group_by(airport_name, destination) %>% 
  summarise(total_arrivals = sum(arrivals))
knitr::kable(arrivals_summary)
airport_name destination total_arrivals
ALASKA los_angeles 559
ALASKA phoenix 233
ALASKA san_diego 232
ALASKA san_francisco 605
ALASKA seattle 2146
AM WEST los_angeles 811
AM WEST phoenix 5255
AM WEST san_diego 448
AM WEST san_francisco 449
AM WEST seattle 262
# Finally, use dplyr to calculate the percentage of arrivals that were delayed for each airline and destination
delay_summary <- airlines_long %>% 
  filter(arrival_performance == "delayed") %>% 
  group_by(airport_name, destination) %>% 
  summarise(delay_percentage = sum(arrivals)/sum(airlines_long$arrivals[airlines_long$airport_name == airport_name]))
# You can then join the two summary tables together if you want to see both the total number of arrivals and the percentage of delayed arrivals for each airline and destination
summary_table <- left_join(arrivals_summary, delay_summary)
knitr::kable(summary_table)
airport_name destination total_arrivals delay_percentage
ALASKA los_angeles 559 0.0164238
ALASKA phoenix 233 0.0031788
ALASKA san_diego 232 0.0052980
ALASKA san_francisco 605 0.0270199
ALASKA seattle 2146 0.0807947
AM WEST los_angeles 811 0.0161938
AM WEST phoenix 5255 0.0574394
AM WEST san_diego 448 0.0089965
AM WEST san_francisco 449 0.0178547
AM WEST seattle 262 0.0084429
Analysis to compare the arrival delays
  1. Compare the total number of arrivals for each airline and destination:
library(ggplot2)
ggplot(summary_table, aes(x = destination, y = total_arrivals, fill = airport_name)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Total Arrivals by Destination and Airline",
       x = "Destination",
       y = "Total Arrivals",
       fill = "Airline")

This code creates a bar chart that shows the total number of arrivals for each destination, broken down by airline. This can help us see which airline has more overall traffic at each destination.

  1. Compare the percentage of delayed arrivals for each airline and destination:
ggplot(delay_summary, aes(x = destination, y = delay_percentage, fill = airport_name)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Percentage of Delayed Arrivals by Destination and Airline",
       x = "Destination",
       y = "Percentage of Delayed Arrivals",
       fill = "Airline")

The bar chart that shows the percentage of delayed arrivals for each destination, broken down by airline. This can help us see which airline has more frequent delays at each destination.

  1. Compare the average delay time for each airline and destination:
delay_times <- airlines_long %>% 
  filter(arrival_performance == "delayed") %>% 
  group_by(airport_name, destination) %>% 
  summarise(avg_delay_time = mean(arrivals)) %>% 
  ungroup()

ggplot(delay_times, aes(x = destination, y = avg_delay_time, fill = airport_name)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Delay Time for Delayed Arrivals by Destination and Airline",
       x = "Destination",
       y = "Average Delay Time",
       fill = "Airline")

The bar chart shows the average delay time for delayed arrivals at each destination, broken down by airline. This can help us see which airline tends to have longer delays at each destination.

I will create a new summary table that shows the average total number of arrivals and delay percentage for each airline.

airline_summary <- summary_table %>%
  group_by(airport_name) %>%
  summarise(avg_total_arrivals = mean(total_arrivals),
            avg_delay_percentage = mean(delay_percentage))

knitr:: kable(airline_summary)
airport_name avg_total_arrivals avg_delay_percentage
ALASKA 755 0.0265430
AM WEST 1445 0.0217855

Conclusions

Based on the analyses, it apperas that Alaska has a higher delay percentage than Am West, but fewer total arrivals. Meanwhile, Am West has more total arrivals but a lower delay percentage. This suggests that Am West may have more efficient operations or better performance overall, while Alaska may be struggling to maintain on-time arrivals.