Introduction

# Here's a picture of the table 

knitr::include_graphics("pic.jpg")

The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2)Readtheinformationfromyour.CSVfileintoR,andusetidyrand dplyr asneededtotidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Data cleanup

Connect to the database using

library(RMySQL)
# Connect to the database using the environment variables

con <- dbConnect(MySQL(),
                 host = "localhost",
                 username = "root",
                 password = "Alex9297248844",
                 dbname = "Airport")

Load data from the database into an R dataframe

con <- dbGetQuery(con, "SELECT * FROM airlines")

str(con)

## 'data.frame':    4 obs. of  7 variables:
##  $ airport_name       : chr  "ALASKA" "ALASKA" "AM WEST" "AM WEST"
##  $ arrival_performance: chr  "on time" "delayed" "on time" "delayed"
##  $ los_angeles        : int  497 62 694 117
##  $ phoenix            : int  221 12 4840 415
##  $ san_diego          : int  212 20 383 65
##  $ san_francisco      : int  503 102 320 129
##  $ seattle            : int  1841 305 201 61

print(con)

##   airport_name arrival_performance los_angeles phoenix san_diego san_francisco
## 1       ALASKA             on time         497     221       212           503
## 2       ALASKA             delayed          62      12        20           102
## 3      AM WEST             on time         694    4840       383           320
## 4      AM WEST             delayed         117     415        65           129
##   seattle
## 1    1841
## 2     305
## 3     201
## 4      61

Tidy and transform the data

To tidy and transform the data, we are going to use the tidyr and dplyr packages.

# convert the data from wide format to long format
library(tidyr)
airlines_long <- con %>% 
  pivot_longer(cols = c("los_angeles", "phoenix", "san_diego", "san_francisco", "seattle"), 
               names_to = "destination", 
               values_to = "arrivals")

knitr::kable(airlines_long)

airport_name	arrival_performance	destination	arrivals
ALASKA	on time	los_angeles	497
ALASKA	on time	phoenix	221
ALASKA	on time	san_diego	212
ALASKA	on time	san_francisco	503
ALASKA	on time	seattle	1841
ALASKA	delayed	los_angeles	62
ALASKA	delayed	phoenix	12
ALASKA	delayed	san_diego	20
ALASKA	delayed	san_francisco	102
ALASKA	delayed	seattle	305
AM WEST	on time	los_angeles	694
AM WEST	on time	phoenix	4840
AM WEST	on time	san_diego	383
AM WEST	on time	san_francisco	320
AM WEST	on time	seattle	201
AM WEST	delayed	los_angeles	117
AM WEST	delayed	phoenix	415
AM WEST	delayed	san_diego	65
AM WEST	delayed	san_francisco	129
AM WEST	delayed	seattle	61

Analysis

# Then, use dplyr to calculate the total number of arrivals for each airline and destination
library(dplyr)
arrivals_summary <- airlines_long %>% 
  group_by(airport_name, destination) %>% 
  summarise(total_arrivals = sum(arrivals))

knitr::kable(arrivals_summary)

airport_name	destination	total_arrivals
ALASKA	los_angeles	559
ALASKA	phoenix	233
ALASKA	san_diego	232
ALASKA	san_francisco	605
ALASKA	seattle	2146
AM WEST	los_angeles	811
AM WEST	phoenix	5255
AM WEST	san_diego	448
AM WEST	san_francisco	449
AM WEST	seattle	262

# Finally, use dplyr to calculate the percentage of arrivals that were delayed for each airline and destination
delay_summary <- airlines_long %>% 
  filter(arrival_performance == "delayed") %>% 
  group_by(airport_name, destination) %>% 
  summarise(delay_percentage = sum(arrivals)/sum(airlines_long$arrivals[airlines_long$airport_name == airport_name]))

# You can then join the two summary tables together if you want to see both the total number of arrivals and the percentage of delayed arrivals for each airline and destination
summary_table <- left_join(arrivals_summary, delay_summary)

knitr::kable(summary_table)

airport_name	destination	total_arrivals	delay_percentage
ALASKA	los_angeles	559	0.0164238
ALASKA	phoenix	233	0.0031788
ALASKA	san_diego	232	0.0052980
ALASKA	san_francisco	605	0.0270199
ALASKA	seattle	2146	0.0807947
AM WEST	los_angeles	811	0.0161938
AM WEST	phoenix	5255	0.0574394
AM WEST	san_diego	448	0.0089965
AM WEST	san_francisco	449	0.0178547
AM WEST	seattle	262	0.0084429

Analysis to compare the arrival delays

Compare the total number of arrivals for each airline and destination:

library(ggplot2)
ggplot(summary_table, aes(x = destination, y = total_arrivals, fill = airport_name)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Total Arrivals by Destination and Airline",
       x = "Destination",
       y = "Total Arrivals",
       fill = "Airline")

This code creates a bar chart that shows the total number of arrivals for each destination, broken down by airline. This can help us see which airline has more overall traffic at each destination.

Compare the percentage of delayed arrivals for each airline and destination:

ggplot(delay_summary, aes(x = destination, y = delay_percentage, fill = airport_name)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Percentage of Delayed Arrivals by Destination and Airline",
       x = "Destination",
       y = "Percentage of Delayed Arrivals",
       fill = "Airline")

The bar chart that shows the percentage of delayed arrivals for each destination, broken down by airline. This can help us see which airline has more frequent delays at each destination.

Compare the average delay time for each airline and destination:

delay_times <- airlines_long %>% 
  filter(arrival_performance == "delayed") %>% 
  group_by(airport_name, destination) %>% 
  summarise(avg_delay_time = mean(arrivals)) %>% 
  ungroup()

ggplot(delay_times, aes(x = destination, y = avg_delay_time, fill = airport_name)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Average Delay Time for Delayed Arrivals by Destination and Airline",
       x = "Destination",
       y = "Average Delay Time",
       fill = "Airline")

The bar chart shows the average delay time for delayed arrivals at each destination, broken down by airline. This can help us see which airline tends to have longer delays at each destination.

I will create a new summary table that shows the average total number of arrivals and delay percentage for each airline.

airline_summary <- summary_table %>%
  group_by(airport_name) %>%
  summarise(avg_total_arrivals = mean(total_arrivals),
            avg_delay_percentage = mean(delay_percentage))

knitr:: kable(airline_summary)

airport_name	avg_total_arrivals	avg_delay_percentage
ALASKA	755	0.0265430
AM WEST	1445	0.0217855

Conclusions

Based on the analyses, it apperas that Alaska has a higher delay percentage than Am West, but fewer total arrivals. Meanwhile, Am West has more total arrivals but a lower delay percentage. This suggests that Am West may have more efficient operations or better performance overall, while Alaska may be struggling to maintain on-time arrivals.

Flights Analysis

waheeb Algabri