DATA 607 Homework 4

Introduction

We have been provided with a chart that describes arrival delays for two airlines across five destinations. In this analysis, I will compare the arrival delays for the two airlines and for each airport: Los Angeles, Phoenix, San Diego, San Francisco, and Seattle.

Load Required Packages

First, let’s load the required packages.

library(RMySQL)
library(dplyr)
library(tidyr)

Load Data from SQL Database

The data is stored in a SQL database. Here, we will load the data into R from SQL.

host <- "localhost"
source("logincredentials.R")
dbname <- "arrivals"

# Establish the database connection
con <- dbConnect(MySQL(), user = user, password = password, dbname = dbname, host = host)

query <- "SELECT * FROM alaska_am"

# Fetch data into a data frame
arrivals <- dbGetQuery(con, query)

show(arrivals)

##         _      __ LosAngeles Phoenix SanDiego SanFrancisco Seattle
## 1  ALASKA on time        497     221      212          503    1841
## 2    <NA> delayed         62      12       20          102     305
## 3    <NA>    <NA>         NA      NA       NA           NA      NA
## 4 AM WEST on time        694    4840      383          320     201
## 5    <NA> delayed        117     415       65          129      61

Clean the Data

Next, we will remove the empty rows and fill the empty cells.

# remove the empty row
arrivals <- arrivals[c(1,2,4,5),]

# rename the column names
colnames(arrivals) <- c("Airline", "Status",
                       "LosAngeles", "Phoenix",
                       "SanDiego", "SanFrancisco", "Seattle")

# rename the NAs in the Airline column
arrivals$Airline[2] <- "ALASKA"
arrivals$Airline[4] <- "AM WEST"

arrivals$Status<- ifelse(grepl("on time", arrivals$Status, ignore.case = TRUE), "on_time", arrivals$Status)

show(arrivals)

##   Airline  Status LosAngeles Phoenix SanDiego SanFrancisco Seattle
## 1  ALASKA on_time        497     221      212          503    1841
## 2  ALASKA delayed         62      12       20          102     305
## 4 AM WEST on_time        694    4840      383          320     201
## 5 AM WEST delayed        117     415       65          129      61

Rearrange the Data using tidyr

Here, we will rearrange the data using tidyr to make it easier to perform the analysis on the delays.

# rearrange the data so that the destination columns are now put into one column
tidyarrivals <- pivot_longer(arrivals, 
                             cols = c(LosAngeles, Phoenix, SanDiego, SanFrancisco, Seattle),
                             names_to = "Destination", 
                             values_to = "NumberOfFlights")

# rearrange the data so that the "Number of Flights" is split between two columns: on time and delayed
tidyarrivals <- pivot_wider(tidyarrivals, names_from = Status, values_from = NumberOfFlights)

show(tidyarrivals)

## # A tibble: 10 × 4
##    Airline Destination  on_time delayed
##    <chr>   <chr>          <int>   <int>
##  1 ALASKA  LosAngeles       497      62
##  2 ALASKA  Phoenix          221      12
##  3 ALASKA  SanDiego         212      20
##  4 ALASKA  SanFrancisco     503     102
##  5 ALASKA  Seattle         1841     305
##  6 AM WEST LosAngeles       694     117
##  7 AM WEST Phoenix         4840     415
##  8 AM WEST SanDiego         383      65
##  9 AM WEST SanFrancisco     320     129
## 10 AM WEST Seattle          201      61

Use dplyr to Analyze the Data

Here, we will use the dplyr pacakage to analyze the data.

# create a column for total flights
tidyarrivals <- tidyarrivals %>%
  mutate(
    total_flights = delayed + on_time,
    # calculate the percentage of flights that are delayed, show only 2 decimal points
    percentage_delayed = round((delayed / total_flights) *100, digits = 2) ,
    # calculate the percentage of flights that are on time, show only 2 decimal points
    percentage_ontime = round((on_time / total_flights) * 100, digits = 2)
  ) 

# rename Los Angeles
tidyarrivals$Destination <- ifelse(grepl("LosAngeles", tidyarrivals$Destination), 
                                     "Los Angeles", tidyarrivals$Destination)
# rename San Francisco
tidyarrivals$Destination <- ifelse(grepl("SanFrancisco", tidyarrivals$Destination, 
                                         ignore.case = TRUE), "San Francisco",
                                   tidyarrivals$Destination)

# rename San Diego
tidyarrivals$Destination <- ifelse(grepl("SanDiego", tidyarrivals$Destination,
                                         ignore.case = TRUE), "San Diego",
                                   tidyarrivals$Destination)

show(tidyarrivals)

## # A tibble: 10 × 7
##    Airline Destination   on_time delayed total_flights percentage_delayed
##    <chr>   <chr>           <int>   <int>         <int>              <dbl>
##  1 ALASKA  Los Angeles       497      62           559              11.1 
##  2 ALASKA  Phoenix           221      12           233               5.15
##  3 ALASKA  San Diego         212      20           232               8.62
##  4 ALASKA  San Francisco     503     102           605              16.9 
##  5 ALASKA  Seattle          1841     305          2146              14.2 
##  6 AM WEST Los Angeles       694     117           811              14.4 
##  7 AM WEST Phoenix          4840     415          5255               7.9 
##  8 AM WEST San Diego         383      65           448              14.5 
##  9 AM WEST San Francisco     320     129           449              28.7 
## 10 AM WEST Seattle           201      61           262              23.3 
## # ℹ 1 more variable: percentage_ontime <dbl>

Below, we will create a table showing the average delayed and on time flight percentages for each airline.

# create a table showing the average delayed and on time percentages for each airline
airlineArrivalPercent <- tidyarrivals %>%
  group_by(Airline) %>%
  summarize(
    # show the average percentage delayed for each airline
    "Average Delayed Percentage" = round(mean(percentage_delayed), digits = 2),
    # show the average percentage on time for each airline
    "Average On Time Percentage" = round(mean(percentage_ontime), digits = 2)
  ) %>%
  as.data.frame() 

show(airlineArrivalPercent)

##   Airline Average Delayed Percentage Average On Time Percentage
## 1  ALASKA                      11.19                      88.81
## 2 AM WEST                      17.77                      82.23

From the above table, we can conlude that Alaska has the least amount of delays out of the two airlines.

Below, we will create a table that shows the average delayed and on time flight percentages for each destination.

# create a table showing the average delayed and on time percentages for each destination

destArrivalPercent <- tidyarrivals %>%
  group_by(Destination) %>%
  summarize(
    # show the average percentage delayed for each destination
    "Average Delayed Percentage" = round(mean(percentage_delayed), 2),
    # show the average percentage on time for each destination
  "Average On Time Percentage" = round(mean(percentage_ontime),2)
  ) %>%
  as.data.frame()

show(destArrivalPercent)

##     Destination Average Delayed Percentage Average On Time Percentage
## 1   Los Angeles                      12.76                      87.24
## 2       Phoenix                       6.53                      93.47
## 3     San Diego                      11.56                      88.44
## 4 San Francisco                      22.80                      77.20
## 5       Seattle                      18.75                      81.25

From the above table, we can conclude that Phoenix has the least amount of delays out of the 5 airports.

Below, we will create a graph that shows the distribution of flight status for each airline and each flight destination.

library(ggplot2)
percent_delayed <- tidyarrivals$percentage_delayed / 100
percent_ontime <- tidyarrivals$percentage_ontime / 100

ggplot(tidyarrivals, aes(x = Destination)) +
  geom_bar(aes(y = percent_ontime * total_flights, fill = "On-Time"), stat = "identity") +
  geom_bar(aes(y = percent_delayed * total_flights, fill = "Delayed"), stat = "identity") +
  labs(title = "Delayed vs. On-Time Flights by Destination", x = "Destination", y = "Number of Flights") +
  scale_fill_manual(values = c("On-Time" = "blue", "Delayed" = "red"), name = "Flight Status") +
  theme_minimal() +
  theme(legend.position = 'right',
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  facet_wrap(~ Airline, scales = "free")

The above graph shows the frequency of delayed and on time flights for each destination, split by airline.

Conclusion

We can tell from the analysis that Alaska has the lower amount of delays out of the two airlines. Out of the five airports, Phoenix has the least amount of delays. If you would like to avoid flight delays, based on the above analysis, the best option would be to fly Alaska airlines and fly into Phoenix.

Recommendations

total_flights <- select(tidyarrivals, Airline, Destination, total_flights) %>%
  as.data.frame()

total_airline <- total_flights %>%
  group_by(Airline) %>%
  summarise(total_flights = sum(total_flights)) %>%
  as.data.frame()

total_airline

##   Airline total_flights
## 1  ALASKA          3775
## 2 AM WEST          7225

total_destination <- total_flights %>%
  group_by(Destination) %>%
  summarize(total_flights = sum(total_flights)) %>%
  as.data.frame()

total_destination

##     Destination total_flights
## 1   Los Angeles          1370
## 2       Phoenix          5488
## 3     San Diego           680
## 4 San Francisco          1054
## 5       Seattle          2408

As you can see, the sample sizes for each airline and each airport are not close in size. In order for future analyses to be more accurate, I would suggest that the sample sizes used to be closer in size. The larger the sample size, the more representative of the actual trend for the population. For example, the conclusions I came to in regards to Phoenix flight delays are likely more accurate than the conclusions I came to for San Diego flight delays.

Sources

Arrival Delay Chart: Numbersense, Kaiser Fung, McGraw Hill, 2013