ASSIGNMENT WEEK 5A

Loading R packages

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Create th CSV File

# Create the data manually
flights_data <- data.frame(
  X = c("ALASKA", NA, NA, "AM WEST", NA),
  Y = c("on time", "delayed", NA, "on time", "delayed"),
  Los_Angeles   = c(497, 62,NA, 694, 117),
  Phoenix       = c(221, 12,NA, 4840, 415),
  San_Diego     = c(212, 20, NA, 383, 65),
  San_Francisco = c(503, 102, NA, 320, 129),
  Seattle       = c(1841, 305, NA , 210, 61)
)

# Write to CSV file
write.csv(flights_data, "airline_flights.csv", row.names = FALSE)

Let’s read the CSV file into R

flights_data <- read.csv("airline_flights.csv", header = T)
flights_data

Let’s remove the blank row now.

flights_data <- flights_data[c(1,2,4,5), ]
flights_data

Let’s populate the missing values

flights_data[c(2,4),1] <- flights_data[c(1,3),1]

colnames(flights_data)[1:2] <- c('Airlines', 'flights_status')

flights_data

Let’s transformed data from wide to long format

Long_flights_data <- gather(flights_data, key = 'Arrival_Destination', value = 'n', 3:7)
Long_flights_data

Let’s split the flights status column into two sections and by airlines.

ANSWER: I would like to spread the flights status column into two categories including delayed/on time and by airlines using the tidyr package to facilitate the percent comparison

Long_flights_data <- spread(Long_flights_data,flights_status,n)
Long_flights_data

Let’s first calculate the percentage of “on time” and “delayed” of both ALASKA and AM WEST.

To calculate the “on time” and “delayed” percent per arrival, we will first find the total number of flights of each arrival destination which is equivalent to flights (on time + delayed) and then divide each number of flights delayed / on time data by the total number of flights for every arrival destination.

Long_flights_data <- Long_flights_data %>% 
  mutate(total_flights = delayed + `on time`,
         percent_on_time = round(`on time`/total_flights*100, 1),
         percent_delayed = round(delayed/total_flights*100, 1))
Long_flights_data

Plot of airlines delayed flights city by city.

city_perf <- Long_flights_data %>%
  mutate(total_flights = delayed + `on time`,
         Delayed_rate = round(delayed/total_flights*100, 2))

ggplot(city_perf, aes(x = Arrival_Destination, y = Delayed_rate, fill = Airlines)) +
  geom_col(position = "dodge") +
  labs(title = "City-by-City Delays % by Airline",
       y = "Delays Percentage", x = "City") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Let’s now compare the overall percent of delayed flights accross both ALASKA and AM WEST airlines.

compareOverAll <- Long_flights_data %>% 
  group_by(Airlines) %>%
  summarize(Delayed_flights = sum(delayed),OnTime_flights = sum(`on time`), 
            Percent_Delayed = round(Delayed_flights/sum(OnTime_flights,Delayed_flights),2))%>%
  select(Airlines, Percent_Delayed)
compareOverAll %>% arrange(desc(Percent_Delayed))

print(compareOverAll)

## # A tibble: 2 × 2
##   Airlines Percent_Delayed
##   <chr>              <dbl>
## 1 ALASKA              0.13
## 2 AM WEST             0.11

Overall flights delayed percent Plot

ggplot(compareOverAll, aes(x = Airlines, y = Percent_Delayed, fill = Airlines)) +
  geom_col() +
  labs(title = "Overall Delayed_flights % by Airline",
       y = "Delayed_flights Percentage", x = "Airline") +
  theme_minimal()

Analysis: The overall percentage delays result between Alaska and AM West airlines is different from my expectations. Indeed, Alaska has less delayed flights in 4 out the five arrival destination including Los Angeles, Phoenix, San Diego and San Francisco compared to AM West but the Alaska airlines overall percentage delayed flights is higher than AM West.

Discrepancy between comparing ALASKA and AM WEST airlines’ flight performances city-by-city and overall.

The result paradox when comparing Alaska and AM West airlines delayed flights perfomance city by city percentages vs Overall percentages can be explained by the fact that both airlines do not have the same amount flights and the worst part is there is high significant total number of flights difference in some cities by airline which of course will impact the overall percent comparison. For instance, in Phoenix, although ALASKA has less delayed flights than AM WEST, the total number of ALASKA flights to Phoenix is less than 5% compared to the total number of AM WEST flights to Phoenix (233 flights for ALASKA vs 5255 flights for AM WEST ).