For this assignment, I had to transform a wide table into a long table and analyze the data. Converting the wide dataset of airline information to a long dataset was simple with the pivot_longer function in tidyr. Initially, I plotted the data to see if anything stood out. Afterwards, I mutated the dataframe to look at the percentage of delayed flights from each airline. Then I plotted the percentage of delayed flights by city to see if there were any trends.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

#Created a csv in excel and uploaded it to github. Saved it as a dataframe(df)

df <- read.csv("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20607/week%205%20flights.csv")

df
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

#Added some values and column names to df

df[2,1] <- "ALASKA"
df[5,1] <- "AM WEST"
colnames(df) <- c("Airline", "Status", "Los Angeles", "Pheonix", "San Diego", "San Francisco", "Seattle")

df
##   Airline  Status Los Angeles Pheonix San Diego San Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

#Used the pivot_longer function in tidyr to convert wide data into long data

df_long <- df %>% 
  pivot_longer(
    cols = c("Los Angeles", "Pheonix", "San Diego", "San Francisco", "Seattle"),
    names_to = "City",
    values_to = "Value",
    values_drop_na = TRUE) %>% 
  select("Airline", "City", "Status", "Value") %>% 
  arrange(City, Status)

df_long
## # A tibble: 20 × 4
##    Airline City          Status  Value
##    <chr>   <chr>         <chr>   <int>
##  1 ALASKA  Los Angeles   delayed    62
##  2 AM WEST Los Angeles   delayed   117
##  3 ALASKA  Los Angeles   on time   497
##  4 AM WEST Los Angeles   on time   694
##  5 ALASKA  Pheonix       delayed    12
##  6 AM WEST Pheonix       delayed   415
##  7 ALASKA  Pheonix       on time   221
##  8 AM WEST Pheonix       on time  4840
##  9 ALASKA  San Diego     delayed    20
## 10 AM WEST San Diego     delayed    65
## 11 ALASKA  San Diego     on time   212
## 12 AM WEST San Diego     on time   383
## 13 ALASKA  San Francisco delayed   102
## 14 AM WEST San Francisco delayed   129
## 15 ALASKA  San Francisco on time   503
## 16 AM WEST San Francisco on time   320
## 17 ALASKA  Seattle       delayed   305
## 18 AM WEST Seattle       delayed    61
## 19 ALASKA  Seattle       on time  1841
## 20 AM WEST Seattle       on time   201

#Wanted to get a general sense of the data by plotting it. Looks like Phenoix has the largest amount of on time flights.

df_long %>% 
  ggplot(aes(x= City, y = Value, fill = Status))+
  facet_wrap(~Airline)+
  geom_bar(stat = "identity", position = "dodge")+
  labs(title = "Airline Delays", x = "City", y = "Count")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

###Analysis to compare the arrival delays for the two airlines

#Calculating the percentage of delayed flights on ALASKA airline

df_delayed_alaska <- df_long %>% 
  filter(Airline == "ALASKA") %>% 
  mutate(total_flights = sum(Value)) %>% 
  filter(Status == "delayed") %>% 
  mutate(pct_flight_delay = sum(Value)/total_flights*100)

df_delayed_alaska
## # A tibble: 5 × 6
##   Airline City          Status  Value total_flights pct_flight_delay
##   <chr>   <chr>         <chr>   <int>         <int>            <dbl>
## 1 ALASKA  Los Angeles   delayed    62          3775             13.3
## 2 ALASKA  Pheonix       delayed    12          3775             13.3
## 3 ALASKA  San Diego     delayed    20          3775             13.3
## 4 ALASKA  San Francisco delayed   102          3775             13.3
## 5 ALASKA  Seattle       delayed   305          3775             13.3

#Calculating the percentage of delayed flights on AM WEST airline

df_delay_am <- df_long %>% 
  filter(Airline == "AM WEST") %>% 
  mutate(total_flights = sum(Value)) %>% 
  filter(Status == "delayed") %>% 
  mutate(pct_flight_delay = sum(Value)/total_flights*100)

df_delay_am
## # A tibble: 5 × 6
##   Airline City          Status  Value total_flights pct_flight_delay
##   <chr>   <chr>         <chr>   <int>         <int>            <dbl>
## 1 AM WEST Los Angeles   delayed   117          7225             10.9
## 2 AM WEST Pheonix       delayed   415          7225             10.9
## 3 AM WEST San Diego     delayed    65          7225             10.9
## 4 AM WEST San Francisco delayed   129          7225             10.9
## 5 AM WEST Seattle       delayed    61          7225             10.9

#Finding the percentage of delayed flights by city and airline

df_delay_city <- df_long %>% 
  group_by(City,Airline) %>% 
  mutate(total_flights_city = sum(Value)) %>% 
  mutate(pct_delay_city = Value/total_flights_city*100)

df_delay_city
## # A tibble: 20 × 6
## # Groups:   City, Airline [10]
##    Airline City          Status  Value total_flights_city pct_delay_city
##    <chr>   <chr>         <chr>   <int>              <int>          <dbl>
##  1 ALASKA  Los Angeles   delayed    62                559          11.1 
##  2 AM WEST Los Angeles   delayed   117                811          14.4 
##  3 ALASKA  Los Angeles   on time   497                559          88.9 
##  4 AM WEST Los Angeles   on time   694                811          85.6 
##  5 ALASKA  Pheonix       delayed    12                233           5.15
##  6 AM WEST Pheonix       delayed   415               5255           7.90
##  7 ALASKA  Pheonix       on time   221                233          94.8 
##  8 AM WEST Pheonix       on time  4840               5255          92.1 
##  9 ALASKA  San Diego     delayed    20                232           8.62
## 10 AM WEST San Diego     delayed    65                448          14.5 
## 11 ALASKA  San Diego     on time   212                232          91.4 
## 12 AM WEST San Diego     on time   383                448          85.5 
## 13 ALASKA  San Francisco delayed   102                605          16.9 
## 14 AM WEST San Francisco delayed   129                449          28.7 
## 15 ALASKA  San Francisco on time   503                605          83.1 
## 16 AM WEST San Francisco on time   320                449          71.3 
## 17 ALASKA  Seattle       delayed   305               2146          14.2 
## 18 AM WEST Seattle       delayed    61                262          23.3 
## 19 ALASKA  Seattle       on time  1841               2146          85.8 
## 20 AM WEST Seattle       on time   201                262          76.7
df_delay_city %>% 
  filter(Status == "delayed") %>% 
  ggplot(aes(x= City, y = pct_delay_city, fill = Airline))+
  geom_bar(stat = "identity", position = "dodge")+
  labs(title = "Airline Delays by percentage", x = "City", y = "Percent Delayed")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusion: AM West airline had more flights at 7225 compared to Alaska Airlines at 3775. Even though they had almost double the number of flights as Alaska Airlines, AM West (10%) had a lower overall percentage of delayed flights than Alaska Airlines (13%).

However, if I were to compare the percentage of delays in each city, Alaska Airlines has a lower percentage of delayed flights in each city compared to AM West.

The reason AM West airline has a lower overall percentage of delayed flights is because Phoenix has almost 5,000 flights that were on time.