Data607 Assignment 4: Tidying and Transforming Data

This assignment will use the following packages:

library(readr)
library(RCurl)
library(stringr)
library(dplyr)
library(tidyr)
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)

General Overview

The goal of this assignment is to utilize untidy airport data that were on time or delayed from 5 different locations (cities) from two different airlines. The following data will also be uploaded onto my github page.

Initilization

To get the data, we will extract the data from the github file from here.

raw_file <- getURL("https://raw.githubusercontent.com/spacerome/Data607_Assignment_4/refs/heads/main/airline_data.csv")

csvfile <- data.frame(read.csv(text=raw_file, sep= "\t", stringsAsFactors = FALSE, check.names = FALSE))
csvfile

##     Var.1   Var.2 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4         delayed         117     415        65           129      61

Looking at the data the raw data gives X and X.1 for the first columns which should be airline and status which should be fixed. The third and fifth rows have an empty value which will be fixed in the next section.

Tidying the Data

colnames(csvfile)[1] <- "airline"
colnames(csvfile)[2] <- "status"
csvfile

##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4         delayed         117     415        65           129      61

After fixing the columns, we will now fix the empty rows from rows 3 and 5, to have ALASKA and AMWEST to continue prepping the data for analysis on the next section. The cities that have a space (e.g. Los Angeles) were outputed as Los.Angeles which will be fixed by utilizing gsub to fix it.

csvfile[csvfile=="    "] <- NA
colnames(csvfile) <- gsub("\\.", " ", colnames(csvfile))
csvfile <- fill(csvfile, airline)
csvfile

##   airline  status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

After cleaning the data then we utilize pivot_longer to prep csvfile for further data analysis.

csvfile <- pivot_longer(csvfile, cols = !c(airline, status),
                        names_to = "city", values_to = "flights")
head(csvfile)

## # A tibble: 6 × 4
##   airline status  city          flights
##   <chr>   <chr>   <chr>           <int>
## 1 ALASKA  on time Los Angeles       497
## 2 ALASKA  on time Phoenix           221
## 3 ALASKA  on time San Diego         212
## 4 ALASKA  on time San Francisco     503
## 5 ALASKA  on time Seattle          1841
## 6 ALASKA  delayed Los Angeles        62

Analyzing the Data

After cleaning the data we will have three variables utilized for analysis: airlineData, cityData, and cityairlineData.

airlineData <- csvfile %>%
  group_by(airline) %>% summarise(flightFrequency = sum(flights), flightDelay = sum(ifelse(status =='delayed',flights,0)))
airlineData <- airlineData %>%
  mutate(ontimeFlights = flightFrequency - flightDelay,
         delayRate = flightDelay/ontimeFlights,
         ontimeRate = ontimeFlights/flightFrequency) %>% arrange(delayRate)
airlineData

## # A tibble: 2 × 6
##   airline flightFrequency flightDelay ontimeFlights delayRate ontimeRate
##   <chr>             <int>       <dbl>         <dbl>     <dbl>      <dbl>
## 1 AM WEST            7225         787          6438     0.122      0.891
## 2 ALASKA             3775         501          3274     0.153      0.867

airlineData %>%
  mutate(
    delayRate = scales::percent(delayRate, accuracy = 0.01),
    ontimeRate = scales::percent(ontimeRate, accuracy = 0.01)
  ) %>%
  select(airline, delayRate, ontimeRate) %>%
  rename(       
    Airline = airline,     
    `Delay Rate` = delayRate,  
    `Ontime Rate` = ontimeRate 
  ) %>%
  kable(format = "html", caption = "Delay Rate and On-Time Rate by Airline") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), position = "center") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4CAF50")

Delay Rate and On-Time Rate by Airline
Airline	Delay Rate	Ontime Rate
AM WEST	12.22%	89.11%
ALASKA	15.30%	86.73%

From analyzing the airlineData, AM WEST has the highest ontime rate of 89.1%, and lowest delay rate of 12.2%, whereas ALASKA has a higher delay rate of 15.3%, but lower ontime rate of 86.7%.

cityData <- csvfile %>%
  group_by(city) %>% summarise(flightFrequency = sum(flights), flightDelay = sum(ifelse(status =='delayed',flights,0)))
cityData <- cityData %>%
  mutate(ontimeFlights = flightFrequency - flightDelay,
         delayRate = flightDelay/ontimeFlights,
         ontimeRate = ontimeFlights/flightFrequency) %>% arrange(delayRate)
cityData

## # A tibble: 5 × 6
##   city          flightFrequency flightDelay ontimeFlights delayRate ontimeRate
##   <chr>                   <int>       <dbl>         <dbl>     <dbl>      <dbl>
## 1 Phoenix                  5488         427          5061    0.0844      0.922
## 2 San Diego                 680          85           595    0.143       0.875
## 3 Los Angeles              1370         179          1191    0.150       0.869
## 4 Seattle                  2408         366          2042    0.179       0.848
## 5 San Francisco            1054         231           823    0.281       0.781

From analyzing the cityData, Phoenix has the highest ontime Rate of 92.2%, and lowest delay rate of 8.44%. San Francisco has the highest delay rate of 28.1% and its ontime rate is 78.1%.

cityairlineData <- csvfile %>%
  group_by(city, airline) %>%
  summarise(flightFrequency = sum(flights, na.rm = TRUE), 
            flightDelay = sum(ifelse(status == 'delayed', flights, 0), na.rm = TRUE)) %>%
  mutate(ontimeFlights = flightFrequency - flightDelay,
         delayRate = flightDelay / flightFrequency, ontimeRate = ontimeFlights/flightFrequency) %>%
  ungroup()

## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.

cityairlineData

## # A tibble: 10 × 7
##    city   airline flightFrequency flightDelay ontimeFlights delayRate ontimeRate
##    <chr>  <chr>             <int>       <dbl>         <dbl>     <dbl>      <dbl>
##  1 Los A… ALASKA              559          62           497    0.111       0.889
##  2 Los A… AM WEST             811         117           694    0.144       0.856
##  3 Phoen… ALASKA              233          12           221    0.0515      0.948
##  4 Phoen… AM WEST            5255         415          4840    0.0790      0.921
##  5 San D… ALASKA              232          20           212    0.0862      0.914
##  6 San D… AM WEST             448          65           383    0.145       0.855
##  7 San F… ALASKA              605         102           503    0.169       0.831
##  8 San F… AM WEST             449         129           320    0.287       0.713
##  9 Seatt… ALASKA             2146         305          1841    0.142       0.858
## 10 Seatt… AM WEST             262          61           201    0.233       0.767

cityairlineData %>%
  mutate(
    delayRate = scales::percent(delayRate, accuracy = 0.01),
    ontimeRate = scales::percent(ontimeRate, accuracy = 0.01)
  ) %>%
  select(city, airline, delayRate, ontimeRate) %>%
  rename(
    City = city,        
    Airline = airline,     
    `Delay Rate` = delayRate,  
    `Ontime Rate` = ontimeRate 
  ) %>%
  kable(format = "html", caption = "Delay Rate and On-Time Rate by City and Airline") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), position = "center") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4CAF50")

Delay Rate and On-Time Rate by City and Airline
City	Airline	Delay Rate	Ontime Rate
Los Angeles	ALASKA	11.09%	88.91%
Los Angeles	AM WEST	14.43%	85.57%
Phoenix	ALASKA	5.15%	94.85%
Phoenix	AM WEST	7.90%	92.10%
San Diego	ALASKA	8.62%	91.38%
San Diego	AM WEST	14.51%	85.49%
San Francisco	ALASKA	16.86%	83.14%
San Francisco	AM WEST	28.73%	71.27%
Seattle	ALASKA	14.21%	85.79%
Seattle	AM WEST	23.28%	76.72%

From this data, AM WEST had a higher delay rate than ALASKA within 4 of the 5 cities, but ALASKA had the highest delay rate at Seattle and had lesser delay rate on the other locations, but the delay rate for AMWEST is higher than ALASKA on all 5 cities. The on time rate was the highest at Phoenix for both airlines which indicates it was most likely the easiest location for both airlines.

Visualization

The following code blocks below will be visualizations from the data.

ggplot(airlineData, aes(x = reorder(airline, -flightFrequency))) +
  geom_bar(aes(y = ontimeRate, fill = "On-Time"), stat = "identity") +
  geom_bar(aes(y = -delayRate, fill = "Delayed"), stat = "identity") +
  labs(title = "On-Time and Delayed Rates Overall by Airline", x = "Airline", y = "Rate", fill = "Status") +
  scale_fill_manual(values = c("On-Time" = "skyblue", "Delayed" = "salmon")) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Though AM WEST has the highest frequencies, it appears to have the most delayed flights as opposed to ALASKA.

ggplot(cityairlineData, aes(x = airline, y = delayRate, fill = airline)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Delay Rate by Airline and City", 
       x = "Airline", y = "Delay Rate") +
  facet_wrap(~ city, scales = "free") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill = guide_legend(title = "Airline"))

This bar plot shows that AM WEST has a higher delay rate than ALASKA, and reinforces the statement about Phoenix having the highest ontime rate, but lowest ontime rate, whereas San Francisco has the highest delay rate for both airlines.

ggplot(cityairlineData, aes(x = airline, y = ontimeRate, fill = airline)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "On Time Rate by Airline and City", 
       x = "Airline", y = "On Time Rate") +
  facet_wrap(~ city, scales = "free") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill = guide_legend(title = "Airline"))

Over here this bar plot shows a pretty tight on-time rate between all five cities, with ALASKA being slightly higher than AM WEST.

CSV File Output

Outputs a csv file to replicate anaylsis:

write.csv(csvfile,"airline_data.csv")

Conclusion

Overall, from the data used for this assignment, AM WEST has a higher delay rate in all cities, but overall, ALASKA has a higher delay rate overall than AM WEST.

I believe if I had more data, such as more cities for each airline, or maybe more airlines, we can get a general understanding on what may be causing the delays for each city. I believe there is also an r library for flights that can be utilized to get a general understanding on why there are flights not arriving on time for San Francisco. I do believe if we include factors such as air traffic, weather, and other uncontrollable factors, this will give us a general understanding on why this is occurring.