Data Import, Tidying, and Transformation

The data is imported as a CVS file and converted to a data frame. The third row is dropped from the data frame because it is an empty row that served as a spacer between the data for the individual airlines in the CSV file. Next, the airline names are duplicated to fill in the missing values in rows 2 and 4 of column “X”, which is then renamed “Airline”. The data is then pivoted vertically so that the individual airport results are captured in a column containing the airport name and another column capturing the delay or on-time flight count. Lastly, two new columns are added to the data frame via the mutate() function: 1) the total flights at each airport for each air line and 2) the event rate (delay or on-time) for each airline at each airport. The completed tidy data set is displayed via kable() for clarity.

knitr::opts_chunk$set(echo = TRUE)

#Load libraries needed for assignment.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(dplyr)
library('RCurl')
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete
library(tidyverse)
library(ggplot2)
library(knitr)
library(ggpubr
        )
#Import data set from GitHub

getdata <- getURL('https://raw.githubusercontent.com/kr0710/Data607/refs/heads/main/Assignment_5_data.csv')

#Convert data to a data frame, drop empty rows, fill in missing row and column names

df <- read.csv(text = getdata)[-c(3),]
df <- read.csv('c:/data/Assignment_5_data.csv')[-c(3),]
df$X <- c(df$X[1], df$X[1], df$X[3], df$X[3])
colnames(df)[1] <- "Airline"
colnames(df)[2] <- "flight_fate"

#Pivot dat vertically 

df_pivot <-
  pivot_longer(df,
    cols = colnames(df)[3:7],
    names_to = 'Airport',
    values_to = 'fate_count',
  )

#Add columns to calculate total flights at each airport for each airline
#Calculate delayed or on-time flight rates
df_final <-
  df_pivot |>
  group_by(Airline, Airport) |>
  mutate(total_flights = sum(fate_count)) |>
  mutate(fate_rate = round((fate_count/total_flights)*100, digits = 2))

#display tidy and transformed data
kable(df_final)
Airline flight_fate Airport fate_count total_flights fate_rate
Alaska on time Los.Angeles 497 559 88.91
Alaska on time Phoenix 221 233 94.85
Alaska on time San.Diego 212 232 91.38
Alaska on time San.Francisco 503 605 83.14
Alaska on time Seattle 1841 2146 85.79
Alaska delayed Los.Angeles 62 559 11.09
Alaska delayed Phoenix 12 233 5.15
Alaska delayed San.Diego 20 232 8.62
Alaska delayed San.Francisco 102 605 16.86
Alaska delayed Seattle 305 2146 14.21
AM West on time Los.Angeles 694 811 85.57
AM West on time Phoenix 4840 5255 92.10
AM West on time San.Diego 383 448 85.49
AM West on time San.Francisco 320 449 71.27
AM West on time Seattle 201 262 76.72
AM West delayed Los.Angeles 117 811 14.43
AM West delayed Phoenix 415 5255 7.90
AM West delayed San.Diego 65 448 14.51
AM West delayed San.Francisco 129 449 28.73
AM West delayed Seattle 61 262 23.28

Data Visualization and Analysis

The data frame is then bifurcated into two data frames depending on the status of the flight (i.e. “flight_fate”). The “fate_rate” (i.e. rate of delay or on-time flights) of the airports and airlines in each respective data frame are both visualized via ggplot2.

At all five airports, it appears that Alaskan Airlines outperforms AM West in respect to lower rates of delayed flights. Naturally, Alaskan airlines also has higher on-time rates across the board at each airport surveyed. Lastly, the relationship between total flights and the delayed flight rate is examined in graphical form. However, there is not a strongly correlated relationship between these two variables, but obviously the data is quite limited (n=10).

Airline Airport flight_fate total_flights fate_rate
Alaska Los.Angeles delayed 559 11.09
Alaska Phoenix delayed 233 5.15
Alaska San.Diego delayed 232 8.62
Alaska San.Francisco delayed 605 16.86
Alaska Seattle delayed 2146 14.21
AM West Los.Angeles delayed 811 14.43
AM West Phoenix delayed 5255 7.90
AM West San.Diego delayed 448 14.51
AM West San.Francisco delayed 449 28.73
AM West Seattle delayed 262 23.28

Airline Airport flight_fate total_flights fate_rate
Alaska Los.Angeles on time 559 88.91
Alaska Phoenix on time 233 94.85
Alaska San.Diego on time 232 91.38
Alaska San.Francisco on time 605 83.14
Alaska Seattle on time 2146 85.79
AM West Los.Angeles on time 811 85.57
AM West Phoenix on time 5255 92.10
AM West San.Diego on time 448 85.49
AM West San.Francisco on time 449 71.27
AM West Seattle on time 262 76.72