The data is imported as a CVS file and converted to a data frame. The third row is dropped from the data frame because it is an empty row that served as a spacer between the data for the individual airlines in the CSV file. Next, the airline names are duplicated to fill in the missing values in rows 2 and 4 of column “X”, which is then renamed “Airline”. The data is then pivoted vertically so that the individual airport results are captured in a column containing the airport name and another column capturing the delay or on-time flight count. Lastly, two new columns are added to the data frame via the mutate() function: 1) the total flights at each airport for each air line and 2) the event rate (delay or on-time) for each airline at each airport. The completed tidy data set is displayed via kable() for clarity.
knitr::opts_chunk$set(echo = TRUE)
#Load libraries needed for assignment.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(dplyr)
library('RCurl')
##
## Attaching package: 'RCurl'
##
## The following object is masked from 'package:tidyr':
##
## complete
library(tidyverse)
library(ggplot2)
library(knitr)
library(ggpubr
)
#Import data set from GitHub
getdata <- getURL('https://raw.githubusercontent.com/kr0710/Data607/refs/heads/main/Assignment_5_data.csv')
#Convert data to a data frame, drop empty rows, fill in missing row and column names
df <- read.csv(text = getdata)[-c(3),]
df <- read.csv('c:/data/Assignment_5_data.csv')[-c(3),]
df$X <- c(df$X[1], df$X[1], df$X[3], df$X[3])
colnames(df)[1] <- "Airline"
colnames(df)[2] <- "flight_fate"
#Pivot dat vertically
df_pivot <-
pivot_longer(df,
cols = colnames(df)[3:7],
names_to = 'Airport',
values_to = 'fate_count',
)
#Add columns to calculate total flights at each airport for each airline
#Calculate delayed or on-time flight rates
df_final <-
df_pivot |>
group_by(Airline, Airport) |>
mutate(total_flights = sum(fate_count)) |>
mutate(fate_rate = round((fate_count/total_flights)*100, digits = 2))
#display tidy and transformed data
kable(df_final)
Airline | flight_fate | Airport | fate_count | total_flights | fate_rate |
---|---|---|---|---|---|
Alaska | on time | Los.Angeles | 497 | 559 | 88.91 |
Alaska | on time | Phoenix | 221 | 233 | 94.85 |
Alaska | on time | San.Diego | 212 | 232 | 91.38 |
Alaska | on time | San.Francisco | 503 | 605 | 83.14 |
Alaska | on time | Seattle | 1841 | 2146 | 85.79 |
Alaska | delayed | Los.Angeles | 62 | 559 | 11.09 |
Alaska | delayed | Phoenix | 12 | 233 | 5.15 |
Alaska | delayed | San.Diego | 20 | 232 | 8.62 |
Alaska | delayed | San.Francisco | 102 | 605 | 16.86 |
Alaska | delayed | Seattle | 305 | 2146 | 14.21 |
AM West | on time | Los.Angeles | 694 | 811 | 85.57 |
AM West | on time | Phoenix | 4840 | 5255 | 92.10 |
AM West | on time | San.Diego | 383 | 448 | 85.49 |
AM West | on time | San.Francisco | 320 | 449 | 71.27 |
AM West | on time | Seattle | 201 | 262 | 76.72 |
AM West | delayed | Los.Angeles | 117 | 811 | 14.43 |
AM West | delayed | Phoenix | 415 | 5255 | 7.90 |
AM West | delayed | San.Diego | 65 | 448 | 14.51 |
AM West | delayed | San.Francisco | 129 | 449 | 28.73 |
AM West | delayed | Seattle | 61 | 262 | 23.28 |
The data frame is then bifurcated into two data frames depending on the status of the flight (i.e. “flight_fate”). The “fate_rate” (i.e. rate of delay or on-time flights) of the airports and airlines in each respective data frame are both visualized via ggplot2.
At all five airports, it appears that Alaskan Airlines outperforms AM West in respect to lower rates of delayed flights. Naturally, Alaskan airlines also has higher on-time rates across the board at each airport surveyed. Lastly, the relationship between total flights and the delayed flight rate is examined in graphical form. However, there is not a strongly correlated relationship between these two variables, but obviously the data is quite limited (n=10).
Airline | Airport | flight_fate | total_flights | fate_rate |
---|---|---|---|---|
Alaska | Los.Angeles | delayed | 559 | 11.09 |
Alaska | Phoenix | delayed | 233 | 5.15 |
Alaska | San.Diego | delayed | 232 | 8.62 |
Alaska | San.Francisco | delayed | 605 | 16.86 |
Alaska | Seattle | delayed | 2146 | 14.21 |
AM West | Los.Angeles | delayed | 811 | 14.43 |
AM West | Phoenix | delayed | 5255 | 7.90 |
AM West | San.Diego | delayed | 448 | 14.51 |
AM West | San.Francisco | delayed | 449 | 28.73 |
AM West | Seattle | delayed | 262 | 23.28 |
Airline | Airport | flight_fate | total_flights | fate_rate |
---|---|---|---|---|
Alaska | Los.Angeles | on time | 559 | 88.91 |
Alaska | Phoenix | on time | 233 | 94.85 |
Alaska | San.Diego | on time | 232 | 91.38 |
Alaska | San.Francisco | on time | 605 | 83.14 |
Alaska | Seattle | on time | 2146 | 85.79 |
AM West | Los.Angeles | on time | 811 | 85.57 |
AM West | Phoenix | on time | 5255 | 92.10 |
AM West | San.Diego | on time | 448 | 85.49 |
AM West | San.Francisco | on time | 449 | 71.27 |
AM West | Seattle | on time | 262 | 76.72 |