I created a csv file that has the data provided for the assignment and uploaded it to my github. I pull the data from github into a data frame, to allow for reproduction.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
fileURL = 'https://raw.githubusercontent.com/stoybis/DATA607Repo/main/assignment4data.csv'
flightData = read.csv(url(fileURL), header = TRUE)
head(flightData)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
I drop the blank third row and assign column names to the first two columns
#drop blank third row
flightData <-flightData[-3,]
rownames(flightData) <-NULL #reset the row names
#add column names to first two columns
colnames(flightData)[1:2] <- c('airline', 'status')
head(flightData)
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
I will be pivoting the data frame to a longer format. However, there are missing values in the data frame in row 2 column 1 and row 4 column 1. I fill in the blanks before pivoting.
#fill in missing data
flightData[2,1] <- "ALASKA"
flightData[4,1] <- "AM WEST"
head(flightData)
## airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
I want to pivot the data frame into a longer format where the columns are: airline, status, city, and count of flights.
flightDataLong <- flightData |> pivot_longer(
cols = !c('airline', 'status'),
names_to = 'city',
values_to = 'count'
)
head(flightDataLong)
## # A tibble: 6 × 4
## airline status city count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
I want to create two columns, on_time and delayed. This will allow me to sum across the two columns to get the total flights for each city by each airline. I rename “on time” to “on_time” to remove the space in the variable name
flightDataLong <- flightDataLong |> pivot_wider(
names_from = status,
values_from = count
)
colnames(flightDataLong)[3] <- 'on_time'
head(flightDataLong)
## # A tibble: 6 × 4
## airline city on_time delayed
## <chr> <chr> <int> <int>
## 1 ALASKA Los.Angeles 497 62
## 2 ALASKA Phoenix 221 12
## 3 ALASKA San.Diego 212 20
## 4 ALASKA San.Francisco 503 102
## 5 ALASKA Seattle 1841 305
## 6 AM WEST Los.Angeles 694 117
I add two columns: total flights and delay rate which is equal to delays divided by total flights.
flightDataLong <- flightDataLong |> mutate(total = on_time + delayed,
delay_rate = delayed/total)
head(flightDataLong)
## # A tibble: 6 × 6
## airline city on_time delayed total delay_rate
## <chr> <chr> <int> <int> <int> <dbl>
## 1 ALASKA Los.Angeles 497 62 559 0.111
## 2 ALASKA Phoenix 221 12 233 0.0515
## 3 ALASKA San.Diego 212 20 232 0.0862
## 4 ALASKA San.Francisco 503 102 605 0.169
## 5 ALASKA Seattle 1841 305 2146 0.142
## 6 AM WEST Los.Angeles 694 117 811 0.144
Below, I check to see which airline has the higher delay rate by each city.
ggplot(flightDataLong, aes(x = city, y = delay_rate, fill = airline))+
geom_bar(stat='identity', position = 'dodge') +
ggtitle('Delay Rate by City')
The chart shows that AM West has a higher delay rate than Alaska for each city. The highest delay rates are in San Francisco while the lowest delay rates are in Phoenix.
If we look at the total on time and delayed departures without breaking it down by city, AM West will seem to have less delays.
totalFlights <- flightDataLong |> group_by(airline) |> summarize(total_onTime = sum(on_time), total_Delayed = sum(delayed))
totalFlights <- totalFlights |> mutate(totalFlights = total_onTime + total_Delayed)
totalFlights <- totalFlights |> mutate(delay_rate = total_Delayed / totalFlights)
ggplot(totalFlights, aes(x = airline, y = delay_rate, fill = airline))+
geom_bar(stat='identity', position = 'dodge') +
ggtitle('Delay Rate by Airline')