Importing data

I created a csv file that has the data provided for the assignment and uploaded it to my github. I pull the data from github into a data frame, to allow for reproduction.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)

fileURL = 'https://raw.githubusercontent.com/stoybis/DATA607Repo/main/assignment4data.csv'

flightData = read.csv(url(fileURL), header = TRUE)

head(flightData)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Cleaning the data

I drop the blank third row and assign column names to the first two columns

#drop blank third row
flightData <-flightData[-3,]
rownames(flightData) <-NULL #reset the row names

#add column names to first two columns
colnames(flightData)[1:2] <- c('airline', 'status')

head(flightData)
##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4         delayed         117     415        65           129      61

I will be pivoting the data frame to a longer format. However, there are missing values in the data frame in row 2 column 1 and row 4 column 1. I fill in the blanks before pivoting.

#fill in missing data
flightData[2,1] <- "ALASKA"
flightData[4,1] <- "AM WEST"

head(flightData)
##   airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

Pivoting the data frame

I want to pivot the data frame into a longer format where the columns are: airline, status, city, and count of flights.

flightDataLong <- flightData |> pivot_longer(
  cols = !c('airline', 'status'),
  names_to = 'city',
  values_to = 'count'
)

head(flightDataLong)
## # A tibble: 6 × 4
##   airline status  city          count
##   <chr>   <chr>   <chr>         <int>
## 1 ALASKA  on time Los.Angeles     497
## 2 ALASKA  on time Phoenix         221
## 3 ALASKA  on time San.Diego       212
## 4 ALASKA  on time San.Francisco   503
## 5 ALASKA  on time Seattle        1841
## 6 ALASKA  delayed Los.Angeles      62

I want to create two columns, on_time and delayed. This will allow me to sum across the two columns to get the total flights for each city by each airline. I rename “on time” to “on_time” to remove the space in the variable name

flightDataLong <- flightDataLong |> pivot_wider(
  names_from = status,
  values_from = count
)

colnames(flightDataLong)[3] <- 'on_time'

head(flightDataLong)
## # A tibble: 6 × 4
##   airline city          on_time delayed
##   <chr>   <chr>           <int>   <int>
## 1 ALASKA  Los.Angeles       497      62
## 2 ALASKA  Phoenix           221      12
## 3 ALASKA  San.Diego         212      20
## 4 ALASKA  San.Francisco     503     102
## 5 ALASKA  Seattle          1841     305
## 6 AM WEST Los.Angeles       694     117

Mutations

I add two columns: total flights and delay rate which is equal to delays divided by total flights.

flightDataLong <- flightDataLong |> mutate(total = on_time + delayed,
                                             delay_rate = delayed/total)

head(flightDataLong)
## # A tibble: 6 × 6
##   airline city          on_time delayed total delay_rate
##   <chr>   <chr>           <int>   <int> <int>      <dbl>
## 1 ALASKA  Los.Angeles       497      62   559     0.111 
## 2 ALASKA  Phoenix           221      12   233     0.0515
## 3 ALASKA  San.Diego         212      20   232     0.0862
## 4 ALASKA  San.Francisco     503     102   605     0.169 
## 5 ALASKA  Seattle          1841     305  2146     0.142 
## 6 AM WEST Los.Angeles       694     117   811     0.144

Analysis

Below, I check to see which airline has the higher delay rate by each city.

ggplot(flightDataLong, aes(x = city, y = delay_rate, fill = airline))+
  geom_bar(stat='identity', position = 'dodge') +
  ggtitle('Delay Rate by City')

The chart shows that AM West has a higher delay rate than Alaska for each city. The highest delay rates are in San Francisco while the lowest delay rates are in Phoenix.

Simpson’s Paradox

If we look at the total on time and delayed departures without breaking it down by city, AM West will seem to have less delays.

totalFlights <- flightDataLong |> group_by(airline) |> summarize(total_onTime = sum(on_time), total_Delayed = sum(delayed))

totalFlights <- totalFlights |> mutate(totalFlights = total_onTime + total_Delayed)

totalFlights <- totalFlights |> mutate(delay_rate = total_Delayed / totalFlights)


ggplot(totalFlights, aes(x = airline, y = delay_rate, fill = airline))+
  geom_bar(stat='identity', position = 'dodge') +
  ggtitle('Delay Rate by Airline')