My approach for Assignment 5A Air delays is to create the data frame based on the table in the pdf attachment as a CSV file. Upload that csv file then code then display it and perform count analyst. Checking for which airline and in which cities experienced the most delays and the most on times. After exploring more of the data I may also bring up any other interesting counts the I notice. Then transform the data from wide to long. The end I will show the percentage of on time and arrivals for each city split based on airlines.
I don’t expected the recreation of the file from the table will be too challenging but I will know when I attempted to recreate the same results. Transforming the data from wide to long is new to me but I do not expect that to be too difficult either. From what I’ve read I just need to use the pivot_longer function in tidyr to transform it. The last thing I expect may be tricky is to accurate calculate and interpreted the percentage ratios of the airline flight arrival and delay time. I worry of not accurately speaking correctly about what the data says and incorrectly calculating what I am trying to prove. Trying to avoid this it is best to take time and go over what I have written and getting a seconded opinion.
airline_data <- read.csv("C:/Users/typem/Documents/GitHub/Data607_Assignment5A/airline_data.csv")
airline_data
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
## Warning: package 'stringr' was built under R version 4.4.3
flight_wide <-na.omit(airline_data) %>%
rename(AirLine = Airline, Arrival = Status) %>%
mutate(AirLine = na_if(AirLine, '')) %>%
fill(AirLine, .direction = 'down') %>%
gather('Location', 'Flight.Cnt', 3:7) %>%
mutate(Flight.Cnt = as.integer(str_replace(Flight.Cnt, ',', '')),
Arrival = str_replace(Arrival, 'on time', 'On.Time'),
Arrival = str_replace(Arrival, 'delayed', 'Delayed')) %>%
spread(Arrival, Flight.Cnt)
flight_wide
## AirLine Location Delayed On.Time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(scales)
## Warning: package 'scales' was built under R version 4.4.3
AirLine_summary <- flight_wide %>%
group_by(AirLine) %>%
summarise(On.Time = sum(On.Time), Delayed = sum(Delayed)) %>%
mutate(Total = On.Time + Delayed,
Delay.Rate = Delayed / Total,
On.Time.Rate = On.Time / Total)
AirLine_summary
## # A tibble: 2 × 6
## AirLine On.Time Delayed Total Delay.Rate On.Time.Rate
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 ALASKA 3274 501 3775 0.133 0.867
## 2 AM WEST 6438 787 7225 0.109 0.891
countOT1 <- ggplot(data = AirLine_summary, aes(x = AirLine, y = On.Time, fill = AirLine)) +
geom_bar(stat = 'identity') +
geom_text(aes(label = On.Time), vjust = -0.5) +
ggtitle('On Time by Airline') +
xlab('Airline') + ylab('On Time')
countDel1 <- ggplot(data = AirLine_summary, aes(x = AirLine, y = Delayed, fill = AirLine)) +
geom_bar(stat = 'identity') +
geom_text(aes(label = Delayed), vjust = -0.5) +
ggtitle('Delay by Airline') +
xlab('Airline') + ylab('Delayed')
countOT1
countDel1
The counts of On time flights is significantly higher than delayed by a couple of thousand. AM West seems to have more delays but also more on times since AM West has more total flight over all than Alaska we need to calculate and run the percentage on arrivals and delays.
RateOT <- ggplot(data = AirLine_summary, aes(x = AirLine, y = On.Time.Rate, fill = AirLine)) +
geom_bar(stat = 'identity') +
geom_text(aes(label = scales::percent(On.Time.Rate, accuracy = 0.01)), vjust = -0.5) +
ggtitle('On Time by Airline') +
xlab('Airline') + ylab('On Time')
RateDel <- ggplot(data = AirLine_summary, aes(x = AirLine, y = Delay.Rate, fill = AirLine)) +
geom_bar(stat = 'identity') +
geom_text(aes(label = scales::percent(Delay.Rate, accuracy = 0.01)), vjust = -0.5) +
ggtitle('Delay by Airline') +
xlab('Airline') + ylab('Delayed')
RateOT
RateDel
Now with this percentage graph we can see the rates of both delayed and on time. Since these rates can tell us who is more delay or on time on average the totals should not skew our thoughts on which Airline is more reliable. We see that AM West the airline with the most delays actually performs better than Alaska because it is at a rate of about 11 percent while Alaska is at about 13 percent. Meaning that AM West has a lower rate of delays than Alaska and is more reliable to avoid delays.
city_summary <- flight_wide %>%
group_by(AirLine, Location) %>%
summarise(
On.Time = sum(On.Time),
Delayed = sum(Delayed),
Total = On.Time + Delayed,
Delay.Rate = round(Delayed / Total, 4),
On.Time.Rate = round(On.Time / Total, 4)
) %>%
arrange(Location, AirLine)
## `summarise()` has grouped output by 'AirLine'. You can override using the
## `.groups` argument.
city_summary
## # A tibble: 10 × 7
## # Groups: AirLine [2]
## AirLine Location On.Time Delayed Total Delay.Rate On.Time.Rate
## <chr> <chr> <int> <int> <int> <dbl> <dbl>
## 1 ALASKA Los.Angeles 497 62 559 0.111 0.889
## 2 AM WEST Los.Angeles 694 117 811 0.144 0.856
## 3 ALASKA Phoenix 221 12 233 0.0515 0.948
## 4 AM WEST Phoenix 4840 415 5255 0.079 0.921
## 5 ALASKA San.Diego 212 20 232 0.0862 0.914
## 6 AM WEST San.Diego 383 65 448 0.145 0.855
## 7 ALASKA San.Francisco 503 102 605 0.169 0.831
## 8 AM WEST San.Francisco 320 129 449 0.287 0.713
## 9 ALASKA Seattle 1841 305 2146 0.142 0.858
## 10 AM WEST Seattle 201 61 262 0.233 0.767
countOT2 <- ggplot(data = city_summary, aes(x = AirLine, y = On.Time, fill = AirLine))+
geom_bar(stat = 'identity')+
facet_grid(~Location)+
ggtitle('Total number of On Time by city')+
geom_text(aes(label = On.Time), vjust = -0.5)+
xlab('Airline') + ylab('On Time')
CountDel2 <- ggplot(data = city_summary, aes (x = AirLine, y = Delayed, fill = AirLine))+
geom_bar(stat = 'identity')+
facet_grid(~Location)+
ggtitle('Total number of Delayed by city')+
geom_text(aes(label = Delayed), vjust = -0.5)+
xlab('AirLine') + ylab('On Time')
countOT2
CountDel2
AM West has the higher total number of on times compared to Alaska
across 3 cites. AM West also has the most number of delays in four
cities. These numbers are totals so they do not reflect rate but where
these airlines are the most active in and how often they are delayed on
on time. For example it is hard to compare their performance in Phoenix
because Alaska has a total of 233 flights compared to AM West about
5,000 flights. We need rates.
RateOT2 <- ggplot(data = city_summary, aes(x = AirLine, y = On.Time.Rate, fill = AirLine)) +
geom_bar(stat = 'identity') +
facet_grid(~Location)+
geom_text(aes(label = scales::percent(On.Time.Rate, accuracy = 0.01)), vjust = -0.5) +
ggtitle('On Time by Airline') +
xlab('Airline') + ylab('On Time')
RateDel2 <- ggplot(data = city_summary, aes(x = AirLine, y = Delay.Rate, fill = AirLine)) +
geom_bar(stat = 'identity') +
facet_grid(~Location)+
geom_text(aes(label = scales::percent(Delay.Rate, accuracy = 0.01)), vjust = -0.5) +
ggtitle('Delay by Airline') +
xlab('Airline') + ylab('Delayed')
RateOT2
RateDel2
I am surprised to see the AM West does have a higher rate of delays
across all cities compared to Alaska when the average across all cities
was the opposite. Something notable that is a positive for AM West is
that the location where they have the most flights, Phoenix has the
lowest delay rate and could explain why AM West has a low delay rate on
average. Nearly all cities where AM West has high delay rates it has a
small amount of flights compared to Phoenix and Los Angeles. Alaska does
actually have its higher delay rates in the cities that have the most
flights and that is why its has a high flight delay overall.