The chart above describes arrival delays for two airlines across five destinations. Your task is to:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page.
#load the data
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
airline_delays <- read.csv('https://raw.githubusercontent.com/nk014914/Data-607/main/airline_destinations_data_untidy.csv')
#view the data
head(airline_delays)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
#Removing blank row and renaming columns
airline_delays <- airline_delays %>%
drop_na() %>%
rename(airline = X, status = X.1)
#Filling NAs for missing airline cells then filling with the previous value downwards
airline_delays <- airline_delays %>%
mutate(airline = na_if(airline, '')) %>%
fill(airline, .direction = c('down'))
#Converting City rows into a column
airline_delays <- airline_delays %>%
gather(key = 'city', value = 'total.flights', 3:7)
#airline_delays
#separate by flight status and create column for total flights
flightstatus <- spread(airline_delays, status, total.flights) %>%
mutate( total.flights = rowSums(across(where(is.numeric))))
#rename on time to not have space
flightstatus <- flightstatus %>%
rename('on.time' = 4)
flightstatus
## airline city delayed on.time total.flights
## 1 ALASKA Los.Angeles 62 497 559
## 2 ALASKA Phoenix 12 221 233
## 3 ALASKA San.Diego 20 212 232
## 4 ALASKA San.Francisco 102 503 605
## 5 ALASKA Seattle 305 1841 2146
## 6 AM WEST Los.Angeles 117 694 811
## 7 AM WEST Phoenix 415 4840 5255
## 8 AM WEST San.Diego 65 383 448
## 9 AM WEST San.Francisco 129 320 449
## 10 AM WEST Seattle 61 201 262
#analyzing delay rates between the two airlines
#create columns for % of flights delayed and % of flights on-time
airline_delays_summarized <- flightstatus %>%
group_by(airline, city) %>%
#creating sums of the delayed and ontime flights to use for rate formulas
summarise(total.delayed = sum(delayed), total.ontime = sum(on.time))%>%
mutate(total = rowSums(across(where(is.numeric))),
delay.rate = round(total.delayed / total*100, 2),
ontime.rate = round (total.ontime / total*100, 2))
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
airline_delays_summarized
## # A tibble: 10 × 7
## # Groups: airline [2]
## airline city total.delayed total.ontime total delay.rate ontime.rate
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 ALASKA Los.Angeles 62 497 559 11.1 88.9
## 2 ALASKA Phoenix 12 221 233 5.15 94.8
## 3 ALASKA San.Diego 20 212 232 8.62 91.4
## 4 ALASKA San.Francisco 102 503 605 16.9 83.1
## 5 ALASKA Seattle 305 1841 2146 14.2 85.8
## 6 AM WEST Los.Angeles 117 694 811 14.4 85.6
## 7 AM WEST Phoenix 415 4840 5255 7.9 92.1
## 8 AM WEST San.Diego 65 383 448 14.5 85.5
## 9 AM WEST San.Francisco 129 320 449 28.7 71.3
## 10 AM WEST Seattle 61 201 262 23.3 76.7
#creating Visuals
library(ggplot2)
#viewing delay rates by city per airline
ggplot(airline_delays_summarized, aes(fill= airline, y = delay.rate, x = city)) +
geom_bar(position="dodge", stat = "identity")
#viewing ontime rates by city per airline
ggplot(airline_delays_summarized, aes(fill= airline, y = ontime.rate, x = city)) +
geom_bar(position="dodge", stat = "identity")
Based on the tidied data and the stacked bar charts, we can conclude that for every destination, AM West had a higher rate of delayed flights in comparison to Alaska.