Assignment – Tidying and Transforming Data

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

  3. Perform analysis to compare the arrival delays for the two airlines.

  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:  The URL to the .Rmd file in your GitHub repository. and  The URL for your rpubs.com web page.

#load the data
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)

airline_delays <- read.csv('https://raw.githubusercontent.com/nk014914/Data-607/main/airline_destinations_data_untidy.csv')

#view the data
head(airline_delays)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61
#Removing blank row and renaming columns
airline_delays <- airline_delays %>% 
  drop_na() %>%
  rename(airline = X, status = X.1)

#Filling NAs for missing airline cells then filling with the previous value downwards
airline_delays <- airline_delays %>% 
  mutate(airline = na_if(airline, '')) %>%
  fill(airline, .direction = c('down'))
#Converting City rows into a column
airline_delays <- airline_delays %>% 
  gather(key = 'city', value = 'total.flights', 3:7)
#airline_delays

#separate by flight status and create column for total flights
flightstatus <- spread(airline_delays, status, total.flights) %>%
  mutate( total.flights = rowSums(across(where(is.numeric))))

#rename on time to not have space
flightstatus <- flightstatus %>%
  rename('on.time' = 4)

flightstatus
##    airline          city delayed on.time total.flights
## 1   ALASKA   Los.Angeles      62     497           559
## 2   ALASKA       Phoenix      12     221           233
## 3   ALASKA     San.Diego      20     212           232
## 4   ALASKA San.Francisco     102     503           605
## 5   ALASKA       Seattle     305    1841          2146
## 6  AM WEST   Los.Angeles     117     694           811
## 7  AM WEST       Phoenix     415    4840          5255
## 8  AM WEST     San.Diego      65     383           448
## 9  AM WEST San.Francisco     129     320           449
## 10 AM WEST       Seattle      61     201           262
#analyzing delay rates between the two airlines

#create columns for % of flights delayed and % of flights on-time

airline_delays_summarized <- flightstatus %>%
  group_by(airline, city) %>%
  #creating sums of the delayed and ontime flights to use for rate formulas
  summarise(total.delayed = sum(delayed), total.ontime = sum(on.time))%>%
  mutate(total = rowSums(across(where(is.numeric))),
       delay.rate = round(total.delayed / total*100, 2),
       ontime.rate = round (total.ontime / total*100, 2))
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
airline_delays_summarized
## # A tibble: 10 × 7
## # Groups:   airline [2]
##    airline city          total.delayed total.ontime total delay.rate ontime.rate
##    <chr>   <chr>                 <int>        <int> <dbl>      <dbl>       <dbl>
##  1 ALASKA  Los.Angeles              62          497   559      11.1         88.9
##  2 ALASKA  Phoenix                  12          221   233       5.15        94.8
##  3 ALASKA  San.Diego                20          212   232       8.62        91.4
##  4 ALASKA  San.Francisco           102          503   605      16.9         83.1
##  5 ALASKA  Seattle                 305         1841  2146      14.2         85.8
##  6 AM WEST Los.Angeles             117          694   811      14.4         85.6
##  7 AM WEST Phoenix                 415         4840  5255       7.9         92.1
##  8 AM WEST San.Diego                65          383   448      14.5         85.5
##  9 AM WEST San.Francisco           129          320   449      28.7         71.3
## 10 AM WEST Seattle                  61          201   262      23.3         76.7
#creating Visuals
library(ggplot2)

#viewing delay rates by city per airline
ggplot(airline_delays_summarized, aes(fill= airline, y = delay.rate, x = city)) +
  geom_bar(position="dodge", stat = "identity")

#viewing ontime rates by city per airline
ggplot(airline_delays_summarized, aes(fill= airline, y = ontime.rate, x = city)) +
  geom_bar(position="dodge", stat = "identity")

Conclusion

Based on the tidied data and the stacked bar charts, we can conclude that for every destination, AM West had a higher rate of delayed flights in comparison to Alaska.