Overview / Introduction

For this assignment, we’ve been given an image of an untidy data set (shown below). Our task is to create a .CSV file that includes all of the information provided in the chart, read the information from the .CSV file into R, and use tidyr and dplyr to tidy and transform the data. Finally, we’ll perform simple analysis on the arrival delays for the two airlines.

Load Packages

As always, let’s start off by loading the necessary packages. Just the tidyverse, dplyr, and scales packages will be needed for this one.

library(tidyverse)
library(dplyr)
library(scales)

Load Data

Now let’s load the data from a .CSV I created and stored on GitHub.

url <- url('https://raw.githubusercontent.com/Stevee-G/Data607/refs/heads/main/flight_delays.csv')
flight_delays <- read.csv(url)

flight_delays
##         X     X.1 Los.Angeles Pheonix San.Diego San.Fransisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Remove Empty Row

As we can see, the table contains an empty row that needs to be removed. Let’s go ahead and do that using the complete.cases() function.

flight_delays <- flight_delays[complete.cases(flight_delays),]

flight_delays
##         X     X.1 Los.Angeles Pheonix San.Diego San.Fransisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Rename Columns and Pivot Table

Next, let’s address the two unnamed columns and call them carrier and status, respectively. Afterwards, we will go ahead and pivot the data longer in order to transform the destination headers into one column alongside their count and trip status.

flight_delays <- flight_delays %>% 
  rename(carrier = X, status = X.1)

flight_delays <- flight_delays %>% 
  pivot_longer(cols = c(3:7),
               names_to = 'destination',
               values_to = 'count')

flight_delays
## # A tibble: 20 × 4
##    carrier   status  destination   count
##    <chr>     <chr>   <chr>         <int>
##  1 "ALASKA"  on time Los.Angeles     497
##  2 "ALASKA"  on time Pheonix         221
##  3 "ALASKA"  on time San.Diego       212
##  4 "ALASKA"  on time San.Fransisco   503
##  5 "ALASKA"  on time Seattle        1841
##  6 ""        delayed Los.Angeles      62
##  7 ""        delayed Pheonix          12
##  8 ""        delayed San.Diego        20
##  9 ""        delayed San.Fransisco   102
## 10 ""        delayed Seattle         305
## 11 "AM WEST" on time Los.Angeles     694
## 12 "AM WEST" on time Pheonix        4840
## 13 "AM WEST" on time San.Diego       383
## 14 "AM WEST" on time San.Fransisco   320
## 15 "AM WEST" on time Seattle         201
## 16 ""        delayed Los.Angeles     117
## 17 ""        delayed Pheonix         415
## 18 ""        delayed San.Diego        65
## 19 ""        delayed San.Fransisco   129
## 20 ""        delayed Seattle          61

Fill Carriers and Adjust Destinations

Now we will fill in the remaining cells in the carrier field, but in order to do so we first need to insert values that can be recognized and replaced by the fill() function. Let us also replace the periods (.) that are contained within some of the values in the destination field. Once done, we will have our table, clean and tidy for analysis.

flight_delays$carrier[flight_delays$carrier == ""] <- NA
flight_delays <- flight_delays %>% 
  fill(carrier)

flight_delays$destination <- str_replace(flight_delays$destination, "\\.", " ")

flight_delays
## # A tibble: 20 × 4
##    carrier status  destination   count
##    <chr>   <chr>   <chr>         <int>
##  1 ALASKA  on time Los Angeles     497
##  2 ALASKA  on time Pheonix         221
##  3 ALASKA  on time San Diego       212
##  4 ALASKA  on time San Fransisco   503
##  5 ALASKA  on time Seattle        1841
##  6 ALASKA  delayed Los Angeles      62
##  7 ALASKA  delayed Pheonix          12
##  8 ALASKA  delayed San Diego        20
##  9 ALASKA  delayed San Fransisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on time Los Angeles     694
## 12 AM WEST on time Pheonix        4840
## 13 AM WEST on time San Diego       383
## 14 AM WEST on time San Fransisco   320
## 15 AM WEST on time Seattle         201
## 16 AM WEST delayed Los Angeles     117
## 17 AM WEST delayed Pheonix         415
## 18 AM WEST delayed San Diego        65
## 19 AM WEST delayed San Fransisco   129
## 20 AM WEST delayed Seattle          61

Summarize and Plot Data

As mentioned before, we will perform a simple analysis comparing the arrival delays for the two airlines. Let’s go ahead and create tibble grouping and summarizing our data by carrier and status. We will also create a plot to go right alongside it. As we can see, although the total flights for each airline differs by quite a bit, their proportions for ‘on time’ and ‘delayed’ flights are actually very similar.

flight_delays_summary <- flight_delays %>%
  group_by(carrier,status) %>%
  summarise(
    'total' = sum(count)
  ) %>% 
  mutate(percentage=percent((total/sum(total)),accuracy=0.1)) %>% 
  arrange(carrier,desc(status))

flight_delays_summary
## # A tibble: 4 × 4
## # Groups:   carrier [2]
##   carrier status  total percentage
##   <chr>   <chr>   <int> <chr>     
## 1 ALASKA  on time  3274 86.7%     
## 2 ALASKA  delayed   501 13.3%     
## 3 AM WEST on time  6438 89.1%     
## 4 AM WEST delayed   787 10.9%
ggplot(flight_delays_summary,aes(x=carrier,y=total,fill=status)) +
  geom_col() +
  geom_text(aes(label = paste(total,"(",percentage,")"),vjust=-0.75)) +
  scale_fill_discrete(breaks=c('on time','delayed'))

Conclusions / Findings and Recommendations

This assignment was great practice for cleaning and tidying really messy data. It was definitely helpful planning and visualizing the changes I wanted to make before actually making them. When taking time to think about the approach, one knows when it is best to clean, restructure, transform, and finally tidy the data. Thankfully, we had the option of recreating the chart rather than trying to pull it from the image, which could have proven much more challenging. Either way, much was learned throughout the assignment.