For this assignment, we’ve been given an image of an untidy data set (shown below). Our task is to create a .CSV file that includes all of the information provided in the chart, read the information from the .CSV file into R, and use tidyr and dplyr to tidy and transform the data. Finally, we’ll perform simple analysis on the arrival delays for the two airlines.
As always, let’s start off by loading the necessary packages. Just
the tidyverse, dplyr, and scales
packages will be needed for this one.
library(tidyverse)
library(dplyr)
library(scales)
Now let’s load the data from a .CSV I created and stored on GitHub.
url <- url('https://raw.githubusercontent.com/Stevee-G/Data607/refs/heads/main/flight_delays.csv')
flight_delays <- read.csv(url)
flight_delays
## X X.1 Los.Angeles Pheonix San.Diego San.Fransisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
As we can see, the table contains an empty row that needs to be
removed. Let’s go ahead and do that using the
complete.cases() function.
flight_delays <- flight_delays[complete.cases(flight_delays),]
flight_delays
## X X.1 Los.Angeles Pheonix San.Diego San.Fransisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Next, let’s address the two unnamed columns and call them
carrier and status, respectively. Afterwards,
we will go ahead and pivot the data longer in order to transform the
destination headers into one column alongside their count and trip
status.
flight_delays <- flight_delays %>%
rename(carrier = X, status = X.1)
flight_delays <- flight_delays %>%
pivot_longer(cols = c(3:7),
names_to = 'destination',
values_to = 'count')
flight_delays
## # A tibble: 20 × 4
## carrier status destination count
## <chr> <chr> <chr> <int>
## 1 "ALASKA" on time Los.Angeles 497
## 2 "ALASKA" on time Pheonix 221
## 3 "ALASKA" on time San.Diego 212
## 4 "ALASKA" on time San.Fransisco 503
## 5 "ALASKA" on time Seattle 1841
## 6 "" delayed Los.Angeles 62
## 7 "" delayed Pheonix 12
## 8 "" delayed San.Diego 20
## 9 "" delayed San.Fransisco 102
## 10 "" delayed Seattle 305
## 11 "AM WEST" on time Los.Angeles 694
## 12 "AM WEST" on time Pheonix 4840
## 13 "AM WEST" on time San.Diego 383
## 14 "AM WEST" on time San.Fransisco 320
## 15 "AM WEST" on time Seattle 201
## 16 "" delayed Los.Angeles 117
## 17 "" delayed Pheonix 415
## 18 "" delayed San.Diego 65
## 19 "" delayed San.Fransisco 129
## 20 "" delayed Seattle 61
Now we will fill in the remaining cells in the carrier
field, but in order to do so we first need to insert values that can be
recognized and replaced by the fill() function. Let us also
replace the periods (.) that are contained within some of the values in
the destination field. Once done, we will have our table,
clean and tidy for analysis.
flight_delays$carrier[flight_delays$carrier == ""] <- NA
flight_delays <- flight_delays %>%
fill(carrier)
flight_delays$destination <- str_replace(flight_delays$destination, "\\.", " ")
flight_delays
## # A tibble: 20 × 4
## carrier status destination count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Pheonix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Fransisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Pheonix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Fransisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST on time Pheonix 4840
## 13 AM WEST on time San Diego 383
## 14 AM WEST on time San Fransisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los Angeles 117
## 17 AM WEST delayed Pheonix 415
## 18 AM WEST delayed San Diego 65
## 19 AM WEST delayed San Fransisco 129
## 20 AM WEST delayed Seattle 61
As mentioned before, we will perform a simple analysis comparing the
arrival delays for the two airlines. Let’s go ahead and create tibble
grouping and summarizing our data by carrier and
status. We will also create a plot to go right alongside
it. As we can see, although the total flights for each airline differs
by quite a bit, their proportions for ‘on time’ and ‘delayed’ flights
are actually very similar.
flight_delays_summary <- flight_delays %>%
group_by(carrier,status) %>%
summarise(
'total' = sum(count)
) %>%
mutate(percentage=percent((total/sum(total)),accuracy=0.1)) %>%
arrange(carrier,desc(status))
flight_delays_summary
## # A tibble: 4 × 4
## # Groups: carrier [2]
## carrier status total percentage
## <chr> <chr> <int> <chr>
## 1 ALASKA on time 3274 86.7%
## 2 ALASKA delayed 501 13.3%
## 3 AM WEST on time 6438 89.1%
## 4 AM WEST delayed 787 10.9%
ggplot(flight_delays_summary,aes(x=carrier,y=total,fill=status)) +
geom_col() +
geom_text(aes(label = paste(total,"(",percentage,")"),vjust=-0.75)) +
scale_fill_discrete(breaks=c('on time','delayed'))
This assignment was great practice for cleaning and tidying really messy data. It was definitely helpful planning and visualizing the changes I wanted to make before actually making them. When taking time to think about the approach, one knows when it is best to clean, restructure, transform, and finally tidy the data. Thankfully, we had the option of recreating the chart rather than trying to pull it from the image, which could have proven much more challenging. Either way, much was learned throughout the assignment.