Overview

Given an untidy dataset, I have been tasked with cleaning it up in order to perform an analysis. This dataset include flight information from two different airlines flying out of five different cities. The data also breaks down if the flights were delayed or on time. We will look at the arrival delays between the two airlines

Getting Started

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
untidy_flights <- read.csv("https://raw.githubusercontent.com/scrummett/DATA607/refs/heads/main/UntidyFlights.csv")

Here we have loaded the proper packages, and saved the data provided to us (that I also stored on github) in a table. From here we need to clean up the data.

Tidying the Data

untidy_flights <- untidy_flights |> 
  rename("Airline" = "X",
         "Status" = "X.1")
untidy_flights <- untidy_flights |> 
  rename_with(~ gsub(".", "_", .x, fixed = TRUE))

Here we are renaming the first two columns with proper labels, “Airline” and “Status”. Additionally, we are replacing the “.” separating words with “_“, however we will shortly change it again.

untidy_flights <- untidy_flights |> 
  filter(!is.na(Los_Angeles))

Here we are removing the rows with no data in them by filtering out any row that doesn’t have a value for “Los_Angeles”.

untidy_flights <- untidy_flights |> 
  mutate(Airline = na_if(Airline, "")) |> 
  fill(Airline)

We need to replace the missing values in “Airline”. Here we are taking the value that came before the missing data and filling it into that empty spot.

untidy_flights <- untidy_flights |> 
  mutate(Seattle = as.numeric(gsub(",", "", Seattle)))

untidy_flights |> 
  mutate(Seattle = as.numeric(Seattle))
##   Airline  Status Los_Angeles Phoenix San_Diego San_Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

The value of the Alaskan flight from Seattle is currently entered as a character instead of a number due to the “,”. First, we can remove this, then make sure that the value is read as a number, followed by turning the entire column into a numerical column. We do this in order to create a single column of the number of flights, including those from other destinations.

untidy_flights <- untidy_flights |> 
  pivot_longer(
    cols = Los_Angeles:Seattle,
    names_to = "City",
    values_to = "n_Flights")

Here we have created a new column for all cities, as well as combined the columns that held information on total flights from these cities.

tidy_flights <- untidy_flights |> 
  mutate(City = gsub("_", " ", City))

Lastly, we replaced all “_” in the “City” column with spaces. Our data is now much tidier and ready for analysis.

Data Analysis First we can take a look at our data broken up by airline, delay status and city.

tidy_flights |> 
  ggplot(aes(x = City, y = n_Flights, fill = factor(Airline))) + 
  geom_bar(stat = "identity", position = "dodge") + 
  facet_wrap(~Status) +
  labs(y = "Total Flights", fill = "Airline") +
  ggtitle("Proportions of Delays by City") + 
  coord_flip()

Here we can quickly visualize where the most delays are happening and from which airlines they happen with. We can see a large number of on time flights coming out of Phoenix, seemingly disproportionate to the rest of the flights coming from other cities across both airlines.

flight_delays <- tidy_flights |> 
  group_by(Airline) |> 
  summarise(
    Total_Flights = sum(n_Flights),
    Delayed_Flights = sum(ifelse(Status == "delayed", n_Flights, 0)),
    OnTime_Flights = sum(ifelse(Status == "on time", n_Flights, 0)),
    Delayed_Percentage = (Delayed_Flights / Total_Flights) * 100
  )
flight_delays
## # A tibble: 2 × 5
##   Airline Total_Flights Delayed_Flights OnTime_Flights Delayed_Percentage
##   <chr>           <dbl>           <dbl>          <dbl>              <dbl>
## 1 ALASKA           3775             501           3274               13.3
## 2 AM WEST          7225             787           6438               10.9
flight_delays |> 
ggplot(aes(x = Airline, y = Delayed_Percentage, fill = Airline)) +
  geom_bar(stat = "identity") +
  labs(title = "Percentage of Delayed Flights by Airline", 
       x = "Airline", y = "Percentage of Delayed Flights")

Here we can see that while “ALASKA” doesn’t have as many total delayed flights as “AM WEST”, they have a higher delay percentage by nearly 3%. This could perhaps be influenced by the sheer number of on time flights from “AM WEST” coming out of Phoenix, dropping their delay percentage drastically.

flight_delays_phoenix <- tidy_flights |> 
  group_by(Airline) |> 
  filter(City != "Phoenix") |> 
  summarise(
    Total_Flights = sum(n_Flights),
    Delayed_Flights = sum(ifelse(Status == "delayed", n_Flights, 0)),
    OnTime_Flights = sum(ifelse(Status == "on time", n_Flights, 0)),
    Delayed_Percentage = (Delayed_Flights / Total_Flights) * 100
  )
flight_delays_phoenix
## # A tibble: 2 × 5
##   Airline Total_Flights Delayed_Flights OnTime_Flights Delayed_Percentage
##   <chr>           <dbl>           <dbl>          <dbl>              <dbl>
## 1 ALASKA           3542             489           3053               13.8
## 2 AM WEST          1970             372           1598               18.9
flight_delays_phoenix |> 
ggplot(aes(x = Airline, y = Delayed_Percentage, fill = Airline)) +
  geom_bar(stat = "identity") +
  labs(title = "Percentage of Delayed Flights by Airline", 
       x = "Airline", y = "Percentage of Delayed Flights")

Here we can see that when we exclude Phoenix, the percentage of delayed flights from “AM WEST” increases by 8%!

Conclusion

Data might come in many different forms, and you may not be able to anticipate the ways in which it needs to be tidied upon initial inspection. For instance, I did not anticipate needing to change “Seattle” to a numerical vector, but as I entered the data wrong, it became necessary to continue. Additionally, from this data we can glean that while “ALASKA” has a higher percentage of delays than “AM WEST”, we can narrow the data down a bit to see that across most cities “AM WEST” has a greater portion of their flights delayed.