Untidy Airport

Reader

We were provided with a chart detailing arrivals and delays for two airlines across five destinations. Since the chart is in a wide format, the first step is to import the data. We obtain the data from GitHub and process it using the readr package. Because the values are not separated by commas, we utilize a specific delimiter during the import to ensure that each destination is allocated its own column, rather than being merged into a single column.

knitr::opts_chunk$set(echo = TRUE)
#install.packages(c("readr", "dplyr", "tidyr"))
library(readr); library(dplyr); library(tidyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#import dataset from Github, had to use delimiter because not comma seperated 
wide_airport <- wide_airport <- read_delim("https://raw.githubusercontent.com/tiffhugh/DATA71200/refs/heads/main/wide_airport.csv", delim = "\t")

## New names:
## • `` -> `...1`
## • `` -> `...2`

## Rows: 4 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): ...1, ...2
## dbl (5): Los Angeles, Phoenix, San Diego, San Francisco, Seattle
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(wide_airport)

Tidy & Dplyr

While the data has been successfully imported, several significant issues must be addressed before we can proceed with any analysis. The dataset is untidy, lacking column names for “airline” and “status,” and contains empty cells. By utilizing the tidy package, we can make necessary adjustments to enhance the data’s quality and structure.

knitr::opts_chunk$set(echo = TRUE)
# Fix columns include airline and status 
colnames(wide_airport)[1] <- "airline"
colnames(wide_airport)[2] <- "status"
# Name empty cell to the correct airline 
wide_airport <- wide_airport %>%
  mutate(airline = ifelse(airline == "", NA, airline)) %>%  
  fill(airline, .direction = "down") %>%  
  mutate(airline = replace(airline, c(2, 4), c("ALASKA", "AM WEST")))
wide_airport

## # A tibble: 4 × 7
##   airline status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 ALASKA  delayed            62      12          20             102     305
## 3 AM WEST on time           694    4840         383             320     201
## 4 AM WEST delayed           117     415          65             129      61

Now that our table is complete, the next step is using pivot_longer() to transform our dataset from a wide format to a long format, allowing us to have a more organized structure where each destination is represented in a separate row along with its corresponding arrival status and frequency.

knitr::opts_chunk$set(echo = TRUE)
long_airport <- wide_airport %>%
  pivot_longer(
    cols = -c(airline, status), 
    names_to = 'destination',     
    values_to = 'frequency'       
  )
head(long_airport)

## # A tibble: 6 × 4
##   airline status  destination   frequency
##   <chr>   <chr>   <chr>             <dbl>
## 1 ALASKA  on time Los Angeles         497
## 2 ALASKA  on time Phoenix             221
## 3 ALASKA  on time San Diego           212
## 4 ALASKA  on time San Francisco       503
## 5 ALASKA  on time Seattle            1841
## 6 ALASKA  delayed Los Angeles          62

Analyze

The data is in a long format so analysis can be done.

knitr::opts_chunk$set(echo = TRUE)
# summary of airline flights 
summary_stats <- long_airport %>%
  group_by(airline, status) %>%
  summarise(total_frequency = sum(frequency, na.rm = TRUE), .groups = 'drop')
print(summary_stats)

## # A tibble: 4 × 3
##   airline status  total_frequency
##   <chr>   <chr>             <dbl>
## 1 ALASKA  delayed             501
## 2 ALASKA  on time            3274
## 3 AM WEST delayed             787
## 4 AM WEST on time            6438

Analyzing the data on airline performance, Alaska Airlines recorded a total of 3,775 flights, with 3,274 on-time arrivals and 501 delays. In comparison, West AM had a higher volume of flights at 7,225, with 6,438 arriving on time and 787 delayed. While this summary provides an overview of flight performance, it lacks depth, so we will now examine the data by destination for further insights.

knitr::opts_chunk$set(echo = TRUE)

destination_percentage <- long_airport %>%
  group_by(destination, status) %>%
  summarise(total_flights = sum(frequency, na.rm = TRUE), .groups = 'drop') %>%
  group_by(destination) %>%
  mutate(percentage = total_flights / sum(total_flights) * 100)  # Calculate percentage

# View the percentage comparison
print(destination_percentage)

## # A tibble: 10 × 4
## # Groups:   destination [5]
##    destination   status  total_flights percentage
##    <chr>         <chr>           <dbl>      <dbl>
##  1 Los Angeles   delayed           179      13.1 
##  2 Los Angeles   on time          1191      86.9 
##  3 Phoenix       delayed           427       7.78
##  4 Phoenix       on time          5061      92.2 
##  5 San Diego     delayed            85      12.5 
##  6 San Diego     on time           595      87.5 
##  7 San Francisco delayed           231      21.9 
##  8 San Francisco on time           823      78.1 
##  9 Seattle       delayed           366      15.2 
## 10 Seattle       on time          2042      84.8

#install.packages('ggplot2')
library(ggplot2)
# Create a bar plot comparing on-time and delayed flights by destination with pink and purple colors
ggplot(destination_percentage, aes(x = destination, y = total_flights, fill = status)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Comparison of On-Time and Delayed Flights by Destination",
       x = "Destination",
       y = "Number of Flights") +
  scale_fill_manual(values = c("on time" = "pink", "delayed" = "purple")) +  # Set colors for the fill
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Adjust x-axis text for readability

Overall, the data shows that on-time flights significantly outnumber delayed ones across all destinations. Phoenix stands out with an impressive on-time performance, as 5,061 out of 5,488 total flights (92.22%) arrived on time, while only 427 flights (about 7.78%) experienced delays. San Diego mirrored this trend, with 595 on-time flights (87.5%) out of 680 total flights, resulting in 85 delays (12.5%). However, San Francisco experienced a higher percentage of delayed flights, with 231 out of 1,054 total flights (approximately 21.92%) being delayed, suggesting potential issues with punctuality. In contrast, Seattle had 366 delays (15.20%) out of 2,408 flights, indicating a slightly higher delay rate than Los Angeles and Phoenix but still maintaining a relatively reliable schedule.

Reference Grolemund, Garrett, and Hadley Wickham. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, 2017.

Untidy Airport

2024-09-28

Reader

Tidy & Dplyr

Analyze