We were provided with a chart detailing arrivals and delays for two airlines across five destinations. Since the chart is in a wide format, the first step is to import the data. We obtain the data from GitHub and process it using the readr package. Because the values are not separated by commas, we utilize a specific delimiter during the import to ensure that each destination is allocated its own column, rather than being merged into a single column.
knitr::opts_chunk$set(echo = TRUE)
#install.packages(c("readr", "dplyr", "tidyr"))
library(readr); library(dplyr); library(tidyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#import dataset from Github, had to use delimiter because not comma seperated
wide_airport <- wide_airport <- read_delim("https://raw.githubusercontent.com/tiffhugh/DATA71200/refs/heads/main/wide_airport.csv", delim = "\t")
## New names:
## • `` -> `...1`
## • `` -> `...2`
## Rows: 4 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): ...1, ...2
## dbl (5): Los Angeles, Phoenix, San Diego, San Francisco, Seattle
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(wide_airport)
While the data has been successfully imported, several significant issues must be addressed before we can proceed with any analysis. The dataset is untidy, lacking column names for “airline” and “status,” and contains empty cells. By utilizing the tidy package, we can make necessary adjustments to enhance the data’s quality and structure.
knitr::opts_chunk$set(echo = TRUE)
# Fix columns include airline and status
colnames(wide_airport)[1] <- "airline"
colnames(wide_airport)[2] <- "status"
# Name empty cell to the correct airline
wide_airport <- wide_airport %>%
mutate(airline = ifelse(airline == "", NA, airline)) %>%
fill(airline, .direction = "down") %>%
mutate(airline = replace(airline, c(2, 4), c("ALASKA", "AM WEST")))
wide_airport
## # A tibble: 4 × 7
## airline status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
Now that our table is complete, the next step is using pivot_longer() to transform our dataset from a wide format to a long format, allowing us to have a more organized structure where each destination is represented in a separate row along with its corresponding arrival status and frequency.
knitr::opts_chunk$set(echo = TRUE)
long_airport <- wide_airport %>%
pivot_longer(
cols = -c(airline, status),
names_to = 'destination',
values_to = 'frequency'
)
head(long_airport)
## # A tibble: 6 × 4
## airline status destination frequency
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
The data is in a long format so analysis can be done.
knitr::opts_chunk$set(echo = TRUE)
# summary of airline flights
summary_stats <- long_airport %>%
group_by(airline, status) %>%
summarise(total_frequency = sum(frequency, na.rm = TRUE), .groups = 'drop')
print(summary_stats)
## # A tibble: 4 × 3
## airline status total_frequency
## <chr> <chr> <dbl>
## 1 ALASKA delayed 501
## 2 ALASKA on time 3274
## 3 AM WEST delayed 787
## 4 AM WEST on time 6438
Analyzing the data on airline performance, Alaska Airlines recorded a total of 3,775 flights, with 3,274 on-time arrivals and 501 delays. In comparison, West AM had a higher volume of flights at 7,225, with 6,438 arriving on time and 787 delayed. While this summary provides an overview of flight performance, it lacks depth, so we will now examine the data by destination for further insights.
knitr::opts_chunk$set(echo = TRUE)
destination_percentage <- long_airport %>%
group_by(destination, status) %>%
summarise(total_flights = sum(frequency, na.rm = TRUE), .groups = 'drop') %>%
group_by(destination) %>%
mutate(percentage = total_flights / sum(total_flights) * 100) # Calculate percentage
# View the percentage comparison
print(destination_percentage)
## # A tibble: 10 × 4
## # Groups: destination [5]
## destination status total_flights percentage
## <chr> <chr> <dbl> <dbl>
## 1 Los Angeles delayed 179 13.1
## 2 Los Angeles on time 1191 86.9
## 3 Phoenix delayed 427 7.78
## 4 Phoenix on time 5061 92.2
## 5 San Diego delayed 85 12.5
## 6 San Diego on time 595 87.5
## 7 San Francisco delayed 231 21.9
## 8 San Francisco on time 823 78.1
## 9 Seattle delayed 366 15.2
## 10 Seattle on time 2042 84.8
#install.packages('ggplot2')
library(ggplot2)
# Create a bar plot comparing on-time and delayed flights by destination with pink and purple colors
ggplot(destination_percentage, aes(x = destination, y = total_flights, fill = status)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Comparison of On-Time and Delayed Flights by Destination",
x = "Destination",
y = "Number of Flights") +
scale_fill_manual(values = c("on time" = "pink", "delayed" = "purple")) + # Set colors for the fill
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Adjust x-axis text for readability
Overall, the data shows that on-time flights significantly outnumber
delayed ones across all destinations. Phoenix stands out with an
impressive on-time performance, as 5,061 out of 5,488 total flights
(92.22%) arrived on time, while only 427 flights (about 7.78%)
experienced delays. San Diego mirrored this trend, with 595 on-time
flights (87.5%) out of 680 total flights, resulting in 85 delays
(12.5%). However, San Francisco experienced a higher percentage of
delayed flights, with 231 out of 1,054 total flights (approximately
21.92%) being delayed, suggesting potential issues with punctuality. In
contrast, Seattle had 366 delays (15.20%) out of 2,408 flights,
indicating a slightly higher delay rate than Los Angeles and Phoenix but
still maintaining a relatively reliable schedule.
Reference Grolemund, Garrett, and Hadley Wickham. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, 2017.