In this assigment, we were tasked with importing a messy (un-tidy) dataset, performing tiddying operations, and performing some basic analysis on our data.
First, I read the csv stored on my github page
# load messy dataframe from github page
messy_numbersense <- read_csv("https://raw.githubusercontent.com/mcastro64/d607-assignments/refs/heads/master/assignment-04/assigment4_data.csv")
## Rows: 2 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): airline
## dbl (10): On Time - Los Angeles, On Time - Phoenix, On Time - San Diego, On ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
messy_numbersense |>
gt() |>
tab_spanner(
label = 'On Time',
columns = matches("On Time*")
) |>
tab_spanner(
label = 'Delayed',
columns = matches("Delayed*")
)
| airline | On Time | Delayed | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| On Time - Los Angeles | On Time - Phoenix | On Time - San Diego | On Time - San Francisco | On Time - Seattle | Delayed - Los Angeles | Delayed - Phoenix | Delayed - San Diego | Delayed - San Francisco | Delayed - Seattle | |
| ALASKA | 497 | 221 | 212 | 503 | 1841 | 62 | 12 | 20 | 102 | 305 |
| AM WEST | 694 | 4840 | 383 | 320 | 201 | 117 | 415 | 65 | 129 | 61 |
Observations for both on time and delayed flight status for each destination are written along the columns of the dataset, making it difficult to analyze. Our next step is to pivot the table from wide to long, breaking up the columns to individual rows for flight status and destination using the names_to parameter to indicate which columns we want to create based on column names. The names_to parameter was set to the delimitor ” - ” in order for dplyr to know how to assign the values to the status and destination columns. Finally, the parament value_to was used to pass the value of the original column from the messy dataset to a new column called “flighs”.
numbersense_longer <- messy_numbersense |>
pivot_longer(
cols = !(airline),
names_to = c("status", "destination"),
names_sep = " - ",
values_to = "flights"
)
numbersense_longer |>
gt()
| airline | status | destination | flights |
|---|---|---|---|
| ALASKA | On Time | Los Angeles | 497 |
| ALASKA | On Time | Phoenix | 221 |
| ALASKA | On Time | San Diego | 212 |
| ALASKA | On Time | San Francisco | 503 |
| ALASKA | On Time | Seattle | 1841 |
| ALASKA | Delayed | Los Angeles | 62 |
| ALASKA | Delayed | Phoenix | 12 |
| ALASKA | Delayed | San Diego | 20 |
| ALASKA | Delayed | San Francisco | 102 |
| ALASKA | Delayed | Seattle | 305 |
| AM WEST | On Time | Los Angeles | 694 |
| AM WEST | On Time | Phoenix | 4840 |
| AM WEST | On Time | San Diego | 383 |
| AM WEST | On Time | San Francisco | 320 |
| AM WEST | On Time | Seattle | 201 |
| AM WEST | Delayed | Los Angeles | 117 |
| AM WEST | Delayed | Phoenix | 415 |
| AM WEST | Delayed | San Diego | 65 |
| AM WEST | Delayed | San Francisco | 129 |
| AM WEST | Delayed | Seattle | 61 |
Next, I wanted to be able to calculate the percentage of flights that were delayed by airline destination. To do this, I first calculated the total number of flights per airline destination using group_by and summarise to perform a sum calculation across all flights for every airline going to a specific destination, iregardless of status (on-time/delayed).
# calculate total number of flights per airline destination
total_flights_per_airline_destination <- numbersense_longer |>
group_by(airline, destination) |>
summarise(
total_flights = sum(flights)
)
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
total_flights_per_airline_destination
## # A tibble: 10 × 3
## # Groups: airline [2]
## airline destination total_flights
## <chr> <chr> <dbl>
## 1 ALASKA Los Angeles 559
## 2 ALASKA Phoenix 233
## 3 ALASKA San Diego 232
## 4 ALASKA San Francisco 605
## 5 ALASKA Seattle 2146
## 6 AM WEST Los Angeles 811
## 7 AM WEST Phoenix 5255
## 8 AM WEST San Diego 448
## 9 AM WEST San Francisco 449
## 10 AM WEST Seattle 262
Next, I used a left_join to combine our dataframe in long form and our dataframe with total flights per airline destination. As we are focusing on delayed flights, I used the filter function on the status column to limit our results to flights with a “Delayed” status.
# join long flights table and total flights table
# then calculate frequence of delays
delayed_flights_observations <- numbersense_longer |>
left_join(total_flights_per_airline_destination, join_by(airline == airline, destination == destination)) |>
filter(status == "Delayed") |>
rename(delayed_flights = flights)
delayed_flights_observations |>
gt() |>
tab_header("Delayed Flights by Airline Destination")
| Delayed Flights by Airline Destination | ||||
| airline | status | destination | delayed_flights | total_flights |
|---|---|---|---|---|
| ALASKA | Delayed | Los Angeles | 62 | 559 |
| ALASKA | Delayed | Phoenix | 12 | 233 |
| ALASKA | Delayed | San Diego | 20 | 232 |
| ALASKA | Delayed | San Francisco | 102 | 605 |
| ALASKA | Delayed | Seattle | 305 | 2146 |
| AM WEST | Delayed | Los Angeles | 117 | 811 |
| AM WEST | Delayed | Phoenix | 415 | 5255 |
| AM WEST | Delayed | San Diego | 65 | 448 |
| AM WEST | Delayed | San Francisco | 129 | 449 |
| AM WEST | Delayed | Seattle | 61 | 262 |
I wanted to get a sense of what percentage of flights were delayed overall per airline, regardless of destination. Working with the dataframe output from step 2, I again used group_by and summarize to calculate the number of delayed flights per airline, the total number of flights per airline and the percentage of flights delayed by airline.
summary_delayed_flights_by_airline <- delayed_flights_observations |>
group_by(airline) |>
summarize(
delayed_flights_by_airline = sum(delayed_flights),
total_flights_by_airline = sum(total_flights),
percent_delayed_by_airline = round(delayed_flights_by_airline / total_flights_by_airline, 3)
)
summary_delayed_flights_by_airline |>
gt() |>
fmt_percent(percent_delayed_by_airline) |>
tab_header("Overall Frequency of Delayed Flights by Airline") |>
cols_label(
airline = "Airlane",
delayed_flights_by_airline = "# Flights Delayed",
total_flights_by_airline = "Total Flights",
percent_delayed_by_airline = "% Flights Delayed"
)
| Overall Frequency of Delayed Flights by Airline | |||
| Airlane | # Flights Delayed | Total Flights | % Flights Delayed |
|---|---|---|---|
| ALASKA | 501 | 3775 | 13.30% |
| AM WEST | 787 | 7225 | 10.90% |
Next, I performed the same calculation on our delayed_flights_observations dataframe to calculate the number of delayed flights per airline destination, the total number of flights per airline destination and the percentage of flights delayed by airline destination.
delayed_flights_by_airline_destination <- delayed_flights_observations |>
mutate(percent_delayed = delayed_flights/total_flights) |>
subset(select = c(airline, destination, delayed_flights, percent_delayed))
delayed_flights_by_airline_destination |>
gt() |>
fmt_percent(percent_delayed) |>
tab_header("Frequency of Delayed Flights by Airline Destination") |>
cols_label(
airline = "Airlane",
delayed_flights = "# Flights Delayed",
percent_delayed = "% Flights Delayed"
)
| Frequency of Delayed Flights by Airline Destination | |||
| Airlane | destination | # Flights Delayed | % Flights Delayed |
|---|---|---|---|
| ALASKA | Los Angeles | 62 | 11.09% |
| ALASKA | Phoenix | 12 | 5.15% |
| ALASKA | San Diego | 20 | 8.62% |
| ALASKA | San Francisco | 102 | 16.86% |
| ALASKA | Seattle | 305 | 14.21% |
| AM WEST | Los Angeles | 117 | 14.43% |
| AM WEST | Phoenix | 415 | 7.90% |
| AM WEST | San Diego | 65 | 14.51% |
| AM WEST | San Francisco | 129 | 28.73% |
| AM WEST | Seattle | 61 | 23.28% |
Using ggplot, I created a simple bar chart plotting airlines and the count of delayed flights. I used the parameter position=“dodge” to transform the result from a stacked bar chart to a individual bars for each destination along the x-axis.
delayed_flights_count_chart <- ggplot(delayed_flights_by_airline_destination, aes(x=airline, y=delayed_flights, fill=destination)) +
geom_bar(
stat='identity',
position = position_dodge(.85),
width = .7
) +
labs(
title = "Counts of Airline Flights Delayed per Destination",
x = "Airline",
y = "Number of Flights Delayed"
) +
geom_text(
aes(label = delayed_flights),
vjust = -0.5,
size = 7,
size.unit = "pt",
position = position_dodge(.7)
) +
theme(
plot.title = element_text(size = 10),
legend.position = "bottom",
aspect.ratio = 4/5
) +
ylim(0, 450)
I also created a simple bar chart plotting airlines and the percentage of delayed flights. I used the parameter position=“dodge” to transform the result from a stacked bar chart to a individual bars for each destination along the x-axis.
delayed_flights_freq_chart <- ggplot(delayed_flights_by_airline_destination, aes(x=airline, y=percent_delayed, fill=destination)) +
geom_bar(
stat = 'identity',
position=position_dodge(.85),
width = 0.7
) +
labs(
title = "% of Airline Flights Delayed per Destination",
x = "Airline",
y = "Percent of Flights Delayed"
) +
scale_y_continuous(labels = scales::percent) +
geom_text(
aes(label = paste(round(percent_delayed* 100, 1), "%")),
vjust = -0.5,
size = 7,
size.unit = "pt",
position = position_dodge(.85)
) +
theme(
plot.title = element_text(size = 10),
legend.position = "bottom",
aspect.ratio = 4/5
) +
ylim(0, .35)
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
I used ggpubr to plot them side-by-side for comparison.
ggarrange(delayed_flights_count_chart, delayed_flights_freq_chart, ncol=2, common.legend = TRUE, legend="bottom")
## Conclusion Examining raw counts, AM West airline’s flights to Phoenix
had the most delays of any group with 415 delayed arrivals. However,
when we compare percent of flights delayed, we see that AM West to
Phoenix actually had the second lowest percent of delayed flights
(7.9%), second only to Alaska Airline’s flight to Phoenix (5.2%).
Looking at the total number of flights per airline destination table
helps us confirm that AM West to Phoenix had the most flights of any
airline to any destination with 5255. Additionally, flights to Phoenix
from AM West and Alaska airlines were below their overall percent of
delayed flights (10.9% and 13.3% respectively).