library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Loading the CSV
flights_csv <- read.csv("https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/Untidy_Flight_Table.csv")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on
## 'https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/Untidy_Flight_Table.csv'
head(flights_csv)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 AM West On Time 694 4840 383 320 201
## 4 Delayed 117 415 65 129 61
There are some blank values in the original csv headers and rows. I first want to label those
# Renaming the columns
flights_csv <- flights_csv |>
rename(
Airline = X,
Flight_Status = X.1
)
head(flights_csv)
## Airline Flight_Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 AM West On Time 694 4840 383 320 201
## 4 Delayed 117 415 65 129 61
There were also blank values in the Airline column. These values are not truly missing, we can infer which airline the corresponding Delayed values belong to.
flights_csv[2,"Airline"] <- "Alaska"
flights_csv[4,"Airline"] <- "AM West"
head(flights_csv)
## Airline Flight_Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska On Time 497 221 212 503 1841
## 2 Alaska Delayed 62 12 20 102 305
## 3 AM West On Time 694 4840 383 320 201
## 4 AM West Delayed 117 415 65 129 61
The different destinations can be considered a variable “Destination” and placed into one column. The resulting dataframe is tidy in that each variable is a column, each row is an observation and each cell is a single value.
flights_tidy <- flights_csv |>
pivot_longer(
cols = !(Airline:Flight_Status),
names_to = "Destination",
values_to = "Count"
)
head(flights_tidy)
## # A tibble: 6 × 4
## Airline Flight_Status Destination Count
## <chr> <chr> <chr> <int>
## 1 Alaska On Time Los.Angeles 497
## 2 Alaska On Time Phoenix 221
## 3 Alaska On Time San.Diego 212
## 4 Alaska On Time San.Francisco 503
## 5 Alaska On Time Seattle 1841
## 6 Alaska Delayed Los.Angeles 62
If I want to treat the flight status of on time and delayed as separate variables I can make it wider, this is something I may want to use later on to view some data easier. I’ll save it as a new dataframe
flights_wide <- flights_tidy |>
pivot_wider(
names_from = Flight_Status,
values_from = Count
) |>
rename(
On_Time = `On Time`
)
head(flights_wide)
## # A tibble: 6 × 4
## Airline Destination On_Time Delayed
## <chr> <chr> <int> <int>
## 1 Alaska Los.Angeles 497 62
## 2 Alaska Phoenix 221 12
## 3 Alaska San.Diego 212 20
## 4 Alaska San.Francisco 503 102
## 5 Alaska Seattle 1841 305
## 6 AM West Los.Angeles 694 117
Below we can see some general trends of delayed and on time flights with relative frequency across the two airlines and different destinations. We can see that generally flights to San Francisco have relatively more delayed flights and flights to Phoenix have less delayed flights.
ggplot(flights_tidy,
aes(x = Destination,
y = Count,
fill = Flight_Status)) +
geom_bar(stat = 'identity',
position = 'fill') +
facet_wrap(~Airline,
ncol=1)
We can get a better idea by computing the percentage of delayed flights. It may be visually easier to use the wider dataframe from earlier.
flights_wide <- flights_wide |>
mutate(Perc_Delayed = round((Delayed / (On_Time + Delayed)) * 100, 2))
print(flights_wide)
## # A tibble: 10 × 5
## Airline Destination On_Time Delayed Perc_Delayed
## <chr> <chr> <int> <int> <dbl>
## 1 Alaska Los.Angeles 497 62 11.1
## 2 Alaska Phoenix 221 12 5.15
## 3 Alaska San.Diego 212 20 8.62
## 4 Alaska San.Francisco 503 102 16.9
## 5 Alaska Seattle 1841 305 14.2
## 6 AM West Los.Angeles 694 117 14.4
## 7 AM West Phoenix 4840 415 7.9
## 8 AM West San.Diego 383 65 14.5
## 9 AM West San.Francisco 320 129 28.7
## 10 AM West Seattle 201 61 23.3
From this we can see that the destination with the highest percentage of delays is San Francisco for both airlines. However, between the two AM West has a higher percentage of delayed flights to San Francisco.
The destination with the lowest percentage of delays is also the same for both airlines, which is Phoneix but AM West again has the higher percentage of delayed flights to Phoenix between the two.
We can try to view the delays by airline as a whole
airline_summary <- flights_tidy |>
group_by(Airline) |>
summarise(
Nbr_OnTime = sum(Count[Flight_Status == "On Time"]),
Nbr_Delayed = sum(Count[Flight_Status == "Delayed"]),
Total_Flights = sum(Count)
) |>
mutate(Perc_Delayed = round((Nbr_Delayed / Total_Flights) * 100,2))
print(airline_summary)
## # A tibble: 2 × 5
## Airline Nbr_OnTime Nbr_Delayed Total_Flights Perc_Delayed
## <chr> <int> <int> <int> <dbl>
## 1 AM West 6438 787 7225 10.9
## 2 Alaska 3274 501 3775 13.3
Alaska airlines has a slightly higher overall percentage of delayed flights, but a lower overall number of flights in total.