The data we are going to look at today is a small sample of arrival delays from two airlines across five destinations [Source: Numbersense, Kaiser Fung, McGraw Hill, 2013]. A csv file and the rmd file used to create this paper can be found on one of my personal github repositories.
Loading data and libraries
Here we will use the tidyverse for ease of manipulation and the RCurl package for retrieving the data directly from github.com.
library(tidyverse)
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/refs/heads/main/Week%205/arrival_delays.csv")
arrival_delays <- read.csv(text = x, na.strings = c("NA",""))
arrival_delays
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1,841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA <NA> NA NA <NA>
## 4 AM WEST on time 694 4,840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Tidy and Transform Data
Now we’ll use tidyr and dplyr to clean up the data as needed. It
should be noted that fill()
requires column values to be NA
before it can transform them. This issue was solved by adding the
na.strings
argument when we read in the csv.
arr_delays_tidy <- arrival_delays |>
rename(airline = X, status = X.1, Los_Angeles = Los.Angeles, # rename columns
San_Diego = San.Diego, San_Francisco = San.Francisco) |>
slice(-3) |> # remove empty row
fill(airline, .direction = "down") |> # fill missing values
mutate(Phoenix = as.numeric(gsub(",", "", Phoenix)), # remove comma from numbers and transform to numeric
Seattle = as.numeric(gsub(",", "", Seattle))) |>
pivot_longer(cols = 3:7, names_to = "destination", values_to = "count") |> # transpose destinations
pivot_wider(names_from = status, values_from = count) |> # move count values to new columns based on status
rename(on_time = `on time`) |>
mutate(total = on_time + delayed, # create new columns for analysis
proportion_on_time = round(on_time/total,4),
proportion_delayed = round(delayed/total,4)) |>
arrange(destination)
arr_delays_tidy
## # A tibble: 10 × 7
## airline destination on_time delayed total proportion_on_time
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA Los_Angeles 497 62 559 0.889
## 2 AM WEST Los_Angeles 694 117 811 0.856
## 3 ALASKA Phoenix 221 12 233 0.948
## 4 AM WEST Phoenix 4840 415 5255 0.921
## 5 ALASKA San_Diego 212 20 232 0.914
## 6 AM WEST San_Diego 383 65 448 0.855
## 7 ALASKA San_Francisco 503 102 605 0.831
## 8 AM WEST San_Francisco 320 129 449 0.713
## 9 ALASKA Seattle 1841 305 2146 0.858
## 10 AM WEST Seattle 201 61 262 0.767
## # ℹ 1 more variable: proportion_delayed <dbl>
We can see that we started by renaming columns, filling in missing
data, and removing an empty row. Next converted count values to numeric
by first removing commas. We then reshaped the data by transposing
columns into one called destination
. We took the values and
spread them into two columns so we could create new columns for
analysis. The new columns are a sum of the on time and delayed counts to
then get a proportion of on time and delayed flights by airline. I
arranged the data by destination to compare quickly which airline has
proportionally more on time flights. For each destination it can be seen
that ALASKA airline has proportionally more on time flights than AM
WEST.
Conclusion
In conclusion, we can see that dplyr has a number of functions to transform the data and tidyr functions are great for reshaping the data. Afterwards, we used dplyr to add new columns for analysis and concluded that ALASKA airline has a better proportion of on time flights than AM WEST for each destination in the dataset.