Load With Clean Names

names that can be referred to when subsetting dataframes( no whitespaces between names or no invalid characters) and filling company column with appropriate values

untid <- read_csv("untidy.csv") |>
          janitor::clean_names() |> 
          rename(company = 1, ood = 2) |> 
          drop_na(ood) |> mutate(company = ifelse(is.na(company), lag(company), company))
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (3): Los Angeles, San Diego, San Francisco num (2):
## Phoenix, Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
kable(untid)
company ood los_angeles phoenix san_diego san_francisco seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Lengthen Data

States are listed as columns holding the amount of delays and on time flights they had however this not a appropriate way of handling analysis in vectorized programming. We would to have them as observations instead of variables.

untid <- untid |>
          pivot_longer(cols = los_angeles:seattle, names_to = "territory", values_to = "arrival")
kable(untid)
company ood territory arrival
ALASKA on time los_angeles 497
ALASKA on time phoenix 221
ALASKA on time san_diego 212
ALASKA on time san_francisco 503
ALASKA on time seattle 1841
ALASKA delayed los_angeles 62
ALASKA delayed phoenix 12
ALASKA delayed san_diego 20
ALASKA delayed san_francisco 102
ALASKA delayed seattle 305
AM WEST on time los_angeles 694
AM WEST on time phoenix 4840
AM WEST on time san_diego 383
AM WEST on time san_francisco 320
AM WEST on time seattle 201
AM WEST delayed los_angeles 117
AM WEST delayed phoenix 415
AM WEST delayed san_diego 65
AM WEST delayed san_francisco 129
AM WEST delayed seattle 61

Widen Data

On time and delayed are two observations that are common in almost every row, and are directly linked to how often it occured so we would like to make them variables and making their number occurences an observation.

untid <- untid |> 
         pivot_wider( names_from = ood, values_from = arrival) |>
         mutate(territory  = str_replace(territory, "_", " ")) |> 
         mutate(territory = str_to_title(territory))  
kable(untid)
company territory on time delayed
ALASKA Los Angeles 497 62
ALASKA Phoenix 221 12
ALASKA San Diego 212 20
ALASKA San Francisco 503 102
ALASKA Seattle 1841 305
AM WEST Los Angeles 694 117
AM WEST Phoenix 4840 415
AM WEST San Diego 383 65
AM WEST San Francisco 320 129
AM WEST Seattle 201 61