names that can be referred to when subsetting dataframes( no whitespaces between names or no invalid characters) and filling company column with appropriate values
untid <- read_csv("untidy.csv") |>
janitor::clean_names() |>
rename(company = 1, ood = 2) |>
drop_na(ood) |> mutate(company = ifelse(is.na(company), lag(company), company))
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (3): Los Angeles, San Diego, San Francisco num (2):
## Phoenix, Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
kable(untid)
| company | ood | los_angeles | phoenix | san_diego | san_francisco | seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
States are listed as columns holding the amount of delays and on time flights they had however this not a appropriate way of handling analysis in vectorized programming. We would to have them as observations instead of variables.
untid <- untid |>
pivot_longer(cols = los_angeles:seattle, names_to = "territory", values_to = "arrival")
kable(untid)
| company | ood | territory | arrival |
|---|---|---|---|
| ALASKA | on time | los_angeles | 497 |
| ALASKA | on time | phoenix | 221 |
| ALASKA | on time | san_diego | 212 |
| ALASKA | on time | san_francisco | 503 |
| ALASKA | on time | seattle | 1841 |
| ALASKA | delayed | los_angeles | 62 |
| ALASKA | delayed | phoenix | 12 |
| ALASKA | delayed | san_diego | 20 |
| ALASKA | delayed | san_francisco | 102 |
| ALASKA | delayed | seattle | 305 |
| AM WEST | on time | los_angeles | 694 |
| AM WEST | on time | phoenix | 4840 |
| AM WEST | on time | san_diego | 383 |
| AM WEST | on time | san_francisco | 320 |
| AM WEST | on time | seattle | 201 |
| AM WEST | delayed | los_angeles | 117 |
| AM WEST | delayed | phoenix | 415 |
| AM WEST | delayed | san_diego | 65 |
| AM WEST | delayed | san_francisco | 129 |
| AM WEST | delayed | seattle | 61 |
On time and delayed are two observations that are common in almost every row, and are directly linked to how often it occured so we would like to make them variables and making their number occurences an observation.
untid <- untid |>
pivot_wider( names_from = ood, values_from = arrival) |>
mutate(territory = str_replace(territory, "_", " ")) |>
mutate(territory = str_to_title(territory))
kable(untid)
| company | territory | on time | delayed |
|---|---|---|---|
| ALASKA | Los Angeles | 497 | 62 |
| ALASKA | Phoenix | 221 | 12 |
| ALASKA | San Diego | 212 | 20 |
| ALASKA | San Francisco | 503 | 102 |
| ALASKA | Seattle | 1841 | 305 |
| AM WEST | Los Angeles | 694 | 117 |
| AM WEST | Phoenix | 4840 | 415 |
| AM WEST | San Diego | 383 | 65 |
| AM WEST | San Francisco | 320 | 129 |
| AM WEST | Seattle | 201 | 61 |