For me, opening a spreadsheet can induce an Everything Everywhere All at Once feeling… where does it start? Where does it end?
I lack knowledge about excel, but it seems to me that you can:
This freedumb [sic] coupled with the fact that calculations are:
makes it difficult to follow the logic.
An analogy between data analysis and cooking:
# read in the data-------------------------------
raw_data <- read_csv(here("data", "Population_Projections.csv"))
my_dt(raw_data)
needed_columns <- raw_data |>
select(-Region, -`Local Health Area`, -Gender, -Total)
my_dt(needed_columns)
long_filtered <- needed_columns |>
pivot_longer(cols = -Year, names_to = "age", values_to = "population") |>
mutate(age = as.numeric(age)) |>
filter(age %in% 15:65)
my_dt(long_filtered)
with_generation <- long_filtered |>
mutate(
birth_year = Year - age,
generation = case_when(#generation is a character variable, which R sorts alphabetically :(
birth_year %in% 1900:1927 ~ "Greatest",
birth_year %in% 1928:1945 ~ "Interwar",
birth_year %in% 1946:1965 ~ "Baby Boomers",
birth_year %in% 1966:1980 ~ "Gen X",
birth_year %in% 1981:1996 ~ "Millenials",
birth_year %in% 1997:2012 ~ "Gen Z",
birth_year %in% 2013:2023 ~ "Gen Alpha",
TRUE ~ NA_character_
),# convert generation to factor, ordered chronologically :)
generation=ordered(generation, levels=c("Greatest",
"Interwar",
"Baby Boomers",
"Gen X",
"Millenials",
"Gen Z",
"Gen Alpha"))
)
my_dt(with_generation)
aggregated <- with_generation |>
group_by(Year, generation) |>
summarise(population = sum(population)) |>
filter(Year > 2012)
my_dt(aggregated)
plt <- aggregated |>
ggplot(aes(Year, population, fill = generation)) +
geom_col()
plt
plt <- plt +
geom_vline(xintercept = year(today()) - .5, lwd = 1.5, alpha = .5) +
geom_col(alpha = .75) +
scale_y_continuous(labels = scales::comma) +
labs(
x = NULL,
y = NULL,
fill = NULL,
title = "Working age Population Projections for BC"
)+
theme_excel_new()
wrapR::plotlify(plt, "", 15)
Happy families are all alike; every unhappy family is unhappy in its own way.
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
Every column is a variable.
Every row is an observation.
Every cell is a single value.
Messy data is any other arrangement of the data.