Demo Demo

The goal: produce plot like this… without excel

Why no excel?

For me, opening a spreadsheet can induce an Everything Everywhere All at Once feeling… where does it start? Where does it end?
I lack knowledge about excel, but it seems to me that you can:
- Put the sheets in any order you wish.
- Put the contents of a sheet in any order you wish (e.g. bottom to top, right to left)
This freedumb [sic] coupled with the fact that calculations are:
- by default hidden from view
- based on cell location rather than named variables

makes it difficult to follow the logic.

An analogy between data analysis and cooking:

Being handed an excel file you did not create (or created by your former self), is like being handed a basket of ingredients and a picture of the desired dish.
An R script is like having the recipe.
For an expert chef, having a basket of ingredients and a picture of the final product might be enough to recreate the dish… still delicious, but unlikely to be an exact replica of the last time.
I am not an expert chef.

Step 1: get the ingredients

# read in the data-------------------------------
raw_data <- read_csv(here("data", "Population_Projections.csv"))
my_dt(raw_data)

Step 2: discard unneeded ingredients

needed_columns <- raw_data |>
  select(-Region, -`Local Health Area`, -Gender, -Total)

my_dt(needed_columns)

Step 3: pour into a long container then filter out clumps

long_filtered <- needed_columns |>
  pivot_longer(cols = -Year, names_to = "age", values_to = "population") |>
  mutate(age = as.numeric(age)) |>
  filter(age %in% 15:65)

my_dt(long_filtered)

Step 4: mix ingredients together.

with_generation <- long_filtered |>
  mutate(
    birth_year = Year - age,
    generation = case_when(#generation is a character variable, which R sorts alphabetically :(
      birth_year %in% 1900:1927 ~ "Greatest", 
      birth_year %in% 1928:1945 ~ "Interwar",
      birth_year %in% 1946:1965 ~ "Baby Boomers",
      birth_year %in% 1966:1980 ~ "Gen X",
      birth_year %in% 1981:1996 ~ "Millenials",
      birth_year %in% 1997:2012 ~ "Gen Z",
      birth_year %in% 2013:2023 ~ "Gen Alpha",
      TRUE ~ NA_character_
    ),# convert generation to factor, ordered chronologically :)
    generation=ordered(generation, levels=c("Greatest",
                                            "Interwar",
                                            "Baby Boomers",
                                            "Gen X",
                                            "Millenials",
                                            "Gen Z",
                                            "Gen Alpha"))
  )

my_dt(with_generation)

Step 5: bake at 350 for one micro second.

aggregated <- with_generation |>
  group_by(Year, generation) |>
  summarise(population = sum(population)) |>
  filter(Year > 2012)

my_dt(aggregated)

Step 6: Check to see if cooked.

plt <- aggregated |>
  ggplot(aes(Year, population, fill = generation)) +
  geom_col()
plt

Step 7: plate it.

plt <- plt +
  geom_vline(xintercept = year(today()) - .5, lwd = 1.5, alpha = .5) +
  geom_col(alpha = .75) +
  scale_y_continuous(labels = scales::comma) +
  labs(
    x = NULL,
    y = NULL,
    fill = NULL,
    title = "Working age Population Projections for BC"
  )+
  theme_excel_new()
wrapR::plotlify(plt, "", 15)

Aside: tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

Every column is a variable.
Every row is an observation.
Every cell is a single value.

Messy data is any other arrangement of the data.

So raw data was messy… because?
After pivoting, data is tidy… because?