01_population_data

Author

Shivana Thiru Moorthy

1 Introduction

This quarto file attempts to create a naughty/nice list for LGAs.

Specifically, it tries to answer the following questions

What has the population growth rate for each LGA in Greater Sydney been for the last 10 years (2014 to 2024)?

This analysis is done for each of the following populations: 1. Total population 2. Under 20s (as a proxy for children)

I use under 20s as a proxy for children as the ABS provides population data in 5-year buckets.

The structure of this file includes: 1. Load packages 2. Import and clean data 3. Run analytics for the Total Population 4. Run analytics for the Under 20s Population

2 Loading packages

Code

library(tidyverse)
library(sf)
library(readxl)
library(magrittr)
library(DT)

3 Cleaning data

3.1 Import GCCSA

I import GCCSA boundaries data and filter to extract Greater Sydney.

I also prepend a “gccsa_” prefix to help identify columns when using this data set later.

Data is sourced from Digital boundary files - Australian Statistical Geography Standard (ASGS) Edition 3.

Code

gccsa_syd <- st_read("./data/GCCSA_2021_AUST_SHP_GDA2020/GCCSA_2021_AUST_GDA2020.shp") |>
  filter(GCC_CODE21 == '1GSYD') |>
  rename_with(~ paste0('gccsa_', .x))

Reading layer `GCCSA_2021_AUST_GDA2020' from data source 
  `/Users/shivanathirumoorthy/Documents/scratch_projects/lga_pop_growth/data/GCCSA_2021_AUST_SHP_GDA2020/GCCSA_2021_AUST_GDA2020.shp' 
  using driver `ESRI Shapefile'
replacing null geometries with empty geometries
Simple feature collection with 35 features and 10 fields (with 19 geometries empty)
Geometry type: GEOMETRY
Dimension:     XY
Bounding box:  xmin: 96.81695 ymin: -43.7405 xmax: 167.998 ymax: -9.142163
Geodetic CRS:  GDA2020

3.2 Import LGA Boundaries

I import the 2024 LGA boundaries data.

I also prepend an “lga_” prefix to help identify columns when using this data set later.

Data is sourced from Digital boundary files - Australian Statistical Geography Standard (ASGS) Edition 3.

Code

lga_raw <- st_read("./data/LGA_2024_AUST_GDA2020/LGA_2024_AUST_GDA2020.shp") |>
  rename_with(~ paste0('lga_', .x))

Reading layer `LGA_2024_AUST_GDA2020' from data source 
  `/Users/shivanathirumoorthy/Documents/scratch_projects/lga_pop_growth/data/LGA_2024_AUST_GDA2020/LGA_2024_AUST_GDA2020.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 566 features and 8 fields (with 19 geometries empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 96.81695 ymin: -43.7405 xmax: 167.998 ymax: -9.142163
Geodetic CRS:  GDA2020

3.2.1 LGAs - Filter only to Greater Sydney

I then filter the LGA boundaries to create a table with only LGAs that are in Greater Sydney.

More specifically, I look for LGAs that have more than 50% of their surface area inside the Greater Sydney GCCSA.

Code

#Obtaining intersections between LGA and Greater Sydney GCCSA
gccsa_lga_syd_intersection <- lga_raw |>
  filter(lga_LGA_CODE24 != '19399') |> #filtering out Unincorporated NSW
  st_intersection(gccsa_syd) |>
  st_make_valid() |> #Repairing geometry
  rename(intersection_geometry = lga_geometry) |> #renaming geometry column
  mutate(intersection_sqkm = as.double(st_area(intersection_geometry) / (10^6))) |>
  mutate(intersection_share_lga = round(intersection_sqkm/lga_AREASQKM, 2))  # Calculating intersection as share of LGA area, rounded

I also present this table that shows each LGA that touches the Greater Sydney GCCSA, and shows the share of its surface area that is inside the Greater Sydney GCCSA.

Code

gccsa_lga_syd_intersection |>
  st_drop_geometry() |>
  #glimpse() %>%
  select(lga_LGA_NAME24, intersection_share_lga) |>
  rename(`LGA name` = lga_LGA_NAME24,
         `Share of area in Greater Sydney (%)` = intersection_share_lga) |>
  arrange(desc(`Share of area in Greater Sydney (%)`), `LGA name`) |>
  datatable(extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = c('csv'))) |>
  formatPercentage('Share of area in Greater Sydney (%)', 0)

I then use a 50% cutoff to decide if an LGA is in Greater Sydney. Note however, that this doesn’t really matter. Any cutoff between 94% and 13% would have the same effect.

Code

gccsa_lga_syd_intersection |>
  st_drop_geometry() |>
  #glimpse() %>%
  select(lga_LGA_NAME24, intersection_share_lga) |>
  ggplot(aes(x = reorder(lga_LGA_NAME24, -intersection_share_lga), y = intersection_share_lga)) +
  geom_col() +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  annotate("text", 
         x = 38, 
         y = 0.55, 
         label = "Cutoff") +
  ggtitle("Share of LGA within Greater Sydney (GCCSA)") +
  theme_classic() +
  scale_y_continuous(name = "Share of surface area (%)",
                   labels = scales::percent, limits = c(0,1), 
                   breaks = seq(from = 0, to = 1, by = 0.25), 
                   expand = c(0,0)) +
  scale_x_discrete(name = "LGA") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

I apply the 50% land area filter now.

Code

#Filtering out only LGAs 50% inside Sydney GCCSA
lga_syd <- gccsa_lga_syd_intersection |>
  filter(intersection_share_lga > 0.5) |>
  select(lga_LGA_CODE24) |>
  st_drop_geometry() |>
  left_join(lga_raw, by = join_by(lga_LGA_CODE24)) |>
  st_as_sf()

3.3 Import population data

I import population data for each LGA. This has population categorised into 5 year buckets.

Data is sourced from the ABS Regional population by age and sex.

This extract represents population as at 30 Jun 2024.

Code

population_lgas <- read_excel("./data/32350DS0006_2001-24.xlsx", sheet = 'Table 3', skip = 5) |>
  #Cleaning header and footer entries
  filter(`...1` != "Year") |>
  filter(`...1` != "(a) Based on 2024 LGA boundaries.") |>
  filter(`...1` != "© Commonwealth of Australia") |>
  pivot_longer(cols = -starts_with("..."),
               names_to = "age_range",
               values_to = "population") |>
  #Repairing column name
  rename(Year = ...1,
       `S/T code` = ...2,
       `S/T name`= ...3,
       `LGA code` = ...4,
       `LGA name` = ...5) |>
  mutate(Year = as.integer(Year),
         population = as.integer(population))

We then create a new field with numeric age_range columns to simplify the output

I then clean the data by: 1. Categorising the start and end of age ranges as numeric values. For example, the age range “0-4” is coded as age_range_lower = 0 and age_range_upper = 4 2. Create a version of the dataset without the “Total persons” entry. (population_lgas_analytics) I later infer total persons by simply adding up the population at all available age bands.

Code

clean_range <- population_lgas |>
  filter(!age_range %in% c("85 and over", "Total persons")) |>
  separate_wider_delim(age_range, 
                       delim = "–", 
                       names = c("age_range_lower", "age_range_higher"),
                       cols_remove = FALSE) |>
  mutate(age_range_lower = as.integer(age_range_lower),
         age_range_higher = as.integer(age_range_higher))

dirty_range <- population_lgas |>
  filter(age_range %in% c("85 and over", "Total persons")) |>
  mutate(age_range_lower = if_else(age_range == "85 and over", as.integer(85), as.integer(0))) |>
  mutate(age_range_higher = as.integer(999))

population_lgas_clean <- clean_range |>
  bind_rows(dirty_range)

population_lgas_analytics <- population_lgas_clean |>
  filter(age_range != "Total persons")

Cleaning up

Code

rm(clean_range)
rm(dirty_range)

4 Prepatory analysis

First we join our LGA list with the population data to build a breakdown of population for all LGAs in Greater Sydney.

Code

lga_analytics_syd <- lga_syd |>
  select(lga_LGA_CODE24) |>
  st_drop_geometry() |>
  as_tibble() %>%
  left_join(population_lgas_analytics, by = join_by(lga_LGA_CODE24 == `LGA code`))

5 Analysis - Total Population

In this section I analyse the growth of the total population in Greater Sydney.

First, I calculate the growth rate of the entire Greater Sydney. Second, I calculate the growth rate for each LGA.

5.1 Total Population - Greater Sydney - Growth Rate

This table looks at the population growth of the entirety of Greater Sydney.

In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024.

This allows us to look at: 1. growth_rate_10y - The cumulative population growth rate for the 10 years prior to each snapshot 2. CAGR_10y - the compound annual growth rate for the 10 years prior to each snapshot

Code

stats_total_population <- lga_analytics_syd |>
  filter(Year %in% c(2024, 2014, 2004)) %>%
  group_by(Year) |>
  summarise(population = sum(population)) %>%
  mutate(decade_lag = lag(population)) |>
  mutate(growth_10y = population - decade_lag,
         growth_rate_10y = growth_10y/decade_lag,
         CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1)

In the interest of representing data directly, I provide a table of this data that includes: 1. Population in 2024 2. Population in 2014 3. Population growth rate over 10 years (%) 4. Population growth rate converted into a CAGR

Code

stats_total_population %>%
  filter(Year == 2024) |>
  select(population, decade_lag, growth_rate_10y, CAGR_10y) |>
  rename(`Population (2024)` = population,
         `Population (2014)` = decade_lag,
         `Population Growth Rate 2014-2024 (%)` = growth_rate_10y,
         `CAGR 2014-2024 (%)` = CAGR_10y) |>
  datatable(extensions = 'Buttons', options =
              list(dom = 'Bfrtip', buttons = c('csv'))) %>%
  formatPercentage('Population Growth Rate 2014-2024 (%)', 2) %>%
  formatPercentage('CAGR 2014-2024 (%)', 2) %>%
  formatRound("Population (2024)", interval = 3,mark = ",", digits = 0) |>
  formatRound("Population (2014)", interval = 3,mark = ",", digits = 0)

5.2 Total Population - LGA - Growth Rate

This table looks at the population growth for each LGA in Greater Sydney.

In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024.

Code

stats_total_population_lga <- lga_analytics_syd |>
  filter(Year %in% c(2024, 2014, 2004)) %>%
  group_by(Year, lga_LGA_CODE24, `LGA name` ) |>
  summarise(population = sum(population)) |>
  ungroup() |>
  arrange(lga_LGA_CODE24, `LGA name`, Year) |>
  group_by(lga_LGA_CODE24, `LGA name`) |>
  mutate(decade_lag = lag(population)) |>
  mutate(growth_10y = population - decade_lag,
         growth_rate_10y = growth_10y/decade_lag,
         CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1) |>
  ungroup()

We can then construct a naughty/nice list to look at population growth by LGA.

I present the LGAs in order of ascending population growth rate. As such, the naughtiest (NIMBYiest?) LGAs appear up top.

Code

stats_total_population_lga |>
  filter(Year == 2024) |>
  select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |>
  arrange(CAGR_10y) %>%
  rename(`Population (2024)` = population,
       `Population (2014)` = decade_lag,
       `Population Growth Rate 2014-2024 (%)` = growth_rate_10y,
       `CAGR 2014-2024 (%)` = CAGR_10y) |>
  datatable(extensions = 'Buttons', options =
              list(dom = 'Bfrtip', buttons = c('csv'))) %>%
  formatPercentage('Population Growth Rate 2014-2024 (%)', 2) |>
  formatPercentage('CAGR 2014-2024 (%)', 2) %>%
  formatRound("Population (2024)", interval = 3,mark = ",", digits = 0) |>
  formatRound("Population (2014)", interval = 3,mark = ",", digits = 0)

5.2.1 Total Population - LGA - Charting

We can then construct a chart showing population growth by LGA, sorting by growth rate.

Code

syd_average_growth <- stats_total_population %>%
  filter(Year == 2024) %>%
  select(growth_rate_10y) %>%
  extract2(1)

total_population_growth_chart <- stats_total_population_lga |>
  filter(Year == 2024) |>
  select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |>
  arrange(CAGR_10y) %>%
  rename(`Population (2024)` = population,
       `Population (2014)` = decade_lag,
       `Population Growth Rate 2014-2024 (%)` = growth_rate_10y,
       `CAGR 2014-2024 (%)` = CAGR_10y) |>
  ggplot(aes(x = reorder(`LGA name`, -`Population Growth Rate 2014-2024 (%)`), 
             y = `Population Growth Rate 2014-2024 (%)`)) +
  geom_col() + 
  geom_hline(yintercept = syd_average_growth, linetype = "dashed") +
  ggtitle("Population - Growth (2014-2024) by LGA") +
  annotate("text", 
           x = 25, 
           y = syd_average_growth + 0.05, 
           label = "Greater Syd Average") +
  scale_y_continuous(name = "Population Growth  (%)",
                     labels = scales::percent, limits = c(-0.25,1.25), 
                     breaks = seq(from = -0.25, to = 1.25, by = 0.25)) +
  scale_x_discrete(name = "LGA") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

We can present this chart and save a copy as a PNG.

Code

total_population_growth_chart %>%
  ggsave(filename = "output/population_growth_by_lga_2014_2014.png", plot = ., width = 16, height = 12, units = "cm")

total_population_growth_chart

6 Analysis - Under 20s

The analysis in this section is motivated by a desire to see what happened to population of children in Sydney.

We measure Under 20s as all population 0-19, because the ABS gives us data in those categories. We don’t actually have separate data for “children”.

6.1 Under 20s - Greater Sydney - Growth Rate

First we measure the overall population of children in all LGAs observed

This table looks at the population growth of under 20s the entirety of Greater Sydney.

In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024.

Code

stats_under_20 <- lga_analytics_syd |>
  filter(age_range_higher <= 19) |>
  filter(Year %in% c(2024, 2014, 2004)) %>%
  group_by(Year) |>
  summarise(population = sum(population)) %>%
  mutate(decade_lag = lag(population)) |>
  mutate(growth_10y = population - decade_lag,
         growth_rate_10y = growth_10y/decade_lag,
         CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1)

Code

stats_under_20 %>%
  filter(Year == 2024) |>
  select(population, decade_lag, growth_rate_10y, CAGR_10y) |>
  rename(`Population - Under 20 (2024)` = population,
         `Population - Under 20 (2014)` = decade_lag,
         `Population - Under 20 - Growth Rate 2014-2024 (%)` = growth_rate_10y,
         `CAGR - Under 20 - 2014-2024 (%)` = CAGR_10y) |>
  datatable(extensions = 'Buttons', options =
              list(dom = 'Bfrtip', buttons = c('csv'))) %>%
  formatPercentage('Population - Under 20 - Growth Rate 2014-2024 (%)', 2) %>%
  formatPercentage('CAGR - Under 20 - 2014-2024 (%)', 2) %>%
  formatRound("Population - Under 20 (2024)", interval = 3,mark = ",", digits = 0) |>
  formatRound("Population - Under 20 (2014)", interval = 3,mark = ",", digits = 0)

6.2 Under 20s - LGA - Growth Rate

This table looks at the population growth in Under 20s for each LGA in Greater Sydney.

In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024.

Code

stats_under_20_lga <- lga_analytics_syd |>
  filter(age_range_higher <= 19) |>
  filter(Year %in% c(2024, 2014, 2004)) %>%
  group_by(Year, lga_LGA_CODE24, `LGA name` ) |>
  summarise(population = sum(population)) |>
  ungroup() |>
  arrange(lga_LGA_CODE24, `LGA name`, Year) |>
  group_by(lga_LGA_CODE24, `LGA name`) |>
  mutate(decade_lag = lag(population)) |>
  mutate(growth_10y = population - decade_lag,
         growth_rate_10y = growth_10y/decade_lag,
         CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1) |>
  ungroup()

We can then construct a naughty/nice list to look at under 20s population growth by LGA.

In the interest of representing data directly, I provide a table of this data that includes: 1. Population - Under 20s - in 2024 2. Population - Under 20s - in 2014 3. Population - Under 20 - growth rate over 10 years (%) 4. Population - Under 20 - growth rate converted into a CAGR

I present the LGAs in order of ascending population growth rate. We can therefore see where “a city without grandchildren” will happen first.

Code

stats_under_20_lga |>
  filter(Year == 2024) |>
  select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |>
  arrange(CAGR_10y) %>%
  rename(`Population - Under 20 (2024)` = population,
       `Population - Under 20 (2014)` = decade_lag,
       `Population - Under 20 - Growth Rate 2014-2024 (%)` = growth_rate_10y,
       `CAGR - Under 20 - 2014-2024 (%)` = CAGR_10y) |>
  datatable(extensions = 'Buttons', options =
              list(dom = 'Bfrtip', buttons = c('csv'))) %>%
  formatPercentage('Population - Under 20 - Growth Rate 2014-2024 (%)', 2) |>
  formatPercentage('CAGR - Under 20 - 2014-2024 (%)', 2) %>%
  formatRound("Population - Under 20 (2024)", interval = 3,mark = ",", digits = 0) |>
  formatRound("Population - Under 20 (2014)", interval = 3,mark = ",", digits = 0)

6.2.1 Under 20s - LGA - Charting

We can then construct a chart showing under 20s population growth by LGA, sorting by growth rate.

Code

syd_under_20_average_growth <- stats_under_20 %>%
  filter(Year == 2024) %>%
  select(growth_rate_10y) %>%
  extract2(1)

under_20_population_growth_chart <- stats_under_20_lga |>
  filter(Year == 2024) |>
  select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |>
  arrange(CAGR_10y) %>%
  rename(`Population (2024)` = population,
       `Population (2014)` = decade_lag,
       `Population Growth Rate 2014-2024 (%)` = growth_rate_10y,
       `CAGR 2014-2024 (%)` = CAGR_10y) |>
  ggplot(aes(x = reorder(`LGA name`, -`Population Growth Rate 2014-2024 (%)`), 
             y = `Population Growth Rate 2014-2024 (%)`)) +
  geom_col() + 
  geom_hline(yintercept = syd_under_20_average_growth, linetype = "dashed") +
  ggtitle("Population - Under 20 - Growth (2014-2024) by LGA") +
  annotate("text", 
           x = 25, 
           y = syd_average_growth + 0.05, 
           label = "Greater Syd Average") +
  scale_y_continuous(name = "Population Growth  (%)",
                     labels = scales::percent, limits = c(-0.25,1.25), 
                     breaks = seq(from = -0.25, to = 1.25, by = 0.25)) +
  scale_x_discrete(name = "LGA") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Code

under_20_population_growth_chart %>%
  ggsave(filename = "output/population_under_20_growth_by_lga_2014_2014.png", plot = ., width = 16, height = 12, units = "cm")

under_20_population_growth_chart

7 Specials

7.1 Burwood

In response to some group chat questions, I include some extra analysis on under 20s in Burwood. Specifically I look into the evolution of the under 20s population in Burwood.

Code

burwood_under_20_bars <- population_lgas_analytics %>%
  filter(`LGA name` == "Burwood") %>%
  filter(age_range_higher <= 19) |>
  arrange(age_range_lower, Year) %>%
  #glimpse() %>%
  ggplot(aes(x = Year, y = population, fill = reorder(age_range, -age_range_lower))) +
  geom_col(position = "stack") +
  ggtitle("Burwood - population under 20 - 2001 to 2024") +
  scale_y_continuous(name = "Population under 20", limits = c(0,8000), 
                     labels = scales::comma,
                     expand = c(0,0)) +
  scale_fill_discrete(guide = guide_legend(title = "Age range")) +
  theme_classic()

I then visualise the number of Under 20s in Burwood based on the 5-year age buckets the ABS provides.

Code

burwood_under_20_bars %>%
  ggsave(filename = "output/burwood_under_20_bars.png", plot = ., width = 16, height = 12, units = "cm")

burwood_under_20_bars

To make the evolution in each 5-year category more obvious, I generate the following line chart of the same data.

Code

burwood_under_20_line <- population_lgas_analytics %>%
  filter(`LGA name` == "Burwood") %>%
  filter(age_range_higher <= 19) |>
  arrange(age_range_lower, Year) %>%
  #glimpse() %>%
  ggplot(aes(x = Year, y = population, colour = reorder(age_range, - age_range_lower))) +
  geom_line() +
  ggtitle("Burwood - population under 20 - 2001 to 2024") +
  scale_y_continuous(name = "Population under 20", limits = c(1200,2400), 
                     labels = scales::comma,
                     expand = c(0,0)) +
  scale_colour_discrete(guide = guide_legend(title = "Age range")) +
  theme_classic()

Code

burwood_under_20_line %>%
  ggsave(filename = "output/burwood_under_20_line.png", plot = ., width = 16, height = 12, units = "cm")

burwood_under_20_line

7.2 Randwick

In response to some group chat questions, I include some extra analysis on under 20s in Randwick. Specifically I look into the evolution of the under 20s population in Burwood.

Code

randwick_under_20_bars <- population_lgas_analytics %>%
  filter(`LGA name` == "Randwick") %>%
  filter(age_range_higher <= 19) |>
  arrange(age_range_lower, Year) |>
  ggplot(aes(x = Year, y = population, fill = reorder(age_range, -age_range_lower))) +
  geom_col(position = "stack") +
  ggtitle("Randwick - population under 20 - 2001 to 2024") +
  scale_y_continuous(name = "Population under 20", limits = c(0,40000), 
                     labels = scales::comma,
                     expand = c(0,0)) +
  scale_fill_discrete(guide = guide_legend(title = "Age range")) +
  theme_classic()

I then visualise the number of Under 20s in Randwick based on the 5-year age buckets the ABS provides.

Code

randwick_under_20_bars %>%
  ggsave(filename = "output/randwick_under_20_bars.png", plot = ., width = 16, height = 12, units = "cm")

randwick_under_20_bars

To make the evolution in each 5-year category more obvious, I generate the following line chart of the same data.

Code

randwick_under_20_line <- population_lgas_analytics %>%
  filter(`LGA name` == "Randwick") %>%
  filter(age_range_higher <= 19) |>
  arrange(age_range_lower, Year) %>%
  #glimpse() %>%
  ggplot(aes(x = Year, y = population, colour = reorder(age_range, - age_range_lower))) +
  geom_line() +
  ggtitle("Randwick - population under 20 - 2001 to 2024") +
  scale_y_continuous(name = "Population under 20", limits = c(5000,10000), 
                     labels = scales::comma,
                     expand = c(0,0)) +
  scale_colour_discrete(guide = guide_legend(title = "Age range")) +
  theme_classic()

Code

randwick_under_20_line %>%
  ggsave(filename = "output/randwick_under_20_line.png", plot = ., width = 16, height = 12, units = "cm")

randwick_under_20_line

--- title: "01_population_data" author: "Shivana Thiru Moorthy" format: html: toc: true number-sections: true theme: cosmo code-tools: true self-contained: true code-fold: true editor: visual execute: warning: false --- # Introduction This quarto file attempts to create a naughty/nice list for LGAs. Specifically, it tries to answer the following questions What has the population growth rate for each LGA in Greater Sydney been for the last 10 years (2014 to 2024)? This analysis is done for each of the following populations: 1. Total population 2. Under 20s (as a proxy for children) I use under 20s as a proxy for children as the ABS provides population data in 5-year buckets. The structure of this file includes: 1. Load packages 2. Import and clean data 3. Run analytics for the Total Population 4. Run analytics for the Under 20s Population # Loading packages ```{r} library(tidyverse) library(sf) library(readxl) library(magrittr) library(DT) ``` # Cleaning data ## Import GCCSA I import GCCSA boundaries data and filter to extract Greater Sydney. I also prepend a "gccsa\_" prefix to help identify columns when using this data set later. Data is sourced from [Digital boundary files - Australian Statistical Geography Standard (ASGS) Edition 3](https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files). ```{r} gccsa_syd <- st_read("./data/GCCSA_2021_AUST_SHP_GDA2020/GCCSA_2021_AUST_GDA2020.shp") |> filter(GCC_CODE21 == '1GSYD') |> rename_with(~ paste0('gccsa_', .x)) ``` ## Import LGA Boundaries I import the 2024 LGA boundaries data. I also prepend an "lga\_" prefix to help identify columns when using this data set later. Data is sourced from [Digital boundary files - Australian Statistical Geography Standard (ASGS) Edition 3](https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files). ```{r} lga_raw <- st_read("./data/LGA_2024_AUST_GDA2020/LGA_2024_AUST_GDA2020.shp") |> rename_with(~ paste0('lga_', .x)) ``` ### LGAs - Filter only to Greater Sydney I then filter the LGA boundaries to create a table with only LGAs that are in Greater Sydney. More specifically, I look for LGAs that have more than 50% of their surface area inside the Greater Sydney GCCSA. ```{r} #Obtaining intersections between LGA and Greater Sydney GCCSA gccsa_lga_syd_intersection <- lga_raw |> filter(lga_LGA_CODE24 != '19399') |> #filtering out Unincorporated NSW st_intersection(gccsa_syd) |> st_make_valid() |> #Repairing geometry rename(intersection_geometry = lga_geometry) |> #renaming geometry column mutate(intersection_sqkm = as.double(st_area(intersection_geometry) / (10^6))) |> mutate(intersection_share_lga = round(intersection_sqkm/lga_AREASQKM, 2)) # Calculating intersection as share of LGA area, rounded ``` I also present this table that shows each LGA that touches the Greater Sydney GCCSA, and shows the share of its surface area that is inside the Greater Sydney GCCSA. ```{r} gccsa_lga_syd_intersection |> st_drop_geometry() |> #glimpse() %>% select(lga_LGA_NAME24, intersection_share_lga) |> rename(`LGA name` = lga_LGA_NAME24, `Share of area in Greater Sydney (%)` = intersection_share_lga) |> arrange(desc(`Share of area in Greater Sydney (%)`), `LGA name`) |> datatable(extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = c('csv'))) |> formatPercentage('Share of area in Greater Sydney (%)', 0) ``` \ I then use a 50% cutoff to decide if an LGA is in Greater Sydney. Note however, that this doesn't really matter. Any cutoff between 94% and 13% would have the same effect. ```{r} gccsa_lga_syd_intersection |> st_drop_geometry() |> #glimpse() %>% select(lga_LGA_NAME24, intersection_share_lga) |> ggplot(aes(x = reorder(lga_LGA_NAME24, -intersection_share_lga), y = intersection_share_lga)) + geom_col() + geom_hline(yintercept = 0.5, linetype = "dashed") + annotate("text", x = 38, y = 0.55, label = "Cutoff") + ggtitle("Share of LGA within Greater Sydney (GCCSA)") + theme_classic() + scale_y_continuous(name = "Share of surface area (%)", labels = scales::percent, limits = c(0,1), breaks = seq(from = 0, to = 1, by = 0.25), expand = c(0,0)) + scale_x_discrete(name = "LGA") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) ``` \ I apply the 50% land area filter now. ```{r} #Filtering out only LGAs 50% inside Sydney GCCSA lga_syd <- gccsa_lga_syd_intersection |> filter(intersection_share_lga > 0.5) |> select(lga_LGA_CODE24) |> st_drop_geometry() |> left_join(lga_raw, by = join_by(lga_LGA_CODE24)) |> st_as_sf() ``` ## Import population data I import population data for each LGA. This has population categorised into 5 year buckets. Data is sourced from the [ABS Regional population by age and sex](https://www.abs.gov.au/statistics/people/population/regional-population-age-and-sex/latest-release#data-downloads). This extract represents population as at 30 Jun 2024. ```{r} population_lgas <- read_excel("./data/32350DS0006_2001-24.xlsx", sheet = 'Table 3', skip = 5) |> #Cleaning header and footer entries filter(`...1` != "Year") |> filter(`...1` != "(a) Based on 2024 LGA boundaries.") |> filter(`...1` != "© Commonwealth of Australia") |> pivot_longer(cols = -starts_with("..."), names_to = "age_range", values_to = "population") |> #Repairing column name rename(Year = ...1, `S/T code` = ...2, `S/T name`= ...3, `LGA code` = ...4, `LGA name` = ...5) |> mutate(Year = as.integer(Year), population = as.integer(population)) ``` We then create a new field with numeric age_range columns to simplify the output I then clean the data by: 1. Categorising the start and end of age ranges as numeric values. For example, the age range "0-4" is coded as age_range_lower = 0 and age_range_upper = 4 2. Create a version of the dataset without the "Total persons" entry. (population_lgas_analytics) I later infer total persons by simply adding up the population at all available age bands. ```{r} clean_range <- population_lgas |> filter(!age_range %in% c("85 and over", "Total persons")) |> separate_wider_delim(age_range, delim = "–", names = c("age_range_lower", "age_range_higher"), cols_remove = FALSE) |> mutate(age_range_lower = as.integer(age_range_lower), age_range_higher = as.integer(age_range_higher)) dirty_range <- population_lgas |> filter(age_range %in% c("85 and over", "Total persons")) |> mutate(age_range_lower = if_else(age_range == "85 and over", as.integer(85), as.integer(0))) |> mutate(age_range_higher = as.integer(999)) population_lgas_clean <- clean_range |> bind_rows(dirty_range) population_lgas_analytics <- population_lgas_clean |> filter(age_range != "Total persons") ``` Cleaning up ```{r} rm(clean_range) rm(dirty_range) ``` # Prepatory analysis First we join our LGA list with the population data to build a breakdown of population for all LGAs in Greater Sydney. ```{r} lga_analytics_syd <- lga_syd |> select(lga_LGA_CODE24) |> st_drop_geometry() |> as_tibble() %>% left_join(population_lgas_analytics, by = join_by(lga_LGA_CODE24 == `LGA code`)) ``` # Analysis - Total Population In this section I analyse the growth of the total population in Greater Sydney. First, I calculate the growth rate of the entire Greater Sydney. Second, I calculate the growth rate for each LGA. ## Total Population - Greater Sydney - Growth Rate This table looks at the population growth of the entirety of Greater Sydney. In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024. This allows us to look at: 1. growth_rate_10y - The cumulative population growth rate for the 10 years prior to each snapshot 2. CAGR_10y - the compound annual growth rate for the 10 years prior to each snapshot ```{r} stats_total_population <- lga_analytics_syd |> filter(Year %in% c(2024, 2014, 2004)) %>% group_by(Year) |> summarise(population = sum(population)) %>% mutate(decade_lag = lag(population)) |> mutate(growth_10y = population - decade_lag, growth_rate_10y = growth_10y/decade_lag, CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1) ``` In the interest of representing data directly, I provide a table of this data that includes: 1. Population in 2024 2. Population in 2014 3. Population growth rate over 10 years (%) 4. Population growth rate converted into a CAGR ```{r} stats_total_population %>% filter(Year == 2024) |> select(population, decade_lag, growth_rate_10y, CAGR_10y) |> rename(`Population (2024)` = population, `Population (2014)` = decade_lag, `Population Growth Rate 2014-2024 (%)` = growth_rate_10y, `CAGR 2014-2024 (%)` = CAGR_10y) |> datatable(extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = c('csv'))) %>% formatPercentage('Population Growth Rate 2014-2024 (%)', 2) %>% formatPercentage('CAGR 2014-2024 (%)', 2) %>% formatRound("Population (2024)", interval = 3,mark = ",", digits = 0) |> formatRound("Population (2014)", interval = 3,mark = ",", digits = 0) ``` ## Total Population - LGA - Growth Rate This table looks at the population growth for each LGA in Greater Sydney. In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024. This allows us to look at: 1. growth_rate_10y - The cumulative population growth rate for the 10 years prior to each snapshot 2. CAGR_10y - the compound annual growth rate for the 10 years prior to each snapshot ```{r} stats_total_population_lga <- lga_analytics_syd |> filter(Year %in% c(2024, 2014, 2004)) %>% group_by(Year, lga_LGA_CODE24, `LGA name` ) |> summarise(population = sum(population)) |> ungroup() |> arrange(lga_LGA_CODE24, `LGA name`, Year) |> group_by(lga_LGA_CODE24, `LGA name`) |> mutate(decade_lag = lag(population)) |> mutate(growth_10y = population - decade_lag, growth_rate_10y = growth_10y/decade_lag, CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1) |> ungroup() ``` We can then construct a naughty/nice list to look at population growth by LGA. In the interest of representing data directly, I provide a table of this data that includes: 1. Population in 2024 2. Population in 2014 3. Population growth rate over 10 years (%) 4. Population growth rate converted into a CAGR I present the LGAs in order of ascending population growth rate. As such, the naughtiest (NIMBYiest?) LGAs appear up top. ```{r} stats_total_population_lga |> filter(Year == 2024) |> select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |> arrange(CAGR_10y) %>% rename(`Population (2024)` = population, `Population (2014)` = decade_lag, `Population Growth Rate 2014-2024 (%)` = growth_rate_10y, `CAGR 2014-2024 (%)` = CAGR_10y) |> datatable(extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = c('csv'))) %>% formatPercentage('Population Growth Rate 2014-2024 (%)', 2) |> formatPercentage('CAGR 2014-2024 (%)', 2) %>% formatRound("Population (2024)", interval = 3,mark = ",", digits = 0) |> formatRound("Population (2014)", interval = 3,mark = ",", digits = 0) ``` ### Total Population - LGA - Charting We can then construct a chart showing population growth by LGA, sorting by growth rate. ```{r} syd_average_growth <- stats_total_population %>% filter(Year == 2024) %>% select(growth_rate_10y) %>% extract2(1) total_population_growth_chart <- stats_total_population_lga |> filter(Year == 2024) |> select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |> arrange(CAGR_10y) %>% rename(`Population (2024)` = population, `Population (2014)` = decade_lag, `Population Growth Rate 2014-2024 (%)` = growth_rate_10y, `CAGR 2014-2024 (%)` = CAGR_10y) |> ggplot(aes(x = reorder(`LGA name`, -`Population Growth Rate 2014-2024 (%)`), y = `Population Growth Rate 2014-2024 (%)`)) + geom_col() + geom_hline(yintercept = syd_average_growth, linetype = "dashed") + ggtitle("Population - Growth (2014-2024) by LGA") + annotate("text", x = 25, y = syd_average_growth + 0.05, label = "Greater Syd Average") + scale_y_continuous(name = "Population Growth (%)", labels = scales::percent, limits = c(-0.25,1.25), breaks = seq(from = -0.25, to = 1.25, by = 0.25)) + scale_x_discrete(name = "LGA") + theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) ``` We can present this chart and save a copy as a PNG. ```{r, fig.width= 8, fig.height= 6} total_population_growth_chart %>% ggsave(filename = "output/population_growth_by_lga_2014_2014.png", plot = ., width = 16, height = 12, units = "cm") total_population_growth_chart ``` # Analysis - Under 20s The analysis in this section is motivated by a desire to see what happened to population of children in Sydney. We measure Under 20s as all population 0-19, because the ABS gives us data in those categories. We don't actually have separate data for "children". ## Under 20s - Greater Sydney - Growth Rate First we measure the overall population of children in all LGAs observed This table looks at the population growth of under 20s the entirety of Greater Sydney. In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024. This allows us to look at: 1. growth_rate_10y - The cumulative population growth rate for the 10 years prior to each snapshot 2. CAGR_10y - the compound annual growth rate for the 10 years prior to each snapshot ```{r} stats_under_20 <- lga_analytics_syd |> filter(age_range_higher <= 19) |> filter(Year %in% c(2024, 2014, 2004)) %>% group_by(Year) |> summarise(population = sum(population)) %>% mutate(decade_lag = lag(population)) |> mutate(growth_10y = population - decade_lag, growth_rate_10y = growth_10y/decade_lag, CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1) ``` In the interest of representing data directly, I provide a table of this data that includes: 1. Population in 2024 2. Population in 2014 3. Population growth rate over 10 years (%) 4. Population growth rate converted into a CAGR ```{r} stats_under_20 %>% filter(Year == 2024) |> select(population, decade_lag, growth_rate_10y, CAGR_10y) |> rename(`Population - Under 20 (2024)` = population, `Population - Under 20 (2014)` = decade_lag, `Population - Under 20 - Growth Rate 2014-2024 (%)` = growth_rate_10y, `CAGR - Under 20 - 2014-2024 (%)` = CAGR_10y) |> datatable(extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = c('csv'))) %>% formatPercentage('Population - Under 20 - Growth Rate 2014-2024 (%)', 2) %>% formatPercentage('CAGR - Under 20 - 2014-2024 (%)', 2) %>% formatRound("Population - Under 20 (2024)", interval = 3,mark = ",", digits = 0) |> formatRound("Population - Under 20 (2014)", interval = 3,mark = ",", digits = 0) ``` ## Under 20s - LGA - Growth Rate This table looks at the population growth in Under 20s for each LGA in Greater Sydney. In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024. This allows us to look at: 1. growth_rate_10y - The cumulative population growth rate for the 10 years prior to each snapshot 2. CAGR_10y - the compound annual growth rate for the 10 years prior to each snapshot ```{r} stats_under_20_lga <- lga_analytics_syd |> filter(age_range_higher <= 19) |> filter(Year %in% c(2024, 2014, 2004)) %>% group_by(Year, lga_LGA_CODE24, `LGA name` ) |> summarise(population = sum(population)) |> ungroup() |> arrange(lga_LGA_CODE24, `LGA name`, Year) |> group_by(lga_LGA_CODE24, `LGA name`) |> mutate(decade_lag = lag(population)) |> mutate(growth_10y = population - decade_lag, growth_rate_10y = growth_10y/decade_lag, CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1) |> ungroup() ``` We can then construct a naughty/nice list to look at under 20s population growth by LGA. In the interest of representing data directly, I provide a table of this data that includes: 1. Population - Under 20s - in 2024 2. Population - Under 20s - in 2014 3. Population - Under 20 - growth rate over 10 years (%) 4. Population - Under 20 - growth rate converted into a CAGR I present the LGAs in order of ascending population growth rate. We can therefore see where "a city without grandchildren" will happen first. ```{r} stats_under_20_lga |> filter(Year == 2024) |> select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |> arrange(CAGR_10y) %>% rename(`Population - Under 20 (2024)` = population, `Population - Under 20 (2014)` = decade_lag, `Population - Under 20 - Growth Rate 2014-2024 (%)` = growth_rate_10y, `CAGR - Under 20 - 2014-2024 (%)` = CAGR_10y) |> datatable(extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = c('csv'))) %>% formatPercentage('Population - Under 20 - Growth Rate 2014-2024 (%)', 2) |> formatPercentage('CAGR - Under 20 - 2014-2024 (%)', 2) %>% formatRound("Population - Under 20 (2024)", interval = 3,mark = ",", digits = 0) |> formatRound("Population - Under 20 (2014)", interval = 3,mark = ",", digits = 0) ``` ### Under 20s - LGA - Charting We can then construct a chart showing under 20s population growth by LGA, sorting by growth rate. ```{r} syd_under_20_average_growth <- stats_under_20 %>% filter(Year == 2024) %>% select(growth_rate_10y) %>% extract2(1) under_20_population_growth_chart <- stats_under_20_lga |> filter(Year == 2024) |> select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |> arrange(CAGR_10y) %>% rename(`Population (2024)` = population, `Population (2014)` = decade_lag, `Population Growth Rate 2014-2024 (%)` = growth_rate_10y, `CAGR 2014-2024 (%)` = CAGR_10y) |> ggplot(aes(x = reorder(`LGA name`, -`Population Growth Rate 2014-2024 (%)`), y = `Population Growth Rate 2014-2024 (%)`)) + geom_col() + geom_hline(yintercept = syd_under_20_average_growth, linetype = "dashed") + ggtitle("Population - Under 20 - Growth (2014-2024) by LGA") + annotate("text", x = 25, y = syd_average_growth + 0.05, label = "Greater Syd Average") + scale_y_continuous(name = "Population Growth (%)", labels = scales::percent, limits = c(-0.25,1.25), breaks = seq(from = -0.25, to = 1.25, by = 0.25)) + scale_x_discrete(name = "LGA") + theme_classic() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) ``` ```{r, fig.width=8, fig.height=6} under_20_population_growth_chart %>% ggsave(filename = "output/population_under_20_growth_by_lga_2014_2014.png", plot = ., width = 16, height = 12, units = "cm") under_20_population_growth_chart ``` # Specials ## Burwood In response to some group chat questions, I include some extra analysis on under 20s in Burwood. Specifically I look into the evolution of the under 20s population in Burwood. ```{r} burwood_under_20_bars <- population_lgas_analytics %>% filter(`LGA name` == "Burwood") %>% filter(age_range_higher <= 19) |> arrange(age_range_lower, Year) %>% #glimpse() %>% ggplot(aes(x = Year, y = population, fill = reorder(age_range, -age_range_lower))) + geom_col(position = "stack") + ggtitle("Burwood - population under 20 - 2001 to 2024") + scale_y_continuous(name = "Population under 20", limits = c(0,8000), labels = scales::comma, expand = c(0,0)) + scale_fill_discrete(guide = guide_legend(title = "Age range")) + theme_classic() ``` I then visualise the number of Under 20s in Burwood based on the 5-year age buckets the ABS provides. ```{r, fig.width=8, fig.height=4} burwood_under_20_bars %>% ggsave(filename = "output/burwood_under_20_bars.png", plot = ., width = 16, height = 12, units = "cm") burwood_under_20_bars ``` To make the evolution in each 5-year category more obvious, I generate the following line chart of the same data. ```{r} burwood_under_20_line <- population_lgas_analytics %>% filter(`LGA name` == "Burwood") %>% filter(age_range_higher <= 19) |> arrange(age_range_lower, Year) %>% #glimpse() %>% ggplot(aes(x = Year, y = population, colour = reorder(age_range, - age_range_lower))) + geom_line() + ggtitle("Burwood - population under 20 - 2001 to 2024") + scale_y_continuous(name = "Population under 20", limits = c(1200,2400), labels = scales::comma, expand = c(0,0)) + scale_colour_discrete(guide = guide_legend(title = "Age range")) + theme_classic() ``` ```{r, fig.width=8, fig.height=4} burwood_under_20_line %>% ggsave(filename = "output/burwood_under_20_line.png", plot = ., width = 16, height = 12, units = "cm") burwood_under_20_line ``` ## Randwick In response to some group chat questions, I include some extra analysis on under 20s in Randwick. Specifically I look into the evolution of the under 20s population in Burwood. ```{r} randwick_under_20_bars <- population_lgas_analytics %>% filter(`LGA name` == "Randwick") %>% filter(age_range_higher <= 19) |> arrange(age_range_lower, Year) |> ggplot(aes(x = Year, y = population, fill = reorder(age_range, -age_range_lower))) + geom_col(position = "stack") + ggtitle("Randwick - population under 20 - 2001 to 2024") + scale_y_continuous(name = "Population under 20", limits = c(0,40000), labels = scales::comma, expand = c(0,0)) + scale_fill_discrete(guide = guide_legend(title = "Age range")) + theme_classic() ``` I then visualise the number of Under 20s in Randwick based on the 5-year age buckets the ABS provides. ```{r, fig.width=8, fig.height=4} randwick_under_20_bars %>% ggsave(filename = "output/randwick_under_20_bars.png", plot = ., width = 16, height = 12, units = "cm") randwick_under_20_bars ``` To make the evolution in each 5-year category more obvious, I generate the following line chart of the same data. ```{r} randwick_under_20_line <- population_lgas_analytics %>% filter(`LGA name` == "Randwick") %>% filter(age_range_higher <= 19) |> arrange(age_range_lower, Year) %>% #glimpse() %>% ggplot(aes(x = Year, y = population, colour = reorder(age_range, - age_range_lower))) + geom_line() + ggtitle("Randwick - population under 20 - 2001 to 2024") + scale_y_continuous(name = "Population under 20", limits = c(5000,10000), labels = scales::comma, expand = c(0,0)) + scale_colour_discrete(guide = guide_legend(title = "Age range")) + theme_classic() ``` ```{r, fig.width=8, fig.height=4} randwick_under_20_line %>% ggsave(filename = "output/randwick_under_20_line.png", plot = ., width = 16, height = 12, units = "cm") randwick_under_20_line ```