Code
library(tidyverse)
library(sf)
library(readxl)
library(magrittr)
library(DT)This quarto file attempts to create a naughty/nice list for LGAs.
Specifically, it tries to answer the following questions
What has the population growth rate for each LGA in Greater Sydney been for the last 10 years (2014 to 2024)?
This analysis is done for each of the following populations: 1. Total population 2. Under 20s (as a proxy for children)
I use under 20s as a proxy for children as the ABS provides population data in 5-year buckets.
The structure of this file includes: 1. Load packages 2. Import and clean data 3. Run analytics for the Total Population 4. Run analytics for the Under 20s Population
library(tidyverse)
library(sf)
library(readxl)
library(magrittr)
library(DT)I import GCCSA boundaries data and filter to extract Greater Sydney.
I also prepend a “gccsa_” prefix to help identify columns when using this data set later.
Data is sourced from Digital boundary files - Australian Statistical Geography Standard (ASGS) Edition 3.
gccsa_syd <- st_read("./data/GCCSA_2021_AUST_SHP_GDA2020/GCCSA_2021_AUST_GDA2020.shp") |>
filter(GCC_CODE21 == '1GSYD') |>
rename_with(~ paste0('gccsa_', .x)) Reading layer `GCCSA_2021_AUST_GDA2020' from data source
`/Users/shivanathirumoorthy/Documents/scratch_projects/lga_pop_growth/data/GCCSA_2021_AUST_SHP_GDA2020/GCCSA_2021_AUST_GDA2020.shp'
using driver `ESRI Shapefile'
replacing null geometries with empty geometries
Simple feature collection with 35 features and 10 fields (with 19 geometries empty)
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: 96.81695 ymin: -43.7405 xmax: 167.998 ymax: -9.142163
Geodetic CRS: GDA2020
I import the 2024 LGA boundaries data.
I also prepend an “lga_” prefix to help identify columns when using this data set later.
Data is sourced from Digital boundary files - Australian Statistical Geography Standard (ASGS) Edition 3.
lga_raw <- st_read("./data/LGA_2024_AUST_GDA2020/LGA_2024_AUST_GDA2020.shp") |>
rename_with(~ paste0('lga_', .x)) Reading layer `LGA_2024_AUST_GDA2020' from data source
`/Users/shivanathirumoorthy/Documents/scratch_projects/lga_pop_growth/data/LGA_2024_AUST_GDA2020/LGA_2024_AUST_GDA2020.shp'
using driver `ESRI Shapefile'
Simple feature collection with 566 features and 8 fields (with 19 geometries empty)
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 96.81695 ymin: -43.7405 xmax: 167.998 ymax: -9.142163
Geodetic CRS: GDA2020
I then filter the LGA boundaries to create a table with only LGAs that are in Greater Sydney.
More specifically, I look for LGAs that have more than 50% of their surface area inside the Greater Sydney GCCSA.
#Obtaining intersections between LGA and Greater Sydney GCCSA
gccsa_lga_syd_intersection <- lga_raw |>
filter(lga_LGA_CODE24 != '19399') |> #filtering out Unincorporated NSW
st_intersection(gccsa_syd) |>
st_make_valid() |> #Repairing geometry
rename(intersection_geometry = lga_geometry) |> #renaming geometry column
mutate(intersection_sqkm = as.double(st_area(intersection_geometry) / (10^6))) |>
mutate(intersection_share_lga = round(intersection_sqkm/lga_AREASQKM, 2)) # Calculating intersection as share of LGA area, roundedI also present this table that shows each LGA that touches the Greater Sydney GCCSA, and shows the share of its surface area that is inside the Greater Sydney GCCSA.
gccsa_lga_syd_intersection |>
st_drop_geometry() |>
#glimpse() %>%
select(lga_LGA_NAME24, intersection_share_lga) |>
rename(`LGA name` = lga_LGA_NAME24,
`Share of area in Greater Sydney (%)` = intersection_share_lga) |>
arrange(desc(`Share of area in Greater Sydney (%)`), `LGA name`) |>
datatable(extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = c('csv'))) |>
formatPercentage('Share of area in Greater Sydney (%)', 0)
I then use a 50% cutoff to decide if an LGA is in Greater Sydney. Note however, that this doesn’t really matter. Any cutoff between 94% and 13% would have the same effect.
gccsa_lga_syd_intersection |>
st_drop_geometry() |>
#glimpse() %>%
select(lga_LGA_NAME24, intersection_share_lga) |>
ggplot(aes(x = reorder(lga_LGA_NAME24, -intersection_share_lga), y = intersection_share_lga)) +
geom_col() +
geom_hline(yintercept = 0.5, linetype = "dashed") +
annotate("text",
x = 38,
y = 0.55,
label = "Cutoff") +
ggtitle("Share of LGA within Greater Sydney (GCCSA)") +
theme_classic() +
scale_y_continuous(name = "Share of surface area (%)",
labels = scales::percent, limits = c(0,1),
breaks = seq(from = 0, to = 1, by = 0.25),
expand = c(0,0)) +
scale_x_discrete(name = "LGA") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
I apply the 50% land area filter now.
#Filtering out only LGAs 50% inside Sydney GCCSA
lga_syd <- gccsa_lga_syd_intersection |>
filter(intersection_share_lga > 0.5) |>
select(lga_LGA_CODE24) |>
st_drop_geometry() |>
left_join(lga_raw, by = join_by(lga_LGA_CODE24)) |>
st_as_sf()I import population data for each LGA. This has population categorised into 5 year buckets.
Data is sourced from the ABS Regional population by age and sex.
This extract represents population as at 30 Jun 2024.
population_lgas <- read_excel("./data/32350DS0006_2001-24.xlsx", sheet = 'Table 3', skip = 5) |>
#Cleaning header and footer entries
filter(`...1` != "Year") |>
filter(`...1` != "(a) Based on 2024 LGA boundaries.") |>
filter(`...1` != "© Commonwealth of Australia") |>
pivot_longer(cols = -starts_with("..."),
names_to = "age_range",
values_to = "population") |>
#Repairing column name
rename(Year = ...1,
`S/T code` = ...2,
`S/T name`= ...3,
`LGA code` = ...4,
`LGA name` = ...5) |>
mutate(Year = as.integer(Year),
population = as.integer(population))We then create a new field with numeric age_range columns to simplify the output
I then clean the data by: 1. Categorising the start and end of age ranges as numeric values. For example, the age range “0-4” is coded as age_range_lower = 0 and age_range_upper = 4 2. Create a version of the dataset without the “Total persons” entry. (population_lgas_analytics) I later infer total persons by simply adding up the population at all available age bands.
clean_range <- population_lgas |>
filter(!age_range %in% c("85 and over", "Total persons")) |>
separate_wider_delim(age_range,
delim = "–",
names = c("age_range_lower", "age_range_higher"),
cols_remove = FALSE) |>
mutate(age_range_lower = as.integer(age_range_lower),
age_range_higher = as.integer(age_range_higher))
dirty_range <- population_lgas |>
filter(age_range %in% c("85 and over", "Total persons")) |>
mutate(age_range_lower = if_else(age_range == "85 and over", as.integer(85), as.integer(0))) |>
mutate(age_range_higher = as.integer(999))
population_lgas_clean <- clean_range |>
bind_rows(dirty_range)
population_lgas_analytics <- population_lgas_clean |>
filter(age_range != "Total persons")Cleaning up
rm(clean_range)
rm(dirty_range)First we join our LGA list with the population data to build a breakdown of population for all LGAs in Greater Sydney.
lga_analytics_syd <- lga_syd |>
select(lga_LGA_CODE24) |>
st_drop_geometry() |>
as_tibble() %>%
left_join(population_lgas_analytics, by = join_by(lga_LGA_CODE24 == `LGA code`)) In this section I analyse the growth of the total population in Greater Sydney.
First, I calculate the growth rate of the entire Greater Sydney. Second, I calculate the growth rate for each LGA.
This table looks at the population growth of the entirety of Greater Sydney.
In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024.
This allows us to look at: 1. growth_rate_10y - The cumulative population growth rate for the 10 years prior to each snapshot 2. CAGR_10y - the compound annual growth rate for the 10 years prior to each snapshot
stats_total_population <- lga_analytics_syd |>
filter(Year %in% c(2024, 2014, 2004)) %>%
group_by(Year) |>
summarise(population = sum(population)) %>%
mutate(decade_lag = lag(population)) |>
mutate(growth_10y = population - decade_lag,
growth_rate_10y = growth_10y/decade_lag,
CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1)In the interest of representing data directly, I provide a table of this data that includes: 1. Population in 2024 2. Population in 2014 3. Population growth rate over 10 years (%) 4. Population growth rate converted into a CAGR
stats_total_population %>%
filter(Year == 2024) |>
select(population, decade_lag, growth_rate_10y, CAGR_10y) |>
rename(`Population (2024)` = population,
`Population (2014)` = decade_lag,
`Population Growth Rate 2014-2024 (%)` = growth_rate_10y,
`CAGR 2014-2024 (%)` = CAGR_10y) |>
datatable(extensions = 'Buttons', options =
list(dom = 'Bfrtip', buttons = c('csv'))) %>%
formatPercentage('Population Growth Rate 2014-2024 (%)', 2) %>%
formatPercentage('CAGR 2014-2024 (%)', 2) %>%
formatRound("Population (2024)", interval = 3,mark = ",", digits = 0) |>
formatRound("Population (2014)", interval = 3,mark = ",", digits = 0)This table looks at the population growth for each LGA in Greater Sydney.
In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024.
This allows us to look at: 1. growth_rate_10y - The cumulative population growth rate for the 10 years prior to each snapshot 2. CAGR_10y - the compound annual growth rate for the 10 years prior to each snapshot
stats_total_population_lga <- lga_analytics_syd |>
filter(Year %in% c(2024, 2014, 2004)) %>%
group_by(Year, lga_LGA_CODE24, `LGA name` ) |>
summarise(population = sum(population)) |>
ungroup() |>
arrange(lga_LGA_CODE24, `LGA name`, Year) |>
group_by(lga_LGA_CODE24, `LGA name`) |>
mutate(decade_lag = lag(population)) |>
mutate(growth_10y = population - decade_lag,
growth_rate_10y = growth_10y/decade_lag,
CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1) |>
ungroup()We can then construct a naughty/nice list to look at population growth by LGA.
In the interest of representing data directly, I provide a table of this data that includes: 1. Population in 2024 2. Population in 2014 3. Population growth rate over 10 years (%) 4. Population growth rate converted into a CAGR
I present the LGAs in order of ascending population growth rate. As such, the naughtiest (NIMBYiest?) LGAs appear up top.
stats_total_population_lga |>
filter(Year == 2024) |>
select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |>
arrange(CAGR_10y) %>%
rename(`Population (2024)` = population,
`Population (2014)` = decade_lag,
`Population Growth Rate 2014-2024 (%)` = growth_rate_10y,
`CAGR 2014-2024 (%)` = CAGR_10y) |>
datatable(extensions = 'Buttons', options =
list(dom = 'Bfrtip', buttons = c('csv'))) %>%
formatPercentage('Population Growth Rate 2014-2024 (%)', 2) |>
formatPercentage('CAGR 2014-2024 (%)', 2) %>%
formatRound("Population (2024)", interval = 3,mark = ",", digits = 0) |>
formatRound("Population (2014)", interval = 3,mark = ",", digits = 0)We can then construct a chart showing population growth by LGA, sorting by growth rate.
syd_average_growth <- stats_total_population %>%
filter(Year == 2024) %>%
select(growth_rate_10y) %>%
extract2(1)
total_population_growth_chart <- stats_total_population_lga |>
filter(Year == 2024) |>
select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |>
arrange(CAGR_10y) %>%
rename(`Population (2024)` = population,
`Population (2014)` = decade_lag,
`Population Growth Rate 2014-2024 (%)` = growth_rate_10y,
`CAGR 2014-2024 (%)` = CAGR_10y) |>
ggplot(aes(x = reorder(`LGA name`, -`Population Growth Rate 2014-2024 (%)`),
y = `Population Growth Rate 2014-2024 (%)`)) +
geom_col() +
geom_hline(yintercept = syd_average_growth, linetype = "dashed") +
ggtitle("Population - Growth (2014-2024) by LGA") +
annotate("text",
x = 25,
y = syd_average_growth + 0.05,
label = "Greater Syd Average") +
scale_y_continuous(name = "Population Growth (%)",
labels = scales::percent, limits = c(-0.25,1.25),
breaks = seq(from = -0.25, to = 1.25, by = 0.25)) +
scale_x_discrete(name = "LGA") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))We can present this chart and save a copy as a PNG.
total_population_growth_chart %>%
ggsave(filename = "output/population_growth_by_lga_2014_2014.png", plot = ., width = 16, height = 12, units = "cm")
total_population_growth_chartThe analysis in this section is motivated by a desire to see what happened to population of children in Sydney.
We measure Under 20s as all population 0-19, because the ABS gives us data in those categories. We don’t actually have separate data for “children”.
First we measure the overall population of children in all LGAs observed
This table looks at the population growth of under 20s the entirety of Greater Sydney.
In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024.
This allows us to look at: 1. growth_rate_10y - The cumulative population growth rate for the 10 years prior to each snapshot 2. CAGR_10y - the compound annual growth rate for the 10 years prior to each snapshot
stats_under_20 <- lga_analytics_syd |>
filter(age_range_higher <= 19) |>
filter(Year %in% c(2024, 2014, 2004)) %>%
group_by(Year) |>
summarise(population = sum(population)) %>%
mutate(decade_lag = lag(population)) |>
mutate(growth_10y = population - decade_lag,
growth_rate_10y = growth_10y/decade_lag,
CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1)In the interest of representing data directly, I provide a table of this data that includes: 1. Population in 2024 2. Population in 2014 3. Population growth rate over 10 years (%) 4. Population growth rate converted into a CAGR
stats_under_20 %>%
filter(Year == 2024) |>
select(population, decade_lag, growth_rate_10y, CAGR_10y) |>
rename(`Population - Under 20 (2024)` = population,
`Population - Under 20 (2014)` = decade_lag,
`Population - Under 20 - Growth Rate 2014-2024 (%)` = growth_rate_10y,
`CAGR - Under 20 - 2014-2024 (%)` = CAGR_10y) |>
datatable(extensions = 'Buttons', options =
list(dom = 'Bfrtip', buttons = c('csv'))) %>%
formatPercentage('Population - Under 20 - Growth Rate 2014-2024 (%)', 2) %>%
formatPercentage('CAGR - Under 20 - 2014-2024 (%)', 2) %>%
formatRound("Population - Under 20 (2024)", interval = 3,mark = ",", digits = 0) |>
formatRound("Population - Under 20 (2014)", interval = 3,mark = ",", digits = 0)This table looks at the population growth in Under 20s for each LGA in Greater Sydney.
In particular, it snapshots Sydney at 3 time periods 2004, 2014 and 2024.
This allows us to look at: 1. growth_rate_10y - The cumulative population growth rate for the 10 years prior to each snapshot 2. CAGR_10y - the compound annual growth rate for the 10 years prior to each snapshot
stats_under_20_lga <- lga_analytics_syd |>
filter(age_range_higher <= 19) |>
filter(Year %in% c(2024, 2014, 2004)) %>%
group_by(Year, lga_LGA_CODE24, `LGA name` ) |>
summarise(population = sum(population)) |>
ungroup() |>
arrange(lga_LGA_CODE24, `LGA name`, Year) |>
group_by(lga_LGA_CODE24, `LGA name`) |>
mutate(decade_lag = lag(population)) |>
mutate(growth_10y = population - decade_lag,
growth_rate_10y = growth_10y/decade_lag,
CAGR_10y = ((1+growth_rate_10y)^(1/10)) - 1) |>
ungroup()We can then construct a naughty/nice list to look at under 20s population growth by LGA.
In the interest of representing data directly, I provide a table of this data that includes: 1. Population - Under 20s - in 2024 2. Population - Under 20s - in 2014 3. Population - Under 20 - growth rate over 10 years (%) 4. Population - Under 20 - growth rate converted into a CAGR
I present the LGAs in order of ascending population growth rate. We can therefore see where “a city without grandchildren” will happen first.
stats_under_20_lga |>
filter(Year == 2024) |>
select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |>
arrange(CAGR_10y) %>%
rename(`Population - Under 20 (2024)` = population,
`Population - Under 20 (2014)` = decade_lag,
`Population - Under 20 - Growth Rate 2014-2024 (%)` = growth_rate_10y,
`CAGR - Under 20 - 2014-2024 (%)` = CAGR_10y) |>
datatable(extensions = 'Buttons', options =
list(dom = 'Bfrtip', buttons = c('csv'))) %>%
formatPercentage('Population - Under 20 - Growth Rate 2014-2024 (%)', 2) |>
formatPercentage('CAGR - Under 20 - 2014-2024 (%)', 2) %>%
formatRound("Population - Under 20 (2024)", interval = 3,mark = ",", digits = 0) |>
formatRound("Population - Under 20 (2014)", interval = 3,mark = ",", digits = 0)We can then construct a chart showing under 20s population growth by LGA, sorting by growth rate.
syd_under_20_average_growth <- stats_under_20 %>%
filter(Year == 2024) %>%
select(growth_rate_10y) %>%
extract2(1)
under_20_population_growth_chart <- stats_under_20_lga |>
filter(Year == 2024) |>
select(`LGA name`, population, decade_lag, growth_rate_10y, CAGR_10y) |>
arrange(CAGR_10y) %>%
rename(`Population (2024)` = population,
`Population (2014)` = decade_lag,
`Population Growth Rate 2014-2024 (%)` = growth_rate_10y,
`CAGR 2014-2024 (%)` = CAGR_10y) |>
ggplot(aes(x = reorder(`LGA name`, -`Population Growth Rate 2014-2024 (%)`),
y = `Population Growth Rate 2014-2024 (%)`)) +
geom_col() +
geom_hline(yintercept = syd_under_20_average_growth, linetype = "dashed") +
ggtitle("Population - Under 20 - Growth (2014-2024) by LGA") +
annotate("text",
x = 25,
y = syd_average_growth + 0.05,
label = "Greater Syd Average") +
scale_y_continuous(name = "Population Growth (%)",
labels = scales::percent, limits = c(-0.25,1.25),
breaks = seq(from = -0.25, to = 1.25, by = 0.25)) +
scale_x_discrete(name = "LGA") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))under_20_population_growth_chart %>%
ggsave(filename = "output/population_under_20_growth_by_lga_2014_2014.png", plot = ., width = 16, height = 12, units = "cm")
under_20_population_growth_chartIn response to some group chat questions, I include some extra analysis on under 20s in Burwood. Specifically I look into the evolution of the under 20s population in Burwood.
burwood_under_20_bars <- population_lgas_analytics %>%
filter(`LGA name` == "Burwood") %>%
filter(age_range_higher <= 19) |>
arrange(age_range_lower, Year) %>%
#glimpse() %>%
ggplot(aes(x = Year, y = population, fill = reorder(age_range, -age_range_lower))) +
geom_col(position = "stack") +
ggtitle("Burwood - population under 20 - 2001 to 2024") +
scale_y_continuous(name = "Population under 20", limits = c(0,8000),
labels = scales::comma,
expand = c(0,0)) +
scale_fill_discrete(guide = guide_legend(title = "Age range")) +
theme_classic() I then visualise the number of Under 20s in Burwood based on the 5-year age buckets the ABS provides.
burwood_under_20_bars %>%
ggsave(filename = "output/burwood_under_20_bars.png", plot = ., width = 16, height = 12, units = "cm")
burwood_under_20_barsTo make the evolution in each 5-year category more obvious, I generate the following line chart of the same data.
burwood_under_20_line <- population_lgas_analytics %>%
filter(`LGA name` == "Burwood") %>%
filter(age_range_higher <= 19) |>
arrange(age_range_lower, Year) %>%
#glimpse() %>%
ggplot(aes(x = Year, y = population, colour = reorder(age_range, - age_range_lower))) +
geom_line() +
ggtitle("Burwood - population under 20 - 2001 to 2024") +
scale_y_continuous(name = "Population under 20", limits = c(1200,2400),
labels = scales::comma,
expand = c(0,0)) +
scale_colour_discrete(guide = guide_legend(title = "Age range")) +
theme_classic() burwood_under_20_line %>%
ggsave(filename = "output/burwood_under_20_line.png", plot = ., width = 16, height = 12, units = "cm")
burwood_under_20_lineIn response to some group chat questions, I include some extra analysis on under 20s in Randwick. Specifically I look into the evolution of the under 20s population in Burwood.
randwick_under_20_bars <- population_lgas_analytics %>%
filter(`LGA name` == "Randwick") %>%
filter(age_range_higher <= 19) |>
arrange(age_range_lower, Year) |>
ggplot(aes(x = Year, y = population, fill = reorder(age_range, -age_range_lower))) +
geom_col(position = "stack") +
ggtitle("Randwick - population under 20 - 2001 to 2024") +
scale_y_continuous(name = "Population under 20", limits = c(0,40000),
labels = scales::comma,
expand = c(0,0)) +
scale_fill_discrete(guide = guide_legend(title = "Age range")) +
theme_classic() I then visualise the number of Under 20s in Randwick based on the 5-year age buckets the ABS provides.
randwick_under_20_bars %>%
ggsave(filename = "output/randwick_under_20_bars.png", plot = ., width = 16, height = 12, units = "cm")
randwick_under_20_barsTo make the evolution in each 5-year category more obvious, I generate the following line chart of the same data.
randwick_under_20_line <- population_lgas_analytics %>%
filter(`LGA name` == "Randwick") %>%
filter(age_range_higher <= 19) |>
arrange(age_range_lower, Year) %>%
#glimpse() %>%
ggplot(aes(x = Year, y = population, colour = reorder(age_range, - age_range_lower))) +
geom_line() +
ggtitle("Randwick - population under 20 - 2001 to 2024") +
scale_y_continuous(name = "Population under 20", limits = c(5000,10000),
labels = scales::comma,
expand = c(0,0)) +
scale_colour_discrete(guide = guide_legend(title = "Age range")) +
theme_classic() randwick_under_20_line %>%
ggsave(filename = "output/randwick_under_20_line.png", plot = ., width = 16, height = 12, units = "cm")
randwick_under_20_line