This assignment aims to visualise the demographic structure of Singapore population by age cohort and by planning area in 2019. The objective of this visualisation is to look at Singapore’s population and dwelling condition.
Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data.
The dataset contains 984,656 rows of data about Area Subzone, Age Group, Sex and Type of Dwelling from 2011 to 2019.
| Variable | Description |
|---|---|
| PA | Planning Area |
| SZ | Subzone |
| AG | Age Group |
| Sex | Sex |
| TOD | Type of Dwelling |
| Pop | Resident Count |
| Time | Time/Period |
Design of Pyramid Chart: I only had experience in designing Pyramid chart with Tableau, so I Googled how to design Pyramid chart with ggplot 2, and there are some excellent works in the Internet, which gives me great help.
Execute Pyramid Chart: after plotting Pyramid Chart, I realize there is an issue with order of x-axis(i.e., Age group). As x-ais automatically orders by the number, the order shows in an illogical way as :
…
50 to 59
5 to 9
45 to 49
…
Hence, I added an if-else clause to alter 5 to 9 to 05 to 9 and it solved the ordering issue.
popGHcens$AG <- ifelse(popGHcens$AG == "5_to_9", "05_to_9", popGHcens$AG)
Hence, I decide to use a ternary plot because it doesn’t show all planning area in axis label.
I proposed to have 2 chars - Pyramid Population chart and Ternary Plot for population structure for different planning areas
dplyr provides a consistent set of verbs that help you solve the most common data manipulation challenges. I used filter() and select() to filter the 2019 data.
ggplot2 is the packages used for data visualization.
ggtern is a package that extends the functionality of ggplot2, giving the capability to plot ternary diagrams for (subset of) the ggplot2 proto geometries.
packages <- c('dplyr', 'ggplot2', 'ggtern')
for (p in packages){
if (!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
data <- read.csv("data.csv")
#View(data)
Since the original data contains data from 2011 to 2019, I will first select the age group, population and gender data in 2019 for Pyramid chart.
popGHcens <- data %>%
filter(Time=="2019") %>%
select(AG,Pop,Sex)
Besides, I created another subset for Ternary Plot, in which I select Planning Area, Age group, population and gender data in 2019.
data_pz <- data %>%
filter(Time=="2019") %>%
select(PA,AG,Pop,Sex)
Pyramid Chart are two barcharts with axes flipped, so barplots for male populations goes to the left with negative sign. And to make the x-axis order in a logical way, I have to alter the age group “5_to_9” as well.
popGHcens$Pop <- ifelse(popGHcens$Sex == "Males", -1*popGHcens$Pop, popGHcens$Pop)
popGHcens$AG <- ifelse(popGHcens$AG == "5_to_9", "05_to_9", popGHcens$AG)
To simplify the age group, I define 3 age groups - Young group (aged from 0 - 24), Economically Active (aged from 25 - 69), Aged Group (aged from 70 - over).
data_pz$Group <- ifelse(data_pz$AG == "0_to_4" | data_pz$AG == "5_to_9" | data_pz$AG == "10_to_14" | data_pz$AG == "15_to_19"| data_pz$AG == "20_to_24", "Young", ifelse(data_pz$AG == "70_to_74" | data_pz$AG == "75_to_79" | data_pz$AG == "80_to_84" | data_pz$AG == "85_to_89"| data_pz$AG == "90_and_over", "Aged", "Economically Active"))
After that, I create another dataset which group the population by the Planning area and Age group, to generate the percent that each age group takes up according to Planning Area.
data_pz_percent <- data_pz %>%
group_by(PA) %>%
summarize(young_percent = sum(Pop[Group=="Young"]) / sum(Pop),
Econ_active_percent = sum(Pop[Group=="Economically Active"]) / sum(Pop),
aged_percent=1-young_percent-Econ_active_percent)
With dataset ready, I manage to generate visualization with ggplot2 and ggtern.
pyramidGH <- ggplot(popGHcens, aes(x = AG, y = Pop, fill = Sex)) +
geom_bar(data = subset(popGHcens, Sex == "Females"), stat = "identity") +
geom_bar(data = subset(popGHcens, Sex == "Males"), stat = "identity") +
scale_y_continuous(breaks = seq(-150000, 150000, 50000),
labels = paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), "m")) +
coord_flip()
pyramidGH
ggtern(data=data_pz_percent,aes(x=Young_Percent,y=Active_percent, z=Aged_percent)) +
geom_point()+
labs(title="Population Structure According to Planning Area, 2019") +
theme_rgbw()
From the Pyramid Chart, there are some insights:
The Pyramid chart reveals the population structure with sex for different age interval.
From the Ternary Plot, there are some insights:
The ternary plot reveals the proportion of each age group takes up in a certain planning area, each dot in the plot represents a planning area and the proportion for different age group (percent for young/active/aged group) in this planning area.