This report aims to visualise the demographic structure of Singapore population by age and planning area in 2019, using the population trend published by Singapore Department of Statistics. The dataset is published on the following website: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data.
The purpose of the visualisation is to study the proportion of the different age groups across the various planning areas. This will allow users, such as policymakers or campaign managers, to understand the needs of each planning area based on the demographics.
The raw dataset contains 19 classes of age groups in intervals of 5 years, starting from 0 to 4, to 90 and above. We will cluster the age groups according to the following table for more meaningful analysis. This clustering will also provide a consistent range of 25 years for the first 3 clusters.
| Original Age Group | Recoded Age Group | Remarks |
|---|---|---|
| 0 to 24 | 1. Youth | This group will cover newborns all the way to the tertiary students. |
| 25 to 49 | 2. Young Workforce | This group will cover the graduating students entering the workforce and young parents. |
| 50 to 75 | 3. Mature Workforce | This group will cover the mature workforce and those who have just retired. |
| 75 and above | 4. Elderly | This group will cover the retirees and elderlies in their silver years. |
The data has 55 planning areas, and some of them may have a low or zero resident count. Based on the objective and output of each visualisation, we may limit the planning areas and perform a cutoff by total resident count for a more meaningful analysis.
The raw dataset is also in the “long” format, where the planning areas, age group and population are all in individual columns. We will perform data wrangling to transform it into a “wide” format (i.e. population count by age groups in individual columns) as input into the various plot functions.
The proposed sketch design will mainly focus on visualising the propotion of age groups in each planning area in the form of a heatmap and parallel coordinates plot.
packages <- c('tidyverse', 'ggplot2', 'scales', 'parcoords', 'heatmaply', 'knitr')
for (p in packages){
if (!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}tidyverse is a set of packages to perform data wrangling and exploration.ggplot2 is a data visualisation package for visualisation and statistical programming language R.scales is used to provide methods to determine breaks and labels for axes and legends.parcoords is used to build parallel coordinates charts.heatmaply is used to build heat maps.knitr is used to enable integration of certain R codes into HTML.Check the number of unique values for each column.
check <- rbind(sapply(data1, function(x) length(unique(x))))
kable(check, format = "markdown", align='l')| PA | SZ | AG | Sex | TOD | Pop | Time |
|---|---|---|---|---|---|---|
| 55 | 323 | 19 | 2 | 8 | 261 | 9 |
We noted a total of 55 planning areas (PA) and 19 age groups (AG).
Recode the age groups in accordance to point 1.2.
data2 <- data1
data2$'Age Group' <- recode(data2$AG,
"0_to_4"="1. Youth",
"5_to_9"="1. Youth",
"10_to_14"="1. Youth",
"15_to_19"="1. Youth",
"20_to_24"="1. Youth",
"25_to_29"="2. Young Workforce",
"30_to_34"="2. Young Workforce",
"35_to_39"="2. Young Workforce",
"40_to_44"="2. Young Workforce",
"45_to_49"="2. Young Workforce",
"50_to_54"="3. Mature Workforce",
"55_to_59"="3. Mature Workforce",
"60_to_64"="3. Mature Workforce",
"65_to_69"="3. Mature Workforce",
"70_to_74"="3. Mature Workforce",
"75_to_79"="4. Elderly",
"80_to_84"="4. Elderly",
"85_to_89"="4. Elderly",
"90_and_over"="4. Elderly")Next we will remove rows that have 0 population count, and only retain the rows for year 2019. Check the number of unique values for each column again.
data3 <- data2[(data2$Time==2019 & data2$Pop!=0),]
check <- rbind(sapply(data3, function(x) length(unique(x))))
kable(check, format = "markdown", align='l')| PA | SZ | AG | Sex | TOD | Pop | Time | Age Group |
|---|---|---|---|---|---|---|---|
| 42 | 234 | 19 | 2 | 7 | 207 | 1 | 4 |
Select 5 random rows to check the dataset.
| PA | SZ | AG | Sex | TOD | Pop | Time | Age Group |
|---|---|---|---|---|---|---|---|
| Choa Chu Kang | Choa Chu Kang Central | 35_to_39 | Males | HDB 4-Room Flats | 220 | 2019 | 2. Young Workforce |
| Choa Chu Kang | Teck Whye | 75_to_79 | Males | HDB 4-Room Flats | 130 | 2019 | 4. Elderly |
| Queenstown | Holland Drive | 5_to_9 | Males | HDB 3-Room Flats | 70 | 2019 | 1. Youth |
| Outram | Pearl’s Hill | 80_to_84 | Males | HDB 3-Room Flats | 20 | 2019 | 4. Elderly |
| Geylang | Macpherson | 50_to_54 | Males | Condominiums and Other Apartments | 30 | 2019 | 3. Mature Workforce |
After inspecting the dataset, we noted that the data has correctly been filtered for year 2019, and the age groups have been categorised correctly. We will next explore the dataset.
Firstly, we will understand how is the population distributed across the planning areas in Singapore.
We will prepare the dataset for the bar chart to present the total population of each planning area in descending order.
barchartable <- xtabs(data3$Pop~data3$PA)
barchartable <- cbind(barchartable)
barchartable <- barchartable[order(-barchartable[,1]),]
barchartable <- as.data.frame(barchartable)
names(barchartable)[1] <- "Pop"
barchartable <- tibble::rownames_to_column(barchartable, "PA")
barchartable$percent <- paste(format(round((barchartable$Pop/sum(barchartable$Pop)*100),2),
nsmall=2),"%",sep = "")Visualise the bar chart with percentiles at 25%, 50% and 75% using ggplot.
ggplot(barchartable, aes(x = reorder(PA, Pop), y = Pop)) +
geom_bar(stat = "identity", fill = "lightblue") +
geom_text(aes(label = paste("",Pop," (",percent,")",sep = "")),
hjust = 0, size = 2.5) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 330000),labels = comma) +
coord_flip() +
ggtitle("Bar Chart of Total Population per Planning Area in 2019") +
labs(y="Total Population Count", x = "Planning Area") +
geom_hline(yintercept=quantile(barchartable$Pop,0.75), color="darkred") +
geom_hline(yintercept=quantile(barchartable$Pop,0.50), color="darkred") +
geom_hline(yintercept=quantile(barchartable$Pop,0.25), color="darkred") +
annotate("text",40,quantile(barchartable$Pop,0.75),
label = paste("75th Percentile: ",
quantile(barchartable$Pop,0.75)), size=3, color = "darkred") +
annotate("text",38,quantile(barchartable$Pop,0.50),
label = paste("50th Percentile: ",
quantile(barchartable$Pop,0.50)), size=3, color = "darkred") +
annotate("text",36,quantile(barchartable$Pop,0.25)+30000,
label = paste("25th Percentile: ",
quantile(barchartable$Pop,0.25)), size=3, color = "darkred") +
theme(
panel.background = element_blank(),
panel.grid.major.x = element_line(colour = "grey"),
axis.text.y.left = element_text(size = 8),
axis.text.x.bottom = element_text(size = 8),
axis.ticks.y = element_blank(),
axis.line.y = element_line(),
axis.line.x = element_line(),
plot.title = element_text(size = 10, hjust = 0.5, face = 'bold')
)
We noted that the first interquartile range is very small, with total population ranging from 70 to 4,205 per planning area. We also noted the ranking of the larger planning areas as we may shortlist them for a deepdive later. Through a separate calculation, we noted that the 15 largest planning areas account for approximately 75% of the total population.
Next we will study the distribution of the population across the age groups using ggplot.
agegrouptable <- aggregate(data3$Pop, by=list('Age Group'=data3$`Age Group`),FUN=sum)
agegrouptable$proportion <- agegrouptable$x/sum(agegrouptable$x)
ggplot(agegrouptable, aes(x = `Age Group`, y = x )) +
geom_bar(position = "dodge", stat = "identity", fill = "lightblue", width = 0.8) +
scale_y_continuous(expand = c(0,0),limits = c(0, 1550000), labels = comma) +
geom_text(aes(label = paste(as.character(x),
" (",as.character(round(proportion,2)*100), "%)",sep = "")),
hjust = 0.5,vjust = -0.5, size = 2.8) +
ggtitle("Population distribution by Age Groups in 2019") +
labs(y="Total Population Count", x = "Age Group") +
theme(
panel.background = element_blank(),
panel.grid.major.y = element_line(colour = "grey"),
axis.text.y.left = element_text(size = 8),
axis.text.x.bottom = element_text(size = 8),
axis.ticks.x = element_blank(),
axis.line.y = element_line(),
axis.line.x = element_line(),
plot.title = element_text(size = 10, hjust = 0.5, vjust = 3, face = 'bold')
)
Based on the national distribution, we will expect the young workforce (37%) and mature workforce (31%) to form the bulk of each planning area. Conversely, the elderlies should form the minority (5%). This distribution also shows that Singapore is facing an ageing population, as the youth propotion (26%) is much lower than the next age band of young workforce (37%).
As we will also visualise the dataset in proportion of age groups for each planning area, we will compute the proportion in the dataframe.
data4 <- aggregate(data3$Pop, by=list('Planning Area'=data3$PA,data3$`Age Group`), FUN=sum)
data4 <- data4 %>%
spread(Group.2,x)
data4[is.na(data4)] <- 0
rownames(data4) <- data4[,1]
data4$Total <- rowSums(data4[,c(2:5)])
data4$'1. Youth Proportion' <- data4$`1. Youth`/data4$Total
data4$'2. Young Workforce Proportion' <- data4$`2. Young Workforce`/data4$Total
data4$'3. Mature Workforce Proportion' <- data4$`3. Mature Workforce`/data4$Total
data4$'4. Elderly Proportion' <- data4$`4. Elderly`/data4$Total
data4$'Total Proportion Check' <- rowSums(data4[,c(7:10)])
data4 <- arrange(data4, -Total)We will select 5 random rows to inspect the proportion.
| 1. Youth | 2. Young Workforce | 3. Mature Workforce | 4. Elderly | Total | 1. Youth Proportion | 2. Young Workforce Proportion | 3. Mature Workforce Proportion | 4. Elderly Proportion | Total Proportion Check | |
|---|---|---|---|---|---|---|---|---|---|---|
| Ang Mo Kio | 35880 | 56400 | 59000 | 13150 | 164430 | 0.2182084 | 0.3430031 | 0.3588153 | 0.0799732 | 1 |
| Changi | 590 | 730 | 440 | 30 | 1790 | 0.3296089 | 0.4078212 | 0.2458101 | 0.0167598 | 1 |
| Bukit Batok | 40220 | 57960 | 49630 | 6330 | 154140 | 0.2609316 | 0.3760218 | 0.3219800 | 0.0410666 | 1 |
| Singapore River | 770 | 1400 | 680 | 90 | 2940 | 0.2619048 | 0.4761905 | 0.2312925 | 0.0306122 | 1 |
| Sembawang | 29350 | 40110 | 23690 | 2920 | 96070 | 0.3055064 | 0.4175081 | 0.2465910 | 0.0303945 | 1 |
We noted that the proportion has been correctly computed, and the sum of the proportion of the 4 age groups add up to 1.
In order to get an overview of all the population count for all the planning areas and age groups, we will perform a visualisation with heatmaply. The heatmap is a useful way to cross examine the planning areas and age groups, and is also able to visualise all the planning areas in one plot meaningfully.
heatmapd1 <- select(data4, c(7:10))
heatmap_matrix <- data.matrix(heatmapd1)
heatmaply(heatmap_matrix,
Colv = FALSE,
fontsize_row = 8,
fontsize_col = 8,
k_row = 8,
xlab = "Age Groups",
ylab = "Planning Areas") %>%
layout(height=600,width=800)
As we are visualising the proportion (between 0 to 1) of each planning area by age group, we will not perform further transformation to the dataset. We will use k_row = 8 to set the row dendogram to 8 clusters. As the column pertains to the 4 age groups, we will remove the column dendogram. Further analysis will be performed under point 3 below.
The mosaicplot is a useful way to visualise the contingency table. The height of the boxes are proportionate to the age group in each plannning area, and the width of the boxes are proportionate to the population in each planning area.
We will also use the share = TRUE to display the Pearson residuals. If the residual is negative (or red), it means that the box has few observations than expected. Conversely, if the resigual is negative (or blue), it means that the box has more observations than expected.
We will also order the dataset to display the most populous area on the left, and least propulous area on the right.
mosaictable <- xtabs(data3$Pop~data3$PA+data3$`Age Group`)
mosaictable <- cbind(mosaictable, Total = rowSums(mosaictable))
mosaictable <- mosaictable[order(-mosaictable[,5]),]
mosaictable <- mosaictable[,4:1]
mosaicplot(mosaictable, las=3, shade=TRUE, main = "Mosaic Plot of Planning Areas and Age Groups")We noted that when we plot for all planning areas, we will not be able to read the plot meaningfully. Hence we will limit the planning area to the top 15 by total population count.
mosaictable <- mosaictable[0:15,1:4]
mosaicplot(mosaictable, las=3, shade=TRUE,
main = "Mosaic Plot of Planning Areas (top 15) and Age Groups")Parallel coordinate plot parcoords is a useful in visualising multivariate numerical data and to understand the relationship between them. In this study, we will use this plot to visualise the relationships between the proportion of the 4 age groups for each planning area.
To be in line with the mosaicplot, we will also limit the planning area to the top 15 by total population count.
data5 <- data4[0:15,]
parcoords(
data5[c(1,7:10)],
color = list(
colorBy = "Planning Area",
colorScale = "scaleOrdinal"),
withD3 = TRUE,
rownames = FALSE,
margin = 45
)parcoords is that there is generally a substantial spread in the proportion of youth and elderlies across the top 15 planning areas. However all 13 of the planning areas (with the exception of Punggol and Seng Kang) have very similar proportion of young workforce (35%-39%) and mature workforce (29%-36%).heatmaply
parcoords
mosaicplot