1. Overview

This report aims to visualise the demographic structure of Singapore population by age and planning area in 2019, using the population trend published by Singapore Department of Statistics. The dataset is published on the following website: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data.

1.1. Purpose of Visualisation

The purpose of the visualisation is to study the proportion of the different age groups across the various planning areas. This will allow users, such as policymakers or campaign managers, to understand the needs of each planning area based on the demographics.

1.2. Data and Design Challenges

The raw dataset contains 19 classes of age groups in intervals of 5 years, starting from 0 to 4, to 90 and above. We will cluster the age groups according to the following table for more meaningful analysis. This clustering will also provide a consistent range of 25 years for the first 3 clusters.

Original Age Group	Recoded Age Group	Remarks
0 to 24	1. Youth	This group will cover newborns all the way to the tertiary students.
25 to 49	2. Young Workforce	This group will cover the graduating students entering the workforce and young parents.
50 to 75	3. Mature Workforce	This group will cover the mature workforce and those who have just retired.
75 and above	4. Elderly	This group will cover the retirees and elderlies in their silver years.

The data has 55 planning areas, and some of them may have a low or zero resident count. Based on the objective and output of each visualisation, we may limit the planning areas and perform a cutoff by total resident count for a more meaningful analysis.

The raw dataset is also in the “long” format, where the planning areas, age group and population are all in individual columns. We will perform data wrangling to transform it into a “wide” format (i.e. population count by age groups in individual columns) as input into the various plot functions.

1.3. Proposed Sketch Design

The proposed sketch design will mainly focus on visualising the propotion of age groups in each planning area in the form of a heatmap and parallel coordinates plot.

img1_path <- "a4sketch2.jpg"
knitr::include_graphics(img1_path)

2. Step-by-step Preparation

2.1. Install and Load R Packages

packages <- c('tidyverse', 'ggplot2', 'scales', 'parcoords', 'heatmaply', 'knitr')

for (p in packages){
  if (!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

tidyverse is a set of packages to perform data wrangling and exploration.
ggplot2 is a data visualisation package for visualisation and statistical programming language R.
scales is used to provide methods to determine breaks and labels for axes and legends.
parcoords is used to build parallel coordinates charts.
heatmaply is used to build heat maps.
knitr is used to enable integration of certain R codes into HTML.

2.2. Load the Data

data1 <- read_csv('respopagesextod2011to2019.csv')

2.3. Data Wrangling

Check the number of unique values for each column.

check <- rbind(sapply(data1, function(x) length(unique(x))))
kable(check, format = "markdown", align='l')

PA	SZ	AG	Sex	TOD	Pop	Time
55	323	19	2	8	261	9

We noted a total of 55 planning areas (PA) and 19 age groups (AG).

Recode the age groups in accordance to point 1.2.

data2 <- data1
data2$'Age Group' <- recode(data2$AG, 
                         "0_to_4"="1. Youth",
                         "5_to_9"="1. Youth",
                         "10_to_14"="1. Youth",
                         "15_to_19"="1. Youth",
                         "20_to_24"="1. Youth",
                         "25_to_29"="2. Young Workforce",
                         "30_to_34"="2. Young Workforce",
                         "35_to_39"="2. Young Workforce",
                         "40_to_44"="2. Young Workforce",
                         "45_to_49"="2. Young Workforce",
                         "50_to_54"="3. Mature Workforce",
                         "55_to_59"="3. Mature Workforce",
                         "60_to_64"="3. Mature Workforce",
                         "65_to_69"="3. Mature Workforce",
                         "70_to_74"="3. Mature Workforce",
                         "75_to_79"="4. Elderly",
                         "80_to_84"="4. Elderly",
                         "85_to_89"="4. Elderly",
                         "90_and_over"="4. Elderly")

Next we will remove rows that have 0 population count, and only retain the rows for year 2019. Check the number of unique values for each column again.

data3 <- data2[(data2$Time==2019 & data2$Pop!=0),]
check <- rbind(sapply(data3, function(x) length(unique(x))))
kable(check, format = "markdown", align='l')

PA	SZ	AG	Sex	TOD	Pop	Time	Age Group
42	234	19	2	7	207	1	4

Select 5 random rows to check the dataset.

check <- data3[sample(nrow(data3), 5), ]
kable(check, format = "markdown", align='l')

PA	SZ	AG	Sex	TOD	Pop	Time	Age Group
Choa Chu Kang	Choa Chu Kang Central	35_to_39	Males	HDB 4-Room Flats	220	2019	2. Young Workforce
Choa Chu Kang	Teck Whye	75_to_79	Males	HDB 4-Room Flats	130	2019	4. Elderly
Queenstown	Holland Drive	5_to_9	Males	HDB 3-Room Flats	70	2019	1. Youth
Outram	Pearl’s Hill	80_to_84	Males	HDB 3-Room Flats	20	2019	4. Elderly
Geylang	Macpherson	50_to_54	Males	Condominiums and Other Apartments	30	2019	3. Mature Workforce

After inspecting the dataset, we noted that the data has correctly been filtered for year 2019, and the age groups have been categorised correctly. We will next explore the dataset.

2.4. Exploring the dataset

2.4.1. Planning Areas

Firstly, we will understand how is the population distributed across the planning areas in Singapore.

We will prepare the dataset for the bar chart to present the total population of each planning area in descending order.

barchartable <- xtabs(data3$Pop~data3$PA)
barchartable <- cbind(barchartable)
barchartable <- barchartable[order(-barchartable[,1]),]
barchartable <- as.data.frame(barchartable)
names(barchartable)[1] <- "Pop"
barchartable <- tibble::rownames_to_column(barchartable, "PA")
barchartable$percent <- paste(format(round((barchartable$Pop/sum(barchartable$Pop)*100),2),
                                  nsmall=2),"%",sep = "")

Visualise the bar chart with percentiles at 25%, 50% and 75% using ggplot.

ggplot(barchartable, aes(x = reorder(PA, Pop), y = Pop)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  geom_text(aes(label = paste("",Pop," (",percent,")",sep = "")),
            hjust = 0, size = 2.5) +
  scale_y_continuous(expand = c(0, 0), limits = c(0, 330000),labels = comma) +
  coord_flip() +
  ggtitle("Bar Chart of Total Population per Planning Area in 2019") +
  labs(y="Total Population Count", x = "Planning Area") +
  geom_hline(yintercept=quantile(barchartable$Pop,0.75), color="darkred") + 
  geom_hline(yintercept=quantile(barchartable$Pop,0.50), color="darkred") + 
  geom_hline(yintercept=quantile(barchartable$Pop,0.25), color="darkred") +
  annotate("text",40,quantile(barchartable$Pop,0.75),
           label = paste("75th Percentile: ",
                         quantile(barchartable$Pop,0.75)), size=3, color = "darkred") +
  annotate("text",38,quantile(barchartable$Pop,0.50),
           label = paste("50th Percentile: ",
                         quantile(barchartable$Pop,0.50)), size=3, color = "darkred") +
  annotate("text",36,quantile(barchartable$Pop,0.25)+30000,
           label = paste("25th Percentile: ",
                         quantile(barchartable$Pop,0.25)), size=3, color = "darkred") +
  theme(
    panel.background = element_blank(),
    panel.grid.major.x = element_line(colour = "grey"),
    axis.text.y.left = element_text(size = 8),
    axis.text.x.bottom = element_text(size = 8),
    axis.ticks.y = element_blank(),
    axis.line.y = element_line(),
    axis.line.x = element_line(),
    plot.title = element_text(size = 10, hjust = 0.5, face = 'bold')
  )

We noted that the first interquartile range is very small, with total population ranging from 70 to 4,205 per planning area. We also noted the ranking of the larger planning areas as we may shortlist them for a deepdive later. Through a separate calculation, we noted that the 15 largest planning areas account for approximately 75% of the total population.

2.4.2. Age Groups

Next we will study the distribution of the population across the age groups using ggplot.

agegrouptable <- aggregate(data3$Pop, by=list('Age Group'=data3$`Age Group`),FUN=sum)
agegrouptable$proportion <- agegrouptable$x/sum(agegrouptable$x)

ggplot(agegrouptable, aes(x = `Age Group`, y = x )) +
  geom_bar(position = "dodge", stat = "identity", fill = "lightblue", width = 0.8) +
  scale_y_continuous(expand = c(0,0),limits = c(0, 1550000), labels = comma) +
  geom_text(aes(label = paste(as.character(x),
                              " (",as.character(round(proportion,2)*100), "%)",sep = "")),
            hjust = 0.5,vjust = -0.5, size = 2.8) +
  ggtitle("Population distribution by Age Groups in 2019") +
  labs(y="Total Population Count", x = "Age Group") +
  theme(
    panel.background = element_blank(),
    panel.grid.major.y = element_line(colour = "grey"),
    axis.text.y.left = element_text(size = 8),
    axis.text.x.bottom = element_text(size = 8),
    axis.ticks.x = element_blank(),
    axis.line.y = element_line(),
    axis.line.x = element_line(),
    plot.title = element_text(size = 10, hjust = 0.5, vjust = 3, face = 'bold')
  )

Based on the national distribution, we will expect the young workforce (37%) and mature workforce (31%) to form the bulk of each planning area. Conversely, the elderlies should form the minority (5%). This distribution also shows that Singapore is facing an ageing population, as the youth propotion (26%) is much lower than the next age band of young workforce (37%).

2.5. Preparing the Visualisation

As we will also visualise the dataset in proportion of age groups for each planning area, we will compute the proportion in the dataframe.

data4 <- aggregate(data3$Pop, by=list('Planning Area'=data3$PA,data3$`Age Group`), FUN=sum)
data4 <- data4 %>%
  spread(Group.2,x)
data4[is.na(data4)] <- 0
rownames(data4) <- data4[,1]
data4$Total <- rowSums(data4[,c(2:5)])
data4$'1. Youth Proportion' <- data4$`1. Youth`/data4$Total
data4$'2. Young Workforce Proportion' <- data4$`2. Young Workforce`/data4$Total
data4$'3. Mature Workforce Proportion' <- data4$`3. Mature Workforce`/data4$Total
data4$'4. Elderly Proportion' <- data4$`4. Elderly`/data4$Total
data4$'Total Proportion Check' <- rowSums(data4[,c(7:10)])
data4 <- arrange(data4, -Total)

We will select 5 random rows to inspect the proportion.

check <- data4[sample(nrow(data4), 5), ]
kable(check[2:11], format = "markdown", align='l')

	1. Youth	2. Young Workforce	3. Mature Workforce	4. Elderly	Total	1. Youth Proportion	2. Young Workforce Proportion	3. Mature Workforce Proportion	4. Elderly Proportion	Total Proportion Check
Ang Mo Kio	35880	56400	59000	13150	164430	0.2182084	0.3430031	0.3588153	0.0799732	1
Changi	590	730	440	30	1790	0.3296089	0.4078212	0.2458101	0.0167598	1
Bukit Batok	40220	57960	49630	6330	154140	0.2609316	0.3760218	0.3219800	0.0410666	1
Singapore River	770	1400	680	90	2940	0.2619048	0.4761905	0.2312925	0.0306122	1
Sembawang	29350	40110	23690	2920	96070	0.3055064	0.4175081	0.2465910	0.0303945	1

We noted that the proportion has been correctly computed, and the sum of the proportion of the 4 age groups add up to 1.

2.5.1. Heatmap

In order to get an overview of all the population count for all the planning areas and age groups, we will perform a visualisation with heatmaply. The heatmap is a useful way to cross examine the planning areas and age groups, and is also able to visualise all the planning areas in one plot meaningfully.

heatmapd1 <- select(data4, c(7:10))
heatmap_matrix <- data.matrix(heatmapd1)

heatmaply(heatmap_matrix,
          Colv = FALSE,
          fontsize_row = 8,
          fontsize_col = 8,
          k_row = 8,
          xlab = "Age Groups",
          ylab = "Planning Areas")  %>% 
  layout(height=600,width=800)

As we are visualising the proportion (between 0 to 1) of each planning area by age group, we will not perform further transformation to the dataset. We will use k_row = 8 to set the row dendogram to 8 clusters. As the column pertains to the 4 age groups, we will remove the column dendogram. Further analysis will be performed under point 3 below.

2.5.2. Mosaic Plot

The mosaicplot is a useful way to visualise the contingency table. The height of the boxes are proportionate to the age group in each plannning area, and the width of the boxes are proportionate to the population in each planning area.

We will also use the share = TRUE to display the Pearson residuals. If the residual is negative (or red), it means that the box has few observations than expected. Conversely, if the resigual is negative (or blue), it means that the box has more observations than expected.

We will also order the dataset to display the most populous area on the left, and least propulous area on the right.

mosaictable <- xtabs(data3$Pop~data3$PA+data3$`Age Group`)
mosaictable <- cbind(mosaictable, Total = rowSums(mosaictable))
mosaictable <- mosaictable[order(-mosaictable[,5]),]
mosaictable <- mosaictable[,4:1]

mosaicplot(mosaictable, las=3, shade=TRUE, main = "Mosaic Plot of Planning Areas and Age Groups")

We noted that when we plot for all planning areas, we will not be able to read the plot meaningfully. Hence we will limit the planning area to the top 15 by total population count.

mosaictable <- mosaictable[0:15,1:4]

mosaicplot(mosaictable, las=3, shade=TRUE,
           main = "Mosaic Plot of Planning Areas (top 15) and Age Groups")

2.5.3. Parallel Coordinates Plot

Parallel coordinate plot parcoords is a useful in visualising multivariate numerical data and to understand the relationship between them. In this study, we will use this plot to visualise the relationships between the proportion of the 4 age groups for each planning area.

To be in line with the mosaicplot, we will also limit the planning area to the top 15 by total population count.

data5 <- data4[0:15,]

parcoords(
 data5[c(1,7:10)],
 color = list(
   colorBy = "Planning Area",
   colorScale = "scaleOrdinal"),
 withD3 = TRUE,
 rownames = FALSE,
 margin = 45
 )

Further analysis will be performed under point 3 below.

3. Final Data Visualisation

The final data visualisation will focus on the proportion of age group across the planning areas.

Heat Map of Planning Areas by Proportion of Age Groups

From the heat map, we observed that Singapore is facing an issue of ageing population. The young workforce and mature workforce are generally lighter in colour and have a proportion of around 0.37(37%) and 0.31(31%) respectively, meaning that there is high proportion of these 2 groups in most of the planning areas. The youth proportion is slight dark and generally have a proportion of around 0.26(26%), and lastly followed by the elderly proportion of around 0.05(5%).

We noted that the following boxes have an unusually high proportion of more than 50% (bright yellow) for a specific age group; Lim Chu Kang (71% mature workforce), Museum (53% young workforce) and Downtown Core (51% young workforce). However from our earlier data exploration, we noted that this might be due a deviation in distribution from the national norm because they have a much lower population count of 70, 480 and 2,500 respectively.

The dendogram of 8 clusters allows users to identify areas with similar proportions of age group. For example, Changi, Sembawang, SengKang and Punggol have relatively higher proportion of youth and younger workforce. On the other hand, the bright green dendogram from Toa Payoh to Sungei Kadut have relatively higher proportion of mature workforce and elderlies.

Parallel Coordinates Plot of Planning Areas (top 15) and Proportion of Age Groups

We will now perform a further dive into the top 15 most populous areas. From the exploration previously, we understand that these accounts for approximately 75% of the population.

The first thing we observed from the mosaic plot and parallel coordinates plot is that Punggol, being the 9th most populous area, has a significant lead in the proportion of youth (33%) and young workforce (46%), and conversely significantly lower mature workforce (18%) and elderlies (2%). This next area with such a trend is SengKang.

On the other hand, areas like Bukit Merah and Toa Payoh has low youth (22%) and young workforce (35%), and higher mature workforce (34%) and elderlies (9%).

One interesting point from parcoords is that there is generally a substantial spread in the proportion of youth and elderlies across the top 15 planning areas. However all 13 of the planning areas (with the exception of Punggol and Seng Kang) have very similar proportion of young workforce (35%-39%) and mature workforce (29%-36%).

4. References

Data Source
- https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data
Resources on Heat Map heatmaply
- https://rpubs.com/geock/hvr
- https://cran.r-project.org/web/packages/heatmaply/vignettes/heatmaply.html
Resources on Parallel Coordinates Plot parcoords
- https://rpubs.com/geock/pcp
- https://cran.r-project.org/web/packages/parcoords/parcoords.pdf
Resources on Mosaic Plot mosaicplot
- https://rpubs.com/tskam/mosaic_plot
- https://www.tutorialgateway.org/mosaic-plot-in-r/#:~:text=The%20Mosaic%20Plot%20in%20R,selling%20in%20all%20the%20countries.
Other General Resources
- https://rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
- http://www.cookbook-r.com/Graphs/

Understanding Singapore’s distribution of Age Groups by Planning Areas

ISSS608 Visual Analytics and Applications | Assignment 4

Author: Khoo Wen Yong

Date: 24 July 2020