1. Overview

This report aims to visualise the demographic structure of Singapore population by age and planning area in 2019, using the population trend published by Singapore Department of Statistics. The dataset is published on the following website: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data.

1.1. Purpose of Visualisation

The purpose of the visualisation is to study the proportion of the different age groups across the various planning areas. This will allow users, such as policymakers or campaign managers, to understand the needs of each planning area based on the demographics.

1.2. Data and Design Challenges

The raw dataset contains 19 classes of age groups in intervals of 5 years, starting from 0 to 4, to 90 and above. We will cluster the age groups according to the following table for more meaningful analysis. This clustering will also provide a consistent range of 25 years for the first 3 clusters.

Original Age Group Recoded Age Group Remarks
0 to 24 1. Youth This group will cover newborns all the way to the tertiary students.
25 to 49 2. Young Workforce This group will cover the graduating students entering the workforce and young parents.
50 to 75 3. Mature Workforce This group will cover the mature workforce and those who have just retired.
75 and above 4. Elderly This group will cover the retirees and elderlies in their silver years.

The data has 55 planning areas, and some of them may have a low or zero resident count. Based on the objective and output of each visualisation, we may limit the planning areas and perform a cutoff by total resident count for a more meaningful analysis.

The raw dataset is also in the “long” format, where the planning areas, age group and population are all in individual columns. We will perform data wrangling to transform it into a “wide” format (i.e. population count by age groups in individual columns) as input into the various plot functions.

1.3. Proposed Sketch Design

The proposed sketch design will mainly focus on visualising the propotion of age groups in each planning area in the form of a heatmap and parallel coordinates plot.

2. Step-by-step Preparation

2.1. Install and Load R Packages

  • tidyverse is a set of packages to perform data wrangling and exploration.
  • ggplot2 is a data visualisation package for visualisation and statistical programming language R.
  • scales is used to provide methods to determine breaks and labels for axes and legends.
  • parcoords is used to build parallel coordinates charts.
  • heatmaply is used to build heat maps.
  • knitr is used to enable integration of certain R codes into HTML.

2.3. Data Wrangling

Check the number of unique values for each column.

PA SZ AG Sex TOD Pop Time
55 323 19 2 8 261 9

We noted a total of 55 planning areas (PA) and 19 age groups (AG).

Recode the age groups in accordance to point 1.2.

Next we will remove rows that have 0 population count, and only retain the rows for year 2019. Check the number of unique values for each column again.

PA SZ AG Sex TOD Pop Time Age Group
42 234 19 2 7 207 1 4

Select 5 random rows to check the dataset.

PA SZ AG Sex TOD Pop Time Age Group
Choa Chu Kang Choa Chu Kang Central 35_to_39 Males HDB 4-Room Flats 220 2019 2. Young Workforce
Choa Chu Kang Teck Whye 75_to_79 Males HDB 4-Room Flats 130 2019 4. Elderly
Queenstown Holland Drive 5_to_9 Males HDB 3-Room Flats 70 2019 1. Youth
Outram Pearl’s Hill 80_to_84 Males HDB 3-Room Flats 20 2019 4. Elderly
Geylang Macpherson 50_to_54 Males Condominiums and Other Apartments 30 2019 3. Mature Workforce

After inspecting the dataset, we noted that the data has correctly been filtered for year 2019, and the age groups have been categorised correctly. We will next explore the dataset.

2.4. Exploring the dataset

2.4.1. Planning Areas

Firstly, we will understand how is the population distributed across the planning areas in Singapore.

We will prepare the dataset for the bar chart to present the total population of each planning area in descending order.

Visualise the bar chart with percentiles at 25%, 50% and 75% using ggplot.



We noted that the first interquartile range is very small, with total population ranging from 70 to 4,205 per planning area. We also noted the ranking of the larger planning areas as we may shortlist them for a deepdive later. Through a separate calculation, we noted that the 15 largest planning areas account for approximately 75% of the total population.

2.5. Preparing the Visualisation

As we will also visualise the dataset in proportion of age groups for each planning area, we will compute the proportion in the dataframe.

We will select 5 random rows to inspect the proportion.

1. Youth 2. Young Workforce 3. Mature Workforce 4. Elderly Total 1. Youth Proportion 2. Young Workforce Proportion 3. Mature Workforce Proportion 4. Elderly Proportion Total Proportion Check
Ang Mo Kio 35880 56400 59000 13150 164430 0.2182084 0.3430031 0.3588153 0.0799732 1
Changi 590 730 440 30 1790 0.3296089 0.4078212 0.2458101 0.0167598 1
Bukit Batok 40220 57960 49630 6330 154140 0.2609316 0.3760218 0.3219800 0.0410666 1
Singapore River 770 1400 680 90 2940 0.2619048 0.4761905 0.2312925 0.0306122 1
Sembawang 29350 40110 23690 2920 96070 0.3055064 0.4175081 0.2465910 0.0303945 1

We noted that the proportion has been correctly computed, and the sum of the proportion of the 4 age groups add up to 1.

2.5.1. Heatmap

In order to get an overview of all the population count for all the planning areas and age groups, we will perform a visualisation with heatmaply. The heatmap is a useful way to cross examine the planning areas and age groups, and is also able to visualise all the planning areas in one plot meaningfully.

 




As we are visualising the proportion (between 0 to 1) of each planning area by age group, we will not perform further transformation to the dataset. We will use k_row = 8 to set the row dendogram to 8 clusters. As the column pertains to the 4 age groups, we will remove the column dendogram. Further analysis will be performed under point 3 below.

2.5.2. Mosaic Plot

The mosaicplot is a useful way to visualise the contingency table. The height of the boxes are proportionate to the age group in each plannning area, and the width of the boxes are proportionate to the population in each planning area.

We will also use the share = TRUE to display the Pearson residuals. If the residual is negative (or red), it means that the box has few observations than expected. Conversely, if the resigual is negative (or blue), it means that the box has more observations than expected.

We will also order the dataset to display the most populous area on the left, and least propulous area on the right.

We noted that when we plot for all planning areas, we will not be able to read the plot meaningfully. Hence we will limit the planning area to the top 15 by total population count.

2.5.3. Parallel Coordinates Plot

Parallel coordinate plot parcoords is a useful in visualising multivariate numerical data and to understand the relationship between them. In this study, we will use this plot to visualise the relationships between the proportion of the 4 age groups for each planning area.

To be in line with the mosaicplot, we will also limit the planning area to the top 15 by total population count.

Further analysis will be performed under point 3 below.

3. Final Data Visualisation

The final data visualisation will focus on the proportion of age group across the planning areas.

Heat Map of Planning Areas by Proportion of Age Groups






From the heat map, we observed that Singapore is facing an issue of ageing population. The young workforce and mature workforce are generally lighter in colour and have a proportion of around 0.37(37%) and 0.31(31%) respectively, meaning that there is high proportion of these 2 groups in most of the planning areas. The youth proportion is slight dark and generally have a proportion of around 0.26(26%), and lastly followed by the elderly proportion of around 0.05(5%).

We noted that the following boxes have an unusually high proportion of more than 50% (bright yellow) for a specific age group; Lim Chu Kang (71% mature workforce), Museum (53% young workforce) and Downtown Core (51% young workforce). However from our earlier data exploration, we noted that this might be due a deviation in distribution from the national norm because they have a much lower population count of 70, 480 and 2,500 respectively.

The dendogram of 8 clusters allows users to identify areas with similar proportions of age group. For example, Changi, Sembawang, SengKang and Punggol have relatively higher proportion of youth and younger workforce. On the other hand, the bright green dendogram from Toa Payoh to Sungei Kadut have relatively higher proportion of mature workforce and elderlies.

Parallel Coordinates Plot of Planning Areas (top 15) and Proportion of Age Groups

We will now perform a further dive into the top 15 most populous areas. From the exploration previously, we understand that these accounts for approximately 75% of the population.

The first thing we observed from the mosaic plot and parallel coordinates plot is that Punggol, being the 9th most populous area, has a significant lead in the proportion of youth (33%) and young workforce (46%), and conversely significantly lower mature workforce (18%) and elderlies (2%). This next area with such a trend is SengKang.

On the other hand, areas like Bukit Merah and Toa Payoh has low youth (22%) and young workforce (35%), and higher mature workforce (34%) and elderlies (9%).

One interesting point from parcoords is that there is generally a substantial spread in the proportion of youth and elderlies across the top 15 planning areas. However all 13 of the planning areas (with the exception of Punggol and Seng Kang) have very similar proportion of young workforce (35%-39%) and mature workforce (29%-36%).