1. Overview

This assignment aims to visualise the demographic structure of Singapore population by age cohort and by planning area in 2019. The objective of this visualisation is to look at Singapore’s population and dwelling condition.

Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data.

1.1 Major data

The dataset contains 984,656 rows of data about Area Subzone, Age Group, Sex and Type of Dwelling from 2011 to 2019.

Variable Description
PA Planning Area
SZ Subzone
AG Age Group
Sex Sex
TOD Type of Dwelling
Pop Resident Count
Time Time/Period

1.2 Challenges

1.2.1 Challenges in designing of Pyramid Population chart:

  • Design of Pyramid Chart: I only had experience in designing Pyramid chart with Tableau, so I Googled how to design Pyramid chart with ggplot 2, and there are some excellent works in the Internet, which gives me great help.

  • Execute Pyramid Chart: after plotting Pyramid Chart, I realize there is an issue with order of x-axis(i.e., Age group). As x-ais automatically orders by the number, the order shows in an illogical way as :

    50 to 59
    5 to 9
    45 to 49

Hence, I added an if-else clause to alter 5 to 9 to 05 to 9 and it solved the ordering issue.

popGHcens$AG <- ifelse(popGHcens$AG == "5_to_9", "05_to_9", popGHcens$AG)

1.2.2 Challenges in designing of Planning Area chart:

  • Design of Planning Area chart: I planned to plot box plot with Age group and population for each planning area. However, as there are so many planning areas, the details of each boxplot are not able to be seen in the axis clearly.

Hence, I decide to use a ternary plot because it doesn’t show all planning area in axis label.

1.3 Proposed stretch design

I proposed to have 2 chars - Pyramid Population chart and Ternary Plot for population structure for different planning areas Proposed stretch

2. Step-by-Step Description

2.1 Install and Load R packages

  • dplyr provides a consistent set of verbs that help you solve the most common data manipulation challenges. I used filter() and select() to filter the 2019 data.

  • ggplot2 is the packages used for data visualization.

  • ggtern is a package that extends the functionality of ggplot2, giving the capability to plot ternary diagrams for (subset of) the ggplot2 proto geometries.

packages <- c('dplyr', 'ggplot2', 'ggtern')

for (p in packages){
  if (!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2.2 Load Dataset

data <- read.csv("data.csv")
#View(data)

2.3 Filter Dataset

Since the original data contains data from 2011 to 2019, I will first select the age group, population and gender data in 2019 for Pyramid chart.

popGHcens <- data %>%
  filter(Time=="2019") %>%
  select(AG,Pop,Sex)

Besides, I created another subset for Ternary Plot, in which I select Planning Area, Age group, population and gender data in 2019.

data_pz <- data %>%
  filter(Time=="2019") %>%
  select(PA,AG,Pop,Sex)

2.4 Data Wrangling

2.4.1 Pyramid Chart

Pyramid Chart are two barcharts with axes flipped, so barplots for male populations goes to the left with negative sign. And to make the x-axis order in a logical way, I have to alter the age group “5_to_9” as well.

popGHcens$Pop <- ifelse(popGHcens$Sex == "Males", -1*popGHcens$Pop, popGHcens$Pop)
popGHcens$AG <- ifelse(popGHcens$AG == "5_to_9", "05_to_9", popGHcens$AG)

2.4.2 Ternary Plot

To simplify the age group, I define 3 age groups - Young group (aged from 0 - 24), Economically Active (aged from 25 - 69), Aged Group (aged from 70 - over).

data_pz$Group <- ifelse(data_pz$AG == "0_to_4" | data_pz$AG == "5_to_9" | data_pz$AG == "10_to_14" | data_pz$AG == "15_to_19"| data_pz$AG == "20_to_24", "Young", ifelse(data_pz$AG == "70_to_74" | data_pz$AG == "75_to_79" | data_pz$AG == "80_to_84" | data_pz$AG == "85_to_89"| data_pz$AG == "90_and_over", "Aged", "Economically Active"))

After that, I create another dataset which group the population by the Planning area and Age group, to generate the percent that each age group takes up according to Planning Area.

data_pz_percent <- data_pz %>%
  group_by(PA)  %>%
  summarize(young_percent = sum(Pop[Group=="Young"]) / sum(Pop),
            Econ_active_percent = sum(Pop[Group=="Economically Active"]) / sum(Pop),
            aged_percent=1-young_percent-Econ_active_percent)

2.5 Data Visualization

With dataset ready, I manage to generate visualization with ggplot2 and ggtern.

2.5.1 Pyramid Chart

pyramidGH <- ggplot(popGHcens, aes(x = AG, y = Pop, fill = Sex)) + 
  geom_bar(data = subset(popGHcens, Sex == "Females"), stat = "identity") + 
  geom_bar(data = subset(popGHcens, Sex == "Males"), stat = "identity") + 
  scale_y_continuous(breaks = seq(-150000, 150000, 50000), 
                     labels = paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), "m")) +  
  coord_flip()
pyramidGH

2.5.2 Ternary Plot

ggtern(data=data_pz_percent,aes(x=Young_Percent,y=Active_percent, z=Aged_percent)) +
  geom_point()+
  labs(title="Population Structure According to Planning Area, 2019") +
  theme_rgbw()

3. Visualization and Insights

3.1 Pyramid Chart

From the Pyramid Chart, there are some insights:

The Pyramid chart reveals the population structure with sex for different age interval.

  • The sex distribution are balanced for aged below 70. However, for population above 70, we can observed that females are more than males.
  • From the shape of Pyramid Chart, we can see the Population structure of Singapore in 2019 is Constrictive. Constrictive pyramids have smaller percentages of people in the younger age cohorts and are typically characteristic of countries with higher levels of social and economic development, where access to quality education and health care is available to a large portion of the population.

3.2 Ternary Plot

From the Ternary Plot, there are some insights:

The ternary plot reveals the proportion of each age group takes up in a certain planning area, each dot in the plot represents a planning area and the proportion for different age group (percent for young/active/aged group) in this planning area.

  • Most points in plot are clustered, which means the distributions of different age group in planning area are similar to some extent.
  • We can observed that the highest proportion of active group takes up more than 80% of the population in this planning area (from static plot we are unable to tell which planning area it is)