Singapore Demographics by Age and Planning Area

Singapore

Relevant Installations and Imports

Packages

# suppressing warnings
defaultW <- getOption("warn")
options(warn = -1) 

# installing packages
packages = c('corrplot', 'ggpubr', 'tidyverse', 'dplyr')

for(p in packages){
  if(!require(p, character.only = T)){install.packages(p)}}

options(warn = defaultW)

Libraries

# suppressing warnings
defaultW <- getOption("warn")
options(warn = -1) 

# importing libraries
library(data.table)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(viridis)

options(warn = defaultW)

Data

# setting working directory
setwd("F:/Arya/MITB/Term2/VA/Assignment 4/Arya_Akhil_Assignment4/Arya_Akhil_Assignment4")  

# importing data
df <- read.csv(file = 'respopagesextod2011to2019.csv')

What do we want to achieve?

Collating the problem statement and the available data, we are going to use the following features for our design:

Period (Time)
Age Groups (AG)
Gender (Sex)
Planning Area (PA)

We will also create additional features derived from the ones stated above to tackle the design challenges.

To reveal the demographic structure of Singapore’s population by age-cohort and planning area, we will try to achieve the following designs:

1. Pyramidal plot

The pyramidal plot will help us understand the distribution of males and females across different age groups at finer levels such as the ones already present in the data. We are not posed with any major challenges from the data to design the pyramidal plots.

Pyramidal Plot

2. Circular/Radial stacked bar chart

A circular stacked chart is a difficult yet enlightening feast. Through it, we wish to understand the following:

Distribution of planning areas by regions in Singapore and how many people inhabit each planning area
Distribution of age cohort among the planning areas

We are indeed posed with quite many data challenges for this design. We will try and tackle one by one in the subsequent steps.

Circular Stacked Bar Chart

What are the challenges and how would we resolve them?

The proposed design faces some challenges from the available data. We have listed the challenges and their solutions below:

Challenge 1:

The problem statement requires demographics as reflected in the year 2019 but the given data contains nine years of information from 2011 to 2019. Below is the code to show the same:

# obtaining unique values of years
unique(df$Time)

## [1] 2011 2012 2013 2014 2015 2016 2017 2018 2019

Solution 1:

We will create a subset of the dataframe to contain only the year 2019. Below is the code for the same:

# creating a subset, year = 2019
df = df[df$Time == 2019,]

Challenge 2:

If we pivot the data and try to find the total population by Planning Area, we observe that 13 planning areas have 0 inhabitance. It could be due to two major reasons - (i) The planning areas are re-categorized into sub-zones now. For example, “Boon Lay” to “Boon Lay Place”. (ii) The places are actually uninhabited.

Below is the code to show the same:

# grouping population by planning area
sum_pop = group_by(df, PA)
sum_pop = summarise(sum_pop, Population = sum(Pop))

# filtering out the planning areas with zero inhabitance
sum_pop_zero = filter(sum_pop, Population == 0)
print(sum_pop_zero)

## # A tibble: 13 x 2
##    PA                      Population
##    <fct>                        <int>
##  1 Boon Lay                         0
##  2 Central Water Catchment          0
##  3 Changi Bay                       0
##  4 Marina East                      0
##  5 Marina South                     0
##  6 North-Eastern Islands            0
##  7 Paya Lebar                       0
##  8 Pioneer                          0
##  9 Simpang                          0
## 10 Straits View                     0
## 11 Tengah                           0
## 12 Tuas                             0
## 13 Western Islands                  0

Solution 2:

These planning areas will not be able to contribute to our understanding of demographics because they are uninhabited and so it is imperative that we should filter them out of the dataframe used for visualization. Below is the code for the same:

# converting the values of zero inhabitance planning areas to vector
zero_inhab <- sum_pop_zero[['PA']]

# filtering the dataframe to remove zero_inhab planning areas
df <- df %>% dplyr::filter(!PA %in% zero_inhab)

Challenge 3:

As proposed in the design, we want to follow a broader age group categorization as follows: 0-14 years old, 15-64 years old and 65 years old and over.

This categorization is obtained from the following link under the section, “Statistical Classification” -

https://www.singstat.gov.sg/-/media/files/standards_and_classifications/nsa.pdf.

The given data contains finer divisions of age groups at an interval of 5 years such as 0-4, 5-9…65-69.. etc., which are not suitable for our purpose. Below is the code to show the same:

# obtaining unique values of age groups
unique(df$AG)

##  [1] 0_to_4      5_to_9      10_to_14    15_to_19    20_to_24    25_to_29   
##  [7] 30_to_34    35_to_39    40_to_44    45_to_49    50_to_54    55_to_59   
## [13] 60_to_64    65_to_69    70_to_74    75_to_79    80_to_84    85_to_89   
## [19] 90_and_over
## 19 Levels: 0_to_4 10_to_14 15_to_19 20_to_24 25_to_29 30_to_34 ... 90_and_over

Solution 3:

We will group and recode the age groups to reflect the broader age divisions we have envisioned for our visualization. Below is the code for the same:

df$Age_Cohort <- with(df, dplyr::case_when(AG %in% c("0_to_4",
                                                     "5_to_9",
                                                     "10_to_14"
                                                     ) ~ 'Young',
                                           AG %in% c("60_to_64",
                                                     "65_to_69",
                                                     "70_to_74",
                                                     "75_to_79",
                                                     "80_to_84",
                                                     "85_to_89",
                                                     "90_and_over"
                                                     ) ~ 'Old',
                                           TRUE ~ 'Active'))

Challenge 4:

The planning areas themselves do not provide specific information to an outsider as to which region of Singapore e.g., central, east, west, etc., are more populated than the others. Moreover, the huge number of planning areas per the size of Singapore also makes it difficult to narrow down the region-specific demographics and understand the distribution on a broader level.

Below is the code to show the number of planning areas:

# obtaining unique values of age groups
unique(df$PA)

##  [1] Ang Mo Kio              Bedok                   Bishan                 
##  [4] Bukit Batok             Bukit Merah             Bukit Panjang          
##  [7] Bukit Timah             Changi                  Choa Chu Kang          
## [10] Clementi                Downtown Core           Geylang                
## [13] Hougang                 Jurong East             Jurong West            
## [16] Kallang                 Lim Chu Kang            Mandai                 
## [19] Marine Parade           Museum                  Newton                 
## [22] Novena                  Orchard                 Outram                 
## [25] Pasir Ris               Punggol                 Queenstown             
## [28] River Valley            Rochor                  Seletar                
## [31] Sembawang               Sengkang                Serangoon              
## [34] Singapore River         Southern Islands        Sungei Kadut           
## [37] Tampines                Tanglin                 Toa Payoh              
## [40] Western Water Catchment Woodlands               Yishun                 
## 55 Levels: Ang Mo Kio Bedok Bishan Boon Lay Bukit Batok ... Yishun

Solution 4:

Therefore, as shown in the design sketch, we wish to recode the planning areas to regions’ categorization of Singapore. Then use the regions in conjunction with the planning areas to reveal the region-wise demographics of Singapore. The reference is taken from the link below -

https://en.wikipedia.org/wiki/Regions_of_Singapore

Below is the code to do the same:

df$Region <- NA
df$Region <- with(df, dplyr::case_when(PA %in% c('Central Water Catchment','Lim Chu Kang','Mandai','Sembawang',
                                                 'Simpang','Sungei Kadult','Woodlands','Yishun') ~'North',
                                       PA %in% c('Ang Mo Kio','Hougang','North-Eastern Islands','Punggol',
                                                 'Seletar','Sengkang','Serangoon')~'N.East',
                                       PA %in% c('Bedok','Changi','Changi Bay', 'Paya Lebar','Pasir Ris', 
                                                 'Tampines') ~ 'East',
                                       PA %in% c('Boon Lay','Bukit Batok','Bukit Panjang','Choo Chu Kang', 
                                                 'Clementi','Jurong East','Jurong West','Pioneer', 
                                                 'Tengah','Tuas','Western Islands', 
                                                 'Western Water Catchment') ~ 'West',
                                       TRUE ~ 'Central'))

How would we prepare the design?

1. Pyramidal plot

We have used ggplot2 to achive the goal. We have used the bar chart feature in conjunction with the coordinate flip feature, “coord_flip” to construct of pyramid of population distribution over agr groups.

1. Data preparation

We subset the relevant columns from the original data and create a new dataframe to be called as “py_data”, or pyramid data.
We cast data by population to aggregate it at the level of males and females by age groups.
We manually sorted the data to avoid any discrepancies later as the data is categorical
We converted the number to percentages because it is easier to compare categorical data in percentages than frequencies.
We melted back the data to shape the data for the pyramid - stacking each age group on top of the other.
At last, since we want to create diverging bars to represent males and females, we need to push males to the negative axis to give the effect of divergence.

The combined code is as shown below:

# suppressing warnings
defaultW <- getOption("warn")
options(warn = -1) 

# creating a subset dataframe
py_data = df %>%
  select(AG, Sex, Pop)

#  Casting data by population
py_data = dcast(setDT(py_data), Sex~AG, fun = list(sum), value.var = "Pop")

# manually sorting columns so avoid any discrepancy as this is categorical data
py_data = py_data[, c(1, 2, 11, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20)]

# converting to percentages for comparison
py_data = py_data %>%
  mutate_if(is.numeric, funs(.*100/sum(.))) 

# melting data by sex and consolidating the sum of population 
py_data = melt(py_data,id=c("Sex"))

# reshaping data to align with diverging bars of the pyramid, 
# shifting males to negative axis 
py_data$value[py_data$Sex=="Males"] = py_data$value[py_data$Sex=="Males"]*-1

options(warn = defaultW)

2. Design preparation

We used the “geom_bar” feature of ggplot2 to plot the bars of age groups on top of each other.
We also specified the x-axis scales to meet the requirements of data i.e., 250k.
At last, we did the coordinate flip to align the pyramid vertically rather than horizontally bu using the feature, “coord_flip”.

The combined code is as shown below:

# plotting the pyramid plot 
ggplot(py_data, aes(x = variable , y = value, fill = Sex)) +
  geom_bar(stat="identity", position="identity") +
  scale_y_continuous(breaks = seq(-100, 100, 25),
                     labels = paste0(as.character(c(seq(100,0,-25),                                seq(25,100,25)))))+
       labs(x = "Age Groups", 
       y = "Percent Population",
       title = "Percentage Distribution of Males and Females across Age groups") +coord_flip()

2. Circular Stacked Bar Chart

We have used ggplot2 to achive the goal. We have used stacked bar chart in conjunction with the “coord_polar” feature to organise the stacked bar chart in a circular manner.

1. Data preparation

We subset the relevant columns from the original data and create a new dataframe to be called as “csb_data”, or circular stacked bar data.
We explicitly changed the data type of categorical columns to factor just as a failsafe.
At last, we consolidated the population by region, planning area, and age cohort.
We defined the width of the gap between the regions to be 2 empty bars.
We added an “id” column to the csb_data. This will be later used to identify records uniquely.
We calculated the angle and position of each region and planning area to be shown on the circular stacked bar chart. We will save this information as “label_csb_data”.
We created two dataframes - “base_csb_data” and “grid_csb_data” to save the starting point of the region and the split of regions across 360 degrees by x and y coordinate.

The combined code is as shown below:

# suppressing warnings
defaultW <- getOption("warn")
options(warn = -1) 

# creating a subset dataframe
csb_data = df %>%
  select(PA, Region, Age_Cohort ,Pop)

# changing data type to factor
csb_data$PA <- as.factor(csb_data$PA)
csb_data$Region <- as.factor(csb_data$Region)
csb_data$Age_Cohort <- as.factor(csb_data$Age_Cohort)

# grouping by PA, region and age cohort
csb_data = group_by(csb_data, PA, Region, Age_Cohort)
csb_data = data.frame(summarise(csb_data, Pop = sum(Pop)))

# Setting a number of 'empty bar' to add at the end of each Region
empty_bar <- 2
nObsType <- nlevels(as.factor(csb_data$Age_Cohort))
to_add <- data.frame( matrix(NA, empty_bar*nlevels(csb_data$Region)*nObsType, ncol(csb_data)) )
colnames(to_add) <- colnames(csb_data)
to_add$Region <- rep(levels(csb_data$Region), each=empty_bar*nObsType )
csb_data <- rbind(csb_data, to_add)
csb_data <- csb_data %>% arrange(Region, PA)
csb_data$id <- rep( seq(1, nrow(csb_data)/nObsType) , each=nObsType)

# Getting the name and the y position of each label
label_csb_data <- csb_data %>% group_by(id, PA) %>% summarize(tot=sum(Pop))
number_of_bar <- nrow(label_csb_data)
angle <- 90 - 360 * (label_csb_data$id-0.5) /number_of_bar     # I substract 0.5 because the letter must have the angle of the center of the bars. Not extreme right(1) or extreme left (0)
label_csb_data$hjust <- ifelse( angle < -90, 1, 0)
label_csb_data$angle <- ifelse(angle < -90, angle+180, angle)

# preparing a csb_data frame for base lines
base_csb_data <- csb_data %>% 
  group_by(Region) %>% 
  summarize(start=min(id), end=max(id) - empty_bar) %>% 
  rowwise() %>% 
  mutate(title=mean(c(start, end)))

# preparing a csb_data frame for grid (scales)
grid_csb_data <- base_csb_data
grid_csb_data$end <- grid_csb_data$end[ c( nrow(grid_csb_data), 1:nrow(grid_csb_data)-1)] + 1
grid_csb_data$start <- grid_csb_data$start - 1
grid_csb_data <- grid_csb_data[-1,]

options(warn = defaultW)

2. Design preparation

We created a stacked bar chart of the age cohort as “p” using the “geom_bar” feature of ggplot2.
We added the population values on the y-axis to be used as axis values to read population data. From the data, we figured that the scale should stretch from 0 to 250k.
We also added the labels of the scale indicated above, for user readability.
We stretched the circular plot to meet the aesthetic requirements and eas out use readability. Stretched up to 250k on the y-axis.
At last, we added labels on top of stacked bar charts to specify regions and also around the circle to specify the regions.

The combined code is as shown below:

# suppressing warnings
defaultW <- getOption("warn")
options(warn = -1) 

# Plotting the circular stacked bar chart
p <- ggplot(csb_data) +      
  
  # Adding the stacked bar
  geom_bar(aes(x=as.factor(id), y=Pop, fill=Age_Cohort), stat="identity", alpha=0.5) +
  scale_fill_viridis(discrete=TRUE) +
  
  # Adding a val=100/75/50/25 lines
  geom_segment(data=grid_csb_data, aes(x = end, y = 0, xend = start, yend = 0), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  geom_segment(data=grid_csb_data, aes(x = end, y = 50000, xend = start, yend = 50000), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  geom_segment(data=grid_csb_data, aes(x = end, y = 100000, xend = start, yend = 100000), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  geom_segment(data=grid_csb_data, aes(x = end, y = 150000, xend = start, yend = 150000), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  geom_segment(data=grid_csb_data, aes(x = end, y = 200000, xend = start, yend = 200000), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  geom_segment(data=grid_csb_data, aes(x = end, y = 250000, xend = start, yend = 250000), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
  
  # Adding text showing the Pop of each 100/75/50/25 lines
  ggplot2::annotate("text", x = rep(max(csb_data$id),6), y = c(0, 50000, 100000, 150000, 200000,250000), label = c("0k", "50k", "100k", "150k", "200k","250k") , color="grey", size=3.8 , angle=0, fontface="bold", hjust=0.9) +
  
  ylim(-450000,max(label_csb_data$tot, na.rm=T)) +
  theme_minimal() +
  
  # Adding plot title
  ggtitle("Distribution of Age Cohorts by Planning Area and Region")+
  
  theme(
    plot.title = element_text(color="black",size=15),
    legend.title = element_blank() ,
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank(),
    #plot.margin = unit(rep(-1,5), "cm") 
  ) +
  coord_polar() +
  
  # Adding labels on top of each bar
  geom_text(data=label_csb_data, aes(x=id, y=tot+19, label=PA, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=3, angle= label_csb_data$angle, inherit.aes = FALSE )+
  
  # Adding base line information
  geom_segment(data=base_csb_data, aes(x = start, y = -6, xend = end, yend = -6), colour = "black", alpha=0.8, size=0.6 , inherit.aes = FALSE ) +

  geom_text(data=base_csb_data, aes(x = title, y = -30, label=Region), hjust=c(1,-0.1,-0.1,-0.1,-0.1), colour = "black", alpha=0.8, size=4, fontface="bold", inherit.aes = FALSE) 

p

options(warn = defaultW)

What are the insights gained from the design?

Following major insights were obtained from the two designs about the demographics in Singapore in the year 2019:

The distribution of males and females in Singapore remains comparable until the age of 70 years with a small bulge on the female side during the 40s and a similar pattern for the males during the 60s.
After the age of 70, a significant shift in the population can be observed towards females. The female population over 70 years of age seems to be increasing constantly with a whopping 71% in the age group “90 and over”. On the contrary, the male population over 70 years of age follow a declining trend, reaching less than 30% in the age group, “90 and over”.
A maximum number of planning areas are mapped under the “Central” region of Singapore while the minimum is in the “East” region. This should not come as a surprise given that the major population and workspaces reside in the city centre than the outskirts.
Jurong West was observed to be one of the most inhabited planning areas of Singapore. Other most populated planning areas include Bedok, Woodlands and Sengkang. This can begin to make sense if we think in terms of dwellings. These areas mark mostly the outskirts of Singapore and hence the places for humongous dwellings constructions.
The “Central” region is distributed into quite several planning areas. Most of them are not highly populated. This kind of strategy helps distribute the crowd and systematically develop parts of the region one by one without affecting the neighbouring areas.
If we narrow the observation down to each stack bar, we can observe that the “Central” region is observed to have the lowest number of “Youngs” or children between ages 0 and 15. That actually makes sense because most workplaces are situated in the central region so one must expect working/active crowd to live nearby. Also, the active crowd with kids would like to shift to a more peaceful area and condos which are mostly available at lesser rates in the outskirts such as Jurong West.
Taking from insight #6, indeed the most “Young” population can be observed in the “North East” region marking the outskirt. Planning areas like Punggol are highly inhabited by children.

References

https://www.r-graph-gallery.com/299-circular-stacked-barplot.html

https://en.wikipedia.org/wiki/Regions_of_Singapore

https://www.singstat.gov.sg/-/media/files/standards_and_classifications/nsa.pdf.

Singapore Demographics by Age and Planning Area

Arya Akhil

7/21/2020

Relevant Installations and Imports

Packages

Libraries

Data

What do we want to achieve?

1. Pyramidal plot

2. Circular/Radial stacked bar chart

What are the challenges and how would we resolve them?

Challenge 1:

Solution 1:

Challenge 2:

Solution 2:

Challenge 3:

Solution 3:

Challenge 4:

Solution 4:

How would we prepare the design?

1. Pyramidal plot

1. Data preparation

2. Design preparation

2. Circular Stacked Bar Chart

1. Data preparation

2. Design preparation

What are the insights gained from the design?

References