Singapore
# suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# installing packages
packages = c('corrplot', 'ggpubr', 'tidyverse', 'dplyr')
for(p in packages){
if(!require(p, character.only = T)){install.packages(p)}}
options(warn = defaultW)
# suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# importing libraries
library(data.table)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(viridis)
options(warn = defaultW)
# setting working directory
setwd("F:/Arya/MITB/Term2/VA/Assignment 4/Arya_Akhil_Assignment4/Arya_Akhil_Assignment4")
# importing data
df <- read.csv(file = 'respopagesextod2011to2019.csv')
Collating the problem statement and the available data, we are going to use the following features for our design:
We will also create additional features derived from the ones stated above to tackle the design challenges.
To reveal the demographic structure of Singapore’s population by age-cohort and planning area, we will try to achieve the following designs:
The pyramidal plot will help us understand the distribution of males and females across different age groups at finer levels such as the ones already present in the data. We are not posed with any major challenges from the data to design the pyramidal plots.
Pyramidal Plot
A circular stacked chart is a difficult yet enlightening feast. Through it, we wish to understand the following:
We are indeed posed with quite many data challenges for this design. We will try and tackle one by one in the subsequent steps.
Circular Stacked Bar Chart
The proposed design faces some challenges from the available data. We have listed the challenges and their solutions below:
The problem statement requires demographics as reflected in the year 2019 but the given data contains nine years of information from 2011 to 2019. Below is the code to show the same:
# obtaining unique values of years
unique(df$Time)
## [1] 2011 2012 2013 2014 2015 2016 2017 2018 2019
We will create a subset of the dataframe to contain only the year 2019. Below is the code for the same:
# creating a subset, year = 2019
df = df[df$Time == 2019,]
If we pivot the data and try to find the total population by Planning Area, we observe that 13 planning areas have 0 inhabitance. It could be due to two major reasons - (i) The planning areas are re-categorized into sub-zones now. For example, “Boon Lay” to “Boon Lay Place”. (ii) The places are actually uninhabited.
Below is the code to show the same:
# grouping population by planning area
sum_pop = group_by(df, PA)
sum_pop = summarise(sum_pop, Population = sum(Pop))
# filtering out the planning areas with zero inhabitance
sum_pop_zero = filter(sum_pop, Population == 0)
print(sum_pop_zero)
## # A tibble: 13 x 2
## PA Population
## <fct> <int>
## 1 Boon Lay 0
## 2 Central Water Catchment 0
## 3 Changi Bay 0
## 4 Marina East 0
## 5 Marina South 0
## 6 North-Eastern Islands 0
## 7 Paya Lebar 0
## 8 Pioneer 0
## 9 Simpang 0
## 10 Straits View 0
## 11 Tengah 0
## 12 Tuas 0
## 13 Western Islands 0
These planning areas will not be able to contribute to our understanding of demographics because they are uninhabited and so it is imperative that we should filter them out of the dataframe used for visualization. Below is the code for the same:
# converting the values of zero inhabitance planning areas to vector
zero_inhab <- sum_pop_zero[['PA']]
# filtering the dataframe to remove zero_inhab planning areas
df <- df %>% dplyr::filter(!PA %in% zero_inhab)
As proposed in the design, we want to follow a broader age group categorization as follows: 0-14 years old, 15-64 years old and 65 years old and over.
This categorization is obtained from the following link under the section, “Statistical Classification” -
https://www.singstat.gov.sg/-/media/files/standards_and_classifications/nsa.pdf.
The given data contains finer divisions of age groups at an interval of 5 years such as 0-4, 5-9…65-69.. etc., which are not suitable for our purpose. Below is the code to show the same:
# obtaining unique values of age groups
unique(df$AG)
## [1] 0_to_4 5_to_9 10_to_14 15_to_19 20_to_24 25_to_29
## [7] 30_to_34 35_to_39 40_to_44 45_to_49 50_to_54 55_to_59
## [13] 60_to_64 65_to_69 70_to_74 75_to_79 80_to_84 85_to_89
## [19] 90_and_over
## 19 Levels: 0_to_4 10_to_14 15_to_19 20_to_24 25_to_29 30_to_34 ... 90_and_over
We will group and recode the age groups to reflect the broader age divisions we have envisioned for our visualization. Below is the code for the same:
df$Age_Cohort <- with(df, dplyr::case_when(AG %in% c("0_to_4",
"5_to_9",
"10_to_14"
) ~ 'Young',
AG %in% c("60_to_64",
"65_to_69",
"70_to_74",
"75_to_79",
"80_to_84",
"85_to_89",
"90_and_over"
) ~ 'Old',
TRUE ~ 'Active'))
The planning areas themselves do not provide specific information to an outsider as to which region of Singapore e.g., central, east, west, etc., are more populated than the others. Moreover, the huge number of planning areas per the size of Singapore also makes it difficult to narrow down the region-specific demographics and understand the distribution on a broader level.
Below is the code to show the number of planning areas:
# obtaining unique values of age groups
unique(df$PA)
## [1] Ang Mo Kio Bedok Bishan
## [4] Bukit Batok Bukit Merah Bukit Panjang
## [7] Bukit Timah Changi Choa Chu Kang
## [10] Clementi Downtown Core Geylang
## [13] Hougang Jurong East Jurong West
## [16] Kallang Lim Chu Kang Mandai
## [19] Marine Parade Museum Newton
## [22] Novena Orchard Outram
## [25] Pasir Ris Punggol Queenstown
## [28] River Valley Rochor Seletar
## [31] Sembawang Sengkang Serangoon
## [34] Singapore River Southern Islands Sungei Kadut
## [37] Tampines Tanglin Toa Payoh
## [40] Western Water Catchment Woodlands Yishun
## 55 Levels: Ang Mo Kio Bedok Bishan Boon Lay Bukit Batok ... Yishun
Therefore, as shown in the design sketch, we wish to recode the planning areas to regions’ categorization of Singapore. Then use the regions in conjunction with the planning areas to reveal the region-wise demographics of Singapore. The reference is taken from the link below -
https://en.wikipedia.org/wiki/Regions_of_Singapore
Below is the code to do the same:
df$Region <- NA
df$Region <- with(df, dplyr::case_when(PA %in% c('Central Water Catchment','Lim Chu Kang','Mandai','Sembawang',
'Simpang','Sungei Kadult','Woodlands','Yishun') ~'North',
PA %in% c('Ang Mo Kio','Hougang','North-Eastern Islands','Punggol',
'Seletar','Sengkang','Serangoon')~'N.East',
PA %in% c('Bedok','Changi','Changi Bay', 'Paya Lebar','Pasir Ris',
'Tampines') ~ 'East',
PA %in% c('Boon Lay','Bukit Batok','Bukit Panjang','Choo Chu Kang',
'Clementi','Jurong East','Jurong West','Pioneer',
'Tengah','Tuas','Western Islands',
'Western Water Catchment') ~ 'West',
TRUE ~ 'Central'))
We have used ggplot2 to achive the goal. We have used the bar chart feature in conjunction with the coordinate flip feature, “coord_flip” to construct of pyramid of population distribution over agr groups.
We subset the relevant columns from the original data and create a new dataframe to be called as “py_data”, or pyramid data.
We cast data by population to aggregate it at the level of males and females by age groups.
We manually sorted the data to avoid any discrepancies later as the data is categorical
We converted the number to percentages because it is easier to compare categorical data in percentages than frequencies.
We melted back the data to shape the data for the pyramid - stacking each age group on top of the other.
At last, since we want to create diverging bars to represent males and females, we need to push males to the negative axis to give the effect of divergence.
The combined code is as shown below:
# suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# creating a subset dataframe
py_data = df %>%
select(AG, Sex, Pop)
# Casting data by population
py_data = dcast(setDT(py_data), Sex~AG, fun = list(sum), value.var = "Pop")
# manually sorting columns so avoid any discrepancy as this is categorical data
py_data = py_data[, c(1, 2, 11, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20)]
# converting to percentages for comparison
py_data = py_data %>%
mutate_if(is.numeric, funs(.*100/sum(.)))
# melting data by sex and consolidating the sum of population
py_data = melt(py_data,id=c("Sex"))
# reshaping data to align with diverging bars of the pyramid,
# shifting males to negative axis
py_data$value[py_data$Sex=="Males"] = py_data$value[py_data$Sex=="Males"]*-1
options(warn = defaultW)
We used the “geom_bar” feature of ggplot2 to plot the bars of age groups on top of each other.
We also specified the x-axis scales to meet the requirements of data i.e., 250k.
At last, we did the coordinate flip to align the pyramid vertically rather than horizontally bu using the feature, “coord_flip”.
The combined code is as shown below:
# plotting the pyramid plot
ggplot(py_data, aes(x = variable , y = value, fill = Sex)) +
geom_bar(stat="identity", position="identity") +
scale_y_continuous(breaks = seq(-100, 100, 25),
labels = paste0(as.character(c(seq(100,0,-25), seq(25,100,25)))))+
labs(x = "Age Groups",
y = "Percent Population",
title = "Percentage Distribution of Males and Females across Age groups") +coord_flip()
We have used ggplot2 to achive the goal. We have used stacked bar chart in conjunction with the “coord_polar” feature to organise the stacked bar chart in a circular manner.
We subset the relevant columns from the original data and create a new dataframe to be called as “csb_data”, or circular stacked bar data.
We explicitly changed the data type of categorical columns to factor just as a failsafe.
At last, we consolidated the population by region, planning area, and age cohort.
We defined the width of the gap between the regions to be 2 empty bars.
We added an “id” column to the csb_data. This will be later used to identify records uniquely.
We calculated the angle and position of each region and planning area to be shown on the circular stacked bar chart. We will save this information as “label_csb_data”.
We created two dataframes - “base_csb_data” and “grid_csb_data” to save the starting point of the region and the split of regions across 360 degrees by x and y coordinate.
The combined code is as shown below:
# suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# creating a subset dataframe
csb_data = df %>%
select(PA, Region, Age_Cohort ,Pop)
# changing data type to factor
csb_data$PA <- as.factor(csb_data$PA)
csb_data$Region <- as.factor(csb_data$Region)
csb_data$Age_Cohort <- as.factor(csb_data$Age_Cohort)
# grouping by PA, region and age cohort
csb_data = group_by(csb_data, PA, Region, Age_Cohort)
csb_data = data.frame(summarise(csb_data, Pop = sum(Pop)))
# Setting a number of 'empty bar' to add at the end of each Region
empty_bar <- 2
nObsType <- nlevels(as.factor(csb_data$Age_Cohort))
to_add <- data.frame( matrix(NA, empty_bar*nlevels(csb_data$Region)*nObsType, ncol(csb_data)) )
colnames(to_add) <- colnames(csb_data)
to_add$Region <- rep(levels(csb_data$Region), each=empty_bar*nObsType )
csb_data <- rbind(csb_data, to_add)
csb_data <- csb_data %>% arrange(Region, PA)
csb_data$id <- rep( seq(1, nrow(csb_data)/nObsType) , each=nObsType)
# Getting the name and the y position of each label
label_csb_data <- csb_data %>% group_by(id, PA) %>% summarize(tot=sum(Pop))
number_of_bar <- nrow(label_csb_data)
angle <- 90 - 360 * (label_csb_data$id-0.5) /number_of_bar # I substract 0.5 because the letter must have the angle of the center of the bars. Not extreme right(1) or extreme left (0)
label_csb_data$hjust <- ifelse( angle < -90, 1, 0)
label_csb_data$angle <- ifelse(angle < -90, angle+180, angle)
# preparing a csb_data frame for base lines
base_csb_data <- csb_data %>%
group_by(Region) %>%
summarize(start=min(id), end=max(id) - empty_bar) %>%
rowwise() %>%
mutate(title=mean(c(start, end)))
# preparing a csb_data frame for grid (scales)
grid_csb_data <- base_csb_data
grid_csb_data$end <- grid_csb_data$end[ c( nrow(grid_csb_data), 1:nrow(grid_csb_data)-1)] + 1
grid_csb_data$start <- grid_csb_data$start - 1
grid_csb_data <- grid_csb_data[-1,]
options(warn = defaultW)
We created a stacked bar chart of the age cohort as “p” using the “geom_bar” feature of ggplot2.
We added the population values on the y-axis to be used as axis values to read population data. From the data, we figured that the scale should stretch from 0 to 250k.
We also added the labels of the scale indicated above, for user readability.
We stretched the circular plot to meet the aesthetic requirements and eas out use readability. Stretched up to 250k on the y-axis.
At last, we added labels on top of stacked bar charts to specify regions and also around the circle to specify the regions.
The combined code is as shown below:
# suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# Plotting the circular stacked bar chart
p <- ggplot(csb_data) +
# Adding the stacked bar
geom_bar(aes(x=as.factor(id), y=Pop, fill=Age_Cohort), stat="identity", alpha=0.5) +
scale_fill_viridis(discrete=TRUE) +
# Adding a val=100/75/50/25 lines
geom_segment(data=grid_csb_data, aes(x = end, y = 0, xend = start, yend = 0), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=grid_csb_data, aes(x = end, y = 50000, xend = start, yend = 50000), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=grid_csb_data, aes(x = end, y = 100000, xend = start, yend = 100000), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=grid_csb_data, aes(x = end, y = 150000, xend = start, yend = 150000), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=grid_csb_data, aes(x = end, y = 200000, xend = start, yend = 200000), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
geom_segment(data=grid_csb_data, aes(x = end, y = 250000, xend = start, yend = 250000), colour = "grey", alpha=1, size=0.3 , inherit.aes = FALSE ) +
# Adding text showing the Pop of each 100/75/50/25 lines
ggplot2::annotate("text", x = rep(max(csb_data$id),6), y = c(0, 50000, 100000, 150000, 200000,250000), label = c("0k", "50k", "100k", "150k", "200k","250k") , color="grey", size=3.8 , angle=0, fontface="bold", hjust=0.9) +
ylim(-450000,max(label_csb_data$tot, na.rm=T)) +
theme_minimal() +
# Adding plot title
ggtitle("Distribution of Age Cohorts by Planning Area and Region")+
theme(
plot.title = element_text(color="black",size=15),
legend.title = element_blank() ,
axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
#plot.margin = unit(rep(-1,5), "cm")
) +
coord_polar() +
# Adding labels on top of each bar
geom_text(data=label_csb_data, aes(x=id, y=tot+19, label=PA, hjust=hjust), color="black", fontface="bold",alpha=0.6, size=3, angle= label_csb_data$angle, inherit.aes = FALSE )+
# Adding base line information
geom_segment(data=base_csb_data, aes(x = start, y = -6, xend = end, yend = -6), colour = "black", alpha=0.8, size=0.6 , inherit.aes = FALSE ) +
geom_text(data=base_csb_data, aes(x = title, y = -30, label=Region), hjust=c(1,-0.1,-0.1,-0.1,-0.1), colour = "black", alpha=0.8, size=4, fontface="bold", inherit.aes = FALSE)
p
options(warn = defaultW)
Following major insights were obtained from the two designs about the demographics in Singapore in the year 2019:
The distribution of males and females in Singapore remains comparable until the age of 70 years with a small bulge on the female side during the 40s and a similar pattern for the males during the 60s.
After the age of 70, a significant shift in the population can be observed towards females. The female population over 70 years of age seems to be increasing constantly with a whopping 71% in the age group “90 and over”. On the contrary, the male population over 70 years of age follow a declining trend, reaching less than 30% in the age group, “90 and over”.
A maximum number of planning areas are mapped under the “Central” region of Singapore while the minimum is in the “East” region. This should not come as a surprise given that the major population and workspaces reside in the city centre than the outskirts.
Jurong West was observed to be one of the most inhabited planning areas of Singapore. Other most populated planning areas include Bedok, Woodlands and Sengkang. This can begin to make sense if we think in terms of dwellings. These areas mark mostly the outskirts of Singapore and hence the places for humongous dwellings constructions.
The “Central” region is distributed into quite several planning areas. Most of them are not highly populated. This kind of strategy helps distribute the crowd and systematically develop parts of the region one by one without affecting the neighbouring areas.
If we narrow the observation down to each stack bar, we can observe that the “Central” region is observed to have the lowest number of “Youngs” or children between ages 0 and 15. That actually makes sense because most workplaces are situated in the central region so one must expect working/active crowd to live nearby. Also, the active crowd with kids would like to shift to a more peaceful area and condos which are mostly available at lesser rates in the outskirts such as Jurong West.
Taking from insight #6, indeed the most “Young” population can be observed in the “North East” region marking the outskirt. Planning areas like Punggol are highly inhabited by children.