Demographic structure by Subzones and Races in Singapore

1. Introduction

Since Singapore is home to many different races, one of Singapore’s goals is to ensure that Singaporeans understand cultural differences and embrace racial harmony.

This Data Visualization aims to discover and a tell a story of race demographic patterns across the various Subzones of Singapore according to the General Household Survey conducted in 2015.

There are 3 types of visualisations generated: an interactive pie chart, interactive maps, and a population pyramid.

2. Major data and design challenges

Two excel sheets were used to generate the plots. Both excel sheets were taken from https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data .

The first excel sheet is the 2015 General Household Survey and shows the resident population by planning area/subzone, ethnic group, and sex.

The second excel sheet consists of age groups by planning area, subzone, age, sex, dwelling and year.

Type of challenge	Description
Data	Since the data from the second excel sheet is from 2011-2020, we need to filter the data to only consider data from 2015.
Data	In the first excel sheet, the range of each age bin in the data is 5 years, except the last age bin with the name “90 & Over”. Therefore, we have many age bins. Accounting for each age bin and visualizing them will not be very helpful for the reader. Creating demographic age groups for visualization will be easier for the reader.
Data	In the first excel sheet, there are headers and subheaders present. We need to filter and clean the data in excel to remove these headers and subheaders.
Data	In order to prepare a pyramid chart for two subzones, we need to filter the data in excel to only include those two subzones.
Data	The data that appears when the user hovers over the map is according to the first column of the dataframe and may not be useful/representative of the data shown.
Design	To give the user some context into the different subzones highlightes, the subzones also must show the population of the different races in Singapore.
Design	As there are many subzones in Singapore, we need to be able to visualize the data in a way where the reader can easily get an overview of the race demographics in Singapore by subzones.
Design	For the pyramid chart, the two axis for the two different subzones ends at different values. We need to make sure that the both the x axis ends at the same value to prevent any confusion to the readers.
Design	The automatic palette set for ggplot uses red and green color for the bars. Since red and green should not be in the same pyramid chart visualization, we need to the change the palette colors for the population pyramid.

2.1 Data cleaning and preparation

Data cleaning and preparation was performed in Excel using the filter function. For the first excel sheet, the filter function was applied to the first row. Below is a snippet of the first excel sheet.

First excel sheet

Filters were applied to column B,C, and D. Under column B, the title and subtitle were filtered out. Under column C, blanks and the word “Total” were filtered out. Under column D, the column headers were filtered out. In order to make a common column to combine the shapefile later on, a column called “SUBZONE_N” was created, and the subzones were capitalised using excel’s UPPER() function.

For the second excel sheet, a filter was also applied to the first row. Below is a snippet of the second excel sheet.

Second excel sheet

Under column B, the subzones were filtered to only include “Tampines East” and “Woodlands East”.

2.2 Proposed Visualisation

3. Step-by-step description

3.1 Installing packages and reading data

The tidyverse library is installed so as to install the various libraries needed for data manipulation and exploration. The shapefile containing the areas of the subzones in Singapore is read into RStudio using the st_read function.

packages = c('sf', 'tmap', 'tidyverse', 'plotly')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

mpsz <- st_read(dsn = "data/geospatial", 
                layer = "MP14_SUBZONE_WEB_PL")
mpsz

Next, the subzone_races.csv file which contains the filtered data is read into RStudio using the read_csv function.

subzones <- read_csv("subzone_races.csv")
subzones

Next, a left join is performed to make sure both files are combined using the common “SUBZONE_N” column. The combined dataframe is called “races”.

races <- left_join(mpsz, subzones,
                              by = c("SUBZONE_N" = "SUBZONE_N"))

Using the plotly library, an interactive pie chart is created to show the percentage of different races in Singapore. The number of people in each race in the subzones dataframe is summed. When the mouse is hovered over a particular section, the percentage and number of people of that particular race is shown.

library(plotly)
labels = c("Chinese", "Malay", "Indians", "Others")
values =   value = c(sum(subzones$`Chinese Total`), sum(subzones$`Malay Total`), sum(subzones$`Indians Total`), sum(subzones$`Others Total`))
fig <- plot_ly(type='pie', labels=labels, values=values, 
               textinfo='label+percent',
               insidetextorientation='radial')
fig <- fig %>% layout(title = "Percentage of different races in Singapore")
fig

A dataframe called “races2” is generated. “races2” reorders the columns in the “races” dataframe to put the SUBZONE_N column first so that the subzone name appears when the user hovers over the subzone on the map.

Using tmap, an interactive map showing the most populated subzones of Singapore is generated. Clicking on a subzone will allow the user to see the population of different races in that area.

races2 <- races %>% select(SUBZONE_N, everything())

tmap_mode("view")
tm_shape(races2) +
  tm_polygons("Total",
                # popup definition
                popup.vars=c(
                    "SZ: "="SUBZONE_N",
                    "Chinese: " = "Chinese Total",
                    "Malay: " = "Malay Total",
                    "Indians: " = "Indians Total")  
                )

The same “races2” dataframe is used to generate an interactive map for the different races.

The interactive map below is for the Chinese race.

races2 <- races %>% select(SUBZONE_N, everything())
tmap_mode("view")
tm_shape(races2) +
  tm_polygons("Chinese Total",
                # popup definition
                popup.vars=c(
                    "SZ: "="SUBZONE_N",
                    "Chinese: " = "Chinese Total",
                    "Malay: " = "Malay Total",
                    "Indians: " = "Indians Total")  
                )

The interactive map below is for the Malay race.

races2 <- races %>% select(SUBZONE_N, everything())
tmap_mode("view")
tm_shape(races2) +
  tm_polygons("Malay Total",
                # popup definition
                popup.vars=c(
                    "SZ: "="SUBZONE_N",
                    "Chinese: " = "Chinese Total",
                    "Malay: " = "Malay Total",
                    "Indians: " = "Indians Total")  
                )

The interactive map below is for the Indian race.

races2 <- races %>% select(SUBZONE_N, everything())
tmap_mode("view")
tm_shape(races2) +
  tm_polygons("Indians Total",
                # popup definition
                popup.vars=c(
                    "SZ: "="SUBZONE_N",
                    "Chinese: " = "Chinese Total",
                    "Malay: " = "Malay Total",
                    "Indians: " = "Indians Total")  
                )

The interactive map below is for people of other races.

races2 <- races %>% select(SUBZONE_N, everything())
tmap_mode("view")
tm_shape(races2) +
  tm_polygons("Others Total",
                # popup definition
                popup.vars=c(
                    "SZ: "="SUBZONE_N",
                    "Chinese: " = "Chinese Total",
                    "Malay: " = "Malay Total",
                    "Indians: " = "Indians Total")  
                )

The second excel sheet was read into R Studio using the read_csv function.

subzones <- read_csv("subzones.csv")

The data is then filtered to only contain data from 2015 as the previous visualisations are based on the 2015 General Household Survey.

subzones <- subzones%>%filter(Time == 2015)

We can now create various demographic age groups in the second excel sheet. Based on the Center for generational kinetics: https://genhq.com/faq-info-about-generations/, there are currently 5 primary generations that make up our society. The ages that make up each generation is shown in the table below.

Generation group	Age
Generation Z	0-24
Millennials	25-43
Generation X	44-55
Baby Boomers	56-74
Silent Generation	Above 74

Based on the table above, the generation age groups are created. The first step is to separate the different ages in the age column is into their respective generation age groups.

ind1 = subzones$AG > 0 & subzones$AG < 24
ind2 = subzones$AG > 24 & subzones$AG < 43
ind3 = subzones$AG > 43 & subzones$AG < 55
ind4 = subzones$AG > 55 & subzones$AG < 74
ind5 = subzones$AG > 74

After the rows are categorized to the different demographic age groups, a new column called “generation” is created. A name is given to the respective generation age groups to categorize each of the rows.

subzones$generation[ind1] = 'Generation Z'
subzones$generation[ind2] = 'Millennials'
subzones$generation[ind3] = 'Generation X'
subzones$generation[ind4] = 'Baby Boomers'
subzones$generation[ind5] = 'Silent Generation'

The pyramid chart will show the number of people per generation for the 2 most densely populated areas in Singapore, Tampines East and Woodlands East.

A limit was set on the y axis to make sure that the labels on the left and right from “0” are equal. Since the original colour of the bar charts is red and green, it was changed to pink and blue to avoid red and green “colourblindness”.

ggplot(subzones, aes(x = generation, fill = SZ,
                 y = ifelse(test = SZ == "Tampines East",
                            yes = -Pop, no = Pop))) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = abs, limits = c(-40000,40000)) +
  labs(title = "Number of people per generation group in Tampines East and Woodlands East", x = "Age", y = "Number of people") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_colour_manual(values = c("pink", "steelblue"),
                      aesthetics = c("colour", "fill")) +
  coord_flip()

4. Insights

Based on the different visualizations generated, there are some insights we can come up with.

The first visualisation (pie chart) shows that the Chinese race is largest ethnic group in Singapore. This shows that as much as the Singapore government wants to prevent too many people in a particular race from being in a particular area, it will be very difficult for them to do so. They should not overlook the need for Singaporeans to engage in other activities for them to mingle with other races and feel a sense of racial harmony.

The interactive maps show that the people of different races tend to live in “Tampines East” and “Woodlands East”. The government should do something to prevent a possible overpopulation in those areas. They should also put in some measures to ensure that people of different races are better spread out across Singapore to better achieve their goal of racial harmony as more Singaporeans can learn to understand different cultures.

The last visualization shows that people in the millennial generation has the most number of people in “Tampines East” and “Woodlands East”. However, the population of other generation groups are still relatively high. This means that we will face an aging population soon and the government will have to make sure these 2 densely populated areas of Singapore have enough area to accomodate elderly homes and activities for both elderly and millennials.