1 Major Data and Design Challenges

1.1 Data needs to be filtered

Given the requirement to visualise the demographic structure of Singapore population by age cohort and planning area in 2019, some data cleaning is needed to filter out the relevant data. Since the dataset downloaded includes data from previous years 2011 to 2019, filter such that year == “2019”. Additionally, other information such as subzone and type_of_dwelling are not relevant for our visualisation, and thus can be dropped from the dataframe.

1.2 Data is in an inconsistent format

In the original dataset, the age range for 5 to 9 years is represented as “5_to_9”, which disrupts the ordering of the age groups in the population pyramid. In order to preserve the order of ascending age groups, change the representation of the age range to “05_to_9”.

1.3 Too many planning areas to visualise

There are a total of 55 planning areas in Singapore, which will make the visualisation difficult to interpret because of many overlapping plots. As such, we can categorise the planning areas according to their regions (Central, North, Northeast, East, West) and visualise the demographic structure within the region itself. Information on which region each planning area belongs to can be found here. Each planning area is matched to its respective region.

1.4 Individual age groups are less meaningful to analyse

Similar to planning areas, there are many age groups provided in the dataset. However, this information is less useful to because the age groups have been broken down into small intervals, which does not allow for macro patterns of the demographic structure to be identified. As such, it would be more meaningful to categorise the age groups into 3 broader categories. The following shows the age range that is used for each category:

  • Young: 0-24 years old

  • Active: 25-64 years old

  • Old: > 65 years old

Together with regional data mentioned above, the broader age categories can be used to generate ternary plots which illustrate the overall demographic structure of each planning area, within each region.


2 Proposed Sketched Design

2 visualisations are proposed to gain insights about the demographic structure of Singapore

  1. Age-sex pyramid, also known as a population pyramid:

The age-sex pyramid provides a good overview of the distribution of the Singapore population by age cohort, displaying which age groups makes up the largest or smallest proportions of the total population. This information is also further broken down into gender, to show the composition of each age group.

  1. Ternary Plots

Ternary plots analyse compositional data in the three-dimensional case. Given the three broader age groups that was identified before, ternary plots can be used to display the distribution of young, economically active and old within a planning area. There will be a ternary plot for each region, to prevent overcrowding of the plots on the ternary diagram, so that each point can be read more easily.


3 Data Visualisation

3.1 Preparation of Data Visualisation

3.1.1 Loading packages

Install the following R packages which will be used to read the dataset and to plot the visualisations. The packages are loaded into the R Studio Environment.

  • tidyverse is allows for data transformation and manipulation

  • ggthemes for the purpose of changing the themes of the visualisations below

packages = c('tidyverse','ggthemes')

for (p in packages){
  if(!require (p, character.only =T)){
    install.packages(p)
  }
  library(p, character.only=T)
}

3.1.2 The dataset

The dataset is used for this visualisation is Singapore Residents by Subzone and Type of Dwelling, 2011-2019 from data.gov.sg. It is called planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv and is in csv file format.

3.1.3 Reading the dataset

Read the dataset using the read.csv() function, and print the head() to get a clear overview of what information the dataset provides.

demographic_data <- read.csv("planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv")
head(demographic_data)
##   planning_area                subzone age_group   sex
## 1    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 2    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 3    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 4    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 5    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 6    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
##                          type_of_dwelling resident_count year
## 1                 HDB 1- and 2-Room Flats              0 2011
## 2                        HDB 3-Room Flats             10 2011
## 3                        HDB 4-Room Flats             30 2011
## 4          HDB 5-Room and Executive Flats             50 2011
## 5 HUDC Flats (excluding those privatised)              0 2011
## 6                       Landed Properties              0 2011

3.2 Data Cleaning

  • Filter data relevant only to the year 2019 using the filter() function

  • Change the naming of the age group “5_to_9” to “05_to_9” using the mutate() function

We can see that the data is now limited to only the year of 2019 under the “year” column.

yr2019 <- demographic_data %>% 
  filter(year == 2019) %>% 
  mutate(age_group = gsub("5_to_9", "05_to_9", age_group)) 

yr2019 <- data.frame(yr2019)
head(yr2019)
##   planning_area                subzone age_group   sex
## 1    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 2    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 3    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 4    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 5    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 6    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
##                          type_of_dwelling resident_count year
## 1                 HDB 1- and 2-Room Flats              0 2019
## 2                        HDB 3-Room Flats             10 2019
## 3                        HDB 4-Room Flats             10 2019
## 4          HDB 5-Room and Executive Flats             20 2019
## 5 HUDC Flats (excluding those privatised)              0 2019
## 6                       Landed Properties              0 2019

3.3 Visualisation 1: Age-sex Pyramid

3.3.1 Creating a dataframe to be used for age-sex pyramid

  • Group the data by age_group and sex since planning area is not relevant in this plot using the groupby()

  • Summarise the resident_count as population to aggregate the resident count in each sex, within each age group using the summarise() function

  • Multiply the population numbers for males to reverse the bars by writing an ifelse condition. This ensures that the male population is represented on the left while the female population is represented on the right for the pyramid.

The dataframe is now ready to be used for plotting of the age-sex pyramid.

pyramid_data <- yr2019 %>% 
  group_by(age_group, sex) %>%
  summarise(population=sum(resident_count)) %>% 
  mutate(pyramid_pop = ifelse(sex=="Males", -population, population))

pyramid_df <- data.frame(pyramid_data)
head(pyramid_df)
##   age_group     sex population pyramid_pop
## 1    0_to_4 Females      90850       90850
## 2    0_to_4   Males      94730      -94730
## 3   05_to_9 Females      97040       97040
## 4   05_to_9   Males     101290     -101290
## 5  10_to_14 Females     102550      102550
## 6  10_to_14   Males     105830     -105830

3.3.2 Plotting the age-sex pyramid

  • Use ggplot and geom_bar() to plot the pyramid.

  • stat="identity" is set in geom_bar because y values are provided, unlike a normal geom_bar(), which by default counts the rows for each x value

  • Set the scale for the population in thousands using scale_y_continuous and include the corresponding labels

  • coord_flip is used to change the orientation of the bars to horizontal

pyramid <- ggplot(pyramid_df, aes(x=age_group, y=pyramid_pop, fill=sex)) + 
  geom_bar(data = subset(pyramid_df, sex == "Females"), stat = "identity") +
  geom_bar(data = subset(pyramid_df, sex == "Males"), stat = "identity") +
  scale_y_continuous(breaks = seq(-150000, 150000, 50000), 
                     labels = paste0(as.character(c(seq(150, 0, -50), seq(50, 150, 50))))) + 
  coord_flip() + 
  scale_fill_manual(values = c("lightpink2", "steelblue3")) +
  ggtitle("Singapore Population Pyramid, 2019") + 
  xlab("Age Group") + 
  ylab("Population in thousands") +
  theme_fivethirtyeight() +
  theme(axis.title = element_text())

pyramid


3.4 Visualisation 2: Ternary Plots

3.4.1 Segmenting Planning Area into Regions

As mentioned before, there are 55 planning areas, which makes it difficult to visualise the ternary plot because of the overlapping plots.

  • Filter relevant columns out by dropping the columns subzone, sex, type_of_dwelling and year

  • Group by planning and age group and summarise resident_count as population to aggregate the resident count in each age group within each planning area

  • Create a vector for each region, with its corresponding planning areas. The vector for the 5th region (West) is not created because an ifelse() function is used to assign planning areas to that region.

  • Match the planning areas to their respective regions using the %in% operator and the ifelse() condition.

subset <- subset(yr2019, select= -c(subzone, sex, type_of_dwelling, year)) %>%
  group_by(planning_area, age_group) %>%
  summarise(population = sum(resident_count))
subset <- data.frame(subset)

Central=c("Bishan", "Bukit Merah","Bukit Timah", "Downtown Core", "Geylang", "Kallang", "Marina East", "Marina South", "Marine Parade", "Museum", "Newton", "Novena", "Orchard", "Outram", "Queenstown", "Dover", "Ghim Moh", "River Valley", "Rochor", "Singapore River", "Southern Islands", "Straits View", "Tanglin",  "Toa Payoh")
East=c("Bedok", "Changi", "Changi Bay", "Pasir Ris", "Paya Lebar", "Tampines")
North=c("Central Water Catchment", "Lim Chu Kang", "Mandai", "Sembawang", "Simpang", "Sungei Kadut", "Woodlands", "Yishun")
Northeast=c("Ang Mo Kio", "Hougang", "North-Eastern Islands", "Punggol", "Seletar","Sengkang", "Serangoon")

regional <- mutate(subset, 
 Region=ifelse(subset$planning_area %in% Central, "Central",
               ifelse(subset$planning_area %in% East, "East",
                      ifelse(subset$planning_area %in% North, "North",
                             ifelse(subset$planning_area %in% Northeast, "North-east", "West")))))

regional_df <- data.frame(regional)
head(regional_df)
##   planning_area age_group population     Region
## 1    Ang Mo Kio    0_to_4       5420 North-east
## 2    Ang Mo Kio   05_to_9       6230 North-east
## 3    Ang Mo Kio  10_to_14       7380 North-east
## 4    Ang Mo Kio  15_to_19       7930 North-east
## 5    Ang Mo Kio  20_to_24       8920 North-east
## 6    Ang Mo Kio  25_to_29      10620 North-east

We can see that there are 5 unique regions now.

unique(regional_df$Region)
## [1] "North-east" "East"       "Central"    "West"       "North"

3.4.2 Combining age groups into larger age brackets (Young, Active, Old)

To create the ternary plot,

  • Use the spread() function to pivot the dataframe. This makes it easier to aggregate the population for each age category by just selecting the relevant columns of age group

  • Use the mutate() function to select the relevant age groups for each category to derive the three new measures.

  • Remove planning areas which have no residential population (TOTAL=0) by using the filter() function

The dataframe is now ready to be used for the ternary plot.

agpop_mutated <- regional_df %>%
  spread(age_group, population) %>%
  mutate(YOUNG = rowSums(.[3:8]))%>%
  mutate(ACTIVE = rowSums(.[9:15]))  %>%
  mutate(OLD = rowSums(.[16:21])) %>%
  mutate(TOTAL = rowSums(.[22:24])) %>%
  filter(TOTAL > 0)

agpop_mutated_df <- data.frame(agpop_mutated)
head(agpop_mutated_df)
##   planning_area     Region X0_to_4 X05_to_9 X10_to_14 X15_to_19 X20_to_24
## 1    Ang Mo Kio North-east    5420     6230      7380      7930      8920
## 2         Bedok       East   10020    11640     13300     14640     16660
## 3        Bishan    Central    2850     3850      4430      4740      5570
## 4   Bukit Batok       West    7130     6640      7800      8800      9850
## 5   Bukit Merah    Central    6100     6650      6640      6380      6850
## 6 Bukit Panjang       West    6700     7230      7680      8500      9570
##   X25_to_29 X30_to_34 X35_to_39 X40_to_44 X45_to_49 X50_to_54 X55_to_59
## 1     10620     10510     10940     11760     12570     12170     13090
## 2     19530     17940     18310     20070     21290     20870     22550
## 3      7090      5430      5290      5940      6860      6510      7220
## 4     12510     12480     10600     10690     11680     12010     12450
## 5      9140     10550     11050     11830     11780     10790     11100
## 6     10560     10740     10230      9610     10610     10450     11410
##   X60_to_64 X65_to_69 X70_to_74 X75_to_79 X80_to_84 X85_to_89 X90_and_over
## 1     12810     11970      8960      6160      3840      2110         1040
## 2     21830     18810     13660      8300      5600      3130         1820
## 3      7140      5730      3880      2540      1670       970          520
## 4     11590      8560      5020      2930      1820      1020          560
## 5     11270     10370      8310      5990      4190      2220         1390
## 6      9970      6910      4230      2470      1560       820          450
##   YOUNG ACTIVE   OLD  TOTAL
## 1 46500  83850 34080 164430
## 2 85790 142860 51320 279970
## 3 28530  44390 15310  88230
## 4 52730  81500 19910 154140
## 5 41760  78370 32470 152600
## 6 50240  73020 16440 139700

3.4.3 Plotting an Overall ternary diagram

Use the library ggtern to create the overall ternary plot of the Singapore population in 2019.

  • Load the library ggtern. This is not done together with the code chunk to load other R packages earlier because there are conflicts between the ggtern and tidyverse

  • Color is set to Region so that each region is represented by a different colour

  • Size is set to TOTAL such that the size of the point is proportional to the population size in the region

library(ggtern)
overall_ternary <- ggtern(data=agpop_mutated_df, aes(x=YOUNG,y=ACTIVE, z=OLD, color=Region, size=TOTAL)) +
  geom_point(alpha=0.5)+
  labs(title="Overall Demographic Structure in Singapore by Region, 2019") +
  theme_tropical() 

overall_ternary


3.4.4 Plotting ternary diagram for each region

While the overall ternary diagram provides an overview of the demographic composition within each region, we can plot individual regional ternary plots to see the demographic composition for each planning area.

  • Use the mutate() function to filter the data for a specific region

North Region

#NORTH region
north <- agpop_mutated %>% filter(Region=="North")
north_tp <- ggtern(data= north, aes(x=YOUNG,y=ACTIVE, z=OLD, color=planning_area, size=TOTAL)) +
  geom_point(alpha=0.5)+
  labs(title="Demographic Structure in North Region") +
  theme_tropical() 

north_tp


Central Region

#CENTRAL region
central <- agpop_mutated %>% filter(Region=="Central")
central_tp <- ggtern(data= central, aes(x=YOUNG,y=ACTIVE, z=OLD, color=planning_area, size=TOTAL)) +
  geom_point(alpha=0.5)+
  labs(title="Demographic Structure in Central Region") +
  theme_tropical() 

central_tp


West Region

#WEST region
west <- agpop_mutated %>% filter(Region=="West")
west_tp <- ggtern(data= west, aes(x=YOUNG,y=ACTIVE, z=OLD, color=planning_area, size=TOTAL)) +
  geom_point(alpha=0.5)+
  labs(title="Demographic Structure in West region, 2019") +
  theme_tropical() 

west_tp


Northeast Region

#NORTH-EAST region
northeast <- agpop_mutated %>% filter(Region=="North-east")
northeast_tp <- ggtern(data= northeast, aes(x=YOUNG,y=ACTIVE, z=OLD, color=planning_area, size=TOTAL)) +
  geom_point(alpha=0.5)+
  labs(title="Demographic Structure in North-east region, 2019") +
  theme_tropical() 

northeast_tp


East region

#EAST region
east <- agpop_mutated %>% filter(Region=="East")
east_tp <- ggtern(data= east, aes(x=YOUNG,y=ACTIVE, z=OLD, color=planning_area, size=TOTAL)) +
  geom_point(alpha=0.5)+
  labs(title="Demographic Structure in East region, 2019") +
  theme_tropical() 

east_tp


4 Final Visualisation & Insights


  1. From the age-sex pyramid, we can tell that Singapore has an ageing population, where the highest proportion of the population is economically active, with an increasing proportion belonging to the aged population, and thus reflecting longer life expectancy among Singaporeans. There is a relatively narrow base, which reflects declining birth rates, and thus a smaller proportion of the younger population. Additionally, there is a larger proportion of females among Singaporeans aged 75 and above as compared to males but the proportion of males and females in other age groups are similar.

  2. From the overall ternary plot of the demographic structure in Singapore, we can tell the demographic composition for most regions follow the age-sex pyramid of Singapore, with around 30-35% of the population belonging to the young population, 18-20% of the population belonging to the old population and the remaining 45-55% of the population being economically active. There are however, some exceptions, especially from the Central region (coloured pink). Planning areas belonging to this region seem to have a lower proportion of both the young and old population, and a higher proportion of the economically active compared to the majority of other planning areas.

  3. From the individual regional ternary plots, we can gain insights on Singapore’s demographic geographically as each point represents a planning area within the region. We can identify the planning areas that were mentioned earlier which have a higher proportion of economically active population, such as Marine Parade in the Central region. In the North region, Sungei Kadut has a much higher proportion of the older population (around 21%) as compared to other planning areas within the region (around 15%). In the West region, Western Water Catchment has a much smaller population of the old (1-2%) as compared to other planning areas within the region (around 15-20%). These insights will be useful for policymakers or urban planners, to better take into consideration the needs of the particular demographic within each planning area. For example, for Sungei Kadult, urban planners can then look into the area and examine if eldercare facilities there are adequate to meet the needs of an older population.