This makeover is created to visualise the demographic structure of Singapore by age groups and by planning area in 2019. The dataset used in this makeover is gathered by Singapore Department of Statistics on 24th September 2020. The main focus of this visualisation is to understand the demographic distribution of Singapore residents.
A ternary plot is created to visualise the demographic distribution between three-part compositional data. These three-part compositional data will have to be the age groups which is provided in the dataset since they represent the entire age proportion. Each component of the diagram is scaled from 0 to 1 to represent the percentage.
While attempting to visualise the dataset, there were several data and design challenges faced.
1. Data Challenges
2. Design Challenge
The dataset provided a list of age groups ranging from “0_to_4” to “90_and_over”. However, as there are 19 age groups, it is too many to conduct a visualisation. I will be using a ternary plot to visualise the distribution of Singapore residents. To do so, I started off by grouping the age groups to 3 categories, mainly the “Generation X and the Baby Boomers”, “Generation Y” and “Generation Z”.
Singapore residents aging between 40 and 54 years old are categorised under Generation X while Singapore residents aging between 55 and 74 years old are categorised under Baby Boomers. Generation X and Baby Boomers are grouped as one category for this visualisation to fit the ternary plot. Generation X and Baby Boomers are the first component of the ternary diagram.
Singapore residents are categorised under Generation Y if they age between 25 and 39 years. They are the second component of the ternary diagram.
Generation Z comprises of Singapore residents aging between 5 and 24 years old. It is the third component of the ternary diagram.
When the dataset is run into R Markdown, they are sorted according to the first few numbers of the age groups. As “5_to_9” starts with “5”, it has been sorted to be after “45_to_49”. While it is sorted in ascending order according to the character, it is not in the correct sort for the age group categories. Therefore, when the “5_to_9” age group has to be called out together with the other age groups (eg. “10_to_14”), it cannot be done using a range (i.e. rowSums (.[6:9])) since “5_to_9” is not in the given range. Manual check and counting of the columns position has to be done. Refer to Figure 1 to see the arrangement of the dataset.
Figure 1. Arrangement of Age groups when dataset is imported in
When the ternary diagram was created, “Gen.X.BB” was cut off at the side as seen in Figure 2.
Figure 2. Ternary Diagram created with three-part compositional age groups
Since there are 19 age groups, I will be using the mutate() function and call out rows manually. I will reference the row numbers in Figure 3 and call them out using rowSums(). As Age Group “5_to_9” is not sorted in increment order of age, row 15 will have to be specified when grouping Singapore residents aged 5 and 9 years old to Generation Z.
The overlapping of axes in the ternary diagram and title can be adjusted using margin.
Figure 3. Proposed Sketch Design with the amended changes
tidyverse is a set of essential packages to install and load core packages within one command.readr is designed to read and import most types of rectangular data (eg: .csv) in an efficient way.ggtern is an extension to ggplot2 to plot ternary diagrams.dplyr is manipulating the data. It is designed to abstract how data should be stored.packages= c('tidyverse', 'readr', 'ggtern', 'dplyr')
for(p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
As I am only interested in the visualisation for 2019, I will filter the Time to be specifically 2019 using the filter function.
final_data = read_csv("./areasubzone20112020/respopagesextod2011to2020.csv")
filtered <- filter(final_data, Time == 2019)
final_data <- filtered
Run the dataset using the head function to display the first few rows of metadata.
head(final_data)
## # A tibble: 6 x 7
## PA SZ AG Sex TOD Pop Time
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 1- and 2-Room Flats 0 2019
## 2 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 3-Room Flats 10 2019
## 3 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 4-Room Flats 10 2019
## 4 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 5-Room and Executive F… 20 2019
## 5 Ang Mo K… Ang Mo Kio Town… 0_to… Males HUDC Flats (excluding thos… 0 2019
## 6 Ang Mo K… Ang Mo Kio Town… 0_to… Males Landed Properties 0 2019
To start grouping the three age groups, I will first create three variables, namely “Gen X BB”, “Gen Y” and “Gen Z”. I will have to use the mutate() function. The mutate() function is used to add new variables/columns and preserves existing ones.
Subsequently, I will be referencing to the data table and count the rows of the selected age groups.
Grouping Age Cohorts
mutated <- final_data %>%
spread(AG, Pop) %>%
mutate("Gen X BB" = rowSums(.[13:14]) + rowSums(.[16:20])) %>%
mutate("Gen Y" = rowSums(.[10:12])) %>%
mutate("Gen Z" = rowSums(.[7:9]) + rowSums(.[15]))
df <- data.frame(mutated)
Plot an static ternary diagram using ggtern() function. This will allow the users to have a clear view of the spread of the population in Singapore.
axis <- function(txt) {
list(
title = txt, tickformat = "%", tickfont = list(size = 10)
)
}
ternaryAxes <- list(
aaxis = axis("Gen X BB"),
baxis = axis("Gen Y"),
caxis = axis("Gen Z")
)
# ggtern visualization
ggtern(data = df,aes(x=Gen.X.BB,y=Gen.Y, z=Gen.Z)) +
geom_point(color = ("pink"))+
labs(title = "Age Cohort Generation Distribution 2019",
margin = list(l = 15, t = 80))+
theme(tern.axis.title.L = element_text(hjust = 0))
The ternary diagram is created using the Age Groups and Population in every planning areas. As the dataset is placed in accordance of the planning areas, by selecting the population as part of the ternary diagram, it will show the demographic distribution of each other area. By using Age Groups and Population, the distribution of the demographics can be visualised easily (i.e. when most of the points are concentrated towards one component, it will mean that most of the areas have a higher percentage of that component).
From the ternary diagram, some of the areas in Singapore has a complete percentage of Singapore residents who belong to Generation X and Baby Boomers. In other words, a portion of the planning areas only consist of Singapore residents above the age of 40. While majority of the planning areas have a mixture of all 3 generation age groups, it has a higher percentage of residents who belong to Generation X and Baby Boomers.
There are several areas with no Generation X residents. According to the ternary plot, there are a few areas that are at the borders of Generation Y and Generation Z. This means that it either has a mixture of only residents from Generation Y and Generation Z or only consists of one Generation (either Generation Y or Z).