Over the years, one of favourite social topic that are frequently discussed is how the population in Singapore has been aging.
In this DataViz, I will explore the changing patterns of Singapore demographics by geographical hierarchy from 2011 to 2019. To do so, I will be using the population dataset from Singstat website.
Fig 1: News article taken from Today’s news article on 11 March 2020
Challenge 1 - The data doesn’t contain the relevant demographic composition (i.e. Young, Economically Active, Aged Group)
There is no demographic composition in the dataset as shown in Fig 2 below.
Fig 2: Screen shot of the data format
So, I will create an indicator later on to group the different age groups into different demographic compositions.
Challenge 2 - Static plot
The usual way of plotting such graph to illustrate the changes over time is to illustrate through data table format or plot the change in demographic composition through the years by using different graph, where each graph will represent one year.
Fig 3: Usual static plot
Challenge 3 - Current data format
Currently the data is in transaction format as shown under Fig 2 earlier. Further grouping is required to split the demographic composition into different columns, instead of sharing the same columns. This is to allow us to plot demographic composition onto different axis.
Below are the prototypes and the explanation of respective components can be found below as well:
Fig 4: Prototype of the graph
I will be using Plotly to plot the graph to address the challenge 2 stated above. Plotly is a great tool to allow the users to interact with the graph. Besides, Plotly also provides users additional functions such as zoom in/out the graph, highlight the data points and so on.
I will also leverage on the animation function in plotly to illustrate how demographic composition by planning area have changed over time. This will also resolve the challenge 2 mentioned above.
Fig 5: Prototype of the data
To address the challenge 1 above, I will create an economic dependency indicator as shown in Fig 5 above.
As we would like to plot the proportion of young, active and elderly under different axis, the data will need to be transformed from current transaction format to data table format. This is illustrated under point 4 in the Fig 5.
After deciding on the graph to be plotted and how to structure the data, below are the list of required R packages in order to plot the graph:
packages = c('tidyverse', 'plotly')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
All the necessary libraries will be loaded into the R environment. By using the following codes, R will first search whether the user has installed packages in R. If the selected packages are not installed, R will install them. Once this is done, all the selected libraries will be loaded into R environment.
I will read the dataset that I have downloaded from Singstat.
data_2000_2010 <- read_csv("data/respopagesextod2000to2010.csv")
data_2011_2019 <- read_csv("data/respopagesextod2011to2019.csv")
Then, I will combine the two files by using rbind. This would allow me to stack one data file onto another one.
total_data <- rbind(data_2000_2010, data_2011_2019)
After stacking the data, I will do a quick check on the number of data points of the merged data. This is to ensure the number of data in the merged model still stack up to the sum of data of the two original datasets.
count_totaldata <- nrow(total_data) * ncol(total_data)
count_data_2000_2010 <- nrow(data_2000_2010) * ncol(data_2000_2010)
count_data_2011_2019 <- nrow(data_2011_2019) * ncol(data_2011_2019)
count_totaldata - count_data_2000_2010 - count_data_2011_2019
## [1] 0
Now, I will start plotting the graph. I will describe the steps taken in order to plot the relevant graphs.
To do so, first we need to group the variables. To do so, I first create a mapping table for the different economic dependencies.
AG <- c("0_to_4",
"5_to_9",
"10_to_14",
"15_to_19",
"20_to_24",
"25_to_29",
"30_to_34",
"35_to_39",
"40_to_44",
"45_to_49",
"50_to_54",
"55_to_59",
"60_to_64",
"65_to_69",
"70_to_74",
"75_to_79",
"80_to_84",
"85_to_89",
"90_and_over")
Economic_Dependency_Indicator <- c(rep("young", 5), rep("active", 8), rep("elderly", 6))
econ_dep_table <- as.data.frame(cbind(AG, Economic_Dependency_Indicator))
Then, I will map the economic status to the data we have downloaded from Singstat website.
total_data_1 <- left_join(total_data, econ_dep_table, by = "AG")
glimpse(total_data_1)
## Observations: 1,924,320
## Variables: 8
## $ PA <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio…
## $ SZ <chr> "Cheng San", "Cheng San", "Cheng San", …
## $ AG <chr> "0_to_4", "0_to_4", "0_to_4", "0_to_4",…
## $ Sex <chr> "Males", "Males", "Males", "Males", "Ma…
## $ TOD <chr> "HDB 1- and 2-Room Flats", "HDB 3-Room …
## $ Pop <dbl> 20, 480, 220, 80, 0, 0, 0, 0, 20, 390, …
## $ Time <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 200…
## $ Economic_Dependency_Indicator <fct> young, young, young, young, young, youn…
I will also rename the columns so that the column namings are more intuitive. Column names will be called by using names functions in R to ensure the names are applied to the dataset.
colnames(total_data_1) <- c("Planning_Area", "Subzone", "Age_Group", "Gender", "Type_of_Dwelling", "Population", "Year", "Economic_Dependency_Indicator")
names(total_data_1)
## [1] "Planning_Area" "Subzone"
## [3] "Age_Group" "Gender"
## [5] "Type_of_Dwelling" "Population"
## [7] "Year" "Economic_Dependency_Indicator"
Next, I will derive the proportion of economic dependency for each demographic composition.
# calculate the population by economic group
total_data_2 <- total_data_1 %>%
group_by(Year, Economic_Dependency_Indicator, Planning_Area) %>%
summarise(Subtotal_Population = sum(Population))
# calculate the population by year & planning area
total_data_3 <- total_data_1 %>%
group_by(Year, Planning_Area) %>%
summarise(Total_Population = sum(Population))
# Join the data set together
total_data_merged <- left_join(total_data_2, total_data_3, by = c("Year", "Planning_Area"))
total_data_merged$Proportion_Economic_Dependency <-
ifelse(total_data_merged$Total_Population == 0, 0,
total_data_merged$Subtotal_Population/total_data_merged$Total_Population)
Finally I will prepare the data into relevant formats so that we can plot the interactive graphs.
# Split into two columns
total_data_merged$Proportion_Young <-
ifelse(total_data_merged$Economic_Dependency_Indicator == "young",
total_data_merged$Proportion_Economic_Dependency,
0)
total_data_merged$Proportion_Active <-
ifelse(total_data_merged$Economic_Dependency_Indicator == "active",
total_data_merged$Proportion_Economic_Dependency,
0)
total_data_merged$Proportion_elderly <-
ifelse(total_data_merged$Economic_Dependency_Indicator == "elderly",
total_data_merged$Proportion_Economic_Dependency,
0)
Graph 1 - Proportion of Young vs elderly
Next, I will filter out the portion of the data that will use to plot the graph. In the code below, I have done the following to prep the data:
- Filter out those populations under ‘Active’
- Only keep those data on Year 2011 and afterwards
- To reduce the noise of the graph, I have kept on those planning areas with populations more than 1,100
- Group the data by year & planning area
total_data_young_elderly <- total_data_merged %>%
group_by(Year, Planning_Area) %>%
filter(Economic_Dependency_Indicator != "active", Year >= 2011, Total_Population>=1100) %>%
select(-Proportion_Economic_Dependency, -Economic_Dependency_Indicator) %>%
summarise(Overall_Population = mean(Total_Population),
Proportion_Young = sum(Proportion_Young),
Proportion_elderly = sum(Proportion_elderly))
total_data_young_elderly %>%
plot_ly(x = ~Proportion_elderly,
y = ~Proportion_Young,
size = ~Overall_Population,
color = ~Overall_Population,
hoverinfo = "text",
text = ~paste("For the residents in ", Planning_Area,
" in ", Year,
"<br />- About ", round(Proportion_Young*100,1),
"% are less than age 25.",
"<br />- About ", round(Proportion_elderly*100,1),
"% are aged 65 and above."),
frame = ~Year,
type = 'scatter',
mode = 'markers',
marker = list(opacity = 0.5, sizemode = 'diameter')) %>%
layout(xaxis = list(title = "Proportion of Elderly",
range = c(0, 0.4),
tickformat = "%"),
yaxis = list(title = "Porportion of Young",
range = c(0, 0.4),
tickformat = "%"),
title = "<b>How does the proportion of Young and Elderly changed over <br />the last 9 years by different planning areas?</b>",
titlefont = list(size = 16)) %>%
animation_slider(currentvalue = list(prefix = "YEAR "))
Fig 6: Graph on Proportion of Elderly vs Young between 2011 till 2019
Fig 6 shows how the relationship between the proportion of elderly vs young have changed over time. In general, we observe the bubbles in the graph are going towards the bottom right corner. This means that from 2011 to 2019, the proportion of elderly has increased but the proportion of young has dropped.
Besides, the change in demographic composition for different planning areas can be quite different from one another.
Fig 7: Proportion of Elderly vs Young stayed in Ang Mo Kio in 2011 & 2019
Fig 8: Proportion of Elderly vs Young stayed in Punggol in 2011 & 2019
As shown under Fig 7 and Fig 8, we noted the number of population in Ang Mo Kio remains quite stable, while there is an increase in population in Punggol. This is expected as Ang Mo Kio is mature estate while more new HDBs were built in non-matured areas such Punggol in the past 9 years.
Meanwhile, we also do observe that there is a slight increase in proportion of young in Punggol from 2011 to 2019 in Fig 8. This is probably due to the families stay in Punggol are mostly younger couples, resulting in an increase in proportion of young over the years.
Graph 2 - Proportion of Active vs Elderly
Similarly, I have followed the same steps as per Graph 1 to prep for my data. But instead of filtering out the active group, I will keep the active group and filter out the young.
total_data_active_elderly <- total_data_merged %>%
group_by(Year, Planning_Area) %>%
filter(Economic_Dependency_Indicator != "young", Year >= 2011, Total_Population>=1100) %>%
select(-Proportion_Economic_Dependency, -Economic_Dependency_Indicator) %>%
summarise(Overall_Population = mean(Total_Population),
Proportion_Active = sum(Proportion_Active),
Proportion_elderly = sum(Proportion_elderly))
total_data_active_elderly %>%
plot_ly(x = ~Proportion_elderly,
y = ~Proportion_Active,
size = ~Overall_Population,
color = ~Overall_Population,
hoverinfo = "text",
text = ~paste("For the residents in ", Planning_Area,
" in ", Year,
"<br />- About ", round(Proportion_Active*100,1),
"% are between age 25 to age 64.",
"<br />- About ", round(Proportion_elderly*100,1),
"% are aged 65 and above."),
frame = ~Year,
type = 'scatter',
mode = 'markers',
marker = list(opacity = 0.5, sizemode = 'diameter')) %>%
layout(xaxis = list(title = "Proportion of Elderly",
range = c(0, 0.8),
tickformat = "%"),
yaxis = list(title = "Porportion of Active",
range = c(0, 0.8),
tickformat = "%"),
title = "<b>How does the proportion of Active and Elderly changed <br />over the last 9 years by different planning areas?</b>",
titlefont = list(size = 16)) %>%
animation_slider(currentvalue = list(prefix = "YEAR "))
Fig 9: Graph on Proportion of Elderly vs Active between 2011 and 2019
Fig 9 shows how the proportion of elderly and proportion of active have changed from 2011 till 2019. Although the proportion of elderly has increased, the proportion of active remains quite stable over the 9 years study period. This increases the pressure on the workforce and Singapore society to support the elderly.
Below are my reflections on the advantages R over Tableau:
Advantage 1 - Flexibility
R allows the users to output the writeup and graphs in different formats, including HTML, PDF, word and so on.
Advantage 2 - R markdown allows the different output
R markdown allows the users to write out the output in various formats.
Advantage 3 - Easier to share
By building in R, the users do not need a licence in order to view the plot. Unlike Tableau, the user need to helderly a licence to modify or view the graphs. Otherwise, the users may need to post in Tableau Public in order to share the graphs.
However, this may not be ideal as the graphs may contain some confidential information that one may not want to publish online.
Advantage 4 - Cheaper cost
R studio is free, while we will need to pay for the licence fees if we were to use Tableau. Besides, a lot of the Tableau functions are also available in R.
Advantage 5 - Machine learning model
As R can be used for Plotly & Shiny, this allows one to build a machine learning model with some interactivities. Parameters can be built at the side to allow for interactivity.
For Tableau, this can only be done by linking to another software, which we need another licence for the machine learning software.