Revisions: Updated the Dumbbell Plots


1 Overview

This DataViz makeover aims to to uncover the changing patterns of demographic composition (i.e. the young (age 0-24), the economically active group (i.e. age 25-64) and the aged group (i.e. 65 and above) in Singapore by geographical hierarchy (i.e. region and planning area) over time (i.e. 2011-2019).

The data used in this makeover, “Singapore Residents by Planning Area, Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019”, was sourced from SingStat website.

2 Major Data and Design Challenges

Below are the three major data and design challenges identified.

2.1 Challenge 1: Visualising Demographic Composition Changes by Region/Planning Area over Time

As we are creating a static visualisation, visualising changes over time would be a challenge. Also, we have a total of 55 planning areas and we have to visualise the demographic changes over the years (i.e. 2011 to 2019). Therefore, there are 3 factors that we have to consider: demographic composition by region / planning area and across time.

2.2 Challenge 2: Missing Region Information

The original data does not contain the region information (i.e. Central, East, North-East, North and West). We would need to source for external data and incorporate this information into the data.

2.3 Challenge 3: Visualising Multiple Planning Areas and Subzones and NULL values

There are a total of 55 planning areas and 323 subzones. To visualise the proportion of young, economically active and aged at one glance would be difficult.

Furthermore, there are some planning areas with no population numbers across the years (i.e. from 2011 to 2019) as seen below. Though this might not be a major issue, but it would be good to keep this in mind and filter these planning areas accordingly.

data <- read.csv("data/respopagesextod2011to2019.csv", header = T)

check <- data %>% 
          group_by(PA,SZ,Time) %>% 
          summarise(total = sum(Pop)) %>%
          filter(total==0) 

datatable(head(check), options = list(dom = 't'))


3 Suggestions to Overcome Challenges

No. Challenge Proposed Solutions
1 Visualising demographic composition changes by Region / Planning Area over Time 1. Create Economy Dependency Groups
Create economy dependency groups. Young (age 0-24), the economically active group (i.e. age 25-64) and the aged group (i.e. 65 and above).

2. Calculate Dependency Ratio
Create new calculated variables like dependency ratio to visualise the changing proportion of Young and Aged population against the Economically Active population for each region / planning area.

3. Creating Appropriate Visualisations
- Use ggridges to visualise the changes in the dependency ratio distribution over the years for each region / planning area.
- Create line graphs to visualise the changes in population growth rates across time and regions.
- Create Tenary plots with ggtern and Bubble plots with plotly.
- Use sunburstR package to create Sunburst Charts with the following levels: Planning Area, Subzone, Economy Dependency and Percentage (i.e. % of Young, Economically Active or Aged in that particular Planning Area and Subzone).
2 Missing Region Information First, download the URA Master Plan subzone boundary in shapefile format (i.e. MP14_SUBZONE_WEB_PL) found from Data.gov.sg, to get the region information.

Then use tmap package to load the data and then we will merge to get the Region information using Subzone as the identifier.
3 Visualising Multiple Planning Areas and Subzones and NULL values Use dplyr from tidyverse to do data manipulation and use the filter function when needed to exclude these planning areas with NULL values when calculating aggregate values like mean and etc.


3.1 Sketch of Proposed DataViz


Sketch of Proposed DataViz

Sketch of Proposed DataViz


4 DataViz Step-by-step guide

4.1 Load R packages and Data

First, load the necessary R packages in RStudio.

  • tidyverse contains a set of essential packages for data manipulation and exploration.
  • gridExtra to arrange multiple grid-based plots on a page, and draw tables.
  • DT to present rectangular R data objects (such as data frames and matrices) as HTML tables.
  • sf to encode spatial vector data.
  • tmap to create thematic maps, such as choropleths and bubble maps.
  • ggridges to visualise changes in distributions over time or space.
  • sunburstR to make interactive ‘d3.js’ sequence sunburst diagrams in R.
  • plotly to create interactive web graphics from ‘ggplot2’ graphs.
packages <- c('tidyverse','gridExtra','DT','sf','tmap','ggridges','ggtern','sunburstR','plotly')

for (p in packages){
  if (!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Second, load the data and check the structure and data type of fields.

data <- read.csv("data/respopagesextod2011to2019.csv", header = T)
glimpse(data)
## Observations: 883,728
## Variables: 7
## $ PA   <fct> Ang Mo Kio, Ang Mo Kio, Ang Mo Kio, Ang Mo Kio, Ang Mo Kio, Ang …
## $ SZ   <fct> Ang Mo Kio Town Centre, Ang Mo Kio Town Centre, Ang Mo Kio Town …
## $ AG   <fct> 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, …
## $ Sex  <fct> Males, Males, Males, Males, Males, Males, Males, Males, Females,…
## $ TOD  <fct> HDB 1- and 2-Room Flats, HDB 3-Room Flats, HDB 4-Room Flats, HDB…
## $ Pop  <int> 0, 10, 30, 50, 0, 0, 40, 0, 0, 10, 30, 60, 0, 0, 40, 0, 0, 10, 3…
## $ Time <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…

4.2 Data Preprocessing

4.2.1 Map Regions

Download URA Master Plan subzone boundary in shapefile format (i.e. MP14_SUBZONE_WEB_PL) from Data.gov.sg and load the data in.

As the values for Planning area and subzone are in caps, we will have to change it to title case in order to merge with our data.

mapsz$SUBZONE_N <- str_to_title(mapsz$SUBZONE_N)
mapsz$PLN_AREA_N <- str_to_title(mapsz$PLN_AREA_N)
mapsz$REGION_N <- str_to_title(mapsz$REGION_N)

Merge with data to include region names for each planning area.

data <- left_join(mapsz, data, by = c("SUBZONE_N" = "SZ"))

Select relevant columns and rename accordingly.

data <- data %>% select(PA, SUBZONE_N, REGION_N, AG, Sex, Pop, Time) %>%
          rename(Planning_area = PA, Sub_zone = SUBZONE_N, Region = REGION_N, Age_group = AG, 
                 Gender = Sex, Population = Pop, Year = Time)

4.2.2 Create Economy Dependency Groups

Filter according to the Economy dempendency groups (i.e. Young, Economically Active and Aged).

data_young <- data %>% filter(Age_group %in% c('0_to_4','5_to_9','10_to_14','15_to_19',
                                               '20_to_24')) %>%
                       group_by(Planning_area, Sub_zone, Region, Year) %>%
                       summarise(Young = sum(Population))

data_ea <- data %>% filter(Age_group %in% c('25_to_29','30_to_34','35_to_39','40_to_44','45_to_49',
                          '50_to_54','55_to_59','60_to_64')) %>%
                          group_by(Planning_area, Sub_zone, Region, Year) %>%
                          summarise(`Economically Active` = sum(Population))

data_aged <- data %>% filter(Age_group %in% c('65_to_69','70_to_74','75_to_79','80_to_84',
                                              '90_and_over')) %>%
                          group_by(Planning_area, Sub_zone, Region, Year) %>%
                          summarise(`Aged` = sum(Population))

Merge all into one file as data_combined.

Calculate the percentages for each economy dependency groups and calculate the Dependency Ratio. Filter out those rows with ‘NaN’ values for Dependency Ratio.

data_combined <- data_combined %>% group_by(`Planning Area`, Subzone, Year) %>% 
                      mutate(`Young %` = (Young/(Young+`Economically Active`+Aged))*100) %>%
                      mutate(`Economically Active %`= 
                               (`Economically Active`/(Young+`Economically Active`+Aged))*100) %>%
                      mutate(`Aged %` = (Aged/(Young+`Economically Active`+Aged))*100)

data_combined <- data_combined %>% group_by(`Planning Area`, Subzone, Year) %>%
                                  mutate(Total=Young+`Economically Active`+Aged) %>%
                                  mutate(Dependency=(Young+Aged)/`Economically Active`) %>%
                                  filter(!is.na(Dependency) & !is.infinite(Dependency)) 

# View the data structure
str(data_combined, give.attr=F)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  2083 obs. of  12 variables:
##  $ Planning Area        : Factor w/ 55 levels "Ang Mo Kio","Bedok",..: 39 39 39 39 39 39 39 39 39 6 ...
##  $ Subzone              : Factor w/ 323 levels "Admiralty","Airport Road",..: 1 1 1 1 1 1 1 1 1 3 ...
##  $ Region               : Factor w/ 5 levels "Central Region",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ Year                 : int  2011 2012 2013 2014 2015 2016 2017 2018 2019 2011 ...
##  $ Young                : int  4130 4120 4030 4300 4630 4630 4580 4440 4370 4250 ...
##  $ Economically Active  : int  7360 7390 7370 8070 8680 8770 8670 8500 8380 9690 ...
##  $ Aged                 : int  760 850 870 970 1080 1130 1170 1190 1280 2310 ...
##  $ Young %              : num  33.7 33.3 32.8 32.2 32.2 ...
##  $ Economically Active %: num  60.1 59.8 60.1 60.5 60.3 ...
##  $ Aged %               : num  6.2 6.88 7.09 7.27 7.51 ...
##  $ Total                : int  12250 12360 12270 13340 14390 14530 14420 14130 14030 16250 ...
##  $ Dependency           : num  0.664 0.673 0.665 0.653 0.658 ...

4.3 Visualisation 1: Dependency Ratio across Regions over Time

4.3.1 Using Histogram and Facet Grid

One way to display the distribution of dependency ratio is to use a facet grid to display across Region and by Year. But this design is not as compact. Hence, we will use ggridges instead.



4.3.2 Using ggridges

Plot the distribution of Dependency Ratio across Regions using ggridges.

library(grid)

grid.arrange(central_ridges,
             north_ridges,
             north_east_ridges,
             west_ridges, 
             east_ridges,
              nrow = 1,
              bottom = textGrob(
                "* Values represent Median Dependency Ratio",
                gp = gpar(fontface = 1, fontsize = 11),
                hjust = 1,
                x = 1)
            ) 


Looking at theggridges visualisation above, it is difficult to see the changes in dependency ratio over time. In view of this, we will do a dumbbell plot to see the changes in dependency ratio across the different regions instead by usingggalt package.

4.3.3 Dumbbell Plot

4.3.3.1 Dumbbell Plot by Region

4.3.3.1.1 Data Preparation
# Load ggalt
library(ggalt)

# Prepare data for Dumbbell Plot
db_data <- data_combined[,c(1:7)]

db_data_region <- db_data %>% group_by(Region, Year) %>%
                              summarise("Total Young"=sum(Young), 
                                        "Total Economically Active"=sum(`Economically Active`),
                                         "Total Aged"=sum(Aged)) %>%
                              mutate("Dependency"=((`Total Young`+`Total Aged`)/
                                                     `Total Economically Active`)*100)

db_region_spread <- db_data_region %>% filter(Year==2011 | Year==2019) %>%
                                        spread(Year, `Dependency`) 

db_region_spread[is.na(db_region_spread)] <- 0
4.3.3.1.2 Plot Dumbbell Plot by Region Level


As seen from the Dumbbell plot for Region level above, the changes in dependency ratio across the different regions comparing 2011 and 2019 is much clearer. Central region has the highest dependency ratio in 2019 and interestingly East Region’s dependency ratio remains the same and the North region has a decline in dependency ratio.

Looking at the dependency ratio by region is good if one wants to get a brief overview at the region level. However, we can also look at the Planning area level as seen in the next section.


4.3.3.2 Dumbbell Plot for Planning Area Level

4.3.3.2.1 Data Preparation
db_data_pa <- db_data %>% group_by(`Planning Area`, Year) %>%
                          summarise("Total Young"=sum(Young), 
                                     "Total Economically Active"=sum(`Economically Active`),
                                     "Total Aged"=sum(Aged)) %>%
                          mutate("Dependency"=((`Total Young`+`Total Aged`)/
                                                 `Total Economically Active`)*100)

db_pa_spread <- db_data_pa %>% filter(Year==2011 | Year==2019) %>%
                              spread(Year, `Dependency`) 

db_pa_spread[is.na(db_pa_spread)] <- 0
4.3.3.2.2 Plot Dumbbell Plot for Planning Area

The downside of this visualisation is that firstly, we are unable to see the dependency ratio for each year as we are only comparing with 2 years. Secondly, we are looking at the dependency ratio instead of the distribution of dependency ratio.

However, compared to the ggridges visualisation, this is a much better visualisation as one is able to see the changes in one glance. Also, it is difficult to see the distribution of the dependency ratio if we are going to place them side by side in the case of the ggridges visualisation.

4.4 Visualisation 2: Population Growth Trend

In this visualisation, we will explore the population growth trend of each economy dependency groups across Regions in Singapore.

4.4.1 Data Preparation

Prepare the data by calculating the population growth rates for each of the Economy dependency groups across the region.

pop <- data_combined %>% group_by(Region, Year) %>%
                         mutate(Total = sum(Total), `Young Total`=sum(Young), 
                                `Economically Active Total`= sum(`Economically Active`), 
                                `Aged Total`=sum(Aged)) %>%
                         select(Region, Year,`Young Total`,`Economically Active Total`,
                                `Aged Total`, Total)

4.4.2 Plot the Diagram


4.5 Visualisation 3: Ternary Plot

4.5.1 Ternary Plot with ggtern

Create Ternary Plot with ggtern package. First, create a Ternary plot for one year (i.e. 2011)

# Prepare data for Ternary Plot
tern_2011 <- data_combined %>% filter(`Young %`!=0, `Economically Active %`!=0, `Aged %`!=0, 
                                      Year==2011)

tern_plot_2011 <- ggtern(data=tern_2011, aes(x=`Young %`, y=`Economically Active %`, z=`Aged %`,
                        col=Region, size=Total)) +
                   geom_point(alpha=0.5) +
                   ggtitle(label= "Ternary Plot of Singapore Demographics in 2011") +
                   xlab("Young") +                       
                   ylab("Economically Active") +
                   zlab("Aged") + 
                   theme_showarrows()

tern_plot_2011

To visualise across the different years use facet_wrap function in ggplot2



This is a good overview however, it is a static visualisation and it is difficult to visualise each planning area (i.e. the bubbles). Instead, we will use plotly instead to visualise the ternary plot in the next section.


4.5.2 Ternary Plot with plotly

tern2 <- data_combined %>% filter(`Young %`!=0, `Economically Active %`!=0, `Aged %`!=0)

tern_plot2 <- plot_ly(tern2, a= ~`Economically Active %`, b= ~`Young %`, c= ~`Aged %`, frame=~Year,
                     color= ~Region, type = "scatterternary", size = ~Total,
                     text = ~paste('Young: ', sep='',round(`Young %`,1),'%',
                                   '<br>Economically Active: ',round(`Economically Active %`,1),'%',
                                   '<br>Aged:',round(`Aged %`,1),'%',
                                   '<br>Subzone: ', Subzone, hoverinfo="text",
                                   '<br>Planning Area: ', `Planning Area`),
                     marker = list(symbol='circle', opacity=0.4, sizemode="diameter", sizeref=2,
                                   line=list(width=1, color='#666666'))) %>%
                     layout(
                       title=list(text="Singapore Demographics by Region from 2011-2019", x=0.4),
                       titlefont = list(family = "Arial",size = 16),
                       ternary=list(aaxis=list(title="Economically Active",min=0.5,
                                               titlefont = list(family = "Arial")),
                                    baxis=list(title="Young", min=0.2,
                                               titlefont = list(family = "Arial")),
                                    caxis=list(title="Aged",titlefont = list(family = "Arial")))
                       ) %>%
              animation_slider(
                      currentvalue = list(prefix="Year ",font=list(color="DimGray",
                                                                   family = "Arial"))
                       ) %>%
    
              animation_opts(
                      2000, redraw = FALSE
                            )
tern_plot2 


4.6 Visualisation 4: Bubble Plot

4.6.1 Bubble Plot with ggplot2

Using ggplot2 to plot a bubble plot of Young % vs. Aged % in 2011.

To see the bubble plot across the years we can use facet_grid function in ggplot2.


Here, we can only see a snapshot of the Young % and Aged % across the regions for each year which would not be clear and easy for the reader to visualise the change over time.

Hence, use plotly instead and include a slider so that the reader is able to toggle the Bubble plot between the different years and visualise the changes in the proportion of Young against the Aged over time.

4.6.2 Bubble Plot with plotly

Use plotly to recreate Bubble Plot below.

bubble_plot_ani <- plot_ly(
                    bubble_plot, x = ~`Aged %`, y = ~`Young %`, frame=~Year,
                    color = ~`Region`, colors='Pastel2', type = "scatter",
                    mode="markers", size= ~`Total`,
                    marker = list(symbol = 'circle', sizemode = 'diameter',
                      line = list(width = 1.1, color = '#666666'), opacity=0.6),
                      text = ~paste(sep='','Subzone: ', `Subzone`,
                                    '<br>Planning Area: ', `Planning Area`,
                                    '<br>Aged: ', round(`Aged %`,1),'%',
                                    '<br>Young : ', round(`Young %`,1),'%', 
                                    '<br>Population: ',Total)) %>%
                    layout(
                            title=list(text="Bubble Plot of Young % vs Aged % in Singapore",
                                       x=0.4), 
                            titlefont = list(family = "Arial",size = 16),
                            xaxis = list(title = 'Aged %',
                              gridcolor = 'rgb(255, 255, 255)',
                              range=c(0,35),
                              zerolinewidth = 1,
                              ticklen = 5,
                              gridwidth = 2, titlefont = list(family = "Arial")),
                            yaxis = list(title = 'Young %',
                              gridcolor = 'rgb(255, 255, 255)',
                              range=c(0,57),
                              zerolinewidth = 1,
                              ticklen = 5,
                              gridwith = 2, titlefont = list(family = "Arial")),
                            plot_bgcolor = 'rgb(243, 243, 243)'
                          )%>%
                    animation_opts(
                            2000, redraw = FALSE
                          ) %>%
                    
                    animation_slider(
                          currentvalue = list(prefix = "Year ", font=list(color="DimGray",
                                                                        family="Arial"))
                          )
bubble_plot_ani



4.7 Visualisation 5: Sunburst Chart

To create a Sunburst chart we need to know what are the levels or hierarcy that we want to visualise.

Here we want to create different levels or hierarchy based on:

  1. Region
  2. Planning Area
  3. Subzone
  4. Type of Economy Dependency (i.e. Young, Economically Active or Aged)

4.7.1 Data Preparation for Sunburst Chart

# Transform data 
data_2019 <- data_combined %>% gather(Type, Total, Young:Aged, factor_key=TRUE) %>%
                               filter(Year==2019) %>%
                               select(Region,`Planning Area`, Subzone, Type, Total) %>%
                               rename(Planning_area = `Planning Area`)

# Check data
datatable(head(data_2019), options = list(dom = 't'))

4.7.2 Plot Sunburst Chart

Sunburt Chart of Singapore Population by Region, Planning Area and Subzone in 2019

(Click or hover on chart to show proportion)


While this Sunburst chart is useful for one to visualise the levels of the population in 2019 according to regions, planning areas, subzone and the economy dependency group, this is only for one year. It would not be wise to create multiple Sunburst charts for every year. Furthermore, having multiple Sunburst Charts would be difficult for one to visualise the changes in the proportion over different years.

As such, bubble plot and ternary charts would be more appropriate.

5 Final Visualisation

The whole visualisation journey of this makeover has been an iterative process, figuring out which visualisations work and what doesn’t. After much trial and error, Visualisation 1 (Dumbbell Plot), 2 (Population Growth Trend), 3 (Ternary Plot) and 4 (Bubble Plot) are chosen to be included in the final visualisation for this DataViz makeover.

Below is the final visualisation.


Singapore’s Demographic Composition from 2011 to 2019





Data Source: SingStat,Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019


6 Insights from Visualisation

6.1 Insight 1: General Increase in Dependency Ratio across Regions


Looking at the Dumbbell plot, we see that there has been an increase in dependency ratio across the regions when comparing between 2011 vs. 2019, with the exception of the North region where there has been a decrease in the dependency ratio and the East Region which did not see any changes in dependency ratio as compared to the rest of the regions in Singapore. An increase in dependency ratio means that there is increasing pressure or burden on the Economically Active in supporting the Young and Aged.

The decrease in dependency ratio in the North region could be due to new housings being built in the region over the years such as Build-To-Order (BTO) flats which causes the influx of more Economically active population in that region and etc. This is an interesting finding and more can be explored as to why there is a decrease in dependency ratio in the North region.


6.2 Insight 2: Overall High Aged Population Growth and Low Young Population Growth



Looking at the Population growth rate trends across the different regions and years, we obeserve the following trends:

  1. In general, from 2011 to 2019, the Young has a lowest mean population growth rate (-0.73%) follwed by Economically Active (0.9%), while Aged has the highest population growth rate at 7.32%.
  2. Central, North-East and the West region display a decreasing population growth rate trend. In particular, the Central region’s population growth rate falls below the mean population growth rate across all economy dependency groups.
  3. Interestingly, there has been a huge spike in population percentage growth rate in the East region from 2018 to 2019. Also, the North region has an overall increasing population growth rate trend from 2014 to 2019 across all economy dependency groups.

In view of these observation 1 and 2, we see that Singapore’s demographics is moving towards an aging population. While observation 3 prompts us to explore further and enquire why there has been an increase (i.e. steep increase from 2018 to 2019 for East region and increasing trend from 2014 to 2019 for North region).

To explore on observation 3 further, we can look at the HDB / DBSS completed flats data from 2008 to 2018 found from Data.gov.sg. Filtering year from 2011 onwards, we can see from the bar chart below that Punggol (North-East region) and Sengkang (North-East region) and Yishun (North region) has the highest number of newly completed flats from 2011 to 2018.

However, given that this HDB data is missing 2019 data, more can be explored further to investigate this increase in population growth rates in the North and East region.


6.3 Insight 3: Shifting to an Aging Population

Looking at the Bubble Plot and Ternary Plot, we see that there is a shifting trend towards an aging population.

From the Ternary Plot, we observe that the most of the subzones are moving towards the direction of the “Aged” corner of the ternary plot. While from the Bubble Plot, which plots the percentage of young vs. the percentage of aged by the size of the total population colored by the different regions, we see a trend where most of the points are moving towards a higher proportion of Aged and lower proportion of Young.


7 Reflection on the Advantages of R over Tableau

7.1 Wide Variety of Packages dedicated to building Specific type of Visualisations

In Tableau, you have to build from scratch (e.g. creating multiple calculated fields) to create visualisations like dumbbell plot, ternary plot, bubble plots, sunburst charts and etc. Whereas, in R there are many available packages that allow us to do that in a much more efficient way using the tidyverse packages to do data manipulation and visualisations,plotly to create interactive plots and specialised packages like ggtern orsunburstR and so on to create specialised data visualisations like the Ternary plot or Sunburst Chart.

7.2 Faster Data Manipulation

With R, everything is written in code and this means that one does not need to redo do all the actions in order tweak some parts of the visualisation like in Tableau. For example, in Tableau you need to drag and drop fields in rows or columns, edit or create new calculated field, add colors or change the size of points under the Marks pane, add level of detail and so on. However in R, all these actions can be completed in one go with a chunk of code. Therefore, using R to create visualisations is much more efficient and it saves us time.

Furthermore, there are many in-built functions in R that allow one to easily see summary statistics and data types of the data in one glance.

7.3 Highly Customisable

R allows us to create highly customisable visualisation. For instance, changing the layout of the plots, axis or aesthetics of the plot with a few lines of code (e.g. facet grids, theme and so on). Whereas in Tableau, you would need to go to change it manually. For instance, one has to go through options to change the size or layout of the visualisations.