Revisions: Updated the Dumbbell Plots
This DataViz makeover aims to to uncover the changing patterns of demographic composition (i.e. the young (age 0-24), the economically active group (i.e. age 25-64) and the aged group (i.e. 65 and above) in Singapore by geographical hierarchy (i.e. region and planning area) over time (i.e. 2011-2019).
The data used in this makeover, “Singapore Residents by Planning Area, Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019”, was sourced from SingStat website.
Below are the three major data and design challenges identified.
As we are creating a static visualisation, visualising changes over time would be a challenge. Also, we have a total of 55 planning areas and we have to visualise the demographic changes over the years (i.e. 2011 to 2019). Therefore, there are 3 factors that we have to consider: demographic composition by region / planning area and across time.
The original data does not contain the region information (i.e. Central, East, North-East, North and West). We would need to source for external data and incorporate this information into the data.
There are a total of 55 planning areas and 323 subzones. To visualise the proportion of young, economically active and aged at one glance would be difficult.
Furthermore, there are some planning areas with no population numbers across the years (i.e. from 2011 to 2019) as seen below. Though this might not be a major issue, but it would be good to keep this in mind and filter these planning areas accordingly.
data <- read.csv("data/respopagesextod2011to2019.csv", header = T)
check <- data %>%
group_by(PA,SZ,Time) %>%
summarise(total = sum(Pop)) %>%
filter(total==0)
datatable(head(check), options = list(dom = 't'))
| No. | Challenge | Proposed Solutions |
|---|---|---|
| 1 | Visualising demographic composition changes by Region / Planning Area over Time | 1. Create Economy Dependency Groups Create economy dependency groups. Young (age 0-24), the economically active group (i.e. age 25-64) and the aged group (i.e. 65 and above). 2. Calculate Dependency Ratio Create new calculated variables like dependency ratio to visualise the changing proportion of Young and Aged population against the Economically Active population for each region / planning area. 3. Creating Appropriate Visualisations - Use ggridges to visualise the changes in the dependency ratio distribution over the years for each region / planning area. - Create line graphs to visualise the changes in population growth rates across time and regions. - Create Tenary plots with ggtern and Bubble plots with plotly. - Use sunburstR package to create Sunburst Charts with the following levels: Planning Area, Subzone, Economy Dependency and Percentage (i.e. % of Young, Economically Active or Aged in that particular Planning Area and Subzone). |
| 2 | Missing Region Information | First, download the URA Master Plan subzone boundary in shapefile format (i.e. MP14_SUBZONE_WEB_PL) found from Data.gov.sg, to get the region information. Then use tmap package to load the data and then we will merge to get the Region information using Subzone as the identifier. |
| 3 | Visualising Multiple Planning Areas and Subzones and NULL values | Use dplyr from tidyverse to do data manipulation and use the filter function when needed to exclude these planning areas with NULL values when calculating aggregate values like mean and etc. |
Sketch of Proposed DataViz
First, load the necessary R packages in RStudio.
tidyverse contains a set of essential packages for data manipulation and exploration.gridExtra to arrange multiple grid-based plots on a page, and draw tables.DT to present rectangular R data objects (such as data frames and matrices) as HTML tables.sf to encode spatial vector data.tmap to create thematic maps, such as choropleths and bubble maps.ggridges to visualise changes in distributions over time or space.sunburstR to make interactive ‘d3.js’ sequence sunburst diagrams in R.plotly to create interactive web graphics from ‘ggplot2’ graphs.packages <- c('tidyverse','gridExtra','DT','sf','tmap','ggridges','ggtern','sunburstR','plotly')
for (p in packages){
if (!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Second, load the data and check the structure and data type of fields.
data <- read.csv("data/respopagesextod2011to2019.csv", header = T)
glimpse(data)
## Observations: 883,728
## Variables: 7
## $ PA <fct> Ang Mo Kio, Ang Mo Kio, Ang Mo Kio, Ang Mo Kio, Ang Mo Kio, Ang …
## $ SZ <fct> Ang Mo Kio Town Centre, Ang Mo Kio Town Centre, Ang Mo Kio Town …
## $ AG <fct> 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, 0_to_4, …
## $ Sex <fct> Males, Males, Males, Males, Males, Males, Males, Males, Females,…
## $ TOD <fct> HDB 1- and 2-Room Flats, HDB 3-Room Flats, HDB 4-Room Flats, HDB…
## $ Pop <int> 0, 10, 30, 50, 0, 0, 40, 0, 0, 10, 30, 60, 0, 0, 40, 0, 0, 10, 3…
## $ Time <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
Download URA Master Plan subzone boundary in shapefile format (i.e. MP14_SUBZONE_WEB_PL) from Data.gov.sg and load the data in.
As the values for Planning area and subzone are in caps, we will have to change it to title case in order to merge with our data.
mapsz$SUBZONE_N <- str_to_title(mapsz$SUBZONE_N)
mapsz$PLN_AREA_N <- str_to_title(mapsz$PLN_AREA_N)
mapsz$REGION_N <- str_to_title(mapsz$REGION_N)
Merge with data to include region names for each planning area.
data <- left_join(mapsz, data, by = c("SUBZONE_N" = "SZ"))
Select relevant columns and rename accordingly.
data <- data %>% select(PA, SUBZONE_N, REGION_N, AG, Sex, Pop, Time) %>%
rename(Planning_area = PA, Sub_zone = SUBZONE_N, Region = REGION_N, Age_group = AG,
Gender = Sex, Population = Pop, Year = Time)
Filter according to the Economy dempendency groups (i.e. Young, Economically Active and Aged).
data_young <- data %>% filter(Age_group %in% c('0_to_4','5_to_9','10_to_14','15_to_19',
'20_to_24')) %>%
group_by(Planning_area, Sub_zone, Region, Year) %>%
summarise(Young = sum(Population))
data_ea <- data %>% filter(Age_group %in% c('25_to_29','30_to_34','35_to_39','40_to_44','45_to_49',
'50_to_54','55_to_59','60_to_64')) %>%
group_by(Planning_area, Sub_zone, Region, Year) %>%
summarise(`Economically Active` = sum(Population))
data_aged <- data %>% filter(Age_group %in% c('65_to_69','70_to_74','75_to_79','80_to_84',
'90_and_over')) %>%
group_by(Planning_area, Sub_zone, Region, Year) %>%
summarise(`Aged` = sum(Population))
Merge all into one file as data_combined.
Calculate the percentages for each economy dependency groups and calculate the Dependency Ratio. Filter out those rows with ‘NaN’ values for Dependency Ratio.
data_combined <- data_combined %>% group_by(`Planning Area`, Subzone, Year) %>%
mutate(`Young %` = (Young/(Young+`Economically Active`+Aged))*100) %>%
mutate(`Economically Active %`=
(`Economically Active`/(Young+`Economically Active`+Aged))*100) %>%
mutate(`Aged %` = (Aged/(Young+`Economically Active`+Aged))*100)
data_combined <- data_combined %>% group_by(`Planning Area`, Subzone, Year) %>%
mutate(Total=Young+`Economically Active`+Aged) %>%
mutate(Dependency=(Young+Aged)/`Economically Active`) %>%
filter(!is.na(Dependency) & !is.infinite(Dependency))
# View the data structure
str(data_combined, give.attr=F)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 2083 obs. of 12 variables:
## $ Planning Area : Factor w/ 55 levels "Ang Mo Kio","Bedok",..: 39 39 39 39 39 39 39 39 39 6 ...
## $ Subzone : Factor w/ 323 levels "Admiralty","Airport Road",..: 1 1 1 1 1 1 1 1 1 3 ...
## $ Region : Factor w/ 5 levels "Central Region",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ Year : int 2011 2012 2013 2014 2015 2016 2017 2018 2019 2011 ...
## $ Young : int 4130 4120 4030 4300 4630 4630 4580 4440 4370 4250 ...
## $ Economically Active : int 7360 7390 7370 8070 8680 8770 8670 8500 8380 9690 ...
## $ Aged : int 760 850 870 970 1080 1130 1170 1190 1280 2310 ...
## $ Young % : num 33.7 33.3 32.8 32.2 32.2 ...
## $ Economically Active %: num 60.1 59.8 60.1 60.5 60.3 ...
## $ Aged % : num 6.2 6.88 7.09 7.27 7.51 ...
## $ Total : int 12250 12360 12270 13340 14390 14530 14420 14130 14030 16250 ...
## $ Dependency : num 0.664 0.673 0.665 0.653 0.658 ...
One way to display the distribution of dependency ratio is to use a facet grid to display across Region and by Year. But this design is not as compact. Hence, we will use ggridges instead.
ggridgesPlot the distribution of Dependency Ratio across Regions using ggridges.
library(grid)
grid.arrange(central_ridges,
north_ridges,
north_east_ridges,
west_ridges,
east_ridges,
nrow = 1,
bottom = textGrob(
"* Values represent Median Dependency Ratio",
gp = gpar(fontface = 1, fontsize = 11),
hjust = 1,
x = 1)
)
Looking at theggridges visualisation above, it is difficult to see the changes in dependency ratio over time. In view of this, we will do a dumbbell plot to see the changes in dependency ratio across the different regions instead by usingggalt package.
# Load ggalt
library(ggalt)
# Prepare data for Dumbbell Plot
db_data <- data_combined[,c(1:7)]
db_data_region <- db_data %>% group_by(Region, Year) %>%
summarise("Total Young"=sum(Young),
"Total Economically Active"=sum(`Economically Active`),
"Total Aged"=sum(Aged)) %>%
mutate("Dependency"=((`Total Young`+`Total Aged`)/
`Total Economically Active`)*100)
db_region_spread <- db_data_region %>% filter(Year==2011 | Year==2019) %>%
spread(Year, `Dependency`)
db_region_spread[is.na(db_region_spread)] <- 0
As seen from the Dumbbell plot for Region level above, the changes in dependency ratio across the different regions comparing 2011 and 2019 is much clearer. Central region has the highest dependency ratio in 2019 and interestingly East Region’s dependency ratio remains the same and the North region has a decline in dependency ratio.
Looking at the dependency ratio by region is good if one wants to get a brief overview at the region level. However, we can also look at the Planning area level as seen in the next section.
db_data_pa <- db_data %>% group_by(`Planning Area`, Year) %>%
summarise("Total Young"=sum(Young),
"Total Economically Active"=sum(`Economically Active`),
"Total Aged"=sum(Aged)) %>%
mutate("Dependency"=((`Total Young`+`Total Aged`)/
`Total Economically Active`)*100)
db_pa_spread <- db_data_pa %>% filter(Year==2011 | Year==2019) %>%
spread(Year, `Dependency`)
db_pa_spread[is.na(db_pa_spread)] <- 0
The downside of this visualisation is that firstly, we are unable to see the dependency ratio for each year as we are only comparing with 2 years. Secondly, we are looking at the dependency ratio instead of the distribution of dependency ratio.
However, compared to the ggridges visualisation, this is a much better visualisation as one is able to see the changes in one glance. Also, it is difficult to see the distribution of the dependency ratio if we are going to place them side by side in the case of the ggridges visualisation.
In this visualisation, we will explore the population growth trend of each economy dependency groups across Regions in Singapore.
Prepare the data by calculating the population growth rates for each of the Economy dependency groups across the region.
pop <- data_combined %>% group_by(Region, Year) %>%
mutate(Total = sum(Total), `Young Total`=sum(Young),
`Economically Active Total`= sum(`Economically Active`),
`Aged Total`=sum(Aged)) %>%
select(Region, Year,`Young Total`,`Economically Active Total`,
`Aged Total`, Total)
ggternCreate Ternary Plot with ggtern package. First, create a Ternary plot for one year (i.e. 2011)
# Prepare data for Ternary Plot
tern_2011 <- data_combined %>% filter(`Young %`!=0, `Economically Active %`!=0, `Aged %`!=0,
Year==2011)
tern_plot_2011 <- ggtern(data=tern_2011, aes(x=`Young %`, y=`Economically Active %`, z=`Aged %`,
col=Region, size=Total)) +
geom_point(alpha=0.5) +
ggtitle(label= "Ternary Plot of Singapore Demographics in 2011") +
xlab("Young") +
ylab("Economically Active") +
zlab("Aged") +
theme_showarrows()
tern_plot_2011
To visualise across the different years use facet_wrap function in ggplot2
This is a good overview however, it is a static visualisation and it is difficult to visualise each planning area (i.e. the bubbles). Instead, we will use plotly instead to visualise the ternary plot in the next section.
plotlytern2 <- data_combined %>% filter(`Young %`!=0, `Economically Active %`!=0, `Aged %`!=0)
tern_plot2 <- plot_ly(tern2, a= ~`Economically Active %`, b= ~`Young %`, c= ~`Aged %`, frame=~Year,
color= ~Region, type = "scatterternary", size = ~Total,
text = ~paste('Young: ', sep='',round(`Young %`,1),'%',
'<br>Economically Active: ',round(`Economically Active %`,1),'%',
'<br>Aged:',round(`Aged %`,1),'%',
'<br>Subzone: ', Subzone, hoverinfo="text",
'<br>Planning Area: ', `Planning Area`),
marker = list(symbol='circle', opacity=0.4, sizemode="diameter", sizeref=2,
line=list(width=1, color='#666666'))) %>%
layout(
title=list(text="Singapore Demographics by Region from 2011-2019", x=0.4),
titlefont = list(family = "Arial",size = 16),
ternary=list(aaxis=list(title="Economically Active",min=0.5,
titlefont = list(family = "Arial")),
baxis=list(title="Young", min=0.2,
titlefont = list(family = "Arial")),
caxis=list(title="Aged",titlefont = list(family = "Arial")))
) %>%
animation_slider(
currentvalue = list(prefix="Year ",font=list(color="DimGray",
family = "Arial"))
) %>%
animation_opts(
2000, redraw = FALSE
)
tern_plot2
ggplot2Using ggplot2 to plot a bubble plot of Young % vs. Aged % in 2011.
To see the bubble plot across the years we can use facet_grid function in ggplot2.
Here, we can only see a snapshot of the Young % and Aged % across the regions for each year which would not be clear and easy for the reader to visualise the change over time.
Hence, use plotly instead and include a slider so that the reader is able to toggle the Bubble plot between the different years and visualise the changes in the proportion of Young against the Aged over time.
plotlyUse plotly to recreate Bubble Plot below.
bubble_plot_ani <- plot_ly(
bubble_plot, x = ~`Aged %`, y = ~`Young %`, frame=~Year,
color = ~`Region`, colors='Pastel2', type = "scatter",
mode="markers", size= ~`Total`,
marker = list(symbol = 'circle', sizemode = 'diameter',
line = list(width = 1.1, color = '#666666'), opacity=0.6),
text = ~paste(sep='','Subzone: ', `Subzone`,
'<br>Planning Area: ', `Planning Area`,
'<br>Aged: ', round(`Aged %`,1),'%',
'<br>Young : ', round(`Young %`,1),'%',
'<br>Population: ',Total)) %>%
layout(
title=list(text="Bubble Plot of Young % vs Aged % in Singapore",
x=0.4),
titlefont = list(family = "Arial",size = 16),
xaxis = list(title = 'Aged %',
gridcolor = 'rgb(255, 255, 255)',
range=c(0,35),
zerolinewidth = 1,
ticklen = 5,
gridwidth = 2, titlefont = list(family = "Arial")),
yaxis = list(title = 'Young %',
gridcolor = 'rgb(255, 255, 255)',
range=c(0,57),
zerolinewidth = 1,
ticklen = 5,
gridwith = 2, titlefont = list(family = "Arial")),
plot_bgcolor = 'rgb(243, 243, 243)'
)%>%
animation_opts(
2000, redraw = FALSE
) %>%
animation_slider(
currentvalue = list(prefix = "Year ", font=list(color="DimGray",
family="Arial"))
)
bubble_plot_ani
To create a Sunburst chart we need to know what are the levels or hierarcy that we want to visualise.
Here we want to create different levels or hierarchy based on:
# Transform data
data_2019 <- data_combined %>% gather(Type, Total, Young:Aged, factor_key=TRUE) %>%
filter(Year==2019) %>%
select(Region,`Planning Area`, Subzone, Type, Total) %>%
rename(Planning_area = `Planning Area`)
# Check data
datatable(head(data_2019), options = list(dom = 't'))
While this Sunburst chart is useful for one to visualise the levels of the population in 2019 according to regions, planning areas, subzone and the economy dependency group, this is only for one year. It would not be wise to create multiple Sunburst charts for every year. Furthermore, having multiple Sunburst Charts would be difficult for one to visualise the changes in the proportion over different years.
As such, bubble plot and ternary charts would be more appropriate.
The whole visualisation journey of this makeover has been an iterative process, figuring out which visualisations work and what doesn’t. After much trial and error, Visualisation 1 (Dumbbell Plot), 2 (Population Growth Trend), 3 (Ternary Plot) and 4 (Bubble Plot) are chosen to be included in the final visualisation for this DataViz makeover.
Below is the final visualisation.
Looking at the Dumbbell plot, we see that there has been an increase in dependency ratio across the regions when comparing between 2011 vs. 2019, with the exception of the North region where there has been a decrease in the dependency ratio and the East Region which did not see any changes in dependency ratio as compared to the rest of the regions in Singapore. An increase in dependency ratio means that there is increasing pressure or burden on the Economically Active in supporting the Young and Aged.
The decrease in dependency ratio in the North region could be due to new housings being built in the region over the years such as Build-To-Order (BTO) flats which causes the influx of more Economically active population in that region and etc. This is an interesting finding and more can be explored as to why there is a decrease in dependency ratio in the North region.
Looking at the Population growth rate trends across the different regions and years, we obeserve the following trends:
In view of these observation 1 and 2, we see that Singapore’s demographics is moving towards an aging population. While observation 3 prompts us to explore further and enquire why there has been an increase (i.e. steep increase from 2018 to 2019 for East region and increasing trend from 2014 to 2019 for North region).
To explore on observation 3 further, we can look at the HDB / DBSS completed flats data from 2008 to 2018 found from Data.gov.sg. Filtering year from 2011 onwards, we can see from the bar chart below that Punggol (North-East region) and Sengkang (North-East region) and Yishun (North region) has the highest number of newly completed flats from 2011 to 2018.
However, given that this HDB data is missing 2019 data, more can be explored further to investigate this increase in population growth rates in the North and East region.
Looking at the Bubble Plot and Ternary Plot, we see that there is a shifting trend towards an aging population.
From the Ternary Plot, we observe that the most of the subzones are moving towards the direction of the “Aged” corner of the ternary plot. While from the Bubble Plot, which plots the percentage of young vs. the percentage of aged by the size of the total population colored by the different regions, we see a trend where most of the points are moving towards a higher proportion of Aged and lower proportion of Young.
In Tableau, you have to build from scratch (e.g. creating multiple calculated fields) to create visualisations like dumbbell plot, ternary plot, bubble plots, sunburst charts and etc. Whereas, in R there are many available packages that allow us to do that in a much more efficient way using the tidyverse packages to do data manipulation and visualisations,plotly to create interactive plots and specialised packages like ggtern orsunburstR and so on to create specialised data visualisations like the Ternary plot or Sunburst Chart.
With R, everything is written in code and this means that one does not need to redo do all the actions in order tweak some parts of the visualisation like in Tableau. For example, in Tableau you need to drag and drop fields in rows or columns, edit or create new calculated field, add colors or change the size of points under the Marks pane, add level of detail and so on. However in R, all these actions can be completed in one go with a chunk of code. Therefore, using R to create visualisations is much more efficient and it saves us time.
Furthermore, there are many in-built functions in R that allow one to easily see summary statistics and data types of the data in one glance.
R allows us to create highly customisable visualisation. For instance, changing the layout of the plots, axis or aesthetics of the plot with a few lines of code (e.g. facet grids, theme and so on). Whereas in Tableau, you would need to go to change it manually. For instance, one has to go through options to change the size or layout of the visualisations.