• 1. Major data and design challenges
    • 1.1 How to visualise multiple dimensions in a single graphic
    • 1.2 How to derive useful measures for insights
    • 1.3 How to clean and transform data in R to support the visualisation
  • 2. Methods to enhance the data and design
    • 2.1 Use motion chart to visualise the data and changes
    • 2.2 Formulate derived measures for further insights
    • 2.3 Group and transform data
    • 2.4 Proposed design
  • 3. Detailed steps on reproduction
    • 3.1 Data wrangling
    • 3.2 Basic Bubble Chart
    • 3.3 Animate with gganimate
  • 4. Final Visualisation
  • 5. Reflection on R vs Tableau
    • 5.1 Large numbers of packages
    • 5.2 Strong community support
    • 5.3 Depth of programmability

1. Major data and design challenges

1.1 How to visualise multiple dimensions in a single graphic

How do we effectively visualise the changes in demographic structure across 4 dimensions of age groups, economic dependency, geography and year, and 1 measure of resident population size, in one graphic. In fact, the data has 2 more dimensions of sex and dwelling types but they are not part of the requirements. The SZ (subzone) dimension has 229 levels while the PA (planning area) dimension has 42 levels. These could make the visualization cluttered and not useful.

1.2 How to derive useful measures for insights

The data has only 1 measure of resident population size which is limited in delivering good insights. We also need to consider the effectiveness of using absolute or percentage representations of this measure with respect to the different dimensions.

1.3 How to clean and transform data in R to support the visualisation

Data cleansing and transformation are required to be performed in R for reproducibility. Fortunately, the data is restricted to 10 years from 2010 to 2019, which is already tabulated as one file. However, there are numerous “zeros” columns which may slow down processing or result in ugly presentations. Data is in a tall format, requiring groupings, summations and divisions to derive the required measures.

2. Methods to enhance the data and design

2.1 Use motion chart to visualise the data and changes

A bubble plot is a scatter plot with a third and fourth variable mapped to circle size and colour. A motion chart adds a fifth variable “time” to the bubble plot and animates the changes through time.
To enhance effectiveness, we make effective use of ggplot2 aesthetics, geometries, facets, coordinates and themes and gganimate animation for clarity:

  1. Size: resident population size
  2. Colour: SZ (subzone)
  3. Position: values and relative values along axes
  4. Frame: year changess
  5. Facets: region (to design additional grouping of planning areas into regions)

We may also explore dropping some dimensions in the visualisations if they are not meaningful or if their corresponding measures can be better represented using derived measures.

2.2 Formulate derived measures for further insights

For standardisation, we can use percentage of young and percentage of old as measures instead of the actual number of residents. Further research on Singstat population revealed use of metrics as old-age support ratio, dependency ratio, old-age dependency ratio and density, which we can derive from our data. Density requires further data on the size of geography area to calculate. However, for this exercise, we will focus on the percentage of young and percentage of old as the two measures to compare on a static graphic.

2.3 Group and transform data

We shall group the age cohorts into age groups of young and aged by Time, PA, SZ and apply summations and divisions to derive the percentage of young and percentage of aged for the y and x measures. Data will need to be pivoted to the wide format for the R calculations. We shall also add in region grouping and bin the total resident populations for facets and possibly to apply colour palettes.

2.4 Proposed design

This work will attempt to recreate a similar work done in Tableau Desktop but in R. This work will not include the interactivity features to limit the scope but shall include all prior enhancement methods identified.

Similar visualisation done in Tableau Desktop

Similar visualisation done in Tableau Desktop

3. Detailed steps on reproduction

3.1 Data wrangling

3.1.1 Install required R packages

This code chunk installs the basic tidyverse, gganimate, gifski and RColorBrewer packages and load them into RStudio environment.

packages <- c('tidyverse','gganimate','gifski','RColorBrewer')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.1.2 Import data file

The csv file is placed on the data sub-directory and imported using read_csv function. Column types are specified to force an integer field for the ‘Time’ column.

data <- read_csv("data/respopagesextod2011to2019.csv",
                 col_names = TRUE,
                 col_types = "cccccdi")

3.1.3 Replace AG string

The AG (Age Group) values are replaced with strings describing three age groups of young, active and aged as follows:

Young: 0 to 24 Active: 25 to 64 Aged: 65 onwards

Note: The least specific groups of 0_to_4 and 5_to_9 have to be replaced last to avoid ambuiguity, eg 0_to_4 can also match 40_to_44.

data$AG <-
str_replace_all(data$AG,c("10_to_14"="Young","15_to_19"="Young","20_to_24"="Young","25_to_29"="Active","30_to_34"="Active","35_to_39"="Active","40_to_44"="Active","45_to_49"="Active","50_to_54"="Active","55_to_59"="Active","60_to_64"="Active","65_to_69"="Aged","70_to_74"="Aged","75_to_79"="Aged","80_to_84"="Aged","85_to_89"="Aged","90_and_over"="Aged","0_to_4"="Young","5_to_9"="Young"))

3.1.4 Group and summarise data

The data is grouped by Time, PA, SZ and AG, and summarised by Pop into a new dataframe.

data2 <- data %>%
  group_by(Time, PA, SZ, AG) %>%
  summarise(pop=sum(Pop))

3.1.5 Pivot Wider

The data is pivoted to a wide format from AG column containing the three categories of Young, Active and Aged, with the values from pop.

data2 <- data2 %>%
  pivot_wider(names_from = AG,
              values_from = pop)

3.1.6 Filter zeros row

Zero rows are filtered out to lessen the number of rows since these are not needed.

data2 <- data2 %>%
filter(Young > 0,
       Active > 0,
       Aged > 0)

3.1.7 Add calculated fields

Calculated fields of percentages of young, active and aged are derived as measures.

data2 <- data2 %>%
  mutate(Total = Young + Active + Aged,
         pct.Young = Young / Total * 100,
         pct.Active = Active / Total * 100,
         pct.Aged = Aged / Total * 100)

3.1.8 Split ‘Total’ into 11 bins

A calculated field “Order” is derived by binning the resident population. This is field will be used to assign the fill colour to the circles for ease of visualising the resident population in addition to the circle size.

Note: Alphabets are assigned to avoid having to factorise the value. Bin size is chosen to approximate the median and mean of the distribution. A total of 11 bins is chosen in order to use the diverging colour palette under Rcolorbrewer.

data2 <- data2 %>%
  mutate(Order = case_when(
    (Total >= 0 & Total < 4000) ~ "a",
    (Total >= 4000  & Total < 8000) ~ "b",
    (Total >= 8000  & Total < 12000) ~ "c",
    (Total >= 12000  & Total < 16000) ~ "d",
    (Total >= 16000  & Total < 20000) ~ "e",
    (Total >= 20000  & Total < 24000) ~ "f",
    (Total >= 24000  & Total < 28000) ~ "g",
    (Total >= 28000  & Total < 32000) ~ "h",
    (Total >= 32000  & Total < 36000) ~ "i",
    (Total >= 36000  & Total < 40000) ~ "j",
    (Total >= 40000) ~ "k"))

3.1.9 Add Region

A third geographical dimension ‘Region’ is added. This will be used for facets in order to break down the plot for comparison.

Note: It is probably easier to join the file with the region information rather than hard-coding the values :)

data2 <- data2 %>%
  mutate(Region = case_when(
    PA == "Bishan" ~ "Central Region",
    PA == "Bukit Merah" ~ "Central Region",
    PA == "Bukit Timah" ~ "Central Region",
    PA == "Downtown Core" ~ "Central Region",
    PA == "Geylang" ~ "Central Region",
    PA == "Kallang" ~ "Central Region",
    PA == "Marina East" ~ "Central Region",
    PA == "Marina South" ~ "Central Region",
    PA == "Marine Parade" ~ "Central Region",
    PA == "Museum" ~ "Central Region",
    PA == "Newton" ~ "Central Region",
    PA == "Novena" ~ "Central Region",
    PA == "Orchard" ~ "Central Region",
    PA == "Outram" ~ "Central Region",
    PA == "Queenstown" ~ "Central Region",
    PA == "River Valley" ~ "Central Region",
    PA == "Rochor" ~ "Central Region",
    PA == "Singapore River" ~ "Central Region",
    PA == "Southern Islands" ~ "Central Region",
    PA == "Straits View" ~ "Central Region",
    PA == "Tanglin" ~ "Central Region",
    PA == "Toa Payoh" ~ "Central Region",
    PA == "Bedok" ~ "East Region",
    PA == "Changi" ~ "East Region",
    PA == "Changi Bay" ~ "East Region",
    PA == "Pasir Ris" ~ "East Region",
    PA == "Paya Lebar" ~ "East Region",
    PA == "Tampines" ~ "East Region",
    PA == "Central Water Catchment" ~ "North Region",
    PA == "Lim Chu Kang" ~ "North Region",
    PA == "Mandai" ~ "North Region",
    PA == "Sembawang" ~ "North Region",
    PA == "Simpang" ~ "North Region",
    PA == "Sungei Kadut" ~ "North Region",
    PA == "Woodlands" ~ "North Region",
    PA == "Yishun" ~ "North Region",
    PA == "Ang Mo Kio" ~ "North-East Region",
    PA == "Hougang" ~ "North-East Region",
    PA == "North-Eastern Islands" ~ "North-East Region",
    PA == "Punggol" ~ "North-East Region",
    PA == "Seletar" ~ "North-East Region",
    PA == "Sengkang" ~ "North-East Region",
    PA == "Serangoon" ~ "North-East Region",
    PA == "Boon Lay" ~ "West Region",
    PA == "Bukit Batok" ~ "West Region",
    PA == "Bukit Panjang" ~ "West Region",
    PA == "Choa Chu Kang" ~ "West Region",
    PA == "Clementi" ~ "West Region",
    PA == "Jurong East" ~ "West Region",
    PA == "Jurong West" ~ "West Region",
    PA == "Pioneer" ~ "West Region",
    PA == "Tengah" ~ "West Region",
    PA == "Tuas" ~ "West Region",
    PA == "Western Islands" ~ "West Region",
    PA == "Western Water Catchment" ~ "West Region"))

3.2 Basic Bubble Chart

3.2.1 Arrange data

Arrange data by region by descending resident population, so that the smaller sized circles will appear in front of the larger sized circles and not get hidden.

arrange(Region,desc(Total))

3.2.2 Set aesthetics.

ggplot(aes(x = pct.Aged, y = pct.Young, size = Total, fill = Order))

3.2.3 Set geometries and coordinates.

geom_point(alpha = 0.8, shape = 21, color = "black")

3.2.4 Set themes.

scale_fill_brewer(type = 'div', palette = 'RdYlBu', aesthetics = "fill") +
ylab("Percentage of Young") +
xlab("Percentage of Aged") +
scale_x_continuous(limits=c(0,50), expand=c(0,0)) +               
scale_y_continuous(limits=c(0,50), expand=c(0,0)) +
scale_size(range = c(0, 10)) +
coord_equal() +
theme(legend.position = "none",
plot.title = element_text(color="dark red", size=14, face="bold.italic"),
plot.subtitle = element_text(color="red", size=12, face="bold.italic"),
axis.title.x = element_text(color="dark blue", size=10),
axis.title.y = element_text(color="dark blue", size=10),
axis.text.x = element_text(size=7),
axis.text.y = element_text(size=7))

3.2.5 Static graphic

Note: Graphic shows all years instead of a selected year, so that the object can later be used for gganimate.

p <-data2 %>%
  arrange(Region,desc(Total)) %>%
  ggplot(aes(x = pct.Aged, y = pct.Young, size = Total, fill = Order)) +
  geom_point(alpha = 0.8, shape = 21, color = "black") +
  facet_wrap(~Region, scales = "fixed", shrink = FALSE) +
  scale_fill_brewer(type = 'div', palette = 'RdYlBu', aesthetics = "fill") +
  labs(title = "How has Singapore changed since 2011?",
       caption = "Data source: Singstat") +
  ylab("Percentage of Young") +
  xlab("Percentage of Aged") +
  scale_x_continuous(limits=c(0,50), expand=c(0,0)) +               
  scale_y_continuous(limits=c(0,50), expand=c(0,0)) +
  scale_size(range = c(0, 10)) +
  coord_equal() +
  theme(legend.position = "none",
  plot.title = element_text(color="dark red", size=14, face="bold.italic"),
  plot.subtitle = element_text(color="red", size=12, face="bold.italic"),
  axis.title.x = element_text(color="dark blue", size=10),
  axis.title.y = element_text(color="dark blue", size=10),
  axis.text.x = element_text(size=7),
  axis.text.y = element_text(size=7))
p

3.3 Animate with gganimate

Animate using ‘Time’ with shadow to show history.

p + transition_time(Time) +
  labs(subtitle = "Year: {frame_time}") +
  shadow_wake(wake_length = 0.1, alpha = FALSE)

4. Final Visualisation

Overall, Singapore is experiencing an ageing population as the percentage of aged increased while the percentage of young decreased from 2011 to 2019, as can be seen from the downward movement from top left to bottom right. However, it can be seen that some of the subzones are rejuvenating, with the circles moving upward in the opposite direction, as part of government town renewal. Most of such renewals occur in subzones with lower resident population, as represented by the smaller red circles, prodominently in the central region. However, some mid-size subzones represented by the light yellow and light blue circles can also be seen rejuvenating in the other regions. The more populated subzones of dark blue circles are generally ageing. Central region has more subzones but each one tends to be smaller compared to fewer but larger subzones in the other regions.

5. Reflection on R vs Tableau

5.1 Large numbers of packages

R has large numbers of available packages, compared to the private developments by Tableau. For example, Tidyverse is a compilation of packages for graph plotting and data wrangling and connectivity, which can be further enhanced with gganimate for animation, plot-ly for interactivity, RColorBrewer for enhanced colouring and Shiny for dashboarding. Tableau has also tried to mimick the 3rd-party developments through their extensions. However, developed extensions are currently limited as developers have to weigh the returns of developing for a proprietary software versus an open system.

5.2 Strong community support

R has a strong community support. Documentations on functions, arguments, reusable codes and examples, know-hows and problems are readily available and shared amongst the R community with active contributions. There are also numerous free and paid online and offline training materials covering beginner to expert courses on R. These make for a shorter curve. Tableau has such resources available too albeit in more limited quantities. This is also partly due to the limited functions available in Tableau; there is only so much that you can learn.

5.3 Depth of programmability

R has the capability to go in-depth on every programmable functions limited only by one’s creativity. This extends into emerging technologies of machine learning and big data. However, as we wrote, Tableau is also developing new capabilities into machine learning.