1. Introduction

Data Source: Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 (Singapore Department of Statistics, 2019)

This Notebook attempts to make use of the data listed above to reveal the demographic structure of Singapore’s population by both age group and planning area. Singstat bins the ages of Singaporeans into 5-wide bins (eg: 0-4, 5-9, …), and the planning areas are geographical boundaries around the island defined by the Urban Redevelopment Authority, a government statutory board. In total, there are 55 distinct Planning Areas around Singapore, with many more subzones within them. The dataset contains the population size in these areas, as well as their type of dwelling.

In this visualisation, we will be focusing on visualising the population size of the larger overall Planning Areas, along with Gender as well as Age Group. There are 3 visualisations used: Population Pyramid of overall population, Ternary Plot of Age Groups, and a Treemap of Age Groups within Planning Areas.

2. Major Data & Design Challenges

2.1. Data Cleaning

Data cleaning is done to remove or fix inaccurate or bad data within the given dataset. The dataset used requires minimal amounts of cleaning, mainly requiring basic steps like encoding of Ordinal variables (Age Group) and checking for null values. In this case, values of 0 are not removed, as a value of 0 is still meaningful for analysis.

2.2. Data Wrangling

In order to create effective visualisations, there is a need to transform and reshape the data to obtain the needed figures, such as grouping by Planning Areas and aggregating their total population. R allows us to carry out data wrangling programmatically, which is more powerful than using a dashboarding tool like Tableau as it allows for more complex operations. However, it is difficult to chain the pipes (%>%) right as well as to get the correct output due to the many ways the code can go wrong, both in syntax and logic.

2.3. Number of dimensions in dataset

This dataset has 4 main variables: Population, Planning Area, Age Group, Sex (6 if you include type of dwelling and subzone, but they are not used for analysis in this notebook). It is extremely difficult to create visualisations in more than 3D as users will find it difficult to interpret the visualisation. Therefore, we have worked around this limitation by limiting the number of dimensions in visualisations, but with more visualisations to ensure that potential insights are not lost.

3. Description & Insights

3.1. Description

Overall Population Pyramid The population pyramid allows us to view the age distribution of the overall Singapore population in bins of 5, which has already been split in the dataset. It also allows us to compare the population distribution between the two genders in Singapore. As this is a relatively simple visualisation, ggplot is used to create a static image.
Ternary Plot of overall population by Age Group A ternary plot allows us to visualise the age groups distributions in three distinct parts - Young (Age 24 and below), Active (Working age 25 to 64), and Old (Age 65 and above). Plotly is used to enable extra information to be available on hovering over the scatter points within the ternary plot, as the points tend to cluster together and overlap, making it more difficult to read sometimes. A hovering tooltip that contains additional information such as Gender, Planning Area and the values of the 3 categories make information more accessible.
Treemap of population distribution by Planning Area and Age Group Within the data, there is a form of hierarchical strcture as the Planning Areas are part of the overall Singapore population, and can be further broken down into the 3 age groups defined in the ternary plot as well (eg: Singapore –> Ang Mo Kio –> Ang Mo Kio, Young). This hierarchical relationship allows us to form a treemap that allows the contents to represent the population proportion within Singapore more effectively than a Pie or Sunburst chart, due to there being 55 different Planning Areas in the dataset. Finally, plotly is used for this as well for the ability to zoom in and out of various sections of the treemap and to show additional information through the on hover tooltips, which helps to mitigate the major drawback of some cells being difficult to read within treemaps due to small proportion.

3.2. Sketch of proposed design

4. EDA

4.1. Install and load required R packages

tidyverse is a set of packages that used for data cleaning and wrangling
skimr is a package designed to provide summary statistics about variables, with more information than base R’s summary methods
ggplot2 is a plotting library used to create static plots, used for the population pyramid
Plotly is another plotting library, used to create interactive Javascript-based visualisations for the Treemap and Ternary Plot

# Install and load required packages
packages <- c('tidyverse', 'ggplot2', 'plotly', 'skimr')
for (p in packages) {
  if(!require(p, character.only = T)) {
    install.packages(p)
  }
  library(p, character.only = T)
}

## Loading required package: tidyverse

## -- Attaching packages ------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  3.0.3     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts --------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Loading required package: plotly

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

## Loading required package: skimr

4.2. Data Cleaning and filtering

Read in and filter the data to 2019 only, then validate the structure of the data

# Read in data into dataframe
pop_data <- utils::read.csv('./respopagesextod2011to2019.csv')
pop_data <- pop_data[pop_data$Time==2019, ]
str(pop_data)

## 'data.frame':    98192 obs. of  7 variables:
##  $ PA  : Factor w/ 55 levels "Ang Mo Kio","Bedok",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SZ  : Factor w/ 323 levels "Admiralty","Airport Road",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ AG  : Factor w/ 19 levels "0_to_4","10_to_14",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex : Factor w/ 2 levels "Females","Males": 2 2 2 2 2 2 2 2 1 1 ...
##  $ TOD : Factor w/ 8 levels "Condominiums and Other Apartments",..: 2 3 4 5 6 7 1 8 2 3 ...
##  $ Pop : int  0 10 10 20 0 0 50 0 0 10 ...
##  $ Time: int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...

Get a sense of how the actual data looks like using head()

head(pop_data)

Generate Summary Statistics

skim(pop_data)

Data summary
Name	pop_data
Number of rows	98192
Number of columns	7
_______________________
Column type frequency:
factor	5
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
PA	1	FALSE	55	Buk: 5168, Que: 4560, Ang: 3648, Dow: 3648
SZ	1	FALSE	323	Adm: 304, Air: 304, Ale: 304, Ale: 304
AG	1	FALSE	19	0_t: 5168, 10_: 5168, 15_: 5168, 20_: 5168
Sex	1	FALSE	2	Fem: 49096, Mal: 49096
TOD	1	FALSE	8	Con: 12274, HDB: 12274, HDB: 12274, HDB: 12274

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Pop	0	1	41.08	130.07	0	0	0	20	2440	▇▁▁▁▁
Time	0	1	2019.00	0.00	2019	2019	2019	2019	2019	▁▁▇▁▁

Check for null values

sum(is.na(pop_data))

## [1] 0

AG is an ordinal variable, therefor we need transform it from discrete to ordinal using factor(). We do not filter out Planning Areas with population values of 0 as a “0” value is still relevant for analysis.

pop_data$AG <- factor(pop_data$AG, order = TRUE, 
                      levels = c('0_to_4', '5_to_9', '10_to_14', '15_to_19', '20_to_24', '25_to_29', 
                                 '30_to_34', '35_to_39', '40_to_44', '45_to_49', '50_to_54', '55_to_59', 
                                 '60_to_64','65_to_69', '70_to_74','75_to_79', '80_to_84','85_to_89', '90_and_over'))

str(pop_data)

## 'data.frame':    98192 obs. of  7 variables:
##  $ PA  : Factor w/ 55 levels "Ang Mo Kio","Bedok",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SZ  : Factor w/ 323 levels "Admiralty","Airport Road",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ AG  : Ord.factor w/ 19 levels "0_to_4"<"5_to_9"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex : Factor w/ 2 levels "Females","Males": 2 2 2 2 2 2 2 2 1 1 ...
##  $ TOD : Factor w/ 8 levels "Condominiums and Other Apartments",..: 2 3 4 5 6 7 1 8 2 3 ...
##  $ Pop : int  0 10 10 20 0 0 50 0 0 10 ...
##  $ Time: int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...

By running str() on the data again, we can see that AG is now Ordinal (Ord.factor). The rest of the factors are categorical.

5. Population Pyramid

5.1. Preparing the Data

males <- dplyr::filter(pop_data, Sex == "Males")
females <- dplyr::filter(pop_data, Sex == "Females")

Generate the aggregated population values to have a table to check whether the plot is generated correctly

age_sex_vals <- aggregate(pop_data$Pop, by=list(Category=pop_data$Sex,pop_data$AG), FUN = sum)
male_vals <- dplyr::filter(age_sex_vals, Category == "Males")
male_vals

female_vals <- dplyr::filter(age_sex_vals, Category == "Females")
female_vals

5.2. Plotting

# Plot the 2 bar charts, with 1 using negative values to be 2 opposite ends of the pyramid
pyramid <- ggplot() +
  geom_col(aes(y = males$AG, x = males$Pop, fill = males$Sex)) +
  geom_col(aes(y = females$AG, x = -females$Pop, fill = females$Sex))

# Format Axes
pyramid <- pyramid +
  scale_x_continuous(breaks = c(-150000, -100000, -50000, 0, 50000, 100000, 150000), 
                     labels = c("150000", "100000", "50000", "0", "50000", "100000", "150000"))

# Format labels
pyramid <- pyramid + ggtitle("Overall Singapore Population Pyramid by Age Group, 2019") +
  xlab("Population") +
  ylab("Age Group") +
  labs(fill = "Gender") +
  scale_fill_brewer(palette = "Set1")

# This plot seems to display differently sometimes based on your packages, the one I usually see is on RPubs
pyramid

6. Ternary Plot of population age groups

6.1. Preparing the data

Ternary Plot of population structure In order to create a ternary plot, the population must be split into 3 distinct groups:

Group	Age Range
Young	0 - 24
Active	25 - 64
Old	65 and above

Preparing the data

# Pivot age groups into their own columns
spread_pop_data <- pop_data %>% spread(AG, Pop)

# Group 5-year granular age groups into the 3 main groups according to the table above
spread_pop_data <- spread_pop_data %>% mutate(Young = rowSums(.[6:10])) %>% mutate(Active = rowSums(.[11:18])) %>% mutate(Old = rowSums(.[19:24]))

# Group by Planning Area and Sex
ternary_data <- spread_pop_data %>%
  group_by(PA, Sex) %>%
  summarise_at(vars(Young,Active,Old), sum)

ternary_data

6.2. Plotting

# Function to format axes
axis <- function(title) {
  list(title = title,
       tickformat = ".0%")
}

fig <- plot_ly(
  ternary_data,
  type = 'scatterternary',
  mode = 'markers',
  a = ~Young,
  b = ~Active,
  c = ~Old,
  marker = list(
    color = '#DB7365',
    line = list('width' = 2)
  ),
  hovertemplate = ~paste( "<br>",
    "Area: ", `PA`, "<br>",
    "Gender: ", `Sex`, "<br>",
    "Young ", `Young`, "<br>",
    "Active: ", `Active`, "<br>",
    "Old: ", `Old`
  )
)

# Title formatting
title_format <- list(text="Singapore Population by Age Groups, 2019",
             xanchor="left",
             xref="container",
             x = 0)

# Layout formatting
fig <- fig %>% layout(
  title = title_format,
  ternary = list(
    aaxis = axis('Young'),
    baxis = axis('Active'),
    caxis = axis('Old')
  )
)

fig

7. Treemap

7.1. Preparing the data

# Group by Planning Area
pop_by_area <- spread_pop_data %>%
  group_by(PA) %>%
  summarise_at(vars(Young,Active,Old), sum) %>%
  mutate(Total = rowSums(.[2:4]))

# Pivot the Age Groups to individual rows
pop_by_area <- gather(pop_by_area, Category, Pop, `Young`:`Active`:`Old`:`Total`)

## Warning in x:y: numerical expression has 2 elements: only the first used

## Warning in x:y: numerical expression has 3 elements: only the first used

# cast PA into string for manipulation
i <- sapply(pop_by_area, is.factor)
pop_by_area[i] <- lapply(pop_by_area[i], as.character)
# Create a new ID string for treemap
pop_by_area$id <- paste0(pop_by_area$PA, "-",pop_by_area$Category)
# Assignment of parent IDs, young/active/old to total, totals to singapore, singapore to empty string
pop_by_area$parent <- paste0(pop_by_area$PA, "-Total")
pop_by_area$parent[pop_by_area$parent == pop_by_area$id] = "Singapore"
pop_by_area <- pop_by_area %>% add_row(PA= "Singapore", 
                                       Category = "", 
                                       Pop = sum(pop_by_area$Pop[pop_by_area$Category != "Total"]), 
                                       id = "Singapore", parent = "")
pop_by_area

7.2. Plotting

fig2 <- plot_ly(
  type = "treemap",
  branchvalues="total",
  data = pop_by_area,
  labels = pop_by_area$id,
  textinfo = "label+value+percent parent+percent root",
  hoverinfo = "label+value+percent parent+percent root",
  parents = pop_by_area$parent,
  values = pop_by_area$Pop,
  marker=list(colorscale='Rainbow')
)

# No title for this chart because the navigation bar would cut into it
fig2

8. Insights

The following insights can be gained from the visualisations:

The population pyramid shows that there are more females aged 70 and over than males within the population, with longer bars at the top of the pyramid, whereas the length of the bars for both genders are a lot closer for ages below 70. This corroborates with other sources that state that females in general have a longer lifespan than males, and is worth further investigation into the factors that caused it to be so.
From the population pyramid and the ternary plot, the majority of Singaporeans are of Active working age (25 to 64), with the southern areas of Singapore like Museum and Downtown Core trending more towards Active adults (Bottom left corner) for both males and females.
The top 3 most populated Planning Areas in Singapore are Bedok, Jurong West and Tampines, accounting for 20% of the country’s total local population spread out across 55 different Planning Areas.

9. Future Improvements

With additional data such as geospatial information, more visualisations can be created by merging the new data with the existing data, such as changing from a Treemap to a Choropleth map by using .shape files and geospatial coordinate mapping of planning areas to a geographical map. More interactivity can also be included by embedded the charts into a Shiny dashboard instead as well.

ISSS608 Assignment 4

Darren Gan