This assignment aims to explore Singapore’s demographic patterns by age groups and by the country’s planning areas in 2019 .
The dataset used is taken from Singapore Residents by Planning Area*, Subzone, Age Group, Sex and Type of Dwelling June 2011-2019 data series published by the Singapore Department of Statistics.
*Planning areas refer to the areas demarcated in the Urban Redevelopment Authority’s Master Plan 2014. 2019 data was used for this data visualisation.
The census contains demographic details of Singapore’s residents (Age Groups, Sex, location and their respective Planning Area, Subzones and Type of Dwellings) and contains approximately 883,728 residents’ details (2011- 2019) in total.
For this purpose of this visualisation, we will look at 2019 dataset and explore:
Currently the dataset is grouped according to these different categories as seen from the table below.
| PA | SZ | AG | Sex | TOD | POP | Time |
|---|---|---|---|---|---|---|
| Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HDB 1-and 2 Room Flats | 0 | 2011 |
| Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HDB 3-Room Flats | 10 | 2011 |
1. Dataset consists of data from 2011 to 2019
Filtering of data from 2019 is required for a indepth analysis on 2019 resident’s data
2. Difficulties in aggregating resident’s count according to the residents’ age group and planning areas
The data is currently segmented according to the subzones and the type of dwellings, making it difficult for data aggregation and analysis via age groups and planning areas. Further data wrangling would be required to aggregate and group resident count by Age Group, Planning areas and Gender.
3. Segmenting age groups into different generation groups
The data is currently segmented according to different age groups. However as the purpose of the visualisation is to derive insights and possible actions from population density and population of planning areas, having too granular segmentation of age-groups would hinder instead of drive insights. Hence combining of age groups into different generation groups (e.g. young, young working adults, economic active, retirees, old) is required for a broader view.
4. Conflicts between R packages
The original idea was to leverage on the ggpol facet_share function to build a butterfly chart with a shared axis to explore the relationship between planning areas, gender and generation groups. However as the ggtern package used for designing ternary plots also changes the coordinates, it overrides the ggpol package’s functionality, rendering it unusable. Hence the idea was mooted.
library('knitr')
include_graphics("/Users/jayneteo/Dropbox/SMU MITB/Term 2 2020/Visual Analytics/Assignments/Assignment 4/Untitled Notebook (4)/Sketch1.jpg")
include_graphics("/Users/jayneteo/Dropbox/SMU MITB/Term 2 2020/Visual Analytics/Assignments/Assignment 4/Untitled Notebook (4)/Sketch2.jpg")
Given that there are categorical and numeric variables at play:
A butterfly barplot (population pyramid) was plotted to illustrate the distribution of various age groups in Singapore. If a population is a young population, it should mirror the shape of a pyramid
To explore the relationship between Planning areas, Generation Groups and the corresponding Population count, a ranked stacked barplot was plotted
To explore if there are planning areas preferred by Singaporeans, a treemap has been plotted
Lastly, to explore the relationship between generation groups, a ternary plot was plotted
Tidyverse contains a set of essential packages for data wrangling and data visualisation
ggpubr" is an extension of ggplot2 and provides easy-to-use functions for creating and customizing ggplot2 based publication ready plots
Treemap is an extension of ggplot2 and offers great flexibility in drawing treemaps
kableExtra and knitr are a packages that offer flexibility in creating tables
ggtern is an extension of ggplot2 and provides easy to use function for creating ternary plots.
Please do not install or run the ggtern package con-currently as it overrides the ggplot coordinates. Please install / load the ggtern package only at the very last
packages = c('tidyverse','ggpubr','ggthemes', 'treemap','kableExtra','knitr')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The data is imported using read_csv() function which converts the data into a tibble dataframe.
popdata <- read_csv("/Users/jayneteo/Dropbox/SMU MITB/Term 2 2020/Visual Analytics/Assignments/Assignment 4/Singapore Residents by Planning AreaSubzone Age Group Sex and Type of Dwelling June 20112019/Population2011to2019.csv")
## Parsed with column specification:
## cols(
## PA = col_character(),
## SZ = col_character(),
## AG = col_character(),
## Sex = col_character(),
## TOD = col_character(),
## Pop = col_double(),
## Time = col_double()
## )
The glimpse() and the head() functions were used to understand the characteristic of the dataset.
glimpse(popdata)
## Rows: 883,728
## Columns: 7
## $ PA <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo …
## $ SZ <chr> "Ang Mo Kio Town Centre", "Ang Mo Kio Town Centre", "Ang Mo Kio …
## $ AG <chr> "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_t…
## $ Sex <chr> "Males", "Males", "Males", "Males", "Males", "Males", "Males", "…
## $ TOD <chr> "HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats…
## $ Pop <dbl> 0, 10, 30, 50, 0, 0, 40, 0, 0, 10, 30, 60, 0, 0, 40, 0, 0, 10, 3…
## $ Time <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
head(popdata,10)
## # A tibble: 10 x 7
## PA SZ AG Sex TOD Pop Time
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 1- and 2-Room Flats 0 2011
## 2 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 3-Room Flats 10 2011
## 3 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 4-Room Flats 30 2011
## 4 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 5-Room and Executive… 50 2011
## 5 Ang Mo K… Ang Mo Kio Town… 0_to… Males HUDC Flats (excluding th… 0 2011
## 6 Ang Mo K… Ang Mo Kio Town… 0_to… Males Landed Properties 0 2011
## 7 Ang Mo K… Ang Mo Kio Town… 0_to… Males Condominiums and Other A… 40 2011
## 8 Ang Mo K… Ang Mo Kio Town… 0_to… Males Others 0 2011
## 9 Ang Mo K… Ang Mo Kio Town… 0_to… Femal… HDB 1- and 2-Room Flats 0 2011
## 10 Ang Mo K… Ang Mo Kio Town… 0_to… Femal… HDB 3-Room Flats 10 2011
Before any data wrangling is done, any rows with NA has been dropped with the drop_na() function.
drop_na(popdata)
## # A tibble: 883,728 x 7
## PA SZ AG Sex TOD Pop Time
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 1- and 2-Room Flats 0 2011
## 2 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 3-Room Flats 10 2011
## 3 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 4-Room Flats 30 2011
## 4 Ang Mo K… Ang Mo Kio Town… 0_to… Males HDB 5-Room and Executive… 50 2011
## 5 Ang Mo K… Ang Mo Kio Town… 0_to… Males HUDC Flats (excluding th… 0 2011
## 6 Ang Mo K… Ang Mo Kio Town… 0_to… Males Landed Properties 0 2011
## 7 Ang Mo K… Ang Mo Kio Town… 0_to… Males Condominiums and Other A… 40 2011
## 8 Ang Mo K… Ang Mo Kio Town… 0_to… Males Others 0 2011
## 9 Ang Mo K… Ang Mo Kio Town… 0_to… Femal… HDB 1- and 2-Room Flats 0 2011
## 10 Ang Mo K… Ang Mo Kio Town… 0_to… Femal… HDB 3-Room Flats 10 2011
## # … with 883,718 more rows
As we would only want to show data in 2019, the filter() function was used to select 2019 data.
popdata2019 <- popdata %>%
filter(`Time` == 2019)
glimpse(popdata2019)
## Rows: 98,192
## Columns: 7
## $ PA <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo …
## $ SZ <chr> "Ang Mo Kio Town Centre", "Ang Mo Kio Town Centre", "Ang Mo Kio …
## $ AG <chr> "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_t…
## $ Sex <chr> "Males", "Males", "Males", "Males", "Males", "Males", "Males", "…
## $ TOD <chr> "HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats…
## $ Pop <dbl> 0, 10, 10, 20, 0, 0, 50, 0, 0, 10, 10, 20, 0, 0, 40, 0, 0, 10, 1…
## $ Time <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019…
The variables - Planning Areas, Age Groups, Population Count and Sex - were selected using the select() function for further data wrangling.
The variables are then grouped using the group_by() function.
popdata2019_select_gender <- popdata2019 %>%
select(PA,AG,Pop,Sex) %>%
group_by(PA,AG,Pop,Sex) %>%
mutate(AG = as_factor(AG),PA = as_factor(PA), Sex = as_factor(Sex))
popdata2019_select_gender
## # A tibble: 98,192 x 4
## # Groups: PA, AG, Pop, Sex [18,873]
## PA AG Pop Sex
## <fct> <fct> <dbl> <fct>
## 1 Ang Mo Kio 0_to_4 0 Males
## 2 Ang Mo Kio 0_to_4 10 Males
## 3 Ang Mo Kio 0_to_4 10 Males
## 4 Ang Mo Kio 0_to_4 20 Males
## 5 Ang Mo Kio 0_to_4 0 Males
## 6 Ang Mo Kio 0_to_4 0 Males
## 7 Ang Mo Kio 0_to_4 50 Males
## 8 Ang Mo Kio 0_to_4 0 Males
## 9 Ang Mo Kio 0_to_4 0 Females
## 10 Ang Mo Kio 0_to_4 10 Females
## # … with 98,182 more rows
To create the generation groups, the data was first transformed from a thin dataframe to a wide dataframe using the pivot_wider() function and then aggregated using the summarise() function. The variable “Sex” was also renamed to Gender for ease.
popdata2019_gender <- popdata2019_select_gender %>%
group_by(PA,AG, Sex) %>%
summarise_all(sum) %>%
pivot_wider(names_from = AG,
values_from = Pop) %>%
rename(Gender = Sex)
popdata2019_gender
## # A tibble: 110 x 21
## # Groups: PA [55]
## PA Gender `0_to_4` `10_to_14` `15_to_19` `20_to_24` `25_to_29` `30_to_34`
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ang … Femal… 2660 3670 3890 4390 5410 5440
## 2 Ang … Males 2760 3710 4040 4530 5210 5070
## 3 Bedok Femal… 4940 6580 7210 8080 9710 9180
## 4 Bedok Males 5080 6720 7430 8580 9820 8760
## 5 Bish… Femal… 1350 2100 2260 2680 3430 2770
## 6 Bish… Males 1500 2330 2480 2890 3660 2660
## 7 Boon… Femal… 0 0 0 0 0 0
## 8 Boon… Males 0 0 0 0 0 0
## 9 Buki… Femal… 3460 3880 4380 4930 6450 6270
## 10 Buki… Males 3670 3920 4420 4920 6060 6210
## # … with 100 more rows, and 13 more variables: `35_to_39` <dbl>,
## # `40_to_44` <dbl>, `45_to_49` <dbl>, `5_to_9` <dbl>, `50_to_54` <dbl>,
## # `55_to_59` <dbl>, `60_to_64` <dbl>, `65_to_69` <dbl>, `70_to_74` <dbl>,
## # `75_to_79` <dbl>, `80_to_84` <dbl>, `85_to_89` <dbl>, `90_and_over` <dbl>
The functions mutate() and rowSum() were used to create the various generation groups.
The row numbers used are the variables’ order according to the new data frame after spreading the data frame.
The age groups are binned into 5 new generation groups:
processeddata1 <- ungroup(popdata2019_gender) %>%
mutate(Children = rowSums(.[,c("0_to_4","5_to_9","10_to_14","15_to_19")]),
Young_Working_Adults = rowSums(.[,c("20_to_24","25_to_29","30_to_34","35_to_39")]),
Economic_Active = rowSums(.[,c("40_to_44","45_to_49","50_to_54","55_to_59")]),
Retirees = rowSums(.[,c("60_to_64","65_to_69","70_to_74","75_to_79")]),
Old = rowSums(.[,c("80_to_84","85_to_89","90_and_over")])) %>%
select(PA, Gender, Children, Young_Working_Adults, Economic_Active, Retirees,Old) %>%
group_by(PA) %>%
pivot_longer(cols = Children:Old,
names_to = "Age_Group",
values_to = "Population",
values_drop_na = TRUE) %>%
mutate(Age_Group = as_factor(Age_Group))
processeddata1
## # A tibble: 550 x 4
## # Groups: PA [55]
## PA Gender Age_Group Population
## <fct> <fct> <fct> <dbl>
## 1 Ang Mo Kio Females Children 13330
## 2 Ang Mo Kio Females Young_Working_Adults 21040
## 3 Ang Mo Kio Females Economic_Active 25660
## 4 Ang Mo Kio Females Retirees 21550
## 5 Ang Mo Kio Females Old 4190
## 6 Ang Mo Kio Males Children 13630
## 7 Ang Mo Kio Males Young_Working_Adults 19950
## 8 Ang Mo Kio Males Economic_Active 23930
## 9 Ang Mo Kio Males Retirees 18350
## 10 Ang Mo Kio Males Old 2800
## # … with 540 more rows
To create a ternary chart, three variables are required. Hence a secondary dataframe was created with three generation groups.
The age groups are binned into 3 new generation groups:
processeddata2 <- ungroup(popdata2019_gender) %>%
mutate(Young = rowSums(.[,c("0_to_4","5_to_9","10_to_14","15_to_19","20_to_24")]),
Economically_Active = rowSums(.[,c("25_to_29","30_to_34","35_to_39","40_to_44","45_to_49","50_to_54","55_to_59")]),
Old = rowSums(.[,c("60_to_64","65_to_69","70_to_74","75_to_79","80_to_84","85_to_89","90_and_over")]))
processeddata2
## # A tibble: 110 x 24
## PA Gender `0_to_4` `10_to_14` `15_to_19` `20_to_24` `25_to_29` `30_to_34`
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ang … Femal… 2660 3670 3890 4390 5410 5440
## 2 Ang … Males 2760 3710 4040 4530 5210 5070
## 3 Bedok Femal… 4940 6580 7210 8080 9710 9180
## 4 Bedok Males 5080 6720 7430 8580 9820 8760
## 5 Bish… Femal… 1350 2100 2260 2680 3430 2770
## 6 Bish… Males 1500 2330 2480 2890 3660 2660
## 7 Boon… Femal… 0 0 0 0 0 0
## 8 Boon… Males 0 0 0 0 0 0
## 9 Buki… Femal… 3460 3880 4380 4930 6450 6270
## 10 Buki… Males 3670 3920 4420 4920 6060 6210
## # … with 100 more rows, and 16 more variables: `35_to_39` <dbl>,
## # `40_to_44` <dbl>, `45_to_49` <dbl>, `5_to_9` <dbl>, `50_to_54` <dbl>,
## # `55_to_59` <dbl>, `60_to_64` <dbl>, `65_to_69` <dbl>, `70_to_74` <dbl>,
## # `75_to_79` <dbl>, `80_to_84` <dbl>, `85_to_89` <dbl>, `90_and_over` <dbl>,
## # Young <dbl>, Economically_Active <dbl>, Old <dbl>
Butterflychart <- popdata2019_select_gender %>%
filter(Pop != 0) %>%
mutate(AG = factor(AG, levels = c("0_to_4","5_to_9","10_to_14","15_to_19","20_to_24","25_to_29","30_to_34","35_to_39","40_to_44","45_to_49","50_to_54","55_to_59","60_to_64","65_to_69","70_to_74","75_to_79","80_to_84","85_to_89","90_and_over"))) %>%
ggplot(aes(x = AG, y = ifelse(test = Sex == "Males",
yes = - Pop, no = Pop),fill = Sex))+
geom_bar(stat = 'identity',width = 0.7)+
scale_fill_manual(values=c('lightpink2','steelblue3'))+
labs(title = "Singapore's demographic structure by age groups",
y = "Population",
fill = "Gender")+
theme(
axis.title.x = element_text(size = 9, face = 'bold'),
axis.title.y = element_blank(),
legend.title = element_text(size =8,face = 'bold'),
legend.text = element_text(size =8),
plot.title = element_text(size = 11, hjust = 0.5,face = 'bold',margin = margin(5,0,10,0))) +
scale_y_continuous(breaks = seq(-150000, 150000, 50000),
labels = paste0(as.character(c(seq(150, 0, -50), seq(50, 150, 50))))) +
coord_flip()
Butterflychart
Populationcount <- processeddata1 %>%
summarise(Total = sum(Population)) %>%
arrange(-Total) %>%
rename(Planning_Areas = PA)
kable(Populationcount) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| Planning_Areas | Total |
|---|---|
| Bedok | 279970 |
| Jurong West | 265010 |
| Tampines | 257020 |
| Woodlands | 255000 |
| Sengkang | 244910 |
| Hougang | 227110 |
| Yishun | 220810 |
| Choa Chu Kang | 191100 |
| Punggol | 170920 |
| Ang Mo Kio | 164430 |
| Bukit Batok | 154140 |
| Bukit Merah | 152600 |
| Pasir Ris | 148210 |
| Bukit Panjang | 139700 |
| Toa Payoh | 121060 |
| Serangoon | 116490 |
| Geylang | 110520 |
| Kallang | 101940 |
| Queenstown | 96470 |
| Sembawang | 96070 |
| Clementi | 92910 |
| Bishan | 88230 |
| Jurong East | 79230 |
| Bukit Timah | 77720 |
| Novena | 49390 |
| Marine Parade | 46450 |
| Tanglin | 21710 |
| Outram | 19050 |
| Rochor | 12860 |
| River Valley | 10180 |
| Newton | 8000 |
| Singapore River | 2940 |
| Downtown Core | 2500 |
| Mandai | 2060 |
| Southern Islands | 1880 |
| Changi | 1790 |
| Orchard | 900 |
| Sungei Kadut | 700 |
| Western Water Catchment | 680 |
| Museum | 430 |
| Seletar | 260 |
| Lim Chu Kang | 70 |
| Boon Lay | 0 |
| Central Water Catchment | 0 |
| Changi Bay | 0 |
| Marina East | 0 |
| Marina South | 0 |
| North-Eastern Islands | 0 |
| Paya Lebar | 0 |
| Pioneer | 0 |
| Simpang | 0 |
| Straits View | 0 |
| Tengah | 0 |
| Tuas | 0 |
| Western Islands | 0 |
barchart <- processeddata1 %>%
mutate(PA = fct_reorder(PA,Population)) %>%
mutate(Age_Group = fct_rev(Age_Group)) %>%
mutate(Population = as.numeric(Population)) %>%
filter(Population != 0) %>%
ggplot(aes(y = reorder(PA,Population,sum),x = Population,fill = Age_Group))+
geom_bar(stat = 'identity')+
coord_cartesian(xlim = c(0,300000))+
scale_fill_brewer(palette = "RdBu")+
labs(title = "Singapore's demographics structure by age and by planning area",
x = "Population",
y = "Planning areas in Singapore",
fill = "Age Group")+
theme(
panel.background = element_blank(),
axis.line.y = element_line(),
axis.ticks.y = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y.left = element_text(size = 7),
axis.text.x.bottom = element_text(size = 7),
axis.title.x = element_text(size = 9, face = 'bold'),
axis.title.y = element_text(size = 9, face = 'bold'),
legend.title = element_text(size =8,face = 'bold'),
legend.text = element_text(size =8),
plot.title = element_text(size = 11, hjust = 0.5,face = 'bold',margin = margin(5,0,10,0)))+
scale_x_continuous(breaks = seq(0,300000,50000))
#geom_text(aes(Populationcount$Planning_Areas, Populationcount$Total, label = Populationcount$Total),size = 3, hjust = -0.2, data = Populationcount)
barchart
treemap package.tree_map <- processeddata1 %>%
treemap(index = c("PA"),
vSize = "Population",
type = "index",
fontsize.labels = 10,
fontcolor.labels = "black",
fontface.labels = 2,
align.labels = list(
c('center','center')
),
inflate.labels = F,
palette = 'Dark2',
title = "Planning Areas by population count",
fontsize.title = 12,
border.col = "white",
border.lwds = 0.5)
ggtern function.packages = c("ggtern")
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
## Loading required package: ggtern
## Registered S3 methods overwritten by 'ggtern':
## method from
## grid.draw.ggplot ggplot2
## plot.ggplot ggplot2
## print.ggplot ggplot2
## --
## Remember to cite, run citation(package = 'ggtern') for further info.
## --
##
## Attaching package: 'ggtern'
## The following objects are masked from 'package:ggplot2':
##
## aes, annotate, ggplot, ggplot_build, ggplot_gtable, ggplotGrob,
## ggsave, layer_data, theme_bw, theme_classic, theme_dark,
## theme_gray, theme_light, theme_linedraw, theme_minimal, theme_void
ternplot<- ggtern(data = processeddata2, aes(x=Young,y=Economically_Active, z=Old)) +
geom_point()
ternplot+
ggtitle("Ternary Plot of Population Structure, 2019")+
xlab("Young")+
ylab("Active")+
zlab("Old")+
theme_rgbw()+
theme(plot.title = element_text(hjust=0.5))
ternplot
There are four graphs plotted in total:
Looking at the population pyramid, it can be observed that Singapore’s population is aging rapidly. The bulk of the population is concentrated at the 45-59 age groups with decreasing birthrates. This demographic shift places pressure on the society as a shrinking workforce is tasked to support an rapidly aging population. Additionally, it can be observed that women tend to have longer life expectancy, judging by the proportion of females in the 75 and above age brackets.
Looking at the stacked barplot, it can be observed that the top five most populous areas are Bedok, Jurong West, Tampines, Woodlands and Sengkang. Not only are these areas the most populous, these areas also reflect a relatively younger demographic profile compared to the other planning areas. Amongst these five most populous areas, Bedok stands out for having both a balanced demographic profile - both young and old. This will have impact on the authorities’ amenities planning and housing developments for the future.