1. Overview

The purpose of this visualisation is to explore the demographic patterns in Singapore from 2011 to 2019. This has been done through visualisations such as an age-sex pyramid and slope charts to display the various relationships between regions/planning areas, population, and gender.

The first part of this analysis includes an age-sex pyramid that provides a snapshot of the 2019 Singapore population segmented by males and females. The second part of the analysis explore the relationship between region/planning area and the different population parameters using the slope chart.

The data for this assignment has been sourced from Singstat and Wikipedia.

2. Challenges

2.1 Major Challenges Faced

  1. The raw data extracted from the Singstat website had not been demarcated by geographical regions (such as north, south, east, west) or any other simplifying characteristic. Such demarcation is required because it makes patterns within the data a lot easier to discern. ‘Planning Region’ was included in the original dataset but contained far too many variables to visualise effectively.

  2. Since every visualisation requires the data to be structured in a specific manner, data manipulation is required every time. For instance, the age-sex pyramid requires the population to be aggregated by age and sex, whereas the slope chart requires other customisations such as feature engineering and external data merging.

  3. Divining economic patterns in addition to demographic trends, in order to make appropriate policy suggestions based on data. Since the data does not contain any obvious economic parameters, they will have to be inferred through feature engineering and the creation of variables such as Dependency Rates.

2.2 Proposed Solutions

  1. External data from Wikipedia is used to address the issue of demarcation. The data has been merged with the Sing stat dataset using the ‘Planning Area/PA’ column.

  2. Data manipulation has been conducted and new data frames have been created before each visualisation. This includes the creation of new variables and grouping and aggregrating data at different levels.

  3. A new variable called dependency ratio has been generated using the age group and population columns. This has been aggregated to a Region/Planning Area level.

2.3 Proposed Sketch

alt text here

alt text here

3. Step by Step Guide for Visualising the Data

3.1 Installing & Launching Packages

packages = c('dplyr', 'tidyverse', 'CGPfunctions', 'tidyr', 'gridExtra')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.2 Importing & Cleaning the dataset

The following dataset has been sourced from Singstat. (https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data)

data <- read_csv("C:/SMU Term 3/Visual Analytics/Assignment_4/Data/respopagesextod2011to2019.csv")

head(data)
## # A tibble: 6 x 7
##   PA        SZ               AG    Sex   TOD                           Pop  Time
##   <chr>     <chr>            <chr> <chr> <chr>                       <dbl> <dbl>
## 1 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 1- and 2-Room Flats         0  2011
## 2 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 3-Room Flats               10  2011
## 3 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 4-Room Flats               30  2011
## 4 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 5-Room and Executive F~    50  2011
## 5 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HUDC Flats (excluding thos~     0  2011
## 6 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males Landed Properties               0  2011

Changing the variable names to make it easier to viewers of the visualisation to understand the charts. So PA, SZ, AG, Pop, and Time have been renamed as Planning_Area, Subzone, Age_Group, Population, and Year respectively.

names(data)[names(data) == "PA"] <- "Planning_Area"
names(data)[names(data) == "SZ"] <- "Subzone"
names(data)[names(data) == "AG"] <- "Age_Group"
names(data)[names(data) == "Pop"] <- "Population"
names(data)[names(data) == "Time"] <- "Year"
data
## # A tibble: 883,728 x 7
##    Planning_Area Subzone      Age_Group Sex    TOD              Population  Year
##    <chr>         <chr>        <chr>     <chr>  <chr>                 <dbl> <dbl>
##  1 Ang Mo Kio    Ang Mo Kio ~ 0_to_4    Males  HDB 1- and 2-Ro~          0  2011
##  2 Ang Mo Kio    Ang Mo Kio ~ 0_to_4    Males  HDB 3-Room Flats         10  2011
##  3 Ang Mo Kio    Ang Mo Kio ~ 0_to_4    Males  HDB 4-Room Flats         30  2011
##  4 Ang Mo Kio    Ang Mo Kio ~ 0_to_4    Males  HDB 5-Room and ~         50  2011
##  5 Ang Mo Kio    Ang Mo Kio ~ 0_to_4    Males  HUDC Flats (exc~          0  2011
##  6 Ang Mo Kio    Ang Mo Kio ~ 0_to_4    Males  Landed Properti~          0  2011
##  7 Ang Mo Kio    Ang Mo Kio ~ 0_to_4    Males  Condominiums an~         40  2011
##  8 Ang Mo Kio    Ang Mo Kio ~ 0_to_4    Males  Others                    0  2011
##  9 Ang Mo Kio    Ang Mo Kio ~ 0_to_4    Femal~ HDB 1- and 2-Ro~          0  2011
## 10 Ang Mo Kio    Ang Mo Kio ~ 0_to_4    Femal~ HDB 3-Room Flats         10  2011
## # ... with 883,718 more rows

3.3 Age-Sex Pyramid

Creating a new dataframe and filtering it for only the year 2019 and the columns ‘Age Group’, ‘Sex’ and ‘Population’. Additionally, population has been aggregated based on Age Group and Sex.

pyramid <- data %>%
  mutate(Year=as.character(Year)) %>%
  filter(Year=="2019") %>%
  select(-Planning_Area,-Subzone,-TOD,-Year)

pyramid$Age_Group<-str_replace(as.character(pyramid$Age_Group),"5_to_9","05_to_09")
pyramid_final <- aggregate(Population~Age_Group+Sex,data=pyramid,FUN=sum)

Plotting the chart by creating a variable for formatting and adding it to the code for the age-sex pyramid. The data source has also been added to the caption.

Formatting <- list( 
  theme_classic(),
  theme(panel.grid.major.x = element_blank()),
  theme(axis.text.x.top = element_text(size=12)),
  theme(plot.title = element_text(size=14, face = "bold", hjust = 0.5)),
  theme(plot.subtitle = element_text(hjust = 0.5)), theme(plot.caption = element_text(hjust = 0, face= "italic"), #Default is hjust=1
        plot.caption.position =  "plot")
)

asp<-ggplot(pyramid_final,aes(x = Age_Group, y = ifelse(Sex == "Males", yes = -Population, no = Population),fill = Sex))+
  geom_col()+
  coord_flip()+
  scale_y_continuous(labels = abs, limits = max(pyramid_final$Population)*c(-1,1))+   
  scale_fill_manual(values=as.vector(c("red","sky blue"))) +
  Formatting +
  labs(
    x="Age",
    y = "Population",
    title = "Singapore's Age-Sex Pyramid, 2019",
    subtitle = ,
    caption = "Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data"
  )
asp

3.4 Adding External Data

Data pertaining to the geographical regions has been sourced from Wikipedia. (https://en.wikipedia.org/wiki/Planning_Areas_of_Singapore)

planning_region = read_csv("C:/SMU Term 3/Visual Analytics/Assignment_4/Data/planning_region.csv")

names(planning_region)[names(planning_region) == "Name"] <- "Planning_Area"
planning_region
## # A tibble: 55 x 2
##    Planning_Area           Region    
##    <chr>                   <chr>     
##  1 Ang Mo Kio              North-East
##  2 Bedok                   East      
##  3 Bishan                  Central   
##  4 Boon Lay                West      
##  5 Bukit Batok             West      
##  6 Bukit Merah             Central   
##  7 Bukit Panjang           West      
##  8 Bukit Timah             Central   
##  9 Central Water Catchment North     
## 10 Changi                  East      
## # ... with 45 more rows

The planning data has then been merged with the original dataset.

my_data <- left_join(data, planning_region, by="Planning_Area")

3.5 Data Preparation & Feature Engineering

The different age groups have been aggregated further to discern meaningful trends and in order to calculate the dependency ratios. Age categories have been aggregated in accordance with the groupings by ourworldindata. (https://ourworldindata.org/age-structure)

Young <- c("0_to_4", "5_to_9", "10_to_14")
Working_Age <- c("15_to_19","20_to_24","25_to_29","30_to_34", "35_to_39","40_to_44","45_to_49", "50_to_54", "55_to_59", "60_to_64")
Elderly <- c("65_to_69", "70_to_74", "75_to_79", "80_to_84","85_to_89", "90_and_over")

These categories have been incorporated into the dataframe and a new variable named Age_Category has been created.

my_data <- my_data %>%
  mutate(Age_Category = case_when(
    Age_Group %in% Young ~ "Young",
    Age_Group %in% Working_Age ~ "Working_Age",
    Age_Group %in% Elderly ~ "Elderly")) %>%
  select(-Age_Group)

A new dataframe is created with the age categories to allow for easier data wrangling. The data has been grouped by Year, Region, Planning_Area, Age_Category and a sum of the population has been calculated.

my_data1 <- my_data %>% 
  select(c("Year","Region","Planning_Area","Age_Category","Population")) %>%
  group_by(Year,Region,Planning_Area,Age_Category) %>% 
  summarise(Population=sum(Population))

my_data1$Year <- as.character(my_data1$Year)

The following chunk of code partially transposes the age category columns. However, as can be seen, there are several null values in this new data.

dr1 <- my_data1 %>%
  mutate(i = row_number()) %>%
  spread(Age_Category, Population) %>%
  select(-i)
head(dr1)
## # A tibble: 6 x 6
## # Groups:   Year, Region, Planning_Area [2]
##   Year  Region  Planning_Area Elderly Working_Age Young
##   <chr> <chr>   <chr>           <dbl>       <dbl> <dbl>
## 1 2011  Central Bishan           9290          NA    NA
## 2 2011  Central Bishan             NA       68040    NA
## 3 2011  Central Bishan             NA          NA 13630
## 4 2011  Central Bukit Merah     24280          NA    NA
## 5 2011  Central Bukit Merah        NA      111120    NA
## 6 2011  Central Bukit Merah        NA          NA 21580

In order to address the issue of null values, they have all been replaced with 0 and subsequently aggregated on a Year, Region and Planning_Area level. The population column is recalculated and added to the dataframe.

dr1[is.na(dr1)] <- 0
dr2 <-dr1 %>%
  group_by(Year,Region,Planning_Area) %>% 
  summarise(Elderly=sum(Elderly), Working_Age=sum(Working_Age), Young=sum(Young)) %>%
  mutate(Population= Elderly + Working_Age + Young)
head(dr2)
## # A tibble: 6 x 7
## # Groups:   Year, Region [1]
##   Year  Region  Planning_Area Elderly Working_Age Young Population
##   <chr> <chr>   <chr>           <dbl>       <dbl> <dbl>      <dbl>
## 1 2011  Central Bishan           9290       68040 13630      90960
## 2 2011  Central Bukit Merah     24280      111120 21580     156980
## 3 2011  Central Bukit Timah      8010       50360 12210      70580
## 4 2011  Central Downtown Core     560        2440   330       3330
## 5 2011  Central Geylang         14870       88200 17330     120400
## 6 2011  Central Kallang         14860       73460 13980     102300

The code below is used to calculate the dependency rates across the different regions in Singapore. After grouping by Year & Region, the elderly, young and total dependency ratios have been calculated.

The formula used is as follows:

Young Dependency Rate = (Young Population/Working Age Population) * 100 Elderly Dependency Rate = (Elderly Population/Working Age Population) * 100 Total Dependency Rate = Young Dependency Rate + Elderly Dependency Rate

dr3 <- dr2 %>%
  select(c("Year","Region","Elderly","Working_Age","Young","Population")) %>%
  group_by(Year, Region) %>% 
  summarise(Elderly=sum(Elderly),Working_Age=sum(Working_Age),Young=sum(Young),Population=sum(Population)) %>%
  mutate(Elderly_Dependency_Ratio= Elderly/Working_Age) %>%
  mutate(Young_Dependency_Ratio= Young/Working_Age) %>%
  mutate(Elderly_Dependency_Ratio = round(Elderly_Dependency_Ratio*100, digits=2),
         Young_Dependency_Ratio = round(Young_Dependency_Ratio*100, digits=2)) %>% 
  mutate(Total_Dependency_Ratio= Elderly_Dependency_Ratio + Young_Dependency_Ratio) %>%
  ungroup(Year) %>%
  mutate(Year = factor(Year)) %>%
  filter(Year %in% c(2011, 2015, 2019))

head(dr3)
## # A tibble: 6 x 9
##   Year  Region Elderly Working_Age  Young Population Elderly_Depende~
##   <fct> <chr>    <dbl>       <dbl>  <dbl>      <dbl>            <dbl>
## 1 2011  Centr~  129240      672190 135280     936710            19.2 
## 2 2011  East     60200      518910 114250     693360            11.6 
## 3 2011  North    31140      380160  98050     509350             8.19
## 4 2011  North~   68670      558480 129950     757100            12.3 
## 5 2011  West     63720      673590 160320     897630             9.46
## 6 2015  Centr~  154810      654810 132020     941640            23.6 
## # ... with 2 more variables: Young_Dependency_Ratio <dbl>,
## #   Total_Dependency_Ratio <dbl>

To enable more detailed analysis of patterns within planning areas, the following chunk of code replicates the previous dataframe, with the addition of Planning_Area as one of the grouping factors.

dr4 <- dr2 %>%
  select(c("Year","Region", "Planning_Area","Elderly","Working_Age","Young","Population")) %>%
  group_by(Year, Region, Planning_Area) %>% 
  summarise(Elderly=sum(Elderly),Working_Age=sum(Working_Age),Young=sum(Young),Population=sum(Population)) %>%
  mutate(Elderly_Dependency_Ratio= Elderly/Working_Age) %>%
  mutate(Young_Dependency_Ratio= Young/Working_Age) %>%
  mutate(Elderly_Dependency_Ratio = round(Elderly_Dependency_Ratio*100, digits=2),
         Young_Dependency_Ratio = round(Young_Dependency_Ratio*100, digits=2)) %>% 
  mutate(Total_Dependency_Ratio= Elderly_Dependency_Ratio + Young_Dependency_Ratio) %>%
  ungroup(Year) %>%
  mutate(Year = factor(Year))

head(dr4)
## # A tibble: 6 x 10
##   Year  Region Planning_Area Elderly Working_Age Young Population
##   <fct> <chr>  <chr>           <dbl>       <dbl> <dbl>      <dbl>
## 1 2011  Centr~ Bishan           9290       68040 13630      90960
## 2 2011  Centr~ Bukit Merah     24280      111120 21580     156980
## 3 2011  Centr~ Bukit Timah      8010       50360 12210      70580
## 4 2011  Centr~ Downtown Core     560        2440   330       3330
## 5 2011  Centr~ Geylang         14870       88200 17330     120400
## 6 2011  Centr~ Kallang         14860       73460 13980     102300
## # ... with 3 more variables: Elderly_Dependency_Ratio <dbl>,
## #   Young_Dependency_Ratio <dbl>, Total_Dependency_Ratio <dbl>

3.6 Slope Chart

Using the newggslopegraph package to plot the total dependency rates across regions for the years 2011, 2015 and 2019.

Formatting <- list( 
   theme(plot.caption = element_text(hjust = 0, face= "italic"), #Default is hjust=1
        plot.caption.position =  "plot")
)

dr_chart1 <- newggslopegraph(dataframe = dr3,
                Times = Year,
                Measurement = Total_Dependency_Ratio,
                Grouping = Region,
                Title = "Total Dependency Rates by Region across time",
                SubTitle = "Dependency Rates in 2011, 2015 and 2019.",
                Caption = "Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data", LineThickness = 2, XTextSize = 10, YTextSize = 3, 
                WiderLabels=TRUE
                ) + Formatting

dr_chart1

In order to view the breakup between the dependency rates between the elderly and young population, the above chart has been replicated for both age categories.

y_ch<- newggslopegraph(dataframe = dr3,
                Times = Year,
                Measurement = Young_Dependency_Ratio,
                Grouping = Region,
                Title = "Young Dependency Rates by Region across time",
                SubTitle = "Dependency Rates in 2011, 2015 and 2019.",
                Caption = "Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data", LineThickness = 2, XTextSize = 10, YTextSize = 3, 
                WiderLabels=TRUE
                ) + Formatting
e_ch <- newggslopegraph(dataframe = dr3,
                Times = Year,
                Measurement = Elderly_Dependency_Ratio,
                Grouping = Region,
                Title = "Elderly Dependency Rates by Region across time",
                SubTitle = "Dependency Rates in 2011, 2015 and 2019.",
                Caption = "Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data", LineThickness = 2, XTextSize = 10, YTextSize = 3,     
                WiderLabels=TRUE
                ) + Formatting
grid.arrange(y_ch,e_ch,ncol =2)

Finally, the slope chart has also been replicated to plot the population of Singapore across time and regions.

pop <- newggslopegraph(dataframe = dr3,
                Times = Year,
                Measurement = Population,
                Grouping = Region,
                Title = "Total Population by Region across time",
                SubTitle = "Population in 2011, 2015 and 2019.",
                Caption = "Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data", LineThickness = 2, XTextSize = 10, YTextSize = 3, 
                WiderLabels=TRUE
                ) + Formatting

pop

4. Final Visualisation

asp

dr_chart1

grid.arrange(y_ch,e_ch,ncol =2)

pop

5. Insights & Implications

  1. From the Age-Sex Pyramid, it appears that the majority of Singapore’s poplation in 2019 is older than 19 years of age and younger than 65 years. This is a positive indication in terms of the current earning potential and productivity of the population and the economy as a whole. However, the population below 19 is lower for each age bracket and brings into question the future earning potential and productivity from a human capital standpoint.The distribution by sex seems relatively similar.

  2. By looking at the dependency rate slope charts, one can discern that the dependency rate has been increasing steadily over time. The rate is also the highest in the Central region and lowest in the Northern region as of 2019. A possible explaination for this is the rent disparity between the two areas. Due to higher rents prevailing in the central region, it is likely that only individuals who have ammassed a certain amount of wealth can afford it. This logic is also substantiated by the second slope chart where it can be seen that the higher dependency rate in the central region is driven primarily by the elderly. However, the increasing dependency rates overall point towards an aging population.

  3. The slope chart pertaining to population shows that after 2015, there have been shifts in population density across the different regions. Population has decreased slightly in the central region and increased considerably in the North-East region. Simultaneously, population has increased slightly in the Western region and decreased slightly in the Eastern region. Population has also been increasing in the Northern region. ```