The purpose of this visualisation is to explore the demographic patterns in Singapore from 2011 to 2019. This has been done through visualisations such as an age-sex pyramid and slope charts to display the various relationships between regions/planning areas, population, and gender.
The first part of this analysis includes an age-sex pyramid that provides a snapshot of the 2019 Singapore population segmented by males and females. The second part of the analysis explore the relationship between region/planning area and the different population parameters using the slope chart.
The data for this assignment has been sourced from Singstat and Wikipedia.
The raw data extracted from the Singstat website had not been demarcated by geographical regions (such as north, south, east, west) or any other simplifying characteristic. Such demarcation is required because it makes patterns within the data a lot easier to discern. ‘Planning Region’ was included in the original dataset but contained far too many variables to visualise effectively.
Since every visualisation requires the data to be structured in a specific manner, data manipulation is required every time. For instance, the age-sex pyramid requires the population to be aggregated by age and sex, whereas the slope chart requires other customisations such as feature engineering and external data merging.
Divining economic patterns in addition to demographic trends, in order to make appropriate policy suggestions based on data. Since the data does not contain any obvious economic parameters, they will have to be inferred through feature engineering and the creation of variables such as Dependency Rates.
External data from Wikipedia is used to address the issue of demarcation. The data has been merged with the Sing stat dataset using the ‘Planning Area/PA’ column.
Data manipulation has been conducted and new data frames have been created before each visualisation. This includes the creation of new variables and grouping and aggregrating data at different levels.
A new variable called dependency ratio has been generated using the age group and population columns. This has been aggregated to a Region/Planning Area level.
alt text here
alt text here
packages = c('dplyr', 'tidyverse', 'CGPfunctions', 'tidyr', 'gridExtra')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The following dataset has been sourced from Singstat. (https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data)
data <- read_csv("C:/SMU Term 3/Visual Analytics/Assignment_4/Data/respopagesextod2011to2019.csv")
head(data)
## # A tibble: 6 x 7
## PA SZ AG Sex TOD Pop Time
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 1- and 2-Room Flats 0 2011
## 2 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 3-Room Flats 10 2011
## 3 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 4-Room Flats 30 2011
## 4 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 5-Room and Executive F~ 50 2011
## 5 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HUDC Flats (excluding thos~ 0 2011
## 6 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males Landed Properties 0 2011
Changing the variable names to make it easier to viewers of the visualisation to understand the charts. So PA, SZ, AG, Pop, and Time have been renamed as Planning_Area, Subzone, Age_Group, Population, and Year respectively.
names(data)[names(data) == "PA"] <- "Planning_Area"
names(data)[names(data) == "SZ"] <- "Subzone"
names(data)[names(data) == "AG"] <- "Age_Group"
names(data)[names(data) == "Pop"] <- "Population"
names(data)[names(data) == "Time"] <- "Year"
data
## # A tibble: 883,728 x 7
## Planning_Area Subzone Age_Group Sex TOD Population Year
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Ang Mo Kio Ang Mo Kio ~ 0_to_4 Males HDB 1- and 2-Ro~ 0 2011
## 2 Ang Mo Kio Ang Mo Kio ~ 0_to_4 Males HDB 3-Room Flats 10 2011
## 3 Ang Mo Kio Ang Mo Kio ~ 0_to_4 Males HDB 4-Room Flats 30 2011
## 4 Ang Mo Kio Ang Mo Kio ~ 0_to_4 Males HDB 5-Room and ~ 50 2011
## 5 Ang Mo Kio Ang Mo Kio ~ 0_to_4 Males HUDC Flats (exc~ 0 2011
## 6 Ang Mo Kio Ang Mo Kio ~ 0_to_4 Males Landed Properti~ 0 2011
## 7 Ang Mo Kio Ang Mo Kio ~ 0_to_4 Males Condominiums an~ 40 2011
## 8 Ang Mo Kio Ang Mo Kio ~ 0_to_4 Males Others 0 2011
## 9 Ang Mo Kio Ang Mo Kio ~ 0_to_4 Femal~ HDB 1- and 2-Ro~ 0 2011
## 10 Ang Mo Kio Ang Mo Kio ~ 0_to_4 Femal~ HDB 3-Room Flats 10 2011
## # ... with 883,718 more rows
Creating a new dataframe and filtering it for only the year 2019 and the columns ‘Age Group’, ‘Sex’ and ‘Population’. Additionally, population has been aggregated based on Age Group and Sex.
pyramid <- data %>%
mutate(Year=as.character(Year)) %>%
filter(Year=="2019") %>%
select(-Planning_Area,-Subzone,-TOD,-Year)
pyramid$Age_Group<-str_replace(as.character(pyramid$Age_Group),"5_to_9","05_to_09")
pyramid_final <- aggregate(Population~Age_Group+Sex,data=pyramid,FUN=sum)
Plotting the chart by creating a variable for formatting and adding it to the code for the age-sex pyramid. The data source has also been added to the caption.
Formatting <- list(
theme_classic(),
theme(panel.grid.major.x = element_blank()),
theme(axis.text.x.top = element_text(size=12)),
theme(plot.title = element_text(size=14, face = "bold", hjust = 0.5)),
theme(plot.subtitle = element_text(hjust = 0.5)), theme(plot.caption = element_text(hjust = 0, face= "italic"), #Default is hjust=1
plot.caption.position = "plot")
)
asp<-ggplot(pyramid_final,aes(x = Age_Group, y = ifelse(Sex == "Males", yes = -Population, no = Population),fill = Sex))+
geom_col()+
coord_flip()+
scale_y_continuous(labels = abs, limits = max(pyramid_final$Population)*c(-1,1))+
scale_fill_manual(values=as.vector(c("red","sky blue"))) +
Formatting +
labs(
x="Age",
y = "Population",
title = "Singapore's Age-Sex Pyramid, 2019",
subtitle = ,
caption = "Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data"
)
asp
Data pertaining to the geographical regions has been sourced from Wikipedia. (https://en.wikipedia.org/wiki/Planning_Areas_of_Singapore)
planning_region = read_csv("C:/SMU Term 3/Visual Analytics/Assignment_4/Data/planning_region.csv")
names(planning_region)[names(planning_region) == "Name"] <- "Planning_Area"
planning_region
## # A tibble: 55 x 2
## Planning_Area Region
## <chr> <chr>
## 1 Ang Mo Kio North-East
## 2 Bedok East
## 3 Bishan Central
## 4 Boon Lay West
## 5 Bukit Batok West
## 6 Bukit Merah Central
## 7 Bukit Panjang West
## 8 Bukit Timah Central
## 9 Central Water Catchment North
## 10 Changi East
## # ... with 45 more rows
The planning data has then been merged with the original dataset.
my_data <- left_join(data, planning_region, by="Planning_Area")
The different age groups have been aggregated further to discern meaningful trends and in order to calculate the dependency ratios. Age categories have been aggregated in accordance with the groupings by ourworldindata. (https://ourworldindata.org/age-structure)
Young <- c("0_to_4", "5_to_9", "10_to_14")
Working_Age <- c("15_to_19","20_to_24","25_to_29","30_to_34", "35_to_39","40_to_44","45_to_49", "50_to_54", "55_to_59", "60_to_64")
Elderly <- c("65_to_69", "70_to_74", "75_to_79", "80_to_84","85_to_89", "90_and_over")
These categories have been incorporated into the dataframe and a new variable named Age_Category has been created.
my_data <- my_data %>%
mutate(Age_Category = case_when(
Age_Group %in% Young ~ "Young",
Age_Group %in% Working_Age ~ "Working_Age",
Age_Group %in% Elderly ~ "Elderly")) %>%
select(-Age_Group)
A new dataframe is created with the age categories to allow for easier data wrangling. The data has been grouped by Year, Region, Planning_Area, Age_Category and a sum of the population has been calculated.
my_data1 <- my_data %>%
select(c("Year","Region","Planning_Area","Age_Category","Population")) %>%
group_by(Year,Region,Planning_Area,Age_Category) %>%
summarise(Population=sum(Population))
my_data1$Year <- as.character(my_data1$Year)
The following chunk of code partially transposes the age category columns. However, as can be seen, there are several null values in this new data.
dr1 <- my_data1 %>%
mutate(i = row_number()) %>%
spread(Age_Category, Population) %>%
select(-i)
head(dr1)
## # A tibble: 6 x 6
## # Groups: Year, Region, Planning_Area [2]
## Year Region Planning_Area Elderly Working_Age Young
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2011 Central Bishan 9290 NA NA
## 2 2011 Central Bishan NA 68040 NA
## 3 2011 Central Bishan NA NA 13630
## 4 2011 Central Bukit Merah 24280 NA NA
## 5 2011 Central Bukit Merah NA 111120 NA
## 6 2011 Central Bukit Merah NA NA 21580
In order to address the issue of null values, they have all been replaced with 0 and subsequently aggregated on a Year, Region and Planning_Area level. The population column is recalculated and added to the dataframe.
dr1[is.na(dr1)] <- 0
dr2 <-dr1 %>%
group_by(Year,Region,Planning_Area) %>%
summarise(Elderly=sum(Elderly), Working_Age=sum(Working_Age), Young=sum(Young)) %>%
mutate(Population= Elderly + Working_Age + Young)
head(dr2)
## # A tibble: 6 x 7
## # Groups: Year, Region [1]
## Year Region Planning_Area Elderly Working_Age Young Population
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2011 Central Bishan 9290 68040 13630 90960
## 2 2011 Central Bukit Merah 24280 111120 21580 156980
## 3 2011 Central Bukit Timah 8010 50360 12210 70580
## 4 2011 Central Downtown Core 560 2440 330 3330
## 5 2011 Central Geylang 14870 88200 17330 120400
## 6 2011 Central Kallang 14860 73460 13980 102300
The code below is used to calculate the dependency rates across the different regions in Singapore. After grouping by Year & Region, the elderly, young and total dependency ratios have been calculated.
The formula used is as follows:
Young Dependency Rate = (Young Population/Working Age Population) * 100 Elderly Dependency Rate = (Elderly Population/Working Age Population) * 100 Total Dependency Rate = Young Dependency Rate + Elderly Dependency Rate
dr3 <- dr2 %>%
select(c("Year","Region","Elderly","Working_Age","Young","Population")) %>%
group_by(Year, Region) %>%
summarise(Elderly=sum(Elderly),Working_Age=sum(Working_Age),Young=sum(Young),Population=sum(Population)) %>%
mutate(Elderly_Dependency_Ratio= Elderly/Working_Age) %>%
mutate(Young_Dependency_Ratio= Young/Working_Age) %>%
mutate(Elderly_Dependency_Ratio = round(Elderly_Dependency_Ratio*100, digits=2),
Young_Dependency_Ratio = round(Young_Dependency_Ratio*100, digits=2)) %>%
mutate(Total_Dependency_Ratio= Elderly_Dependency_Ratio + Young_Dependency_Ratio) %>%
ungroup(Year) %>%
mutate(Year = factor(Year)) %>%
filter(Year %in% c(2011, 2015, 2019))
head(dr3)
## # A tibble: 6 x 9
## Year Region Elderly Working_Age Young Population Elderly_Depende~
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2011 Centr~ 129240 672190 135280 936710 19.2
## 2 2011 East 60200 518910 114250 693360 11.6
## 3 2011 North 31140 380160 98050 509350 8.19
## 4 2011 North~ 68670 558480 129950 757100 12.3
## 5 2011 West 63720 673590 160320 897630 9.46
## 6 2015 Centr~ 154810 654810 132020 941640 23.6
## # ... with 2 more variables: Young_Dependency_Ratio <dbl>,
## # Total_Dependency_Ratio <dbl>
To enable more detailed analysis of patterns within planning areas, the following chunk of code replicates the previous dataframe, with the addition of Planning_Area as one of the grouping factors.
dr4 <- dr2 %>%
select(c("Year","Region", "Planning_Area","Elderly","Working_Age","Young","Population")) %>%
group_by(Year, Region, Planning_Area) %>%
summarise(Elderly=sum(Elderly),Working_Age=sum(Working_Age),Young=sum(Young),Population=sum(Population)) %>%
mutate(Elderly_Dependency_Ratio= Elderly/Working_Age) %>%
mutate(Young_Dependency_Ratio= Young/Working_Age) %>%
mutate(Elderly_Dependency_Ratio = round(Elderly_Dependency_Ratio*100, digits=2),
Young_Dependency_Ratio = round(Young_Dependency_Ratio*100, digits=2)) %>%
mutate(Total_Dependency_Ratio= Elderly_Dependency_Ratio + Young_Dependency_Ratio) %>%
ungroup(Year) %>%
mutate(Year = factor(Year))
head(dr4)
## # A tibble: 6 x 10
## Year Region Planning_Area Elderly Working_Age Young Population
## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2011 Centr~ Bishan 9290 68040 13630 90960
## 2 2011 Centr~ Bukit Merah 24280 111120 21580 156980
## 3 2011 Centr~ Bukit Timah 8010 50360 12210 70580
## 4 2011 Centr~ Downtown Core 560 2440 330 3330
## 5 2011 Centr~ Geylang 14870 88200 17330 120400
## 6 2011 Centr~ Kallang 14860 73460 13980 102300
## # ... with 3 more variables: Elderly_Dependency_Ratio <dbl>,
## # Young_Dependency_Ratio <dbl>, Total_Dependency_Ratio <dbl>
Using the newggslopegraph package to plot the total dependency rates across regions for the years 2011, 2015 and 2019.
Formatting <- list(
theme(plot.caption = element_text(hjust = 0, face= "italic"), #Default is hjust=1
plot.caption.position = "plot")
)
dr_chart1 <- newggslopegraph(dataframe = dr3,
Times = Year,
Measurement = Total_Dependency_Ratio,
Grouping = Region,
Title = "Total Dependency Rates by Region across time",
SubTitle = "Dependency Rates in 2011, 2015 and 2019.",
Caption = "Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data", LineThickness = 2, XTextSize = 10, YTextSize = 3,
WiderLabels=TRUE
) + Formatting
dr_chart1
In order to view the breakup between the dependency rates between the elderly and young population, the above chart has been replicated for both age categories.
y_ch<- newggslopegraph(dataframe = dr3,
Times = Year,
Measurement = Young_Dependency_Ratio,
Grouping = Region,
Title = "Young Dependency Rates by Region across time",
SubTitle = "Dependency Rates in 2011, 2015 and 2019.",
Caption = "Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data", LineThickness = 2, XTextSize = 10, YTextSize = 3,
WiderLabels=TRUE
) + Formatting
e_ch <- newggslopegraph(dataframe = dr3,
Times = Year,
Measurement = Elderly_Dependency_Ratio,
Grouping = Region,
Title = "Elderly Dependency Rates by Region across time",
SubTitle = "Dependency Rates in 2011, 2015 and 2019.",
Caption = "Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data", LineThickness = 2, XTextSize = 10, YTextSize = 3,
WiderLabels=TRUE
) + Formatting
grid.arrange(y_ch,e_ch,ncol =2)
Finally, the slope chart has also been replicated to plot the population of Singapore across time and regions.
pop <- newggslopegraph(dataframe = dr3,
Times = Year,
Measurement = Population,
Grouping = Region,
Title = "Total Population by Region across time",
SubTitle = "Population in 2011, 2015 and 2019.",
Caption = "Data Source: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data", LineThickness = 2, XTextSize = 10, YTextSize = 3,
WiderLabels=TRUE
) + Formatting
pop
asp
dr_chart1
grid.arrange(y_ch,e_ch,ncol =2)
pop
From the Age-Sex Pyramid, it appears that the majority of Singapore’s poplation in 2019 is older than 19 years of age and younger than 65 years. This is a positive indication in terms of the current earning potential and productivity of the population and the economy as a whole. However, the population below 19 is lower for each age bracket and brings into question the future earning potential and productivity from a human capital standpoint.The distribution by sex seems relatively similar.
By looking at the dependency rate slope charts, one can discern that the dependency rate has been increasing steadily over time. The rate is also the highest in the Central region and lowest in the Northern region as of 2019. A possible explaination for this is the rent disparity between the two areas. Due to higher rents prevailing in the central region, it is likely that only individuals who have ammassed a certain amount of wealth can afford it. This logic is also substantiated by the second slope chart where it can be seen that the higher dependency rate in the central region is driven primarily by the elderly. However, the increasing dependency rates overall point towards an aging population.
The slope chart pertaining to population shows that after 2015, there have been shifts in population density across the different regions. Population has decreased slightly in the central region and increased considerably in the North-East region. Simultaneously, population has increased slightly in the Western region and decreased slightly in the Eastern region. Population has also been increasing in the Northern region. ```