This static visualization aims to uncover the demographic structure of the Singapore population by age cohort (e.g., 0-4, 5-9, ……) and by planning area in 2019. The Source of the data is from Singapore Department of Statistics. (https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data)
The dataset contains Singapore residents population by age group, gender, area, and type of dwelling from the year 2011 to 2019.
For this makeover assignment, I will only be limiting the data to the year 2019 and have filtered the data in excel.
The main focus of the this makeover challenge is to visualise and uncover patterns regarding the planning area and type of dwelling across the different age groups.
In this makeover assignment, I will first categorise the age into different age groups so that insights and analysis can be more easily done.
According to Index Mundi, the age structure of Singapore is as follows:
0-14 years (children), 15-24 years (early working age), 25-54 years (prime working age), 55-64 years (mature working age), 65 years and over (elderly).
The first design challenge is to have an overview of the population distribution and demographic structure in the different planning area in 2019 in a chart. To do so, a stacked bar charted, stacked based on the age groups defined and sorted based on the total population in the planning area in a descending order and a heatmap is used.
To visualise the top planning areas for the different age groups, a bar chart showing the top 10 planning areas calculated based on the population percentage is used for each age group. Another stack bar chart is used to visualise the distribution of the type of dwellings in the 10 locations. This will allow us to find any patterns between the type of dwelling and the population distribution.
The last design challenge is to ientify the population structure by age group combined with Gender in 2019.
knitr::opts_chunk$set(echo = TRUE)
packages = c( 'dplyr', 'tidyverse', 'heatmaply', 'gridExtra', "RColorBrewer")
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
After installing and loading necessary libraries, we can load the dataset, then ascertain the dataframe dimensions, variable datatypes and number of missing values.
pop_data <- read_csv("data/respopagesextod2019.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## PA = col_character(),
## SZ = col_character(),
## AG = col_character(),
## Sex = col_character(),
## TOD = col_character(),
## Pop = col_double(),
## Time = col_double()
## )
print(paste('DataFrame contains ', dim(pop_data)[1], 'rows and ', dim(pop_data)[2], 'columns.'))
## [1] "DataFrame contains 162111 rows and 7 columns."
In this dataset, there are 7 variables which are shown in each column.
The column names refers to the following:
| Column Names | Refers to |
|---|---|
| PA | Planning Area |
| SZ | Subzone |
| AG | Age Group |
| Sex | Sex |
| TOD | Type of Dwelling |
| Pop | Resident Count |
| Time | Time / Period |
| SZ | Content Cell |
| Age Group | Age |
|---|---|
| children | 0-14 years |
| early working age | 15-24 years |
| prime working age | 25-54 years |
| mature working age | 55-64 years |
| elderly | 65 years and over |
Adding a new column called “Group”
pop_data <- pop_data %>% mutate(Group =
case_when((AG == "0_to_4"| AG == "5_to_9"| AG == "10_to_14") ~ "Children",
( AG == "15_to_19"| AG == "20_to_24") ~ "Early Working Age",
( AG == "25_to_29"| AG == "30_to_34"| AG == "35_to_39"| AG == "40_to_44"| AG == "45_to_49"| AG == "50_to_54") ~ "Prime Working Age",
( AG == "55_to_59"| AG == "60_to_64") ~ "Mature Working Age",
( AG == "65_to_69"| AG == "70_to_74"| AG == "75_to_79"| AG == "80_to_84"| AG == "85_to_89"| AG == "90_and_over") ~ "Elderly",
TRUE ~ ""))
Aggregate data by Planning Area for each Group and adding of a new column “Population Percent”.
pop_by_pa <- pop_data %>% group_by(PA) %>% summarise(Pop=sum(Pop))
pa_list <- pop_by_pa[order(pop_by_pa$Pop, decreasing = TRUE),]$PA
pop_by_age_pa <- pop_data %>% group_by(PA, Group) %>% summarise(Pop=sum(Pop))
## `summarise()` has grouped output by 'PA'. You can override using the `.groups` argument.
df_by_pa <- group_by(.data = pop_by_age_pa, PA)
df_summarize2 <- mutate(.data = df_by_pa, Population_percent = Pop/sum(Pop))
print(df_summarize2)
## # A tibble: 276 x 4
## # Groups: PA [56]
## PA Group Pop Population_percent
## <chr> <chr> <dbl> <dbl>
## 1 Ang Mo Kio Children 19030 0.116
## 2 Ang Mo Kio Early Working Age 16850 0.102
## 3 Ang Mo Kio Elderly 34080 0.207
## 4 Ang Mo Kio Mature Working Age 25900 0.158
## 5 Ang Mo Kio Prime Working Age 68570 0.417
## 6 Bedok Children 34960 0.125
## 7 Bedok Early Working Age 31300 0.112
## 8 Bedok Elderly 51320 0.183
## 9 Bedok Mature Working Age 44380 0.159
## 10 Bedok Prime Working Age 118010 0.422
## # ... with 266 more rows
The top 30 planning area covers more than 99% of population in Singapore hence I have limit the Bar Chart to the top 30. This will also reduce the amount of noise in the stacked bar chart. Next, i fixed the stack order based on Age.
top30 <- subset(df_summarize2, PA %in% pa_list[1:30])
top30_orderedstack <- top30 %>%
mutate (Group=factor(Group, levels = c("Elderly", "Mature Working Age", "Prime Working Age", "Early Working Age", "Children" ), ordered = TRUE))
ggplot(top30_orderedstack, aes(x = reorder(PA, Pop), y = Pop, fill = Group)) +
geom_bar(stat = "identity") +
geom_smooth(method="lm") +
coord_flip() +
scale_fill_brewer(palette = "RdYlGn")
## `geom_smooth()` using formula 'y ~ x'
First, the dataset is filtered to the top 30 planning areas, with the population percent to be more than 0%.
Next, the order of the columns of the age group is fixed according to their respective age.
Normalisation is done to the dataset according to the age groups
Lastly, the package Heatmaply is used to plot the Heatmap.
Normalisation is required to create the Heatmap as the around half of the population fall within the Prime Working Age Group and the difference in the distribution of the population may not be as visible without the normalisation.
df2 <- top30%>%
group_by(PA, Group)%>%
summarise(Population_percent=sum(Population_percent))%>%
filter(Population_percent>'0')
wide <- df2 %>%
spread(Group, Population_percent)
col.order <- c("Children","Early Working Age","Prime Working Age","Mature Working Age","Elderly")
wide2<-wide[,col.order]
row.names(wide2) <- wide$PA
df2_matrix <- data.matrix(wide2)
heatmaply(normalize(df2_matrix),
colors = RdYlGn, Colv=NA
)
Adding a new column called “location_p”
pop_by_pa <- pop_data %>% group_by(PA) %>% summarise(Pop=sum(Pop))
pa_list <- pop_by_pa[order(pop_by_pa$Pop, decreasing = TRUE),]$PA
pop_by_age_pa <- pop_data %>% group_by(PA, TOD) %>% summarise(Pop=sum(Pop))
## `summarise()` has grouped output by 'PA'. You can override using the `.groups` argument.
df_by_pa <- group_by(.data = pop_by_age_pa, PA)
df_location_summarize2 <- mutate(.data = df_by_pa, location_percent = Pop/sum(Pop))
print(df_location_summarize2)
## # A tibble: 441 x 4
## # Groups: PA [56]
## PA TOD Pop location_percent
## <chr> <chr> <dbl> <dbl>
## 1 Ang Mo Kio Condominiums and Other Apartments 13880 0.0844
## 2 Ang Mo Kio HDB 1- and 2-Room Flats 11230 0.0683
## 3 Ang Mo Kio HDB 3-Room Flats 56830 0.346
## 4 Ang Mo Kio HDB 4-Room Flats 43550 0.265
## 5 Ang Mo Kio HDB 5-Room and Executive Flats 21650 0.132
## 6 Ang Mo Kio HUDC Flats (excluding those privatised) 0 0
## 7 Ang Mo Kio Landed Properties 16070 0.0977
## 8 Ang Mo Kio Others 1220 0.00742
## 9 Bedok Condominiums and Other Apartments 54430 0.194
## 10 Bedok HDB 1- and 2-Room Flats 10120 0.0361
## # ... with 431 more rows
A Bar Chart depicting the 10 Planning Areas with the highest number of children
A Stacked distribution bar chart of the dwelling types in the 10 locations
The dataset is first filtered to the Top 30 Planning Areas similar to the previous sections, then filtered to the group ‘Children’.
The dataset is then sorted in a descending order based on the population percentage of each Planning Area.
Lastly, the top 10 planning areas is taken to make the 2 charts.
top_children <- subset(df_summarize2, PA %in% pa_list[1:30])
top_children <- top_children %>%
select(PA,Group,Pop,Population_percent) %>%
filter( Group == "Children")
top_children_ordered <-top_children[order(top_children$Population_percent, decreasing = TRUE),]
children_list <- top_children[order(top_children$Population_percent, decreasing = TRUE),]$PA
top10_children <- subset(top_children_ordered, PA %in% children_list[1:10])
children_bar<-ggplot(data = top10_children, aes(x = reorder(PA, Population_percent), y = Population_percent )) +
geom_bar(stat = "sum", alpha = 0.7) +
ggtitle("Distribution of Children in the Top 10 Planning Areas, 2019", subtitle='The 10 planning areas with the largest population percentage of Children (0-14 years)')+
xlab("Group")+
ylab("Population Proportion")+
geom_text(aes(label=round(Population_percent, digits = 2), vjust = 1)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none")+
coord_flip()
df_by_loc_children <- subset(df_location_summarize2, PA %in% children_list[1:10])
df_by_loc_children <- df_by_loc_children %>%
mutate (TOD=factor(TOD, levels = c("HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats", "HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)", "Condominiums and Other Apartments", "Landed Properties", "Others" ), ordered = TRUE))
children_location_stacked <- ggplot(df_by_loc_children, aes(x = reorder(PA, Pop), y = location_percent, fill = TOD)) +
geom_bar(stat = "identity", position = "fill") +
coord_flip()+
scale_fill_brewer(palette = "Blues")
children_bar
children_location_stacked
From the charts, we can deduce that Punggol has a significantly larger population of children as compared to the other Planning Areas in Singapore. Also, the top 10 planning area have a population percentage of children higher than 15%.
Next, the top 2 PLanning Areas with highest percentage of children, Punggol and Sengkang, have a low percentage of 1-Room, 2-Room and 3-Room HDB Flats. Hence, for Singaporeans who are parents or soon-to-be parents, housing size plays a significant factor in their purchasing decision.
A Bar Chart depicting the 10 Planning Areas with the highest number of Early Working Age
A Stacked distribution bar chart of the dwelling types in the 10 locations
The dataset for the Early Working Age Group is prepared in the same way as the dataset used for the Children age group.
top_Early_Working_Age <- subset(df_summarize2, PA %in% pa_list[1:30])
top_Early_Working_Age <- top_Early_Working_Age %>%
select(PA,Group,Pop,Population_percent) %>%
filter( Group == "Early Working Age")
top_Early_Working_Age_ordered <-top_Early_Working_Age[order(top_Early_Working_Age$Population_percent, decreasing = TRUE),]
Early_Working_Age_list <- top_Early_Working_Age[order(top_Early_Working_Age$Population_percent, decreasing = TRUE),]$PA
top10_Early_Working_Age <- subset(top_Early_Working_Age_ordered, PA %in% Early_Working_Age_list[1:10])
early_bar<-ggplot(data = top10_Early_Working_Age, aes(x = reorder(PA, Population_percent), y = Population_percent )) +
geom_bar(stat = "sum", alpha = 0.7) +
ggtitle("Distribution of Early Working Age Group in the Top 10 Planning Areas, 2019", subtitle='The 10 planning areas with the largest population percentage of Early Working Age Group (15-24 years)')+
xlab("Group")+
ylab("Population Proportion")+
geom_text(aes(label=round(Population_percent, digits = 2), vjust = 1)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none")+
coord_flip()
df_by_loc_early <- subset(df_location_summarize2, PA %in% Early_Working_Age_list[1:10])
df_by_loc_early <- df_by_loc_early %>%
mutate (TOD=factor(TOD, levels = c("HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats", "HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)", "Condominiums and Other Apartments", "Landed Properties", "Others" ), ordered = TRUE))
early_location_stacked <- ggplot(df_by_loc_early, aes(x = reorder(PA, Pop), y = location_percent, fill = TOD)) +
geom_bar(stat = "identity", position = "fill") +
coord_flip()+
scale_fill_brewer(palette = "Blues")
early_bar
early_location_stacked
The Planning Areas with highest population percentage of early working age group are Paris Ris, Woodlands and Chua Chu Kang.
With the exception of Bukit Timah and Serangoon, a common observation across these 10 Planning Areas are the lower percentage of Condominiums and Landed Properties.
A Bar Chart depicting the 10 Planning Areas with the highest number of Prime Working Age
A Stacked distribution bar chart of the dwelling types in the 10 locations
The dataset is also prepared in the same way as the Children dataset.
top_Prime_Working_Age <- subset(df_summarize2, PA %in% pa_list[1:30])
top_Prime_Working_Age <- top_Prime_Working_Age %>%
select(PA,Group,Pop,Population_percent) %>%
filter( Group == "Prime Working Age")
top_Prime_Working_Age_ordered <-top_Prime_Working_Age[order(top_Prime_Working_Age$Population_percent, decreasing = TRUE),]
Prime_Working_Age_list <- top_Prime_Working_Age[order(top_Prime_Working_Age$Population_percent, decreasing = TRUE),]$PA
top10_Prime_Working_Age <- subset(top_Prime_Working_Age_ordered, PA %in% Prime_Working_Age_list[1:10])
prime_bar<-ggplot(data = top10_Prime_Working_Age, aes(x = reorder(PA, Population_percent), y = Population_percent )) +
geom_bar(stat = "sum", alpha = 0.7) +
ggtitle("Distribution of Prime Working Age Group in the Top 10 Planning Areas, 2019", subtitle='The 10 planning areas with the largest population percentage of Prime Working Age Group (25-54 years)')+
xlab("Group")+
ylab("Population Proportion")+
geom_text(aes(label=round(Population_percent, digits = 2), vjust = 1)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none")+
coord_flip()
df_by_loc_prime <- subset(df_location_summarize2, PA %in% Prime_Working_Age_list[1:10])
df_by_loc_prime <- df_by_loc_prime %>%
mutate (TOD=factor(TOD, levels = c("HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats", "HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)", "Condominiums and Other Apartments", "Landed Properties", "Others" ), ordered = TRUE))
prime_location_stacked <- ggplot(df_by_loc_prime, aes(x = reorder(PA, Pop), y = location_percent, fill = TOD)) +
geom_bar(stat = "identity", position = "fill") +
coord_flip()+
scale_fill_brewer(palette = "Blues")
prime_bar
prime_location_stacked
Punggol and River Valley are the top 2 Planning Areas where half of the population consists of Prime Working Age Group. From the stacked bar chart for the type of dwelling, with the exception of Tanglin and River Valley, most of them live in 4-Room and 5-Room HDB Flats.
A Bar Chart depicting the 10 Planning Areas with the highest number of Mature Working Age
A Stacked distribution bar chart of the dwelling types in the 10 locations
The dataset is also prepared in the same way as the Children dataset.
top_Mature_Working_Age <- subset(df_summarize2, PA %in% pa_list[1:30])
top_Mature_Working_Age <- top_Mature_Working_Age %>%
select(PA,Group,Pop,Population_percent) %>%
filter( Group == "Mature Working Age")
top_Mature_Working_Age_ordered <-top_Mature_Working_Age[order(top_Mature_Working_Age$Population_percent, decreasing = TRUE),]
Mature_Working_Age_list <- top_Mature_Working_Age[order(top_Mature_Working_Age$Population_percent, decreasing = TRUE),]$PA
top10_Mature_Working_Age <- subset(top_Mature_Working_Age_ordered, PA %in% Mature_Working_Age_list[1:10])
mature_bar<-ggplot(data = top10_Mature_Working_Age, aes(x = reorder(PA, Population_percent), y = Population_percent )) +
geom_bar(stat = "sum", alpha = 0.7) +
ggtitle("Distribution of Mature Working Age Group in the Top 10 Planning Areas, 2019", subtitle='The 10 planning areas with the largest population percentage of Mature Working Age Group (55-64 years)')+
xlab("Group")+
ylab("Population Proportion")+
geom_text(aes(label=round(Population_percent, digits = 2), vjust = 1)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none")+
coord_flip()
df_by_loc_mature <- subset(df_location_summarize2, PA %in% Mature_Working_Age_list[1:10])
df_by_loc_mature <- df_by_loc_mature %>%
mutate (TOD=factor(TOD, levels = c("HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats", "HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)", "Condominiums and Other Apartments", "Landed Properties", "Others" ), ordered = TRUE))
mature_location_stacked <- ggplot(df_by_loc_mature, aes(x = reorder(PA, Pop), y = location_percent, fill = TOD)) +
geom_bar(stat = "identity", position = "fill") +
coord_flip()+
scale_fill_brewer(palette = "Blues")
mature_bar
mature_location_stacked
The top 3 planning areas with the highest percentage of Mature Working Age Group are Serangoon, Tampines and Hougang. A common observation that can be seen from the stacked bar chart across these 10 planning areas is the overall higher percentage of them living in Condominiums and Landed Properties as compared to the other Age Group.
A Bar Chart depicting the 10 Planning Areas with the highest number of Elderly
A Stacked distribution bar chart of the dwelling types in the 10 locations
The dataset is also prepared in the same way as the Children dataset.
top_Elderly <- subset(df_summarize2, PA %in% pa_list[1:30])
top_Elderly <- top_Elderly %>%
select(PA,Group,Pop,Population_percent) %>%
filter( Group == "Elderly")
top_Elderly_ordered <-top_Elderly[order(top_Elderly$Population_percent, decreasing = TRUE),]
Elderly_list <- top_Elderly[order(top_Elderly$Population_percent, decreasing = TRUE),]$PA
top10_Elderly <- subset(top_Elderly_ordered, PA %in% Elderly_list[1:10])
elderly_bar<-ggplot(data = top10_Elderly, aes(x = reorder(PA, Population_percent), y = Population_percent )) +
geom_bar(stat = "sum", alpha = 0.7) +
ggtitle("Distribution of Elderly in the Top 10 Planning Areas, 2019", subtitle='The 10 planning areas with the largest population percentage of Elderly (65 years and over)')+
xlab("Group")+
ylab("Population Proportion")+
geom_text(aes(label=round(Population_percent, digits = 2), vjust = 1)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none")+
coord_flip()
df_by_loc_Elderly <- subset(df_location_summarize2, PA %in% Elderly_list[1:10])
df_by_loc_Elderly <- df_by_loc_Elderly %>%
mutate (TOD=factor(TOD, levels = c("HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats", "HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)", "Condominiums and Other Apartments", "Landed Properties", "Others" ), ordered = TRUE))
elderly_location_stacked <- ggplot(df_by_loc_Elderly, aes(x = reorder(PA, Pop), y = location_percent, fill = TOD)) +
geom_bar(stat = "identity", position = "fill") +
coord_flip()+
scale_fill_brewer(palette = "Blues")
elderly_bar
elderly_location_stacked
Outram is top location with the highest percentage of Elderly. One observation that can be seen in the stacked bar chart is the high percentage of elderly living in 1-Room, 2-Room and 3-Room HDB Flats.
Rename the age for 0-4 and 5-9 years old so that the order in the population pyramid is sorted correctly.
The age group 0_to_4 and 5_to_9 is first modified so that all the age groups can be ordered. The females dataset is then separated and flipped and the population pyramid is created using the library ggplot.
pop_data$AG <- str_replace(as.factor(pop_data$AG),"5_to_9","05_to_09")
pop_data$AG <- str_replace(as.factor(pop_data$AG),"0_to_4","0_to_04")
pop_data$Pop <- ifelse(pop_data$Sex == "Females",-1*pop_data$Pop, pop_data$Pop)
age_cohort <- ggplot(pop_data, aes(x = AG, y = Pop, fill = Sex))+
geom_bar(data = subset(pop_data, Sex == "Females"), stat = "identity")+
geom_bar(data = subset(pop_data, Sex == "Males"), stat = "identity") +
scale_y_continuous(breaks = seq(-150000, 150000, 50000),
labels = paste0(as.character(c(seq(150, 0, -50), seq(50, 150, 50))))) +
coord_flip()
age_cohort+
ggtitle("Singapore Residents Pyramid by Age Cohort, 2019")+
xlab("Age group")+
ylab("Population in thousands")+
scale_fill_brewer(palette = "Set1")+
theme_minimal()+
theme(legend.position='right')
From the population pyramid, we can observe that Singapore faces an aging population and a similar proportion of Males and Females.