IS428 - Assignment 4

1.Overview

This static visualization aims to uncover the demographic structure of the Singapore population by age cohort (e.g., 0-4, 5-9, ……) and by planning area in 2019. The Source of the data is from Singapore Department of Statistics. (https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data)

The dataset contains Singapore residents population by age group, gender, area, and type of dwelling from the year 2011 to 2019.

For this makeover assignment, I will only be limiting the data to the year 2019 and have filtered the data in excel.

1.1 Design Challenges

The main focus of the this makeover challenge is to visualise and uncover patterns regarding the planning area and type of dwelling across the different age groups.

In this makeover assignment, I will first categorise the age into different age groups so that insights and analysis can be more easily done.

According to Index Mundi, the age structure of Singapore is as follows:
0-14 years (children), 15-24 years (early working age), 25-54 years (prime working age), 55-64 years (mature working age), 65 years and over (elderly).

The first design challenge is to have an overview of the population distribution and demographic structure in the different planning area in 2019 in a chart. To do so, a stacked bar charted, stacked based on the age groups defined and sorted based on the total population in the planning area in a descending order and a heatmap is used.
To visualise the top planning areas for the different age groups, a bar chart showing the top 10 planning areas calculated based on the population percentage is used for each age group. Another stack bar chart is used to visualise the distribution of the type of dwellings in the 10 locations. This will allow us to find any patterns between the type of dwelling and the population distribution.
The last design challenge is to ientify the population structure by age group combined with Gender in 2019.

2. Proposed Design Sketch

3. Data Viz Step-By-Step

3.1 Install and Load R packages

knitr::opts_chunk$set(echo = TRUE)

packages = c( 'dplyr', 'tidyverse', 'heatmaply', 'gridExtra', "RColorBrewer")

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.2 Dataset Overview

After installing and loading necessary libraries, we can load the dataset, then ascertain the dataframe dimensions, variable datatypes and number of missing values.

pop_data <- read_csv("data/respopagesextod2019.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   PA = col_character(),
##   SZ = col_character(),
##   AG = col_character(),
##   Sex = col_character(),
##   TOD = col_character(),
##   Pop = col_double(),
##   Time = col_double()
## )

print(paste('DataFrame contains ', dim(pop_data)[1], 'rows and ', dim(pop_data)[2], 'columns.'))

## [1] "DataFrame contains  162111 rows and  7 columns."

In this dataset, there are 7 variables which are shown in each column.

The column names refers to the following:

Column Names	Refers to
PA	Planning Area
SZ	Subzone
AG	Age Group
Sex	Sex
TOD	Type of Dwelling
Pop	Resident Count
Time	Time / Period
SZ	Content Cell

3.3 Creating the Age Groups

Age Group	Age
children	0-14 years
early working age	15-24 years
prime working age	25-54 years
mature working age	55-64 years
elderly	65 years and over

Adding a new column called “Group”

pop_data <- pop_data %>% mutate(Group = 
                        case_when((AG == "0_to_4"| AG == "5_to_9"| AG == "10_to_14") ~ "Children",
                                  ( AG == "15_to_19"| AG == "20_to_24") ~ "Early Working Age",
                                  ( AG == "25_to_29"| AG == "30_to_34"| AG == "35_to_39"| AG == "40_to_44"| AG == "45_to_49"| AG == "50_to_54") ~ "Prime Working Age",
                                  ( AG == "55_to_59"| AG == "60_to_64") ~ "Mature Working Age",
                                  ( AG == "65_to_69"| AG == "70_to_74"| AG == "75_to_79"| AG == "80_to_84"| AG == "85_to_89"| AG == "90_and_over") ~ "Elderly",
                                   TRUE ~ ""))

3.4 Preparing the Dataset

Aggregate data by Planning Area for each Group and adding of a new column “Population Percent”.

pop_by_pa <- pop_data %>% group_by(PA) %>% summarise(Pop=sum(Pop))
pa_list <- pop_by_pa[order(pop_by_pa$Pop, decreasing = TRUE),]$PA
pop_by_age_pa <- pop_data %>% group_by(PA, Group) %>% summarise(Pop=sum(Pop))

## `summarise()` has grouped output by 'PA'. You can override using the `.groups` argument.

df_by_pa <-  group_by(.data = pop_by_age_pa, PA)
df_summarize2 <- mutate(.data = df_by_pa, Population_percent = Pop/sum(Pop))

print(df_summarize2)

## # A tibble: 276 x 4
## # Groups:   PA [56]
##    PA         Group                 Pop Population_percent
##    <chr>      <chr>               <dbl>              <dbl>
##  1 Ang Mo Kio Children            19030              0.116
##  2 Ang Mo Kio Early Working Age   16850              0.102
##  3 Ang Mo Kio Elderly             34080              0.207
##  4 Ang Mo Kio Mature Working Age  25900              0.158
##  5 Ang Mo Kio Prime Working Age   68570              0.417
##  6 Bedok      Children            34960              0.125
##  7 Bedok      Early Working Age   31300              0.112
##  8 Bedok      Elderly             51320              0.183
##  9 Bedok      Mature Working Age  44380              0.159
## 10 Bedok      Prime Working Age  118010              0.422
## # ... with 266 more rows

3.5 Creating the Top 30 Planning Area Stacked Bar Chart.

The top 30 planning area covers more than 99% of population in Singapore hence I have limit the Bar Chart to the top 30. This will also reduce the amount of noise in the stacked bar chart. Next, i fixed the stack order based on Age.

top30 <- subset(df_summarize2, PA %in% pa_list[1:30])
top30_orderedstack <- top30 %>%
  mutate (Group=factor(Group, levels = c("Elderly", "Mature Working Age", "Prime Working Age", "Early Working Age", "Children" ), ordered = TRUE))


ggplot(top30_orderedstack, aes(x = reorder(PA, Pop), y = Pop, fill = Group)) +
  geom_bar(stat = "identity") +  
  geom_smooth(method="lm") + 
  coord_flip() +
  scale_fill_brewer(palette = "RdYlGn")

## `geom_smooth()` using formula 'y ~ x'

3.6 Creating the Heatmap

First, the dataset is filtered to the top 30 planning areas, with the population percent to be more than 0%.

Next, the order of the columns of the age group is fixed according to their respective age.

Normalisation is done to the dataset according to the age groups

Lastly, the package Heatmaply is used to plot the Heatmap.

Normalisation is required to create the Heatmap as the around half of the population fall within the Prime Working Age Group and the difference in the distribution of the population may not be as visible without the normalisation.

df2 <- top30%>%
  group_by(PA, Group)%>%
  summarise(Population_percent=sum(Population_percent))%>%
  filter(Population_percent>'0')

wide <- df2 %>%
  spread(Group, Population_percent)

col.order <- c("Children","Early Working Age","Prime Working Age","Mature Working Age","Elderly")
wide2<-wide[,col.order]

row.names(wide2) <- wide$PA

df2_matrix <- data.matrix(wide2)

heatmaply(normalize(df2_matrix),
          colors = RdYlGn, Colv=NA
          )

3.7 Creating the Dataset for the Planning Area and the type of dwelling.

Adding a new column called “location_p”

pop_by_pa <- pop_data %>% group_by(PA) %>% summarise(Pop=sum(Pop))
pa_list <- pop_by_pa[order(pop_by_pa$Pop, decreasing = TRUE),]$PA
pop_by_age_pa <- pop_data %>% group_by(PA, TOD) %>% summarise(Pop=sum(Pop))

## `summarise()` has grouped output by 'PA'. You can override using the `.groups` argument.

df_by_pa <-  group_by(.data = pop_by_age_pa, PA)
df_location_summarize2 <- mutate(.data = df_by_pa, location_percent = Pop/sum(Pop))

print(df_location_summarize2)

## # A tibble: 441 x 4
## # Groups:   PA [56]
##    PA         TOD                                       Pop location_percent
##    <chr>      <chr>                                   <dbl>            <dbl>
##  1 Ang Mo Kio Condominiums and Other Apartments       13880          0.0844 
##  2 Ang Mo Kio HDB 1- and 2-Room Flats                 11230          0.0683 
##  3 Ang Mo Kio HDB 3-Room Flats                        56830          0.346  
##  4 Ang Mo Kio HDB 4-Room Flats                        43550          0.265  
##  5 Ang Mo Kio HDB 5-Room and Executive Flats          21650          0.132  
##  6 Ang Mo Kio HUDC Flats (excluding those privatised)     0          0      
##  7 Ang Mo Kio Landed Properties                       16070          0.0977 
##  8 Ang Mo Kio Others                                   1220          0.00742
##  9 Bedok      Condominiums and Other Apartments       54430          0.194  
## 10 Bedok      HDB 1- and 2-Room Flats                 10120          0.0361 
## # ... with 431 more rows

3.8 Charts for the Children Group

A Bar Chart depicting the 10 Planning Areas with the highest number of children

A Stacked distribution bar chart of the dwelling types in the 10 locations

The dataset is first filtered to the Top 30 Planning Areas similar to the previous sections, then filtered to the group ‘Children’.
The dataset is then sorted in a descending order based on the population percentage of each Planning Area.
Lastly, the top 10 planning areas is taken to make the 2 charts.

top_children <- subset(df_summarize2, PA %in% pa_list[1:30])
top_children <- top_children %>% 
  select(PA,Group,Pop,Population_percent) %>%
  filter( Group == "Children")
top_children_ordered <-top_children[order(top_children$Population_percent, decreasing = TRUE),]
children_list <- top_children[order(top_children$Population_percent, decreasing = TRUE),]$PA
top10_children <- subset(top_children_ordered, PA %in% children_list[1:10])


children_bar<-ggplot(data = top10_children, aes(x = reorder(PA, Population_percent), y = Population_percent )) +
  geom_bar(stat = "sum", alpha = 0.7) + 
  ggtitle("Distribution of Children in the Top 10 Planning Areas, 2019", subtitle='The 10 planning areas with the largest population percentage of Children (0-14 years)')+
  xlab("Group")+
  ylab("Population Proportion")+
  geom_text(aes(label=round(Population_percent, digits = 2), vjust = 1)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "none")+
  coord_flip()

df_by_loc_children <- subset(df_location_summarize2, PA %in% children_list[1:10])
df_by_loc_children <- df_by_loc_children %>%
  mutate (TOD=factor(TOD, levels = c("HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats", "HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)", "Condominiums and Other Apartments", "Landed Properties",  "Others" ), ordered = TRUE))

children_location_stacked <- ggplot(df_by_loc_children, aes(x = reorder(PA, Pop), y = location_percent, fill = TOD)) +
  geom_bar(stat = "identity", position = "fill")  +
  coord_flip()+
  scale_fill_brewer(palette = "Blues")


children_bar

children_location_stacked

From the charts, we can deduce that Punggol has a significantly larger population of children as compared to the other Planning Areas in Singapore. Also, the top 10 planning area have a population percentage of children higher than 15%.

Next, the top 2 PLanning Areas with highest percentage of children, Punggol and Sengkang, have a low percentage of 1-Room, 2-Room and 3-Room HDB Flats. Hence, for Singaporeans who are parents or soon-to-be parents, housing size plays a significant factor in their purchasing decision.

3.9 Charts for the Early Working Age Group

A Bar Chart depicting the 10 Planning Areas with the highest number of Early Working Age

A Stacked distribution bar chart of the dwelling types in the 10 locations

The dataset for the Early Working Age Group is prepared in the same way as the dataset used for the Children age group.

top_Early_Working_Age <- subset(df_summarize2, PA %in% pa_list[1:30])
top_Early_Working_Age <- top_Early_Working_Age %>% 
  select(PA,Group,Pop,Population_percent) %>%
  filter( Group == "Early Working Age")
top_Early_Working_Age_ordered <-top_Early_Working_Age[order(top_Early_Working_Age$Population_percent, decreasing = TRUE),]
Early_Working_Age_list <- top_Early_Working_Age[order(top_Early_Working_Age$Population_percent, decreasing = TRUE),]$PA
top10_Early_Working_Age <- subset(top_Early_Working_Age_ordered, PA %in% Early_Working_Age_list[1:10])



early_bar<-ggplot(data = top10_Early_Working_Age, aes(x = reorder(PA, Population_percent), y = Population_percent )) +
  geom_bar(stat = "sum", alpha = 0.7) + 
  ggtitle("Distribution of Early Working Age Group in the Top 10 Planning Areas, 2019", subtitle='The 10 planning areas with the largest population percentage of Early Working Age Group (15-24 years)')+
  xlab("Group")+
  ylab("Population Proportion")+
  geom_text(aes(label=round(Population_percent, digits = 2), vjust = 1)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "none")+
  coord_flip()


df_by_loc_early <- subset(df_location_summarize2, PA %in% Early_Working_Age_list[1:10])
df_by_loc_early <- df_by_loc_early %>%
  mutate (TOD=factor(TOD, levels = c("HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats", "HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)", "Condominiums and Other Apartments", "Landed Properties",  "Others" ), ordered = TRUE))

early_location_stacked <- ggplot(df_by_loc_early, aes(x = reorder(PA, Pop), y = location_percent, fill = TOD)) +
  geom_bar(stat = "identity", position = "fill")  +
  coord_flip()+
  scale_fill_brewer(palette = "Blues")

early_bar

early_location_stacked

The Planning Areas with highest population percentage of early working age group are Paris Ris, Woodlands and Chua Chu Kang.

With the exception of Bukit Timah and Serangoon, a common observation across these 10 Planning Areas are the lower percentage of Condominiums and Landed Properties.

3.9 Charts for the Prime Working Age Group

A Bar Chart depicting the 10 Planning Areas with the highest number of Prime Working Age

A Stacked distribution bar chart of the dwelling types in the 10 locations

The dataset is also prepared in the same way as the Children dataset.

top_Prime_Working_Age <- subset(df_summarize2, PA %in% pa_list[1:30])
top_Prime_Working_Age <- top_Prime_Working_Age %>% 
  select(PA,Group,Pop,Population_percent) %>%
  filter( Group == "Prime Working Age")
top_Prime_Working_Age_ordered <-top_Prime_Working_Age[order(top_Prime_Working_Age$Population_percent, decreasing = TRUE),]
Prime_Working_Age_list <- top_Prime_Working_Age[order(top_Prime_Working_Age$Population_percent, decreasing = TRUE),]$PA
top10_Prime_Working_Age <- subset(top_Prime_Working_Age_ordered, PA %in% Prime_Working_Age_list[1:10])



prime_bar<-ggplot(data = top10_Prime_Working_Age, aes(x = reorder(PA, Population_percent), y = Population_percent )) +
  geom_bar(stat = "sum", alpha = 0.7) + 
  ggtitle("Distribution of Prime Working Age Group in the Top 10 Planning Areas, 2019", subtitle='The 10 planning areas with the largest population percentage of Prime Working Age Group (25-54 years)')+
  xlab("Group")+
  ylab("Population Proportion")+
  geom_text(aes(label=round(Population_percent, digits = 2), vjust = 1)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "none")+
  coord_flip()


df_by_loc_prime <- subset(df_location_summarize2, PA %in% Prime_Working_Age_list[1:10])
df_by_loc_prime <- df_by_loc_prime %>%
  mutate (TOD=factor(TOD, levels = c("HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats", "HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)", "Condominiums and Other Apartments", "Landed Properties",  "Others" ), ordered = TRUE))

prime_location_stacked <- ggplot(df_by_loc_prime, aes(x = reorder(PA, Pop), y = location_percent, fill = TOD)) +
  geom_bar(stat = "identity", position = "fill")  +
  coord_flip()+
  scale_fill_brewer(palette = "Blues")

prime_bar

prime_location_stacked

Punggol and River Valley are the top 2 Planning Areas where half of the population consists of Prime Working Age Group. From the stacked bar chart for the type of dwelling, with the exception of Tanglin and River Valley, most of them live in 4-Room and 5-Room HDB Flats.

3.10 Charts for the Mature Working Age Group

A Bar Chart depicting the 10 Planning Areas with the highest number of Mature Working Age

A Stacked distribution bar chart of the dwelling types in the 10 locations

The dataset is also prepared in the same way as the Children dataset.

top_Mature_Working_Age <- subset(df_summarize2, PA %in% pa_list[1:30])
top_Mature_Working_Age <- top_Mature_Working_Age %>% 
  select(PA,Group,Pop,Population_percent) %>%
  filter( Group == "Mature Working Age")
top_Mature_Working_Age_ordered <-top_Mature_Working_Age[order(top_Mature_Working_Age$Population_percent, decreasing = TRUE),]
Mature_Working_Age_list <- top_Mature_Working_Age[order(top_Mature_Working_Age$Population_percent, decreasing = TRUE),]$PA
top10_Mature_Working_Age <- subset(top_Mature_Working_Age_ordered, PA %in% Mature_Working_Age_list[1:10])



mature_bar<-ggplot(data = top10_Mature_Working_Age, aes(x = reorder(PA, Population_percent), y = Population_percent )) +
  geom_bar(stat = "sum", alpha = 0.7) + 
  ggtitle("Distribution of Mature Working Age Group in the Top 10 Planning Areas, 2019", subtitle='The 10 planning areas with the largest population percentage of Mature Working Age Group (55-64 years)')+
  xlab("Group")+
  ylab("Population Proportion")+
  geom_text(aes(label=round(Population_percent, digits = 2), vjust = 1)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "none")+
  coord_flip()


df_by_loc_mature <- subset(df_location_summarize2, PA %in% Mature_Working_Age_list[1:10])
df_by_loc_mature <- df_by_loc_mature %>%
  mutate (TOD=factor(TOD, levels = c("HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats", "HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)", "Condominiums and Other Apartments", "Landed Properties",  "Others" ), ordered = TRUE))

mature_location_stacked <- ggplot(df_by_loc_mature, aes(x = reorder(PA, Pop), y = location_percent, fill = TOD)) +
  geom_bar(stat = "identity", position = "fill")  +
  coord_flip()+
  scale_fill_brewer(palette = "Blues")

mature_bar

mature_location_stacked

The top 3 planning areas with the highest percentage of Mature Working Age Group are Serangoon, Tampines and Hougang. A common observation that can be seen from the stacked bar chart across these 10 planning areas is the overall higher percentage of them living in Condominiums and Landed Properties as compared to the other Age Group.

3.11 Charts for the Elderly Group

A Bar Chart depicting the 10 Planning Areas with the highest number of Elderly

A Stacked distribution bar chart of the dwelling types in the 10 locations

The dataset is also prepared in the same way as the Children dataset.

top_Elderly <- subset(df_summarize2, PA %in% pa_list[1:30])
top_Elderly <- top_Elderly %>% 
  select(PA,Group,Pop,Population_percent) %>%
  filter( Group == "Elderly")
top_Elderly_ordered <-top_Elderly[order(top_Elderly$Population_percent, decreasing = TRUE),]
Elderly_list <- top_Elderly[order(top_Elderly$Population_percent, decreasing = TRUE),]$PA
top10_Elderly <- subset(top_Elderly_ordered, PA %in% Elderly_list[1:10])

elderly_bar<-ggplot(data = top10_Elderly, aes(x = reorder(PA, Population_percent), y = Population_percent )) +
  geom_bar(stat = "sum", alpha = 0.7) + 
  ggtitle("Distribution of Elderly in the Top 10 Planning Areas, 2019", subtitle='The 10 planning areas with the largest population percentage of Elderly (65 years and over)')+
  xlab("Group")+
  ylab("Population Proportion")+
  geom_text(aes(label=round(Population_percent, digits = 2), vjust = 1)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "none")+
  coord_flip()


df_by_loc_Elderly <- subset(df_location_summarize2, PA %in% Elderly_list[1:10])
df_by_loc_Elderly <- df_by_loc_Elderly %>%
  mutate (TOD=factor(TOD, levels = c("HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Flats", "HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)", "Condominiums and Other Apartments", "Landed Properties",  "Others" ), ordered = TRUE))

elderly_location_stacked <- ggplot(df_by_loc_Elderly, aes(x = reorder(PA, Pop), y = location_percent, fill = TOD)) +
  geom_bar(stat = "identity", position = "fill")  +
  coord_flip()+
  scale_fill_brewer(palette = "Blues")

elderly_bar

elderly_location_stacked

Outram is top location with the highest percentage of Elderly. One observation that can be seen in the stacked bar chart is the high percentage of elderly living in 1-Room, 2-Room and 3-Room HDB Flats.

3.12 Creating the Population Pyramid based on Gender

Rename the age for 0-4 and 5-9 years old so that the order in the population pyramid is sorted correctly.

The age group 0_to_4 and 5_to_9 is first modified so that all the age groups can be ordered. The females dataset is then separated and flipped and the population pyramid is created using the library ggplot.

pop_data$AG <- str_replace(as.factor(pop_data$AG),"5_to_9","05_to_09")
pop_data$AG <- str_replace(as.factor(pop_data$AG),"0_to_4","0_to_04")


pop_data$Pop <- ifelse(pop_data$Sex == "Females",-1*pop_data$Pop, pop_data$Pop)

age_cohort <- ggplot(pop_data, aes(x = AG, y = Pop, fill = Sex))+
  geom_bar(data = subset(pop_data, Sex == "Females"), stat = "identity")+
      geom_bar(data = subset(pop_data, Sex == "Males"), stat = "identity") +
  scale_y_continuous(breaks = seq(-150000, 150000, 50000), 
                     labels = paste0(as.character(c(seq(150, 0, -50), seq(50, 150, 50))))) +
  coord_flip()

age_cohort+
  ggtitle("Singapore Residents Pyramid by Age Cohort, 2019")+
            xlab("Age group")+
            ylab("Population in thousands")+
  scale_fill_brewer(palette = "Set1")+
  theme_minimal()+ 
  theme(legend.position='right')

From the population pyramid, we can observe that Singapore faces an aging population and a similar proportion of Males and Females.

4 Conclusion

There is a relationship between the type of Dwelling and the different Age Group. The population of children in a planning area is affected by housing size.This can be observed from the stacked bar chart for the top 10 Planning Areas with the highest percentage of children. There are a higher percentage of mature working age group living in private estate. There is also a high percentage of elderly living in 1-Room, 2-Room and 3-Room HDB Flats.

The Bar Charts and the Heatmap gives insights into how Singapore’s population is generally distributed geographically. In the northeast region, planning areas like Seng Kang and Punggol houses have a higher percentage of children and Prime working age group. This suggest that there are many young families there. The planning areas with high percentage of early working age group are located in non-mature estates such as Woodlands, Choa Chu Kang and Jurong. Planning Areas in the Central Region such as Queenstown, Geylang, Ang Mo Kio, Toa Payoh have a higher percentage of mature working age group and elderly. These are normally mature estates.

From the population pyramid, Singapore population is mostly economically active, with ageing population increasing and declining birth rates as it has a narrow base. This information will be helpful to those who are planning policies for the future of Singapore. For instance, the general healthcare expenses in the future will likely increase in the future.