This dataset Singapore Residents by Subzone and Type of Dwelling, 2011 - 2019 is retrieved from Data.gov.sg. This publication aims to reveal the demographic structure of Singapore’s population by age cohort (e.g., 0-4, 5-9, ……) and identify trends in residence in non-mature estates in Singapore.
Major challenge | Description |
---|---|
Data | The age group is recorded as x_to_y, and as a result the age group 5_to_9 will not appear in the correct order when sorted ascending. |
Design | Generating a population pyramid is not straightforward with ggplot. This involves searching for codes for parameters like coord_flip . |
Design | Creating a pie chart with ggplot is involves creating a bar chart first and preventing overlaps of percentage labels. |
Design | Due to the multiple lines present in the line graph, fig.height and fig.width must be specified larger, and expand the x axis if needed. |
Proposed Sketch Design
Steps:
1. Set working directory & load essential libraries
2. Store the dataset into df
3. View the dimensions of the dataset, along with the structure of variables and an overview of each column and its records
4. Check for NA values
knitr::opts_chunk$set(warning = FALSE, error = TRUE, message = FALSE, results = 'asis')
setwd("C:/Users/X Lin/Desktop/Notes/4.1/VA/Assignment 4")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
df <- read.csv("planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv")
dim(df) #check no. of observations
## [1] 883728 7
head(df, n = 20) # view first 20 rows
## planning_area subzone age_group sex
## 1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 7 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 8 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 9 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females
## 10 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females
## 11 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females
## 12 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females
## 13 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females
## 14 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females
## 15 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females
## 16 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females
## 17 Ang Mo Kio Ang Mo Kio Town Centre 5_to_9 Males
## 18 Ang Mo Kio Ang Mo Kio Town Centre 5_to_9 Males
## 19 Ang Mo Kio Ang Mo Kio Town Centre 5_to_9 Males
## 20 Ang Mo Kio Ang Mo Kio Town Centre 5_to_9 Males
## type_of_dwelling resident_count year
## 1 HDB 1- and 2-Room Flats 0 2011
## 2 HDB 3-Room Flats 10 2011
## 3 HDB 4-Room Flats 30 2011
## 4 HDB 5-Room and Executive Flats 50 2011
## 5 HUDC Flats (excluding those privatised) 0 2011
## 6 Landed Properties 0 2011
## 7 Condominiums and Other Apartments 40 2011
## 8 Others 0 2011
## 9 HDB 1- and 2-Room Flats 0 2011
## 10 HDB 3-Room Flats 10 2011
## 11 HDB 4-Room Flats 30 2011
## 12 HDB 5-Room and Executive Flats 60 2011
## 13 HUDC Flats (excluding those privatised) 0 2011
## 14 Landed Properties 0 2011
## 15 Condominiums and Other Apartments 40 2011
## 16 Others 0 2011
## 17 HDB 1- and 2-Room Flats 0 2011
## 18 HDB 3-Room Flats 10 2011
## 19 HDB 4-Room Flats 30 2011
## 20 HDB 5-Room and Executive Flats 60 2011
table(df$planning_area) # count of unique values of planning area
##
## Ang Mo Kio Bedok Bishan
## 32832 21888 8208
## Boon Lay Bukit Batok Bukit Merah
## 10944 24624 46512
## Bukit Panjang Bukit Timah Central Water Catchment
## 19152 21888 2736
## Changi Changi Bay Choa Chu Kang
## 8208 2736 16416
## Clementi Downtown Core Geylang
## 24624 32832 13680
## Hougang Jurong East Jurong West
## 27360 27360 24624
## Kallang Lim Chu Kang Mandai
## 24624 2736 8208
## Marina East Marina South Marine Parade
## 2736 2736 13680
## Museum Newton North-Eastern Islands
## 8208 16416 2736
## Novena Orchard Outram
## 13680 8208 10944
## Pasir Ris Paya Lebar Pioneer
## 21888 13680 13680
## Punggol Queenstown River Valley
## 19152 41040 13680
## Rochor Seletar Sembawang
## 27360 10944 24624
## Sengkang Serangoon Simpang
## 19152 19152 10944
## Singapore River Southern Islands Straits View
## 8208 5472 2736
## Sungei Kadut Tampines Tanglin
## 13680 13680 10944
## Tengah Toa Payoh Tuas
## 2736 32832 16416
## Western Islands Western Water Catchment Woodlands
## 8208 2736 24624
## Yishun
## 24624
table(df$age_group) # count of unique values of age group
##
## 0_to_4 10_to_14 15_to_19 20_to_24 25_to_29 30_to_34
## 46512 46512 46512 46512 46512 46512
## 35_to_39 40_to_44 45_to_49 5_to_9 50_to_54 55_to_59
## 46512 46512 46512 46512 46512 46512
## 60_to_64 65_to_69 70_to_74 75_to_79 80_to_84 85_to_89
## 46512 46512 46512 46512 46512 46512
## 90_and_over
## 46512
table(df$type_of_dwelling) # count of unique values for type of dwelling
##
## Condominiums and Other Apartments HDB 1- and 2-Room Flats
## 110466 110466
## HDB 3-Room Flats HDB 4-Room Flats
## 110466 110466
## HDB 5-Room and Executive Flats HUDC Flats (excluding those privatised)
## 110466 110466
## Landed Properties Others
## 110466 110466
colSums(is.na(df)) # check for NA values
## planning_area subzone age_group sex
## 0 0 0 0
## type_of_dwelling resident_count year
## 0 0 0
is.numeric(df$resident_count) # check if resident_count is numeric
## [1] TRUE
geom_bar
and coord_flip
pyramid
#split the data set to females and males by resident count
males <- filter(df, sex == "Males" & year == 2019) %>% select(age_group, resident_count) #filter df to males and year 2019
males_agg <- males %>% group_by(age_group) %>% summarise(Male = sum(resident_count)) #sum resident counts by age group
males_agg[10,1] <- "05_to_9"
males_agg[1,1] <- "00_to_4"
#repeat for female
females <- filter(df, sex == "Females" & year == 2019) %>% select(age_group, resident_count)
females_agg <- females %>% group_by(age_group) %>% summarise( Female = sum(resident_count))
females_agg[10,1] <- "05_to_9"
females_agg[1,1] <- "00_to_4"
library(sqldf)
agg <- sqldf("SELECT age_group, Female, Male from males_agg inner join females_agg using(age_group) order by age_group asc") #inner join using SQL
agg # age group 5-9 would be in the wrong order, thus I renamed it to 05-9 and 0-4 to 00-4
age_group Female Male
1 00_to_4 90850 94730 2 05_to_9 97040 101290 3 10_to_14 102550 105830 4 15_to_19 108910 113730 5 20_to_24 122480 127040 6 25_to_29 145960 142640 7 30_to_34 153460 140360 8 35_to_39 158850 142310 9 40_to_44 157120 144130 10 45_to_49 160230 151800 11 50_to_54 152750 149360 12 55_to_59 153590 153850 13 60_to_64 140770 138490 14 65_to_69 112900 108920 15 70_to_74 79190 71450 16 75_to_79 52680 42460 17 80_to_84 36230 26230 18 85_to_89 21430 12490 19 90_and_over 13730 5590
final_agg <- agg %>% gather(key = sex,residents, -age_group)
final_agg$age_group <- gsub("_to_","-", final_agg$age_group) #remove underscores in age group
final_agg$age_group
[1] “00-4” “05-9” “10-14” “15-19” “20-24”
[6] “25-29” “30-34” “35-39” “40-44” “45-49”
[11] “50-54” “55-59” “60-64” “65-69” “70-74”
[16] “75-79” “80-84” “85-89” “90_and_over” “00-4”
[21] “05-9” “10-14” “15-19” “20-24” “25-29”
[26] “30-34” “35-39” “40-44” “45-49” “50-54”
[31] “55-59” “60-64” “65-69” “70-74” “75-79”
[36] “80-84” “85-89” “90_and_over”
library(RColorBrewer)
as.factor(final_agg$sex)
[1] Female Female Female Female Female Female Female Female Female Female [11] Female Female Female Female Female Female Female Female Female Male
[21] Male Male Male Male Male Male Male Male Male Male
[31] Male Male Male Male Male Male Male Male
Levels: Female Male
options(scipen = 999)
pyramid<-ggplot(final_agg, aes(x = as.factor(age_group), y = residents, fill = sex)) +
geom_bar(data = subset(final_agg, sex == "Female"), stat = "identity") +
geom_bar(data = subset(final_agg, sex == "Male"), stat = "identity",
mapping = aes(y = -residents)) +
scale_colour_manual(values = c("pink", "steelblue"),
aesthetics = c("colour", "fill")) +
ggtitle("Population pyramid of Singapore in 2019") +
ylab("Number of residents") +
scale_y_continuous(labels = abs) +
xlab("Age group") +
coord_flip()
Over the years, the Singapore government has been implementing ways to develop non-mature housing estates in a bid to attract potential homeowners to buy homes in these non-mature areas (which are not as favorable compared to mature estates). Non mature estates are not as attractive as they are either considered remote. More information on mature vs. non-mature estates can be found here non mature estates by Redbrick, a mortgage advisor. Here, we want to see how resident counts have changed between 2011 to 2019, and possibly identify which non-mature housing estates are gaining popularity amongst home buyers.
In 2019, the list of non-mature estates identified are:
1. Bukit Batok
2. Bukit Panjang
3. Choa Chu Kang
4. Hougang
5. Jurong East
6. Jurong West
7. Punggol
8. Sembawang
9. Sengkang
10. Woodlands
11. Yishun
Steps:
1. Narrow the data df to only those where planning area is a non-mature estatem, saved as df_nonmature
2. Group df_nonmature
by planning area and year 3. Use geom_line
to plot a time series graph of resident count
3. The code for the line graph is saved to line
.
df_nonmature <- df %>% filter(planning_area %in% c("Bukit Batok", "Bukit Panjang", "Choa Chu Kang", "Hougang", "Jurong East",
"Jurong West", "Punggol", "Sembawang","Sengkang", "Woodlands","Yishun"))
df_nonmature2 <- sqldf("select sum(resident_count) as total_residents, planning_area, year from df_nonmature group by planning_area, year")
options(scipen = 999)
line <- df_nonmature2 %>% ggplot(aes( x = year, y = total_residents, group = planning_area)) + geom_line(aes(color = planning_area, alpha = 1, show.legend = FALSE), size = 3) + geom_text(data = df_nonmature2 %>% filter(year == 2019), aes(label = planning_area), color = "white", hjust = -.01, size = 7) +
ggtitle("Line graph of resident count in non mature estates") + ylab("Total number of residents") + theme_dark(base_size = 25) + theme(legend.position="none") + scale_x_continuous(expand = c(.125, 0), breaks = seq(2011, 2019,1))
We are now concerned with two characteristics of households residing in non-matures estates.
1. Age group
Non-mature estates are mostly home to young couples and families, thus we aim to find out if this is the case by plotting a bar plot of the distribution of age groups for each non-mature estate.
Steps:
1. Filter df_nonmature
to only records with year 2019 and group the resulting data set by age group and planning area
2. Create a bar chart using geom_bar
3. The code for the bar chart is saved as barchart
df_nonmature_2019 <- df_nonmature %>% filter(year == 2019)
df_nonmature_20192 <- sqldf("select sum(resident_count) as total_residents, planning_area, age_group from df_nonmature_2019 group by planning_area, age_group")
df_nonmature_20192$age_group <- gsub("5_to_9", "05_to_9", df_nonmature_20192$age_group)
df_nonmature_20192$age_group <- gsub("_to_", "-", df_nonmature_20192$age_group)
df_nonmature_20192$age_group <- gsub("_", " ", df_nonmature_20192$age_group)
# bar plot of each estate
library(RColorBrewer)
barchart <-ggplot(df_nonmature_20192, aes(x = as.factor(age_group), y = total_residents, fill = -total_residents)) + geom_bar(stat = "identity") +
facet_grid(planning_area~.) + theme_bw(base_size = 16) + ggtitle("Bar plot of residents in non-mature estates by age group") +
ylab("Number of residents") + xlab("Age group") + scale_fill_continuous(type = "gradient") + theme_gray(base_size = 22) + guides(fill=guide_legend(title="Total Residents"))
2. Type of dwelling
We are also interested to see which house (i.e. HDB/Landed house/Condominium) is the predominant dwelling in non-mature estates in 2019.
Steps:
1. Using the dataset filtered by year 2019 and for non mature estates df_nonmature_dwelling2019
, group the dataset by type of dwelling
2. Create a new column perc, which calculates the percentage of each type of dwelling in 2019
3. Create a stacked bar chart bp
using geom_bar
4. The geom_text_repel
as part of the ggrepel
package prevents any overlapping of labels. This is used in place of geom_text
for a neater pie chart
5. Using the bar chart bp
created in step 3, apply coord_polar
to turn the bar chart into a circular pie chart
6. The code for the pie chart is saved as pie
df_nonmature_dwelling2019 <- df_nonmature_2019 %>% group_by(type_of_dwelling) %>% summarise(total_residents = sum(resident_count))
df_nonmature_dwelling2019$perc <- round((df_nonmature_dwelling2019$total_residents/ sum(df_nonmature_dwelling2019$total_residents))*100,1)
df_nonmature_dwelling20192 <- df_nonmature_dwelling2019[-c(6,8),] # remove HUDC and other apartments which are not large figures
bp <- df_nonmature_dwelling20192 %>% ggplot(aes(x = "", y = total_residents, fill = type_of_dwelling)) + geom_bar(width = 1, stat = "identity", color = "black") #create a stacked bar chart before making a pie chart
library(ggrepel)
pie <- bp + coord_polar("y", start=0) + geom_text_repel(aes(label = paste(perc, "%"), x = 1.6),position = position_stack(vjust = 0.3),size=3) + scale_x_discrete(expand = c(0,0)) + guides(fill=guide_legend(title="Type of Dwelling")) +
theme_void() + scale_fill_brewer(palette="Pastel2") #make pie chart using coord_polar
pyramid
The graph shows that as of 2019, the top 3 non mature estate by resident count are (in order): Jurong West, Woodlands, and Sengkang.
Notably, the non mature estates Sengkang, Punggol, and Sembawang have seen a significant increase in the number of residents residing in these locations (seen by the steeper slopes in recent years). This is expected, as the governmnent has been developing these areas extensively. For example, connectivity within Punggol has improved tremendously over the years, with multiple Light Rail Transit (LRT) stations built to provide greater accessibility to condominiums and public housing located further from Punggol MRT station. Other major projects in Punggol include the Punggol Digital District, which is envisioned to be a vibrant learning hub with shared spaces for industry and academia; the Punggol Town Hub, which brings together services and programmes to serve its residents. More amenities and lifestyle destinations are expected to be developed in Sengkang and Sembawang, including recreational and green spaces for residents.
line
From the bar graph, we see that younger households predominate in Sengkang and Punggol - most residents living in Sengkang and Punggol are aged between 35 to 39. We also do see more children (i.e. those aged below 10) in Sengkang and Punggol in comparison to the other non-mature estates. This suggests that Sengkang and Punggol are popular non-mature estate choices amongst younger couples and families. In areas such as Yishun, Hougang, and Woodlands, majority of the residents belong to older age groups (i.e. 50-60).
barchart
From the pie chart, HDB flats is the dominant type of housing in non-mature estates. In particular, most residents live in HDB 4-room flats (39.7%) or HDB 5-room flats (33.1%). The number of residents living in Condominiums, landed properties, and 1 or 2 room HDB flats are considerably lower. This is aligned with the government’s plan to increase the supply public housing (HDB) in non-mature estates.
pie