Assignment 4 - Discover Trends in Singapore’s Population

The aim of this visualization is to discover trends of Singapore’s population by age structure (e.g., 00-04, 05-09, …), regions and the types of dwellings, mainly focused in 2019. Additionally, I will zoom in on the non-mature estates(Planning Areas) of Singapore to discover insights about these areas. I focus on the non-mature estates as they are less preferable to buyers than mature estates according to 99 and RedBrick. It would interesting to see if this opinion has changed over the years.

The dataset used is Singapore Residents by Planning Area Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019 data series which can be found on SingStat’s website.

1. Load the Necessary Libraries, Data and Conduct Exploratory Data Analysis

library(tidyverse); library(RColorBrewer); library(ggplot2)

residents <- read_csv("respopagesextod2011to2020.csv")
glimpse(residents)

## Rows: 984,656
## Columns: 7
## $ PA   <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang M...
## $ SZ   <chr> "Ang Mo Kio Town Centre", "Ang Mo Kio Town Centre", "Ang Mo Ki...
## $ AG   <chr> "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0...
## $ Sex  <chr> "Males", "Males", "Males", "Males", "Males", "Males", "Males",...
## $ TOD  <chr> "HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Fla...
## $ Pop  <dbl> 0, 10, 30, 50, 0, 0, 40, 0, 0, 10, 30, 60, 0, 0, 40, 0, 0, 10,...
## $ Time <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...

2. Major Design/Data Challenges

Challenge 1 (Data): Mislabeled fields of “AG”
The fields of “AG” shows the different age groups. However, there are 2 age groups where the labels interfere when visualizing the graph. To illustrate, in constructing the pyramid plot to show the age distribution, the age group labeled “5_to_9” appears in the center of the pyramid instead of its correct position at the bottom (since the pyramid will be in ascending order). This is because the label is written as 5 instead of 05.

# Notice it is "0_to_4" and "5_to_9" 
unique(residents$AG)

##  [1] "0_to_4"      "5_to_9"      "10_to_14"    "15_to_19"    "20_to_24"   
##  [6] "25_to_29"    "30_to_34"    "35_to_39"    "40_to_44"    "45_to_49"   
## [11] "50_to_54"    "55_to_59"    "60_to_64"    "65_to_69"    "70_to_74"   
## [16] "75_to_79"    "80_to_84"    "85_to_89"    "90_and_over"

To remedy this, I will rename the label to “05_to_09”. I will do this for “0_to_4” as well to remain consistent even though it does not affect the pyramid.

# The following code edits the names
residents$AG[residents$AG == "0_to_4"] <- "00_to_04"
residents$AG[residents$AG == "5_to_9"] <- "05_to_09"

unique(residents$AG)

##  [1] "00_to_04"    "05_to_09"    "10_to_14"    "15_to_19"    "20_to_24"   
##  [6] "25_to_29"    "30_to_34"    "35_to_39"    "40_to_44"    "45_to_49"   
## [11] "50_to_54"    "55_to_59"    "60_to_64"    "65_to_69"    "70_to_74"   
## [16] "75_to_79"    "80_to_84"    "85_to_89"    "90_and_over"

Challenge 2 (Data): Too many categories in the PA column
There are too many categories in the PA(Planning Areas) column. With 55 Planning Areas, it is difficult to accommodate all fields of Planning Areas to visualize them.

unique(residents$PA)

##  [1] "Ang Mo Kio"              "Bedok"                  
##  [3] "Bishan"                  "Boon Lay"               
##  [5] "Bukit Batok"             "Bukit Merah"            
##  [7] "Bukit Panjang"           "Bukit Timah"            
##  [9] "Central Water Catchment" "Changi"                 
## [11] "Changi Bay"              "Choa Chu Kang"          
## [13] "Clementi"                "Downtown Core"          
## [15] "Geylang"                 "Hougang"                
## [17] "Jurong East"             "Jurong West"            
## [19] "Kallang"                 "Lim Chu Kang"           
## [21] "Mandai"                  "Marina East"            
## [23] "Marina South"            "Marine Parade"          
## [25] "Museum"                  "Newton"                 
## [27] "North-Eastern Islands"   "Novena"                 
## [29] "Orchard"                 "Outram"                 
## [31] "Pasir Ris"               "Paya Lebar"             
## [33] "Pioneer"                 "Punggol"                
## [35] "Queenstown"              "River Valley"           
## [37] "Rochor"                  "Seletar"                
## [39] "Sembawang"               "Sengkang"               
## [41] "Serangoon"               "Simpang"                
## [43] "Singapore River"         "Southern Islands"       
## [45] "Straits View"            "Sungei Kadut"           
## [47] "Tampines"                "Tanglin"                
## [49] "Tengah"                  "Toa Payoh"              
## [51] "Tuas"                    "Western Islands"        
## [53] "Western Water Catchment" "Woodlands"              
## [55] "Yishun"

Consequently, I group the Planning Areas into Regions according to Wikipedia’s definition. The groupings are North, North East, East, West and South.

# create new column called Region
residents$Region <- residents$PA %>%
  fct_collapse(Central = c("Bishan", "Bukit Merah", "Bukit Timah", "Downtown Core", "Geylang", "Kallang", "Marina East", "Marina South", "Marine Parade", "Museum", "Newton", "Novena", "Orchard", "Outram", "Queenstown", "River Valley", "Rochor", "Singapore River", "Southern Islands", "Straits View", "Tanglin", "Toa Payoh"), 
         East = c("Bedok", "Changi", "Changi Bay", "Pasir Ris", "Paya Lebar", "Tampines"), 
         North = c("Central Water Catchment", "Lim Chu Kang", "Mandai", "Sembawang", "Simpang", "Sungei Kadut", "Woodlands", "Yishun"), 
         "North East" = c("Ang Mo Kio","Hougang", "North-Eastern Islands", "Punggol", "Seletar", "Sengkang", "Serangoon"), 
         West = c("Boon Lay", "Bukit Batok", "Bukit Panjang", "Choa Chu Kang", "Clementi", "Jurong East", "Jurong West", "Pioneer", "Tengah", "Tuas", "Western Islands", "Western Water Catchment" )) 

unique(residents$Region)

## [1] North East East       Central    West       North     
## Levels: North East East Central West North

Challenge 3 (Data): Misleading value in “Pop”
In the year 2019, there are 68193 values of Pop (population) equal to 0. The 0 is misleading as it can be misconstrued as missing values or that no one is living that type of dwelling. It leaves room for misunderstanding of the graphs generated.

residents %>%
  filter(Time == 2019, Pop == 0) %>%
  summarise(Count = n())

## # A tibble: 1 x 1
##   Count
##   <int>
## 1 68193

As such, I will only use values that are more than 0 as shown in the following code.

residents <- residents %>%
  filter(Pop > 0)

residents %>%
  filter(Time == 2019, Pop == 0) %>%
  summarise(Count = n())

## # A tibble: 1 x 1
##   Count
##   <int>
## 1     0

Challenge 4 (Design): Difficulty in Properly Displaying Annotations
After creating multiple graphs using facet wrap, I used the geom_label function to add annotations to the graphs. However, some of the labels are cut off or they overlap, making it unreadable. After extensive Google research, I found out that the size of the graph can be adjusted in the r chunks e.g. {r fig.height = 6, fig.width = 12}

Lastly, since most of the graphs/charts will focus on 2019, I will create a separate dataframe for it.

residents_2019 <- residents %>%
  filter(Time==2019)

unique(residents_2019$Time)

## [1] 2019

3. Proposed Sketch Designs

4. Generated Plots

I first look at the population of Singapore to get a sense of the overall age distribution, this will indicate the mortality rate and the reproduction capabilities of the city. I will be using functions like ggplot and geom_bar from the ggplot2 package. The steps are as follows:
1. Define the dataframe (residents_2019) as the first argument in ggplot. Set x to “AG” and y to “Pop” in in the aes function.
2. Define 2 geom_bar functions, set the first one to use the subset of the current dataframe and set the “Sex” to “Females”.
3. Similarly, set the second geom_bar to use the subset of the current dataframe but set the “Sex” to “Males” and add a mapping function and set the y-axis to “-Pop”.
4. Add the function scale_y_continuous and coord_flip.
5. Using a palette from the RColorBrewer package, I set the colours for males and females.
6. Lastly, add final touches like proper axis labels using geom_labels.

ggplot(residents_2019, aes(x=AG, fill=Sex, y=Pop)) +
  geom_bar(data = subset(residents_2019, Sex == "Males"), stat = "identity") + 
  geom_bar(data = subset(residents_2019, Sex == "Females"), stat = "identity", mapping = aes(y = -Pop)) +
  scale_y_continuous(breaks = seq(-150000, 150000, 50000), 
                     labels=abs(seq(-150, 150, 50))) +
  coord_flip() + 
  scale_fill_brewer(palette = "Pastel1") + 
  theme_bw(base_size = 13) +
  labs(title = "Age-sex Distribution of the Population in Singapore", x = "Age Groups", y="Number of Residents (thousands)", fill="Gender")

Next, I take a deeper look at the percentage of residents living in the 5 regions and the types of dwelling. I will create a faceted bar graph based on the 5 regions. Each graph will show the types of dwellings and proportion of residents living in them. The package used to create the graphs is ggplot2. The steps are as follow:
1. Calculate the sum of “Pop” and assign it to “total_pop”.
2. Create a new dataframe: TOD_Region_pop.
3. In the dataframe, group the original dataframe (residents_2019) by “Region” and “TOD”.
4. Calculate the population percentage by using sum(Pop), dividing by total_pop and multiplying by 100. Round the value to 2 decimal places.
5. Create the graph by using the dataframe, TOD_Region_pop and the defining the x and y axis in the aes function of ggplot. Then add geom_col to create the bar graph.
6. Add facet_wrap(~Region) to split the bar graph into the 5 regions.
7. Add final touch-ups like title and axis labellings.

The code is shown below:

total_pop <- sum(residents_2019$Pop)

TOD_Region_pop <- residents_2019 %>%
  group_by(Region, TOD) %>%
  summarise(Pop_Proportion = round(sum(Pop)/total_pop*100, digits=2))

ggplot(TOD_Region_pop, aes(x=Pop_Proportion, y=TOD, label=Pop_Proportion)) +
  geom_col(fill = "#9B5540") +
  facet_wrap(~Region) +
  theme_bw(base_size=14) +
  theme(panel.spacing = unit(0.5, "lines")) +
  labs(title = "Percentage of Residents by Type of Dwelling in each Region", 
       x="Residents (%)",
       y="") +
  geom_label(size=3)

Next, I will look at the number of residents living in non-mature estates to discover if the opinion that non-mature estates are less desirable than mature estates have dissipated over the years or not. I will create a new dataframe to include only the non-mature estates by following RedBrick’s website to know which planning areas are considered non-mature. I create the graph using the ggplot2 package and the following steps:
1. Create a new dataframe: nonmature_df, which contains the non-mature estates along with the original variables.
2. Filter “Time” so that 2020 is not included and calcuate the sum of population using sum(Pop).

# create a new dataframe called nonmature_df
nonmature_df <- residents %>% 
  filter(PA %in% c("Bukit Batok", "Bukit Panjang", "Choa Chu Kang", "Hougang", "Jurong East", "Jurong West", "Punggol", "Sembawang","Sengkang", "Woodlands","Yishun"))

glimpse(nonmature_df)

## Rows: 101,654
## Columns: 8
## $ PA     <chr> "Bukit Batok", "Bukit Batok", "Bukit Batok", "Bukit Batok", ...
## $ SZ     <chr> "Bukit Batok Central", "Bukit Batok Central", "Bukit Batok C...
## $ AG     <chr> "00_to_04", "00_to_04", "00_to_04", "00_to_04", "00_to_04", ...
## $ Sex    <chr> "Males", "Males", "Males", "Males", "Females", "Females", "F...
## $ TOD    <chr> "HDB 3-Room Flats", "HDB 4-Room Flats", "HDB 5-Room and Exec...
## $ Pop    <dbl> 120, 270, 200, 70, 120, 280, 200, 70, 130, 320, 240, 80, 110...
## $ Time   <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, ...
## $ Region <fct> West, West, West, West, West, West, West, West, West, West, ...

# Notice 2020 is no longer in the dataframe
nonmature_df <- nonmature_df %>% 
  filter(Time != 2020) %>%
  group_by(PA, Time) %>% 
  summarise(total_residents = sum(Pop))

unique(nonmature_df$Time)

## [1] 2011 2012 2013 2014 2015 2016 2017 2018 2019

Create the timeline graph using the nonmature_df dataframe. Set x to Time and y to total_residents.
Include scale_x_continuous and scale_y_continuous to change the labels of the x and y axis ticks accordingly.
Lastly, add final touches like set the background of the plot to gray and include proper labels.

ggplot(nonmature_df, aes(x=Time, y=total_residents)) +
  geom_line(aes(color = PA), show.legend = FALSE, size = 1.8) +
  geom_text(data = nonmature_df %>% filter(Time == 2019), aes(label = PA), size=3.9, hjust = -.1) +
  scale_x_continuous(breaks = seq(2011, 2019, 1), expand = c(0.12, 0)) +
  scale_y_continuous(breaks = seq(100000, 250000, 50000), 
                     labels=abs(seq(100, 250, 50))) +
  theme_set(theme_gray(base_size = 25)) +
  theme(axis.title = element_text(size=13),
        plot.title = element_text(size = 20))+
  labs(title="Timeline of Residency Changes in Non-mature Estates over 9 Years", 
       x = "\nYear",
       y = "Number of Residents (thousands)\n")

4. Insights Gleaned from the Visualizations

1. Age-sex Pyramid:
For the most part, the number of females to males are relatively similar. However, as the females tend to live longer as the population ages. The pyramid is also described as having a constrictive form. The pyramid is wider in the middle than the top and bottom indicating the large percentage of people are the working age group.
The shape of the pyramid also indicates that Singapore has low birth and death rates. As the pyramid starts to narrow from 60 onwards, it shows that people are living longer lives which means that the death rate is low. Similarly, the base of the pyramid is narrower than the middle which means that birth rates are low.

2. Type of Dwelling in each Region Faceted Plot:
Having the highest percentages of residents across all regions, the most lived in type of residence in Singapore is a HDB 4-room flat. Additionally, except for the Central region, the second most lived in type of dwelling is the HDB-5 room and executive flats. As the Central area is known to be on the pricier side, it is no surprise that “Condominiums and Other Apartments” is the 2nd most lived in residence as compared to the rest of the regions where the second most lived in residence is “HDB 5-Room and Executive Flats”.

3. Timeline of Residency in Non-mature Estates over 9 Years:
Some of the most notable increases in residency are in Punggol and Sengkang. A possible reason could be that these areas are becoming more modernized and accessible. For instance, since 2007, plans have been rolled out to turn Punggol into a “waterfront town of the 21st century”. This will undoubtedly draw more people into the area. It seems like opinions that non-mature estates are less desirable is starting to change. Even other areas like Yishun and Sembawang have seen a slight increase in residency in recent years.