1 Introduction

The data used for this assignment is “Singapore Residents by Planning Area Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019 data series” from https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data

There are a total of 7 variables: Planning Area, Subzone, Age Group, Sex, Type of Dwellings, Population Count, Time. We are supposed to build a visualization that allows us to reveal the demographic structure of Singapore population by age cohort and planning area.

1.1 Challenges Faced

1.1.1 Variables

The number of unique value for each of the variables is very high. There is a total of 19 different age group and 55 planning area. If we were to create a visualization where age group and planning area each take an axis, the visualization will be too cluttered for any analysis.

1.1.2 Cleanliness of Data

Given that this dataset is retrieved from a government website, it is rather clean but not perfect. After perform some EDA, there are some planning with no people at all. Since it is a 10 years, we also need to perform some data cleaning to get the data required visualization.

1.1.3 Sorting Age Group

As categorical are being sorted alphabetically, age group 5_to_9 will be place in the middle of 45_to_49 and 50_to_55.

1.2 Solution

Variables: In order to resolve the cluttered visualization issue, we will be building a pyramid chart for each of the planning area. This can be achieved by using facet_wrap with respect to planning area. In that case, we will be able to see the population distribution for each of the planning area without squeezing everything into a single chart.
Cleanliness of Data: We will be using filter function to filter out year 2019 as well as removing planning area with no people
Sorting Age Group: This can be done by changing the age group “5_to_9” to “05_to_09”

1.3 Proposed Design

Rough sketch of proposed visualisation. Pardon me for my bad handwriting and drawing. This is the visual of multiple population pyramid chart based on different Planning Area.

knitr::include_graphics("sketch.jpg")

2 Step-by-step Instructions

2.1 Install and load the libraries

First we will need to install and load all the necessary library needed for the analysis and building of visualization.

packages = c('tidyverse','ggplot2','stringr', 'ggpol')
for (p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

2.2 Data Wrangling

2.2.1 Loading data

As the data is in csv format, we will be using read_csv to load the data as a dataframe

pop_data <- read_csv("data/respopagesextod2011to2020.csv")

Take a glimpse at the data to understand the data structure

glimpse(pop_data)

## Rows: 984,656
## Columns: 7
## $ PA   <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang M...
## $ SZ   <chr> "Ang Mo Kio Town Centre", "Ang Mo Kio Town Centre", "Ang Mo Ki...
## $ AG   <chr> "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0...
## $ Sex  <chr> "Males", "Males", "Males", "Males", "Males", "Males", "Males",...
## $ TOD  <chr> "HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-Room Fla...
## $ Pop  <dbl> 0, 10, 30, 50, 0, 0, 40, 0, 0, 10, 30, 60, 0, 0, 40, 0, 0, 10,...
## $ Time <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...

2.2.2 Filter data for analysis

Use the filter function to retrieve all the row that belongs to year 2019.

clean_data <- pop_data %>% 
  filter(Time == 2019)

As stated in the challenges, some of the planning area do not have any population. Hence, by removing them, there will be lesser distraction in the final visualisation. The code chunk below do the following: * Retrieve all the row that contain Population Count * Create a list of Planning Area which contain Population Count The list of PA with people is printed out below.

clean_data_summary <- clean_data %>%
  group_by(PA)%>%
  summarise(Pop = sum(Pop))%>%
  filter(Pop != 0)

## `summarise()` ungrouping output (override with `.groups` argument)

non_zero_pa <- unique(clean_data_summary$PA)
non_zero_pa

##  [1] "Ang Mo Kio"              "Bedok"                  
##  [3] "Bishan"                  "Bukit Batok"            
##  [5] "Bukit Merah"             "Bukit Panjang"          
##  [7] "Bukit Timah"             "Changi"                 
##  [9] "Choa Chu Kang"           "Clementi"               
## [11] "Downtown Core"           "Geylang"                
## [13] "Hougang"                 "Jurong East"            
## [15] "Jurong West"             "Kallang"                
## [17] "Lim Chu Kang"            "Mandai"                 
## [19] "Marine Parade"           "Museum"                 
## [21] "Newton"                  "Novena"                 
## [23] "Orchard"                 "Outram"                 
## [25] "Pasir Ris"               "Punggol"                
## [27] "Queenstown"              "River Valley"           
## [29] "Rochor"                  "Seletar"                
## [31] "Sembawang"               "Sengkang"               
## [33] "Serangoon"               "Singapore River"        
## [35] "Southern Islands"        "Sungei Kadut"           
## [37] "Tampines"                "Tanglin"                
## [39] "Toa Payoh"               "Western Water Catchment"
## [41] "Woodlands"               "Yishun"

Retain all the row with planning area that have people by filtering.

clean_data <- clean_data %>%
  filter(PA %in% non_zero_pa)

2.2.3 Change row value

As stated above in the challenges, age group “5_to_9” is will not be sorted according. The code chunk will replace all “5_to_9” to “05_to_09”

clean_data$AG[clean_data$AG == "5_to_9"] <- "05_to_09"

2.2.4 Mutate Features

As the population pyramid is separated by Sex, we will need to change one of the gender population count to negative values. In my case, I replaced Females population count to negative in the code chunk below.

clean_data <- clean_data %>%
  mutate(Pop = ifelse(Sex == "Females", -Pop,Pop))

2.2.5 Aggregate Data for Visualisation

The code chunk below does the following: * Group the data by Planning Area, Age and Sex * Aggregate the data by using the summarise function and sum the population count with respect to the group This will allow us to build the pyramid chart with the correct population count

clean_data <- clean_data %>%
  group_by(PA,AG,Sex) %>%
  summarise(Pop = sum(Pop))

## `summarise()` regrouping output by 'PA', 'AG' (override with `.groups` argument)

2.2.6 Final Visualisation

The code chunk below will plot the final visualisation:

Input the data, x, y and fill of the plot into ggplot function. Data is the cleaned data, x would be the Age group, y would be the population count and fill is the Sex.
Add geom_bar to the ggplot object. Stat is identity as we will to plot the height with respect to the population count.
Filp the x-axis to the y-axis and vice versa.
facet_share() is a function from “ggpol” helps us to merge the plot. In this case, we will merge the male and female bar chart together and form one pyramid chart. Note that dir = “h” means direction is horizontal.
We will perform facet wrap on planning area so that we can see one population for each planning area. After removing the Planning Area with no population, there is still a huge number of planning area. Hence, ncol is being use to make sure we can see each of the population distribution properly.
Adjust the y axis scale to a fixed value of -15000 and 15000. label = abs will help us to remove the negative sign on the females side of axis.(Female contain all negative values).
Set labels for x-axis, y-axis as well as title of the visualisation.

Do note that the code chunk have these settings (fig.width= 12, fig.height= 24) in order to display the chart nicely.

ggplot(clean_data, aes(x = AG, y = Pop, fill = Sex)) +
  geom_bar(stat = "identity")+ 
  coord_flip() +
  facet_share(~Sex, dir = "h", reverse_num = TRUE)  +
  facet_wrap(~ PA, ncol = 5)+
  scale_y_continuous(labels = abs,
                     limits = c(-15000, 15000))+
  labs(x = "Age Group", y = "Population Count", title = "Singapore Population Distribution by Age Group and Planning Area in 2019", caption = "\n\n Data source: Singstat, https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data")+
  theme(plot.title = element_text(size = 18, face = "bold",hjust = 0.5),
        axis.title = element_text(face = "bold"),
        plot.caption = element_text(hjust =0))

3 Short Description of Visualisation

Population pyramid helps us to visualise the population distribution across the planning area. However, after removing planning area with no population, there are still some area that is sparsely populated. (e.g. Orchard, Singapore River)

Most of the population pyramid follows a dumbbell shape where there are 2 peaks 50-64 years and 20–34 years such as Choa Chu Kang, Serangoon and Tampines. It is self-explanatory for the 50-64 years as they were part of the baby boomer’s generation. However, it is interesting that the second peak is around 20-34 years. According to our belief, Singaporean fertility rate is on a decline over the years but there is still a large proportion of young adults across most planning area.
Unlike most planning area, Punggol and Sengkang have very different population distribution where the 2 peaks are concentrated between mid-thirties and forties. They also have a relatively higher distribution for children below 15 years. This shows that most of the residents in these areas are young couples and they are most giving birth to children which explained the second peak of the distribution for children below 15 years. With that many young couple crowded in these planning areas. It might explain why other planning areas experience the second peak at 20-34 years instead of a smooth downwards slope distribution.
In the more mature estate such as Ang Mo Kio and Toa Payoh, it seems like a normal distribution leaning towards the elderly population. This could be the planning area of the mature population who have their kids moving out and building a family on their own or perhaps family in these area tend not have kids and they are starting to face the problems of an aging population.
Across all planning area and age group, it seems that there is an equal distribution between both genders. Despite, the popular belief where Chinese prefer male child, it seems that Singapore has achieved a good balance between both genders.

4 Conclusion

Different planning area consist of different population structure. Some of the hypothesis will require a more in depth analysis in order to verify its facts.

IS 428: Assignment_4

Fabian

10/20/2020