IS428 Visual Analytics for Business Intelligence

Overview

The objective of this data visualisation is to give insights into the demographic structure of Singapore’s population by age cohort and planning area in 2019. The data on Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2020 was obtained from the Department of Statistics, Singapore.

Major data and design challenges

The first major challenge was figuring out how to aggregate the data to suit the different visualisations, and each new graph I plotted uses a different data frame. This was necessary because it would be difficult to plot the charts I wanted using the raw data directly. Additionally, 13 of the 55 planning areas in Singapore had no data and these were removed.

The second major challenge relates to the design of the visualisation. For example, ensuring that the age group column is ordered correctly and the ticks on the axes are labelled and oriented in an intuitive way that the reader can easily understand. Furthermore, I also changed the theme of some of the charts to make for a clearer visualisation.

Proposed sketched design

design sketch

Step-by-step description of how the data visualisation was prepared

Step 1: load the required packages and read the data

packages = c('tidyverse')
for (p in packages) {
  if (!require(p, character.only = T)) {
    install.packages(p)
  }
  library(p, character.only = T)
}

## Loading required package: tidyverse

## Warning: package 'tidyverse' was built under R version 3.6.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 3.6.3

## Warning: package 'tibble' was built under R version 3.6.3

## Warning: package 'tidyr' was built under R version 3.6.3

## Warning: package 'readr' was built under R version 3.6.3

## Warning: package 'purrr' was built under R version 3.6.3

## Warning: package 'dplyr' was built under R version 3.6.3

## Warning: package 'stringr' was built under R version 3.6.3

## Warning: package 'forcats' was built under R version 3.6.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

raw_data <- read_csv("data.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   PA = col_character(),
##   SZ = col_character(),
##   AG = col_character(),
##   Sex = col_character(),
##   TOD = col_character(),
##   Pop = col_double(),
##   Time = col_double()
## )

Step 2: data preparation

First, I filtered the data to keep only 2019 data.

raw_data_2019 <- raw_data %>% filter(Time == 2019)

Then, I aggregated the data in various different ways to suit the visualisations I wanted to show. For example, to obtain data for the population pyramid, I aggregate the population column for each age group and gender.

For the “AG” column, when plotting the graph, the age group “5_to_9” will be placed after “45_to_49” on the axis, but this order is incorrect. Instead, I created another column called startAge that stores the starting age of the age range. I then convert this column to a numeric value using the as.numeric function.

total_pop_2019 <- aggregate(Pop~AG+Sex, raw_data_2019, sum) %>% 
  mutate(startAge = sub("\\_.*", "", AG))
total_pop_2019$startAge <- as.numeric(total_pop_2019$startAge)

pa_pop <- aggregate(Pop~PA, raw_data_2019, sum)

pa_age <- aggregate(Pop~PA+AG, raw_data_2019, sum) %>% 
  mutate(startAge = sub("\\_.*", "", AG))
pa_age$startAge <- as.numeric(pa_age$startAge)

tod_pop <- aggregate(Pop~TOD, raw_data_2019, sum) %>% 
  mutate(type_of_dwelling = sub("\\ .*", "", TOD))

tod_pop2 <- aggregate(Pop~type_of_dwelling, tod_pop, sum)

I realised that some planning areas have 0 residents in total, so I removed these planning areas for the data visualisation using the filter function.

no_pop <- filter(pa_pop, Pop == 0)
print(no_pop$PA)

##  [1] "Boon Lay"                "Central Water Catchment"
##  [3] "Changi Bay"              "Marina East"            
##  [5] "Marina South"            "North-Eastern Islands"  
##  [7] "Paya Lebar"              "Pioneer"                
##  [9] "Simpang"                 "Straits View"           
## [11] "Tengah"                  "Tuas"                   
## [13] "Western Islands"

Step 3: plotting the graphs

Population Pyramid

The first graph I will plot is the population pyramid. I use the newly created “startAge” column to reorder the original “AG” column, so the age column is now sorted in the correct order.

ggplot(data=total_pop_2019,aes(x=reorder(AG, startAge), y=Pop, fill=Sex)) + 
  geom_bar(stat="identity", 
           data=subset(total_pop_2019, Sex=="Females"),
           color="grey30") + 
  geom_bar(stat="identity", 
           data=subset(total_pop_2019, Sex=="Males"),
           aes(y=Pop*(-1)),
           color="grey30") +
  scale_y_continuous(
    breaks=seq(-150000,150000,50000),
    labels=paste0(as.character(c(3:0*50, 1:3*50)), "K")) +
  coord_flip() +
  xlab("Age") +
  theme_bw()

The population pyramid shows the distribution of age and gender in Singapore. There are several insights revealed by the graph. Firstly, the population pyramid is wider in the middle of the graph, indicating a large number of adults aged 20 to 64 years and a lower number of seniors aged 65 years and above. This is also known as the old-age support ratio in Singapore. Secondly, there are more females than males aged 65 and above, which seems to suggest that females have a longer life span than males, since the distribution of males and females for younger age groups is roughly equal. Thirdly, we can see that the number of young residents has been declining over the years, which corresponds with the downward trend in resident total fertility rate, as seen below.

Resident Total Fertility Rate in Singapore Source: SingStat Infographic

Bar chart

The second graph is a bar chart showing the total population within each planning area.

ggplot(data=pa_pop[!pa_pop$PA %in% no_pop$PA, ],
       aes(x=Pop, y=reorder(PA,Pop))) +
  geom_bar(stat="identity") +
  ylab("Planning Area") + 
  scale_x_continuous(labels = scales::comma, breaks=c(0, 50000, 100000, 150000, 200000, 250000)) +
  theme_bw()

Bedok has the largest population, followed by Jurong West and Tampines.

Trellis Plot (age group)

The third visualisation is a trellis plot that shows the number of residents per age group for each planning area.

ggplot(data=pa_age[!pa_age$PA %in% no_pop$PA, ],
       aes(x=reorder(AG, startAge), y=Pop)) +
  geom_bar(stat="identity") + 
  facet_wrap(~PA, nrow = 11, ncol = 5) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  xlab("Age")

It is interesting to note that newer estates like Punggol have a large number of young children (age 0 to 9) and adults (age 30 to 44), which indicates that there are many young families staying in Punggol, as compared to more mature estates like Ang Mo Kio and Bedok that have a larger population of seniors than young children.

Trellis Plot (type of dwelling)

This graph shows the number of residents by type of dwelling in each planning area.

ggplot(data=raw_data_2019[!raw_data_2019$PA %in% no_pop$PA, ],
       aes(x=TOD, y=Pop)) +
  geom_bar(stat="identity") + 
  facet_wrap(~PA, nrow = 11, ncol = 5) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

We can see that most residents stay in HDB flats, which corresponds with the 78.42% statistic shown in the stacked bar chart below. Most of these HDB flats are in the heartlands, for example, Tampines and Jurong West. Bukit Timah stands out for having more residents living in condominiums and landed properties, which is in line with its reputation as a rich neighbourhood for expats and Singaporeans who are more well-to-do.

To plot the stacked bar chart below, I created a dummy variable called ‘row’ and another column ‘percent’ to store the percentage of residents staying in each type of dwelling.

# dummy variable
tod_pop2$row <- 1

tod_pop2$percent <- round((tod_pop2$Pop/sum(tod_pop2$Pop))*100,2)
  
ggplot(data=tod_pop2,
       aes(x=row, fill=type_of_dwelling, y=Pop)) +
  geom_bar(stat="identity") +
  geom_text(aes(label = percent), 
        position = position_stack(vjust = 0.5)) + 
  theme_bw() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) +
  ylab("Percentage of Total Population(%)")

IS428 Visual Analytics for Business Intelligence | Assignment 4

Leandra Lee

3/14/2021