IS428 Visual Analytics and Applications - Assignment 4

Overview

The data was collected from Singapore Department of Statistics. In the data, it shows the distribution of the residents’ population (round up to the nearest 10) in the different areas of Singapore per year from 2011 to 2020. The distribution of the residents’ population is reflected based on the year, their age group, gender and also their type of dwelling in each planning areas. The data consists of the following fields:

It would be interesting to look at the population distribution of the different age groups and based on the planning areas and the types of dwelling that residents are staying in.

Major data and design challenges

There are a few data and design challenges when doing the visualisation:

Proposed sketched design

Source from Singapore Department of Statistics

Step by Step Guide

Step 1: Install and Load package

ggplot2 is a part of the tidyverse, an ecosystem of packages designed with common APIs and a shared philosophy.

packages=c("tidyverse","plotly", "scales","grid")

for(p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Step 2: Load and extract 2019 data

demographics <-read_csv("data/respopagesextod2011to2020.csv")

# only take year 2019
demo_2019 <- filter(demographics, Time==2019)

Step 3: Create new columns in dataframe

3.1. Categorise each age cohort into their Age categories

There are a total of 19 age cohorts (‘AG’ column) in the data. It would be too tedious if readers wants to see the different age cohorts in the different planning areas. Thus, we will create a new column that segments the population into 5 age categories based on to the table below.

TABLE 1

Age Category Age cohorts (5 year range)
Children ‘0_to_4’,‘5_to_9’,‘10_to_14’
Young Adult ‘15_to_19’,‘20_to_24’,‘25_to_29’
Adult ‘30_to_34’, ‘35_to_39’, ‘40_to_44’
Middle age ‘45_to_49’, ‘50_to_54’, ‘55_to_59’
Elderly ‘60_to_64’, ‘65_to_69’, ‘70_to_74’, ‘75_to_79’, ‘80_to_84’, ‘85_to_89’, ‘90_and_over’

Below is the code for grouping the different age cohorts into their respective age category based on the table above.

#group the age cohort into the different age category
children <- c('0_to_4','5_to_9','10_to_14')
young_adult <- c('15_to_19','20_to_24','25_to_29')
adult <- c('30_to_34', '35_to_39', '40_to_44')
middleage<-c('45_to_49', '50_to_54', '55_to_59')
elderly <- c('60_to_64', '65_to_69', '70_to_74', '75_to_79', '80_to_84', '85_to_89', '90_and_over')
  
#create a new column in the dataframe
demo_2019_ <- demo_2019 %>% mutate(Age_category = 
                                case_when(`AG` %in% children ~ "Children (0-14)",
                                  `AG` %in% young_adult ~ "Young Adult (15-29)",
                                  `AG` %in% adult ~ "Adult (30-44)",
                                  `AG` %in% middleage ~ "Middle age (45-59)",
                                  `AG` %in% elderly ~ "Elderly (>60)",
                                   TRUE ~ ""))

3.2. Categorise each Planning Areas based on Region

Since there is a lot of planning areas in Singapore, it would be difficult for readers to consume so much information. Thus, we are grouping the planning areas (PA column) into 5 different regions based on the table below.

TABLE 2

Region Planning Areas
North Region Sungei Kadut, Yishun, Mandai, Woodlands, Simpang, Sembawang, Central Water Catchment, Lim Chu Kang
North-east Region Hougang, Punggol, Ang Mo Kio, Sengkang, Seletar, North-Eastern Islands, Serangoon
Central Region Kallang, Marina East, Museum, Novena, Singapore River, Rochor, Bukit Timah, Downtown Core, Marina South, Newton, Orchard, Queenstown, Southern Islands, Toa Payoh, Bishan, Bukit Merah, Geylang, Marine Parade, Outram, River Valley, Tanglin, Straits View
East Region Changi, Pasir Ris, Bedok, Changi Bay, Tampines, Paya Lebar
West Region Boon Lay, Bukit Panjang, Clementi, Tengah, Western Islands, Bukit Batok, Jurong East, Western Water Catchment, Choa Chu Kang, Jurong West, Tuas, Pioneer

If you want to look at all the planning areas from the data file, use the code below.

unique(demo_2019_$PA)

Below is the code for grouping the different planning areas into their respective regions based on the table above.

#group the planning areas into the different regions
north<- c( "Sungei Kadut","Yishun", "Mandai", "Woodlands", "Simpang", "Sembawang","Lim Chu Kang","Central Water Catchment")
northeast<- c("Hougang","Punggol","Ang Mo Kio","Sengkang", "Seletar", "North-Eastern Islands", "Serangoon" )
central<-c("Kallang","Marina East","Museum","Novena","Singapore River","Rochor", "Bukit Timah","Downtown Core", "Marina South", "Newton", "Orchard","Queenstown", "Southern Islands", "Toa Payoh","Bishan", "Bukit Merah","Geylang", "Marine Parade", "Outram" , "River Valley", "Tanglin", "Straits View")
east<-c("Changi","Pasir Ris","Bedok", "Changi Bay","Tampines","Paya Lebar" )
west<-c("Boon Lay" ,"Bukit Panjang", "Clementi","Tengah","Western Islands", "Bukit Batok", "Jurong East", "Western Water Catchment", "Choa Chu Kang","Jurong West", "Tuas","Pioneer")

#create a new column in the dataframe
demo_2019__ <- demo_2019_ %>% mutate(Region = 
                                case_when(`PA` %in% north ~ "North Region",
                                  `PA` %in% northeast ~ "North-east Region",
                                  `PA` %in% central ~ "Central Region",
                                  `PA` %in% east ~ "East Region",
                                  `PA` %in% west ~ "West Region",
                                   TRUE ~ ""))

Step 4: Check for unusual data

This step is to confirm that it is okay to proceed with the grouping done in the previous step. If there is any unusual findings, such as 80% of the the age cohort 45_to_49 are females, we might not be able to group the age cohorts into categories as it might cause the visualisation to have missing information after grouping them together.

The code below is to check for unusual data in the age cohorts.

#Reorder the age cohort based on the age
demo_2019 %>% 
  dplyr::mutate(AG = factor(AG, 
                            levels = c('0_to_4','5_to_9','10_to_14','15_to_19','20_to_24', '25_to_29', '30_to_34', '35_to_39', '40_to_44', '45_to_49', '50_to_54', '55_to_59', '60_to_64', '65_to_69', '70_to_74', '75_to_79', '80_to_84', '85_to_89', '90_and_over'))) %>%    

#Plot graph
ggplot(aes(x=AG, y=Pop))+geom_col(aes(fill=Sex))+ scale_y_continuous(labels = comma)+theme(axis.text.x = element_text(angle=45, vjust=1,hjust=1))

Based on the chart above, the age cohorts do not have any unusual data, thus it is safe to group them based on their age group (Children, Young Adult and etc.).

As mentioned earlier in the data and design challenges, the chart below shows that it is indeed very cluttered when all planning areas is shown in the visualisation.

The code below is to show that there are too many planning areas which might be confusing for readers.

ggplot(data=demo_2019_,
aes(x=reorder(PA, -Pop), y=Pop))+ 
geom_col(aes(fill=Age_category)) + scale_y_continuous(labels = comma) +
theme(axis.text.x = element_text(angle=90, vjust=1,hjust=1), legend.position = "bottom")

Step 5: Other findings

The code below shows that the both the genders are almost the same for each age category except for the adult and elderly category, which has a relatively higher females compared to males.

p<-ggplot(demo_2019__ %>%
            group_by(Age_category,Sex)%>% mutate(Age_category = factor(Age_category, 
                            levels = c('Children (0-14)','Young Adult (15-29)','Adult (30-44)','Middle age (45-59)','Elderly (>60)')))%>% summarise(Pop_total=sum(Pop)),
          aes(x=Age_category, y=Pop_total, fill=Sex))
## `summarise()` regrouping output by 'Age_category' (override with `.groups` argument)
p+geom_col(position="dodge")+scale_y_continuous(labels = comma)  

The following two pie chart are plotted using plotly library.

The code below will show the pie chart of the population distribution by types of dwelling in 2019.

TOD_distribution <- plot_ly(data=demo_2019, labels = ~TOD, values = ~Pop, type = 'pie')
TOD_distribution <- TOD_distribution %>% layout(title = 'Types of Dwelling population distribution in 2019',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

TOD_distribution

The code below will show the pie chart of the population distribution by age category in 2019.

AG_distribution <- plot_ly(data=demo_2019_, labels = ~Age_category, values = ~Pop, type = 'pie')
AG_distribution <- AG_distribution %>% layout(title = 'Age category population distribution in 2019',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

AG_distribution

Step 6: Plot the charts

6.1 Plot bar chart based on population’s age category by region.

The code below will show the charts on the population distribution of the different age categories by regions in 2019.

#Reorder the age cohort based on the age
demo_2019__ %>% 
  dplyr::mutate(Age_category = factor(Age_category, 
                            levels = c('Children (0-14)','Young Adult (15-29)','Adult (30-44)','Middle age (45-59)','Elderly (>60)'))) %>%
  
#Plot the graph
ggplot(
aes(fill=Region, y=Pop, x=Region))+ 
geom_col() + 
  theme(axis.text.x = element_blank(), legend.position = "none") + facet_wrap(~Age_category,ncol=5) + scale_y_continuous(labels = comma)+
  ggtitle("Demographic structure of Singapore's Population in 2019") ->db1
# "->db1" is to create a vector so that we can combine the 2 charts together for the final visualisation

To see how the visualisation looks like, remove “-> db1” and change legend.position=“bottom” in the code above.

6.2 Plot bar chart based on population’s Type of dwelling (TOD) by region.

The code below will show the charts on the population distribution of the types of dwelling by regions in 2019.

#the dplyr mutate function is to reorder the different fields according to how you want it to be ordered

demo_2019__ %>% 
  dplyr::mutate(Age_category = factor(Age_category, 
                            levels = c('Children (0-14)','Young Adult (15-29)','Adult (30-44)','Middle age (45-59)','Elderly (>60)'))) %>%dplyr::mutate(TOD = factor(TOD, 
                            levels = c("HDB 1- and 2-Room Flats","HDB 3-Room Flats","HDB 4-Room Flats","HDB 5-Room and Executive Flats", "HUDC Flats (excluding those privatised)","Condominiums and Other Apartments", "Landed Properties", "Others"))) %>%     
  
#plot the graph  
ggplot(
aes(fill=Region, y=Pop, x=Region))+ 
geom_col() + 
  theme(axis.text.x = element_blank(), legend.position = "bottom", legend.text=element_text(size=7)) + facet_wrap(~TOD,ncol=8,labeller = label_wrap_gen(width = 8)) + scale_y_continuous(labels = comma) ->db2 

# "->db2" is to create a vector so that we can combine the 2 charts together for the final visualisation

To see how the visualisation looks like, remove “-> db2” in the code above.

Step 7: Create the Final visualisation with the code below

The library grid is used to combine the visualisations.

grid.newpage()
pushViewport(viewport(layout=grid.layout(nrow=25, ncol=60)))

define_region<- function(row,col){
  viewport(layout.pos.row=row, layout.pos.col=col)
}

print(db1,vp=define_region(row=1:10,col=1:60))
print(db2,vp=define_region(row=11:25,col=1:60))

Final data visualisation and description

Source from Singapore Department of Statistics \[\\[1in]\]

In general, from the final visualisation, most of the regions have a higher population of residents that are living in HDB 4-Room Flats and HDB 5-Room and Executive Flats and no one is living in HUDC Flats excluding those that are privatised. There is also a much lower population that are categorize under living in HDB 1- and 2-Room Flats and Others.

Diving into the specific regions, we are able to tell that the East and North region are having a lower population as compared to the other regions. In addition to this insight, the residents that are staying in these two regions have a higher population staying in HDB 4-Room Flats and HDB 5-Room and Executive Flats.

Another interesting insight would be that there is a higher elderly population living in the Central Region and has the highest elderly population as compared to other regions. In addition, the Central region have the highest population living in landed properties and Condominiums and Other Apartments as compared to the other regions. The wealthiness of the population living in the Central region justifies why there are more people that are living in the private estate. Although there is also more people that are living in the HDB 1- and 2-Room Flats and HDB 3-Room Flats, we cannot infer that there are more poor people in the Central Region as houses in the Central Region are usually more expensive compared to other regions of Singapore.

Lastly, there is a higher population living in HDB 4-Room Flats in the North-east and West region. We can also tell that there is a relatively higher population of Adult aged 30 to 44 living in the North-east region. It could be because in the past few years, Punggol has become a new planning area for new flats to be build and Sengkang is also relatively new. Thus, adults in the 30s, usually an age to start a new family/marriage, are applying their new home in the new planning areas.

Citations