Visual Analytics - Assignment 4

A. Assignment Overview

The data visualisation assignment is intended to understand & visualise the change in demographic pattern of Singapore based on age and location in 2019.

For the purposes of this assignment, - An age sex pyramid and a heatmap is created to show the demographic structure of Singapore in 2019 by age cohort & gender [Population Pyramid] and by age cohort & planning area [heatmap].

B. Challenges faced while visualization

B.1. Data and Design challenges :

Data and design challenges faced are listed below:

Format of the raw data isn’t readily usable to create the visualisation. There are certain pre-processing steps required to make the data usable for creating the visualisation. There are some fields like TOD (Type of Dwelling) etc. that are not used in any of the visualizations. Hence, the data preparation to have the adequate fields for all the different visualizations had to be done separately.
Elements of interactivity in the visualizations in form of tool-tip etc is a challenge. As for Tableau, directly filters, tooltips and legends can be added to the visualization in an interactive manner. In R through ggplot2, interactivity in terms of tooltips couldn’t be added. It is hard to add labels for each value as that may look cumbersome. Hence, intuitively added these interactivities seem to be a challenge.
Customization of the view of the charts in ggplot. The default view of the visualization created with ggplot is not appealing and hence not very professional as was the case with Tableau. The grey default background doesn’t look appealing.

B.2. Overcoming challenges :

For data preparation challenges, dplyr from tidyverse has been used to perform filtering and aggregation of the fields. The data preparation is performed separately for each of the visualizations.
Adding interactivity in R is not as intuitive as it is in Tableau, but there is a lot of scope to add interactivity through ggplot and plotly. A ggplot2 can be converted into a plotly with tooltips that show the values of the label and enables the user to click screenshots of the visualization and save as a png. As for title, subtitle and caption (source credits) - these are added through labs() in the ggplot.
A code chunk mentioned below in the data visualization area of the steps of creating the visualization called as Formatting. Some basic formatting with respect to background color, themes, hiding of axis if required, hiding of legends etc have been added. This is a reusable piece of code which is used in more than one visualizations and can be used in future as well.

B.3. Sketches of the visualizations

An Age-sex population pyramid is used to show population distribution of Singapor split by gender for the year 2019.

A Heatmap is used to show the demographic structure of Top 10 Populous Singapore Areas in 2019 split by age brackets.

C. Steps for Creating Visualisations

This segment is used to describe in detail the steps involved for creating the visualisations. It is broadly split into 3 sections - R Packages & Libraries, Data Preparation & Creating Visualisation.

C.1. Installation & Initialization of R packages

This code chunk installs the basic tidyverse, heatmaply and plotly packages on the user machine without having to load one by one. These packages are installed and loaded into Rstudio environment because they are needed to be loaded for the visualizations.

packages = c('tidyverse','heatmaply','plotly')

for(p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

C.2. Data Processing & Preparation (for visualisations)

The data source for this dataviz is Singstat.com. The dataset used for this dataviz is the Singapore Residents by Planning Area, Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019. The data set was downloaded from the link provided and included in the data sub-folder of the DataViz Makeover 8 project folder in csv format.

Importing Data

We use the read_csv() function of readr package to read the raw data & store it in the “raw_data” dataframe.

raw_data <- read_csv("C:/Users/Rajnish Julka/Downloads/SMU Term 3/Visual Analytics/Assignments/Assignment 4/respopagesextod2011to2019.csv")

## Parsed with column specification:
## cols(
##   PA = col_character(),
##   SZ = col_character(),
##   AG = col_character(),
##   Sex = col_character(),
##   TOD = col_character(),
##   Pop = col_double(),
##   Time = col_double()
## )

Data Preparation steps for “Age-Sex Pyramid”

Preparation of the data is done using dplyr package in tidyverse. Functions like mutate(), spread(), filter() select() are used to derive appropriate variables for each of the respective visualizations.

“pop_pyramid” is created to filter the data for 2019 and keep only the required columns and remove the others by select(). Also because Time is read as col_double() by default it is converted to character() through mutate.

pop_pyramid <- raw_data %>%
  mutate(`Time`=as.character(Time)) %>%
  filter(Time=="2019") %>%
  select(-PA,-SZ,-TOD,-Time)

In order to have the population brackets in increasing order, we rename “5_to_9” value to “05_to_09” so that the order can be maintained. Otherwise “5_to_9” comes just before the “50_to_54” bracket which messes up the visualisation.

We will need to aggregate the raw data and store it in the final dataframe - “pop_pyramid_final” as shown below.

pop_pyramid$AG<-str_replace(as.character(pop_pyramid$AG),"5_to_9","05_to_09")
pop_pyramid_final <- aggregate(Pop~AG+Sex,data=pop_pyramid,FUN=sum)

Data Preparation steps for “Population distribution heatmap for Top 10 Regions”

Data preparation is done in order to create a heat-map for showing the population distribution in Top 10 Regions of Singapore (by population).

First, we identify the Top 10 regions by rolling up the data at region level to get no. of people in each region and sort it in a descending order. Then to create the population distributino heat-map for these 10 regions split by age brackets, we roll-up the data at PA & Age Bracket level. The data is stored in heatmap_data_10.

heatmap_data <- raw_data %>%
  mutate(`Time`=as.character(Time)) %>%
  filter(Time=="2019") %>%
  select(-SZ,-Sex,-TOD,-Time) %>%
  group_by(PA,AG) %>% 
  summarize(Pop_Agg = sum(Pop))

heatmap_data$AG<-str_replace(as.character(heatmap_data$AG),"5_to_9","05_to_09")

Filter out data for Top 10 Planning Areas (by Population)

heatmap_data_top10region = heatmap_data %>%
  group_by(PA) %>% 
  summarize(Population = sum(Pop_Agg)) %>% 
  arrange(desc(Population))

heatmap_data_10 <- heatmap_data %>%
  filter(PA=="Bedok" || PA=="Jurong West" || PA=="Tampines" || PA=="Woodlands" || PA=="Sengkang" || PA=="Hougang" || PA=="Yishun" || PA=="Choa Chu Kang" || PA=="Punggol" || PA=="Ang Mo Kio" )

C.3. Creating Visualizations

A basic formatting chunk is prepared so that the theme, position of axis, removal of legend can be standardized and you don’t have to repeat the steps for individual visualisations.

#reusable code for formatting the plot
Formatting <- list( 
  theme_bw(),
  theme(panel.grid.major.x = element_blank()),
  theme(axis.text.x.top = element_text(size=12)),
  theme(plot.title = element_text(size=14, face = "bold", hjust = 0.5)),
  theme(plot.subtitle = element_text(hjust = 0.5))
)

Age-Sex Population Pyramid

The visualisation is to show the age and gender distribution in Singapore for 2019. We make use of ggplot in tidyverse to create the visualization. The following elements are added:

geom_col() is added to ggplot which is used for bar charts. In this case we need 2 bar charts one for males and one for females. In aesthetics X axis is the column Age “AG” only but for the Y axis the Male population values are multiplied with -1 to make them to make them flipped, and the fill of the chart is based on Sex so that there are 2 different colours for the 2 categories.
coord_flip() is used to flip the X and Y axes. The Y axis is scaled using scale_y_continuous so that the negative populations are transformed to positive.
labs() the axes are labelled, a title, a subtitle and caption are given to the visualization.

pyramidplot<-ggplot(pop_pyramid_final,aes(x = AG, y = ifelse(Sex == "Males", yes = -Pop, no = Pop),fill = Sex))+
  geom_col()+
  coord_flip()+
  scale_y_continuous(labels = abs, limits = max(pop_pyramid_final$Pop)*c(-1,1))+
  Formatting +
  labs(
    x="Age",
    y = "Population",
    title = "Population Pyramid for Singapore, 2019",
    subtitle = "Similar distribution for males & females, major chunk of Singaporeans fall in the 25-60 Years age bracket",
    caption = "Data Source: Singstat.com"
  )

pyramidplot

Population Distribution Heatmap (for Top 10 Populous regions of Singapore)

Base Visualization This visualization is to show the demographic structure of Singapore Population by age cohort and planning area.The density of each of the blocks is represent by the coloured box. Even though there is a heatmap() function in the base package that could be used for this purpose. In this case, ggplot is used with geom_tile() which ultimately creates a heatmap. The dataframe created above is used here - heatmapnew. Apart from that as aesthetics, on X axis age “AG” is plotted with planning area “PA” on Y axis. Population is used as a fill for the geom_tile() plot. The general formatting done previously is retained with some additional formatting to rotate labels on X axis and reduce font size on Y axis to fit all the planning areas. Through labs, the title, subtitle, caption, axes labels are named.

heatmapviz<- ggplot(heatmap_data_10,aes(AG,PA,fill=Pop_Agg))+
  geom_tile(position = "identity",stat = "identity")+
  Formatting+
  labs(title = "Population Distribution Heatmap for Top 10 populous regions in Singapore",
    subtitle = "Distribution of Singapore's Population by Age and Planning Area",
    caption = "Data Source: Singstat.com",
    x="Age Group",
    y="Planning Area")+
  theme(axis.text.x = element_text(angle = 90))+
  theme(axis.text.y = element_text(size=5))

heatmapviz

D. Key Observations from the 2 Visualizations:

Population Pyramid: Amply clear from the plot that major chunk of Singaporean population falls in the age bracket of 25-60 Years. It is slightly drifted towards the aged & experienced folks with declining young population. This can also be linked to a low birth rate owing to high cost of living. Singapore has a lot of expatriates who have moved here for work related purpose. The split between males & females is sort of similar.

pyramidplot

Population Distribution Heatmap: Looking at the population distribution heatmap for 10 most populous regions of Singapore we can identify what age bracket people are present in what areas of Singapore. Regions like Punggol, Sengkang & Jurong have a high no. of young/middle population. Such a heat map can be of great benefit for companies in their marketing & promotion initiatives.

heatmapviz