1.0 Overview

The purpose of this visualization document is to reveal some elements of the demographic structure of Singapore based on the data given by the Singapore department of statistics (through Singstat.gov.sg). The data provided has the demographic characteristics related to planning area, subzone, type of dwelling, age group, sex and population for the years 2011-2019. For this visualization task, we will be taking only the data related to 2019.

Though many visualizations can be created based on the usage of the different demographic characteristics, we have put more focus to the age group in this visualization. The variation of age group with the other demographic characteristics is studied more in this visualization document.

Given our focus for this visualization task, we have made four visualizations. The first one is a stacked bar chart which is made to show the percentage of population belonging to each dwelling type per age group. The second one is a density ridgeline chart which shows the population density of different age group. The thirds one is a population pyramid which shows the variation of population in different age groups for each gender, and the final one is a population heatmap for the planning area and age group characteristics.

2.0 Major data and design challenges

  1. Data Challenge: The initial data file provided for this assignment is ‘respopagesextod2011to2019.csv’. This file contains the demographic characteristics with abbreviated columns such as ‘PA’,‘AG’,‘TOD’ etc. The metadata for these abbreviations were given in another csv file. Thus, if we intend to use only the main file for the purpose of visualization, we might have to refer each and every time to what the column names mean as plotting charts with the existing column names might create ambiguity.

  2. Data/Design Challenge:The existing raw data cannot be used for certain visualizations such as the density ridgeline chart and the heatmap as it is in a disaggregated manner and the ggplot package does not recognize and use them in an intelligent way like Tableau. Thus we have to aggregate the data in a specific manner as per our use case if we intend to use certain charts like the heatmap (tile chart).

  3. Data/ Design Challenge: Presence of Ordinal columns in the existing data also poses some problems when they are used in some charts. Here, the column ‘Age group’(AG) has data such as 0_to_5, 5_to_9 etc. While plotting such ordinal data in some charts, we would want to see the column’s values following an order, either ascending or descending. But in R, it will result in some random ordering of the columns ordinal values which might affect the insight that the visualization tries to convey. This behavior can be solved easily in Tableau, due to its interactivity, unlike R.

  4. Design Challenge: The view of the charts are not easily customizable in R, contrary to Tableau. For example, the age group labels getting displayed in the axis are overlapping each other, and there is no easy option to rotate the label here in R. Also, charts such as Age sex pyramid requires dual axis with one axis inverted, for which there is no easy option in R unlike Tableau.

3.0 Plan to overcome the data and design challenges

3.1 The Plan

  1. The data manipulation challenge as indicated in point 1 in the above section can be addressed by the use of data manipulation packages in R such as dplyr. Thus, There is a scope to change the column names using a function called rename().

  2. The disaggregated data problem for certain charts as indicated in point 2 in the above section can be solved by getting aggregated data pertaining to our use case through the aggregate() function.

  3. The ordinal data ordering problem while plotting as indicated in point 3 in the above section can be solved by changing the column to a factor and manually indicating the ordered levels as part of the levels parameter.

  4. For customization problem as indicated in point 4 in the above section, We can see that the ggplot package supports good degree of customizations such as setting the themes, changing the inclination of the labels in the axis through the theme function, changing the colour scale etc.

3.2 The Sketch

4.0 Step by step description of the Data Visualization

4.1 Installing and loading the required packages

As a prior to the visualizations, we need to install/load some important R packages that will aid the use of certain visualization functions. The packages tidyverse, dplyr, tidyr, ggthemes and ggridges which are required for this visualization are installed/loaded by using the following function.

packages = c('tidyverse','dplyr','tidyr','ggthemes','ggridges')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

4.2 Loading the data

The initial data provided for the visualization task was ‘respopagesextod2011to2019.csv’ and it had demographic data for all the years from 2011 to 2019. Since we intend to use only 2019 data, the records pertaining to 2019 are extracted and saved in another file called ‘2019_demo.csv’. Thus we will load this data into ‘demo’ using the read_csv function.

demo = read_csv("Singapore Residents by Planning AreaSubzone Age Group Sex and Type of Dwelling June 20112019/2019_demo.csv")

4.3 Data Manipulation Part 1

To rename the abbreviated column names such as ‘PA’,‘AG’ etc in order to avoid ambiguity, the rename function is used to rename these columns as per their metadata.

demo <- demo %>%
  rename('Planning_Area'='PA',
         'Subzone'='SZ',
         'Age_Group'='AG',
         'Population'='Pop',
         'Year'='Time',
         'Dwelling_Type'= 'TOD')

The structure of the demographic data is changed to a dataframe to ensure that a dataframe structure is used throught the visualization.

demo <- as.data.frame(demo)
head(demo)
##   Planning_Area                Subzone Age_Group   Sex
## 1    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 2    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 3    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 4    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 5    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
## 6    Ang Mo Kio Ang Mo Kio Town Centre    0_to_4 Males
##                             Dwelling_Type Population Year
## 1                 HDB 1- and 2-Room Flats          0 2019
## 2                        HDB 3-Room Flats         10 2019
## 3                        HDB 4-Room Flats         10 2019
## 4          HDB 5-Room and Executive Flats         20 2019
## 5 HUDC Flats (excluding those privatised)          0 2019
## 6                       Landed Properties          0 2019

Next, we look upon the statistics of each individual field in the dataframe and also the class type of the field.

summary(demo)
##  Planning_Area        Subzone           Age_Group             Sex           
##  Length:98192       Length:98192       Length:98192       Length:98192      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Dwelling_Type        Population           Year     
##  Length:98192       Min.   :   0.00   Min.   :2019  
##  Class :character   1st Qu.:   0.00   1st Qu.:2019  
##  Mode  :character   Median :   0.00   Median :2019  
##                     Mean   :  41.08   Mean   :2019  
##                     3rd Qu.:  20.00   3rd Qu.:2019  
##                     Max.   :2440.00   Max.   :2019

Next, we know that the column Age_group is of ordinal data and that we should arrange the classes inside this column so that order is maintained during the visualization. We do this by changing the type of the Age group column to factor and indicate the levels in the order which we intend to use it for visualization.

demo$Age_Group <- factor(demo$Age_Group, levels=c("0_to_4","5_to_9","10_to_14","15_to_19","20_to_24","25_to_29","30_to_34","35_to_39","40_to_44","45_to_49","50_to_54","55_to_59","60_to_64","65_to_69", "70_to_74", "75_to_79","80_to_84","85_to_89","90_and_over"))

4.4 Creating the Stacked bar chart showing percentage of population of dwelling type for each age group

Now, the stacked bar chart is created. Here we intend to show the population proportion of each dwelling type for each age group. To create this we use the Age group as the x axis variable and the Population as the y axis variable.Dwelling type is used as the ‘fill’ parameter in the aes function. geom_bar() is used to create the bar chart. Here they are used with position = ‘fill’ to create a proportion based bar chart using the proportion of the y axis parameter. theme_minimal() is used to remove the gray background (non data component), which is the default setting for a R based visualization. The theme() function is used with the axis.text.x parameter to tilt the label by a degree of 45 in order to prevent the overlapping of labels. The labs() function is used to provide a title, caption and the x and y axis labels for the visualization.

ggplot(demo, aes(fill=Dwelling_Type, y=Population, x=Age_Group)) + 
    geom_bar(position="fill", stat="identity")+theme_minimal()+ theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 10, hjust = 1))+
  labs(title="Stacked Bar Chart showing Population proportion of Type of dwelling in each Age group", 
       caption="Source: Singstat.gov.sg", x = "Age group", y=" Population proportion expressed in decimal")

4.5 Data Manipulation Part 2

For the purpose of using the demographic data for creating the ridgelines and the heatmap, we need the data in an aggregated form. Thus we use the function called aggregate to create an aggregated dataframe. The population is aggregated per planning area and age group.

demo_agg = aggregate(Population ~ Planning_Area+Age_Group,data = demo, FUN = sum)
demo_agg = as.data.frame(demo_agg)
head(demo_agg)
##   Planning_Area Age_Group Population
## 1    Ang Mo Kio    0_to_4       5420
## 2         Bedok    0_to_4      10020
## 3        Bishan    0_to_4       2850
## 4      Boon Lay    0_to_4          0
## 5   Bukit Batok    0_to_4       7130
## 6   Bukit Merah    0_to_4       6100

For this aggregated dataset too, We repeat the conversion of the age group to factor with ordering the labels in the levels parameter to facilitate properly ordered ordinal age group levels in the visualization.

demo_agg$Age_Group <- factor(demo_agg$Age_Group, levels=c("0_to_4","5_to_9","10_to_14","15_to_19","20_to_24","25_to_29","30_to_34","35_to_39","40_to_44","45_to_49","50_to_54","55_to_59","60_to_64","65_to_69", "70_to_74", "75_to_79","80_to_84","85_to_89","90_and_over"))

4.6 Creation of the Density ridgelines chart

Ridgeline plots are partially overlapping line plots that create the impression of a mountain range. We intend to do this to show the population density for each age group. To create this we use the aggregated data (demo_agg), Age group as the y axis variable and the Population as the x axis variable. geom_density_ridges() is used to create the ridgeline chart. The density estimate is filled with lightblue. theme_minimal() is used to remove the gray background (non data component), which is the default setting for a R based visualization. The labs() function is used to provide a title, caption and the x and y axis labels for the visualization.

ggplot(demo_agg,aes(x = Population, y = Age_Group))+geom_density_ridges(fill = "lightblue")+theme_minimal()+
  labs(title="Density Ridgeline Chart showing population density for each age group", 
       caption="Source: Singstat.gov.sg")

4.7 Creation of the Age Sex population pyramid

Now, the age sex pyramid is created. To create this we initially use the Age group as the x axis variable and the Population as the y axis variable to create a stacked bar representation and flip it later to create the pyramid.Sex is used as the ‘fill’ parameter in the aes function. The y variable (Population) in the aes function is used with an ifelse statement to facilitate dual axis representation of bars according to the Sex with one side reversed. geom_bar() is used to create the bar chart first. theme_minimal() is used to remove the gray background (non data component), which is the default setting for a R based visualization. scale_y_continuous() is used with the limits parameter set as ‘max(demo$Population) * c(-80,80))’ to establish the population axis according to the values in the column. The multiplier ‘c(-80,80)’ is set through a series of trial and error to ensure that the bars are shown fully in both sides. coord_flip() is used to swap the x and y axis to create a pyramid. The labs() function is used to provide a title, caption and the x and y axis labels for the visualization. options(scipen = 999) is used to avoid the x axis population numbers being represented in scientific notation.

options(scipen=999)
ggplot(demo, aes(x = Age_Group, fill = Sex,
                 y = ifelse(test = Sex == "Males",
                            yes = -Population, no = Population))) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = abs, limits = max(demo$Population) * c(-80,80)) +theme_minimal()+
  coord_flip() + labs(title="Age-Sex Population pyramid", 
       caption="Source: Singstat.gov.sg", x = "Age Group", y = "Population")

4.8 Creation of the Population heatmap

Now, the population heatmap is created. To create this we use the aggregated data (demo_agg), Age group as the x axis variable and the Planning area as the y axis variable.Population is used as the ‘fill’ parameter in the aes function. geom_tile() is used to create a tile, which is none other than a heatmap. The colour is given to be white to ensure that the edges of the tiles are white, thus promoting visibility of the gradient color. scale_fill_gradient2() is used to fill the tiles with a color gradient according to the population. The tile pertaining to the planning area - age group combination with lower population is given the ‘blue’ color and those with the highest population is given the ‘red’ color. The color for the mid is given as ‘white’ and the ‘midpoint’ is set as ‘10000’. theme_minimal() is used to remove the gray background (non data component), which is the default setting for a R based visualization. The theme() function is used with the axis.text.x parameter to tilt the label by a degree of 45 in order to prevent the overlapping of labels. The labs() function is used to provide a title, caption and the x and y axis labels for the visualization.

ggplot(data = demo_agg, aes(x=Age_Group, y=Planning_Area, fill=Population)) + 
  geom_tile(color = "white")+
  scale_fill_gradient2(low="blue", high="red", midpoint = 10000)+theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 10, hjust = 1))+
  labs(title="Heatmap of population for Planning area and Age group characteristics", 
       caption="Source: Singstat.gov.sg", x = "Age group", y="Planning Area")

5.0 Final data visualization with description and insights

Stacked bar chart showing percentage of population of dwelling type for each age group

Description:

The purpose of this visualization is to show the population proportion of each dwelling type for each age group.For example, in the visualization, the blue colored stack in the first bar for age group 0_to_4 indicates that almost 25% of the total population belonging to age group 0 to 4 are living in HDB 5 room and executive flats.

Insights:

  1. It is seen that the dwelling type - ‘HDB 4-room flats’ has the major proportion of population in all age groups. Also, this dwelling type has almost the same proportion across all age groups.
  2. It is seen that there is a decreasing pattern in the population proportion of the Condominiums and other apartment dwelling type as we go towards the older age group.

Density ridgelines chart

Description:

The purpose of this visualization is to show the population density of each age group in Singapore for 2019. This will enable us to see which age group have wider distribution of population and which age group has lower.

Insights:

  1. We see that the population density curve becomes narrow as the age increases. The age group 90& above has the most narrow population density.
  2. The middle age groups such 30 to 34, 35 to 39, 40 to 44 etc. has the widest population density curve.

Age Sex population pyramid

Description:

The purpose of this visualization is to compare the population proportion of males and females in different age groups. The same length of the bars corresponding to males and females indicates that there is a balanced proportion, whereas unequal length of bars indicate a skewed proportion towards a particular gender.

Insights:

  1. From the visualization, we see that the age sex ratio for the young and the middle age groups are almost balanced, with the middle age groups having the highest population counts in both the Males and females.
  2. We notice that there is skewness in the female population bars as we progress towards the older age groups. This indicates that the population consists of more females than males in the older age groups.

Population heatmap for Planning area-Age group

Description:

The purpose of this visualization is to show the heatmap for the age group and the planning area combination. This will enable us to know the areas where the population density is high/low and will also help us see which age group people are present more/less in which planning area. The blue color indicates the lower population, thus more darker blue color indicates very less population. On the other hand, red indicates high population and more darker the red color is, more dense is that age group in that area.

Insights:

  1. From the heatmap, we see that Tampines, Sengkang, Jurong west and bedok are the most densely populated areas given by more number of ‘Red’ filled tiles which indicates higher value of population.
  2. We see that the most densely populated tile is for the Sengkang- 35 to 39 age group combination.We already know from the previous charts that 35 to39 age group is the highest populated group, and from here we can see that most of them reside in Sengkang.