The purpose of this visualization document is to reveal some elements of the demographic structure of Singapore based on the data given by the Singapore department of statistics (through Singstat.gov.sg). The data provided has the demographic characteristics related to planning area, subzone, type of dwelling, age group, sex and population for the years 2011-2019. For this visualization task, we will be taking only the data related to 2019.
Though many visualizations can be created based on the usage of the different demographic characteristics, we have put more focus to the age group in this visualization. The variation of age group with the other demographic characteristics is studied more in this visualization document.
Given our focus for this visualization task, we have made four visualizations. The first one is a stacked bar chart which is made to show the percentage of population belonging to each dwelling type per age group. The second one is a density ridgeline chart which shows the population density of different age group. The thirds one is a population pyramid which shows the variation of population in different age groups for each gender, and the final one is a population heatmap for the planning area and age group characteristics.
Data Challenge: The initial data file provided for this assignment is ‘respopagesextod2011to2019.csv’. This file contains the demographic characteristics with abbreviated columns such as ‘PA’,‘AG’,‘TOD’ etc. The metadata for these abbreviations were given in another csv file. Thus, if we intend to use only the main file for the purpose of visualization, we might have to refer each and every time to what the column names mean as plotting charts with the existing column names might create ambiguity.
Data/Design Challenge:The existing raw data cannot be used for certain visualizations such as the density ridgeline chart and the heatmap as it is in a disaggregated manner and the ggplot package does not recognize and use them in an intelligent way like Tableau. Thus we have to aggregate the data in a specific manner as per our use case if we intend to use certain charts like the heatmap (tile chart).
Data/ Design Challenge: Presence of Ordinal columns in the existing data also poses some problems when they are used in some charts. Here, the column ‘Age group’(AG) has data such as 0_to_5, 5_to_9 etc. While plotting such ordinal data in some charts, we would want to see the column’s values following an order, either ascending or descending. But in R, it will result in some random ordering of the columns ordinal values which might affect the insight that the visualization tries to convey. This behavior can be solved easily in Tableau, due to its interactivity, unlike R.
Design Challenge: The view of the charts are not easily customizable in R, contrary to Tableau. For example, the age group labels getting displayed in the axis are overlapping each other, and there is no easy option to rotate the label here in R. Also, charts such as Age sex pyramid requires dual axis with one axis inverted, for which there is no easy option in R unlike Tableau.
The data manipulation challenge as indicated in point 1 in the above section can be addressed by the use of data manipulation packages in R such as dplyr. Thus, There is a scope to change the column names using a function called rename().
The disaggregated data problem for certain charts as indicated in point 2 in the above section can be solved by getting aggregated data pertaining to our use case through the aggregate() function.
The ordinal data ordering problem while plotting as indicated in point 3 in the above section can be solved by changing the column to a factor and manually indicating the ordered levels as part of the levels parameter.
For customization problem as indicated in point 4 in the above section, We can see that the ggplot package supports good degree of customizations such as setting the themes, changing the inclination of the labels in the axis through the theme function, changing the colour scale etc.
As a prior to the visualizations, we need to install/load some important R packages that will aid the use of certain visualization functions. The packages tidyverse, dplyr, tidyr, ggthemes and ggridges which are required for this visualization are installed/loaded by using the following function.
packages = c('tidyverse','dplyr','tidyr','ggthemes','ggridges')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The initial data provided for the visualization task was ‘respopagesextod2011to2019.csv’ and it had demographic data for all the years from 2011 to 2019. Since we intend to use only 2019 data, the records pertaining to 2019 are extracted and saved in another file called ‘2019_demo.csv’. Thus we will load this data into ‘demo’ using the read_csv function.
demo = read_csv("Singapore Residents by Planning AreaSubzone Age Group Sex and Type of Dwelling June 20112019/2019_demo.csv")
To rename the abbreviated column names such as ‘PA’,‘AG’ etc in order to avoid ambiguity, the rename function is used to rename these columns as per their metadata.
demo <- demo %>%
rename('Planning_Area'='PA',
'Subzone'='SZ',
'Age_Group'='AG',
'Population'='Pop',
'Year'='Time',
'Dwelling_Type'= 'TOD')
The structure of the demographic data is changed to a dataframe to ensure that a dataframe structure is used throught the visualization.
demo <- as.data.frame(demo)
head(demo)
## Planning_Area Subzone Age_Group Sex
## 1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## Dwelling_Type Population Year
## 1 HDB 1- and 2-Room Flats 0 2019
## 2 HDB 3-Room Flats 10 2019
## 3 HDB 4-Room Flats 10 2019
## 4 HDB 5-Room and Executive Flats 20 2019
## 5 HUDC Flats (excluding those privatised) 0 2019
## 6 Landed Properties 0 2019
Next, we look upon the statistics of each individual field in the dataframe and also the class type of the field.
summary(demo)
## Planning_Area Subzone Age_Group Sex
## Length:98192 Length:98192 Length:98192 Length:98192
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Dwelling_Type Population Year
## Length:98192 Min. : 0.00 Min. :2019
## Class :character 1st Qu.: 0.00 1st Qu.:2019
## Mode :character Median : 0.00 Median :2019
## Mean : 41.08 Mean :2019
## 3rd Qu.: 20.00 3rd Qu.:2019
## Max. :2440.00 Max. :2019
Next, we know that the column Age_group is of ordinal data and that we should arrange the classes inside this column so that order is maintained during the visualization. We do this by changing the type of the Age group column to factor and indicate the levels in the order which we intend to use it for visualization.
demo$Age_Group <- factor(demo$Age_Group, levels=c("0_to_4","5_to_9","10_to_14","15_to_19","20_to_24","25_to_29","30_to_34","35_to_39","40_to_44","45_to_49","50_to_54","55_to_59","60_to_64","65_to_69", "70_to_74", "75_to_79","80_to_84","85_to_89","90_and_over"))
Now, the stacked bar chart is created. Here we intend to show the population proportion of each dwelling type for each age group. To create this we use the Age group as the x axis variable and the Population as the y axis variable.Dwelling type is used as the ‘fill’ parameter in the aes function. geom_bar() is used to create the bar chart. Here they are used with position = ‘fill’ to create a proportion based bar chart using the proportion of the y axis parameter. theme_minimal() is used to remove the gray background (non data component), which is the default setting for a R based visualization. The theme() function is used with the axis.text.x parameter to tilt the label by a degree of 45 in order to prevent the overlapping of labels. The labs() function is used to provide a title, caption and the x and y axis labels for the visualization.
ggplot(demo, aes(fill=Dwelling_Type, y=Population, x=Age_Group)) +
geom_bar(position="fill", stat="identity")+theme_minimal()+ theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 10, hjust = 1))+
labs(title="Stacked Bar Chart showing Population proportion of Type of dwelling in each Age group",
caption="Source: Singstat.gov.sg", x = "Age group", y=" Population proportion expressed in decimal")
For the purpose of using the demographic data for creating the ridgelines and the heatmap, we need the data in an aggregated form. Thus we use the function called aggregate to create an aggregated dataframe. The population is aggregated per planning area and age group.
demo_agg = aggregate(Population ~ Planning_Area+Age_Group,data = demo, FUN = sum)
demo_agg = as.data.frame(demo_agg)
head(demo_agg)
## Planning_Area Age_Group Population
## 1 Ang Mo Kio 0_to_4 5420
## 2 Bedok 0_to_4 10020
## 3 Bishan 0_to_4 2850
## 4 Boon Lay 0_to_4 0
## 5 Bukit Batok 0_to_4 7130
## 6 Bukit Merah 0_to_4 6100
For this aggregated dataset too, We repeat the conversion of the age group to factor with ordering the labels in the levels parameter to facilitate properly ordered ordinal age group levels in the visualization.
demo_agg$Age_Group <- factor(demo_agg$Age_Group, levels=c("0_to_4","5_to_9","10_to_14","15_to_19","20_to_24","25_to_29","30_to_34","35_to_39","40_to_44","45_to_49","50_to_54","55_to_59","60_to_64","65_to_69", "70_to_74", "75_to_79","80_to_84","85_to_89","90_and_over"))
Ridgeline plots are partially overlapping line plots that create the impression of a mountain range. We intend to do this to show the population density for each age group. To create this we use the aggregated data (demo_agg), Age group as the y axis variable and the Population as the x axis variable. geom_density_ridges() is used to create the ridgeline chart. The density estimate is filled with lightblue. theme_minimal() is used to remove the gray background (non data component), which is the default setting for a R based visualization. The labs() function is used to provide a title, caption and the x and y axis labels for the visualization.
ggplot(demo_agg,aes(x = Population, y = Age_Group))+geom_density_ridges(fill = "lightblue")+theme_minimal()+
labs(title="Density Ridgeline Chart showing population density for each age group",
caption="Source: Singstat.gov.sg")
Now, the age sex pyramid is created. To create this we initially use the Age group as the x axis variable and the Population as the y axis variable to create a stacked bar representation and flip it later to create the pyramid.Sex is used as the ‘fill’ parameter in the aes function. The y variable (Population) in the aes function is used with an ifelse statement to facilitate dual axis representation of bars according to the Sex with one side reversed. geom_bar() is used to create the bar chart first. theme_minimal() is used to remove the gray background (non data component), which is the default setting for a R based visualization. scale_y_continuous() is used with the limits parameter set as ‘max(demo$Population) * c(-80,80))’ to establish the population axis according to the values in the column. The multiplier ‘c(-80,80)’ is set through a series of trial and error to ensure that the bars are shown fully in both sides. coord_flip() is used to swap the x and y axis to create a pyramid. The labs() function is used to provide a title, caption and the x and y axis labels for the visualization. options(scipen = 999) is used to avoid the x axis population numbers being represented in scientific notation.
options(scipen=999)
ggplot(demo, aes(x = Age_Group, fill = Sex,
y = ifelse(test = Sex == "Males",
yes = -Population, no = Population))) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = abs, limits = max(demo$Population) * c(-80,80)) +theme_minimal()+
coord_flip() + labs(title="Age-Sex Population pyramid",
caption="Source: Singstat.gov.sg", x = "Age Group", y = "Population")
Now, the population heatmap is created. To create this we use the aggregated data (demo_agg), Age group as the x axis variable and the Planning area as the y axis variable.Population is used as the ‘fill’ parameter in the aes function. geom_tile() is used to create a tile, which is none other than a heatmap. The colour is given to be white to ensure that the edges of the tiles are white, thus promoting visibility of the gradient color. scale_fill_gradient2() is used to fill the tiles with a color gradient according to the population. The tile pertaining to the planning area - age group combination with lower population is given the ‘blue’ color and those with the highest population is given the ‘red’ color. The color for the mid is given as ‘white’ and the ‘midpoint’ is set as ‘10000’. theme_minimal() is used to remove the gray background (non data component), which is the default setting for a R based visualization. The theme() function is used with the axis.text.x parameter to tilt the label by a degree of 45 in order to prevent the overlapping of labels. The labs() function is used to provide a title, caption and the x and y axis labels for the visualization.
ggplot(data = demo_agg, aes(x=Age_Group, y=Planning_Area, fill=Population)) +
geom_tile(color = "white")+
scale_fill_gradient2(low="blue", high="red", midpoint = 10000)+theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 10, hjust = 1))+
labs(title="Heatmap of population for Planning area and Age group characteristics",
caption="Source: Singstat.gov.sg", x = "Age group", y="Planning Area")
Description:
The purpose of this visualization is to show the population proportion of each dwelling type for each age group.For example, in the visualization, the blue colored stack in the first bar for age group 0_to_4 indicates that almost 25% of the total population belonging to age group 0 to 4 are living in HDB 5 room and executive flats.
Insights:
Description:
The purpose of this visualization is to show the population density of each age group in Singapore for 2019. This will enable us to see which age group have wider distribution of population and which age group has lower.
Insights:
Description:
The purpose of this visualization is to compare the population proportion of males and females in different age groups. The same length of the bars corresponding to males and females indicates that there is a balanced proportion, whereas unequal length of bars indicate a skewed proportion towards a particular gender.
Insights:
Description:
The purpose of this visualization is to show the heatmap for the age group and the planning area combination. This will enable us to know the areas where the population density is high/low and will also help us see which age group people are present more/less in which planning area. The blue color indicates the lower population, thus more darker blue color indicates very less population. On the other hand, red indicates high population and more darker the red color is, more dense is that age group in that area.
Insights: