1. Overview

This DataViz assignment is to capture the change in the demographic pattern of Singapore population by age cohort, economic dependency and geographical hierarchy over time (2011 to 2019). There are two parts to this assignment - 1. Reveal the demographic structure of Singapore population by age cohort (i.e. 0-4, 5-9,..) and by planning area in 2019 2. Reveal the changing patterns of demographic composition (i.e. the young (age 0-24), the economically active group (i.e. age 25-64) and the aged group (i.e. 65 and above)) in Singapore by geographical hierarchy (i.e. region and planning area) over time (i.e. 2011-2019).

For the purposes of this assignment, - An age sex pyramid and a heatmap is created to show the demographic structure of Singapore in 2019 by age cohort and by gender and planning area respectively. (Alternative 1) - To reveal the changing patterns of demographic composition by economic dependency and by geographical area over time, a slope chart is created to show the change over time - 2011 to 2019 grouped by the economic dependency groups. (Alternative 2)

2. Challenges faced while visualization

Since the dataviz assignment until now used to be on Tableau, it used to be a user friendly and no code platform, but because this week onwards the assignment is supposed to be done on R language, there are multiple challenges being faced.

2.1. Major data and design challenges

Some of the major data and design challenges faced in this assignment:

The raw data wasn’t in a form that it could be consumed by all the visualizations. For each of the visualization, there were some data manipulations that needed to be performed. There are some fields like TOD (Type of Dwelling) etc. that are not used in any of the visualizations. Hence, the data preparation to have the adequate fields for all the different visualizations had to be done separately. For instance, for the age-sex pyramid, the fields required were age, sex and Population aggregated for these categories. On the other hand for a heatmap the population needs to be grouped for the different age groups for each of the different planning areas. Hence, data cleaning for both of these needed to be done separately and clean data can’t be reused.Because here there is only one table was used, issues with respect to join have not been faced, rather filtering, aggregating these were the major challenges.
Bringing in interactivity in the visualizations in form of tool-tip etc is a challenge. As for Tableau, directly filters, tooltips and legends can be added to the visualization in an interactive manner. In R through ggplot2, interactivity in terms of tooltips couldn’t be added. It is hard to add labels for each value as that may look cumbersome. Hence, intuitively added these interactivities seem to be a challenge.
Customization of the view of the charts in ggplot. The default view of the visualization created with ggplot is not appealing and hence not very professional as was the case with Tableau. The grey default background doesn’t look appealing. Also the basic themes are not appealing and need to be customized. Apart from that in the slope chart there are many legends have been added taking up the space of the original visualization. This doesn’t look good and need to be customized.

2.2. Ways to overcome challenges mentioned above

For data preparation challenges, dplyr from tidyverse has been used to perform filtering and aggregation of the fields. First of the process of data preparation is performed separately for each of the visualizations to maintain sanity, the codes that could be reused have been. The required columns are retained and transformed by spread or mutate followed by using aggregate to sum the rows with similar categories.
Even though adding interactivity in R is not as intuitive as it is in Tableau, but there still is a lot of scope to add interactivity through ggplot and plotly. A ggplot2 can be converted into a plotly with tooltips that show the values of the label and enables the user to click screenshots of the visualization and save as a png. As for title, subtitle and caption (source credits) - these are added through labs() in the ggplot.
There is a code chunk mentioned below in the data visualization area of the steps of creating the visualization called as Formatting. Some basic formatting with respect to background color, themes, hiding of axis if required, hiding of legends etc have been added. This is a reusable piece of code which is used in more than one visualizations and can be used in future as well.

2.3. Sketches of the visualizations

Age-sex population pyramid is used to show Singaporean population distribution segregated by gender for the year 2019.

Heatmap is used to show the demographic structure of Singapore in 2019 by age cohort and by gender and planning area

Slope Line Chart is used to show the changing patterns of demographic composition by economic dependency and by geographical area over time

3. Step by Step Description of the process

This section describes the step by step process for creating the visualizations. It is divided into 3 sections - Intallation of R packages, Data Preparation followed by Creating Visualizations. Customizations and aesthetics have been involved in the third section because all that can be added in ggplot code only. The aesthetics added are mentioned explicitly for clarity.

3.1. Installation of R packages

This code chunk installs the basic tidyverse, heatmaply and plotly packages on the user machine without having to load one by one. These packages are installed and loaded into Rstudio environment because they are needed to be loaded for the visualizations.

packages = c('tidyverse','heatmaply','plotly')

for(p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

3.2. Data Preparation

The data source for this dataviz is Singstat.com. The dataset used for this dataviz is the Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019. The data set was downloaded from the link provided and included in the data sub-folder of the DataViz Makeover 8 project folder in csv format.

Importing Data

To import respopagsex2000to2019.csv into R, read_csv() function of readr package is used. This is the raw data hence is stored as “raw_data” dataframe.

raw_data <- read_csv("data/respopagesextod2011to2019.csv")

## Parsed with column specification:
## cols(
##   PA = col_character(),
##   SZ = col_character(),
##   AG = col_character(),
##   Sex = col_character(),
##   TOD = col_character(),
##   Pop = col_double(),
##   Time = col_double()
## )

Preparation of the data is done using dplyr package in tidyverse. Functions like mutate(), spread(), filter() select() are used to derive appropriate variables for each of the respective visualizations.

Data Preparation for Age-Sex Pyramid (Gender and Age grouped)

“poppyramid” is created to filter the data for 2019 and keep only the required columns and remove the others by select(). Also because Time is read as col_double() by default it is converted to character() through mutate.

poppyramid <- raw_data %>%
  mutate(`Time`=as.character(Time)) %>%
  filter(Time=="2019") %>%
  select(-PA,-SZ,-TOD,-Time)

Because in poppyramid the age in “AG” is sorted in ascending order, “5_to_9” value is “AG” goes before 50 which is not something we want. Hence, through str_replace it is replaced to “05_to_09” so that the order is maintained in the next steps. Finally because there are multiple rows for the same age and gender, they needed to be aggregated with the help of aggregate() function and the final dataframe is stored in “poppyramidfinal” as shown below.

poppyramid$AG<-str_replace(as.character(poppyramid$AG),"5_to_9","05_to_09")
poppyramidfinal <- aggregate(Pop~AG+Sex,data=poppyramid,FUN=sum)

Data Preparation for Heatmap (PA and Age grouped)

Similar to what was done for the age-sex population pyramid, data preparation is done for the Heatmap as well. Whichever part of the code could be reused has been done, but it is separated for the purposes of sanity. Time is again converted to character and filtered for 2019 because this is to reveal the demographic structure in Singapore population by age cohort and by planning area in 2019. Because we want the planning area grouped for each age cohort separately, the dataset is filtered through select() to remove all unwanted columns. These iterations are stored in a dataframe called heatmap

heatmap<- raw_data %>%
  mutate(`Time`=as.character(Time)) %>%
  filter(Time=="2019") %>%
  select(-SZ,-Time,-Sex,-TOD)

The issue with heatmap is the same as popyramid. There are multiple rows with the same planning area and different age groups. What we actually need is one distinct row for each planning area with the population spread over different age cohorts. This is done through aggregate and the value of ‘5_to_9’ is changed to ‘05_to_09’ through str_replace.

heatmapnew <- aggregate(Pop~PA+AG,data=heatmap,FUN=sum)
heatmapnew$AG<-str_replace(as.character(heatmapnew$AG),"5_to_9","05_to_09")

Data Preparation for Slope Chart (Economic dependency grouped)

This data preparation is performed for the slope chart where we want the data in a form that the age groups are grouped based on their economic dependencies - young (0-24 years), active (25-64 years) and old (65+ years). Hence with mutate(), the age groups are grouped together through rowSums and then the extra columns are removed with select(). To make Economic_dependency as a column, pivot_longer based on the Population values.

popslopeline <- raw_data %>%
  mutate(`Time`=as.character(Time)) %>%
  spread(AG,Pop) %>%
  mutate (Young1=rowSums(.[6:9])) %>%
  mutate (Young2=rowSums(.[15])) %>% 
  mutate (Young=rowSums(.[25:26])) %>%
  mutate (Active1=rowSums(.[10:14])) %>%
  mutate (Active2=rowSums(.[16:18])) %>%
  mutate (Active=rowSums(.[28:29])) %>%
  mutate (Old=rowSums(.[19:24])) %>%
  select(PA,Time,Young,Active,Old) %>%
  pivot_longer(-c(Time,PA),names_to = "Economic_dependency",values_to = "Population")

The similar issue of multiple rows with same categories is here too, hence aggregate() is used again to add up the Population based on PA, Time, and Economic_dependency based on Sum function. Also, because we want to facet the chart in a manner that the order - Young, Active and Old is retained the levels are factored into Economic_dependency column.

popnew1<- aggregate(Population~PA+Time+Economic_dependency,data=popslopeline,FUN=sum)
popnew1$Economic_dependency=factor(popnew1$Economic_dependency,levels=c("Young","Active","Old"))

3.3. Creating Visualizations

This is basic formatting code chunk block created which is reused in different visualizations. This is for formatting purposes as it sets the theme, positions the axes, removes the legend, sets the axis in a standardized form.

#reusable code for formatting the plot
Formatting <- list( 
  theme_bw(),
  theme(panel.grid.major.x = element_blank()),
  theme(axis.text.x.top = element_text(size=12)),
  theme(plot.title = element_text(size=14, face = "bold", hjust = 0.5)),
  theme(plot.subtitle = element_text(hjust = 0.5))
)

Age-Pyramid sex

Base Visualization This visualization is to show the age and the gender distribution in Singapore in 2019. ggplot in tidyverse is used to create this visualization. geom_col() is added to ggplot which is used for bar charts. In this case we need 2 bar charts one for males and one for females. In aesthetics X axis is the column Age “AG” only but for the Y axis the Male population values are multiplied with -1 to make them to make them flipped, and the fill of the chart is based on Sex so that there are 2 different colours for the 2 categories. coord_flip() is used to flip the X and Y axes. The Y axis is scaled using scale_y_continuous so that the negative populations are transformed to positive. Apart from that using labs() the axes are labelled, a title, a subtitle and caption are given to the visualization. Also the standardized Formatting defined before is used.

p<-ggplot(poppyramidfinal,aes(x = AG, y = ifelse(Sex == "Males", yes = -Pop, no = Pop),fill = Sex))+
  geom_col()+
  coord_flip()+
  scale_y_continuous(labels = abs, limits = max(poppyramidfinal$Pop)*c(-1,1))+
  Formatting +
  labs(
    x="Age",
    y = "Population",
    title = "Singapore's Age-Sex Pyramid Structure, 2019",
    subtitle = "Age-sex ratio of the population in Singapore seems to be skewed towards middle aged generation",
    caption = "Data Source: Singstat.com"
  )
p

Interactivity embedded The p visual for age-sex population pyramid created above is converted to a ggplotly() under plotly() so as to make it interactive with tooltips that show the values for the different age groups and the associated Male and Female population group.

ggplotly(p,session="knitr")

Heatmap

Base Visualization This visualization is to show the demographic structure of Singapore Population by age cohort and planning area.The density of each of the blocks is represent by the coloured box. Even though there is a heatmap() function in the base package that could be used for this purpose. In this case, ggplot is used with geom_tile() which ultimately creates a heatmap. The dataframe created above is used here - heatmapnew. Apart from that as aesthetics, on X axis age “AG” is plotted with planning area “PA” on Y axis. Population is used as a fill for the geom_tile() plot. The general formatting done previously is retained with some additional formatting to rotate labels on X axis and reduce font size on Y axis to fit all the planning areas. Through labs, the title, subtitle, caption, axes labels are named.

heatmapviz<- ggplot(heatmapnew,aes(AG,PA,fill=Pop))+
  geom_tile(position = "identity",stat = "identity")+
  Formatting+
  labs(title = "Singapore Demographic structure, 2019",
    subtitle = "Distribution of Singapore's Population by Age and Planning Area",
    caption = "Data Source: Singstat.com",
    x="Age Group",
    y="Planning Area")+
  theme(axis.text.x = element_text(angle = 90))+
  theme(axis.text.y = element_text(size=5))
heatmapviz

Interactivity embedded To make the previous visualization interactive and to embed tooltips into the visualization, ggplotly is used. Apart from that the planning areas are also very small in form of tiles hence need to be focused on.

q<-ggplotly(heatmapviz,session="knitr")
q

Slope Chart - Changing pattern of demographic composition (economic dependency) by geographical area over time

Base Visualization The code chunk below creates a static slope chart visualization for showing the population by the different planning areas from 2011-2019 which is wrapped for the economic dependency groups. The ggplot below is used for the associated dataframe with aesthetics wherein the X-axis is the Time and Y-axis is the Population grouped by planning area. To make the line chart, geom_line() is added coloured by “PA” and by adjusting transparency through alpha and size. Similarly another layer is added for the geom_point() with similar aesthetics as geom_line(). To wrap the visualization based on Economic_dependency, facet_grid() is used. The default formatting created above is used for this visualization as well with some more formatting to remove legend and X axis title. Apart from that through labs(), title, subtitle and caption for source is added.

gg<-ggplot(popnew1,aes(x=Time,y=Population,group=PA))+
  geom_line(aes(color=PA,alpha=1),size=1)+
  geom_point(aes(color=PA,alpha=1),size=2)+
  facet_grid(Economic_dependency ~ .)+
  Formatting + 
  list(theme(axis.title.x = element_blank()),
       theme(legend.position = "none"))+
  labs(
    title = "Singapore's Demographic Structure, 2011-2019",
    subtitle = "Economic dependency of the population change over the years",
    caption = "Data Source: Singstat.com")
gg

Interactivity embedded The visualization created above is made interactive to add tooltips through ggplotly() in plotly() which converts a ggplot to plotly chart.

r <- ggplotly(gg,session="knitr")
r

4. Reflections from the visualizations

Age-sex Population Pyramid From the age-sex pyramid, the current population distribution shows that the middle aged population is more than the young and the old population. This could probably be the case because of more earning class people migrating from other countries to Singapore for work purposes. Apart from that the old population in general is lesser than the middle aged and the young population.

Heatmap From the heatmap, it can be seen that there are some regions which are densely populated whereas there are some which are sparsely populated. For instance, it can be said that regions like Punggo, Jurong West, Sengkang have a higher density of low to middle aged population. Another observation can be drawn about lesser densities of older population.

Slope Chart This visualization shows the population by different planning areas from 2011-2019 distributed for the different economic dependency groups. The active population is highest followed by young and then old. The trend has been the same for the past 9 years, but with aging population. Population of young peopl is falling with an increase in the old population.

5. Advantages of data visualization on R compared to Tableau

Visibility of Grammar of Graphics This is an advantage of R compared to Tableau wherein each layer can be visibly added one by one in ggplot. On Tableau, because of the software being a data exploration software, the different layers are not visibly added one after the after. It needs to be manually performed one after the after, there are chances that a layer could be missed if not careful, whereas on R because the function has all these layers embedded it is not very to forget as there is a guideline.
Pre-built libraries These libraries help with various functions which have a preset format and can the desired visualization can be created with just one function. For instance, for a ternary plot, there are many different data manipulations, calculated fields etc that need to be created on Tableau but in R with the help of ggtern, a ternary plot can be created in a 5 liner code. The variety of the visualizations that can be created on R with pre-built libraries like ggplot and plotly are far more than that in Tableua.
Flexibility The amount of flexibility in R with respect to visualizations is more compared to Tableau because of everything being coded. The present visualizations can be customized in terms of aesthetics such as color, sizes, labels, axes etc. Automatic interactivity in terms of tooltips etc can be added by plotly. All of the these (legends, tooltips etc) automatically show up without adding separately.
Embedded visualization and comments R Markdown in R enables literate programming (write the code, generate the output) and all notes and comments alongside to support the same. It is a very useful way of creating a report and describing the process undertaken for the visulaization. These files can be knitted into html, pdf and word for circulation. This is a unique feature not available on Tableau, if the report of the dashboard is created on Tableau, then the report describing the data visualization needs to be created separately.

DataViz Makeover 8

Oishee Bhattacharyya

9 March 2020