1. Major Data & Design Challenges

The Singapore Residents by Planning Area Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019 data series provided by the Department of Statistics Singapore can be downloaded from https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data as ‘respopagesextod2011to2019.csv’ file. The challenges with working with the downloaded data and visualizing it are described in the following sub-sections.

1.1. Need to Filter Data

On examining the downloaded data series in Microsoft Excel, we see that the data series consisted data from Years 2011 to 2019 (see Figure 1 below). Since we are only interested in the latest data in Year 2019, we would need to filter all data with Year 2019 out from the downloaded dataset.

Figure 1: Downloaded Data Series

1.2. Need for More Information to Group Planning Areas into Regions

Also, as we scroll through the list, there are about slightly more than 300 subzones which were grouped into 55 planning areas. While the data is very detailed, it would be a cognitive overload for a reader to visualize so much details in a static visualization. Hence, there is a need to further group the planning areas into a smaller number of regions so that the reader may have a quick and better overview. However, there was no information on how the planning areas could be further grouped into regions. There is a need to find out from other sources how the planning areas could be further grouped into regions. Wikipedia offered a good source of information on this grouping: https://en.wikipedia.org/wiki/Planning_Areas_of_Singapore

1.3. Transform the Data

Since the ‘respopagesextod2011to2019.csv’ file could be conveniently opened and manipulated in Microsoft Excel, using Data Filters, I was able to filter away all data before Year 2019 and resave the file.

Also, since the grouping of planning areas into regions was not included in the data series that we downloaded from the Department of Statistics Singapore, there is a need to insert this information into the dataset. Wikipedia had provided a structured table on how planning areas are grouped into regions (see Figure 2 below). Hence, I used Python to read this table on how the grouping was done - from Columns 1 and 6, then modify the ‘respopagesextod2011to2019.csv’ file to include the regions.

Figure 2: Structured Table from Wikipedia with Grouping of Planning Areas into Regions

I also transformed the data into a long format using Python. Finally, 2 transformed datasets were ready: ‘Long_Data2.csv’ and ‘Long_Data3.csv’ (see Figure 3 below).

Figure 3: ‘Long_Data2.csv’ and ‘Long_Data3.csv’

1.4. Proposed Design

1.4.1. Overview

There would be an overview visualization (see Figure 4 below) to enable the reader to appreciate the national demographics of Singapore (sorted from oldest to youngest age groups). Since the wordings for the age groups are quite long, we should put these on the y-axis so that the wordings could be clearly and neatly displayed. Since age groups are categorical in nature, and their counts are discrete, we could use a horizontal bar chart with the total count for each age group presented on each bar. Major and minor gridlines could also help the reader have a good sense of how one bar compares to another.

Figure 4: Sketch of Overview Visualization

After the reader is able to appreciate the overall national demographics, we could then present a similar horizontal bar chart (see Figure 5 below), with more information on gender. The horizontal stacked bar chart would have the stacked bars coloured differently for gender - Male and Female. The counts for each stacked bar will also be presented on the graph.

Figure 5: Sketch of Overall National Demographics by Gender

1.4.2. Diving Deeper

With an appreciation of the national demographics of Singapore by gender, readers can now dive deeper into details - at a region level. I envisaged 6 horizontal bar stacked bar charts (see Figure 6 below) for each region - Central, East, North, North-East, South and West, be visualized together on a common x-axis scale for count of residents. This is to allow the reader to appreciate Singapore demographics across regions. However, since 6 charts are now presented together, I would like to leave out the count being displayed on each bar, so as not to clutter the visualization. I would also like to leave out visualizing the demographics by age groups and planning areas as it will be too much details for a reader to appreciate. Visualizing data by planning area could be done in a different way as described in Section 1.4.4 below.

Figure 6: Sketch of Demographics by Region

1.4.3. Changing Perspectives from Demographics to Population

So far, the demographics charts described above give information on age groups. We could use a different visualization to lead the reader into another perspective, this time, visualizing the total resident population by region (leaving out details on age groups). See Figure 7 below. Here, we could sort the regions in descending count of their total resident population so that the reader could appreciate which regions have larger population size.

Figure 7: Sketch of Resident Population by Region

1.4.4. Diving Deeper in Perspective of Population

Now, with the reader’s perspective changed to population by region, we can dive deeper into population by planning areas. Rather than 1 single visualization on resident population by planning areas, we could display 6 horizontal bar charts (see Figure 8 below) - one for each region. This visualization would help the reader to appreciate which regions the planning areas are grouped under and also the population sizes of the planning areas. The planning areas could be sorted by descending count of population so that the reader is able to appreciate which region contain planning areas of larger resident population sizes.

Figure 8: Sketch of Proposed Design

2. Step-by-Step Description on Data Visualization Preparation

2.1. Installation and Loading of Required Libraries

Tidyverse package was loaded using the following lines:

#installing and loading the required libraries
packages = c('tidyverse')
for (p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

2.2. Reading in Transformed Datasets

The transformed datasets ‘Long_Data2.csv’ and ’Long_Data3.csv are read and imported using the following lines:

#importing data from transformed Long_Data2.csv file
pop_data = read_csv("data/Long_Data2.csv")


#importing data from transformed Long_Data3.csv file
pop_data_pa = read_csv("data/Long_Data3.csv")

2.3. Preparation of Overview Chart

First, ggplot function is called with data argument as ‘data=pop_data’. Then the aesthetic mapping ‘x = Age_Group’ is called.

ggplot(data=pop_data, aes(x=Age_Group))

Next, the geom object ‘geom_bar’ is called in order to display a vertical bar chart of count for the various age groups.

ggplot(data=pop_data, aes(x=Age_Group)) +
  geom_bar()

Since we wanted the age groups to be on the y-axis, we added a function ‘coord_flip’ to switch the vertical bar chart to a horizontal bar chart.

ggplot(data=pop_data, aes(x=Age_Group)) +
  geom_bar() +
  coord_flip()

The above chart has horizontal gridlines which is unnecessary data ink. To remove it, yet retain the vertical major and minor gridlines, we add on the theme function specifying our design for the vertical major and minor gridlines.

ggplot(data=pop_data, aes(x=Age_Group)) +
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank())

To add the count value for each horizontal bar, we add another geom object called ‘geom_text’, specifying the text to be displayed as the count value. To ensure that the value is easily seen over the grey bars, we set the color as ‘light blue’, adjust the position to be within the bars and made the text size smaller.

ggplot(data=pop_data, aes(x=Age_Group)) +
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank()) +
  geom_text(stat='count', aes(label=..count..), color='light blue', hjust=1.1, size = 2.8)

Finally, we want to change the count axis from scientific notation format to a more readable number format, with thousands separated by a comma. We do this by importing a library called ‘scales’, and specifying ‘labels = comma’.

library(scales)

ggplot(data=pop_data, aes(x=Age_Group)) +
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank()) +
  geom_text(stat='count', aes(label=..count..), color='light blue', hjust=1.1, size = 2.8) +
  scale_y_continuous(labels = comma)

The above final visualization is described in Section 3.1 subsequently.

2.4. Preparation of Singapore Demographics Breakdown by Gender Chart

1.We build on using the codes written for the overview chart. We add ‘fill=Gender’ element to the ggplot function.

library(scales)

ggplot(data=pop_data, aes(x=Age_Group, fill=Gender)) +
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank()) +
  geom_text(stat='count', aes(label=..count..), color='light blue', hjust=1.1, size = 2.8) +
  scale_y_continuous(labels = comma)

However, the count values on each bar were overlapping and cannot be seen clearly over the coloured stacked bars. Hence, we will need to adjust the elements within ‘geom_text’ function to re-position the count value and change the text color to ‘black’ to contrast against the colours of the stacked bars.

ggplot(data=pop_data, aes(x=Age_Group, fill = Gender)) + 
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank()) +
  geom_text(stat='count', aes(label=..count..), color='black', position = position_stack(vjust = .5), size = 2.8) +
  scale_y_continuous(labels = comma)

The above final visualization is described in Section 3.2 subsequently.

2.5. Preparation of Demographic Structure of Singapore by Gender & Region Chart

From the codes written for the previous chart, we can add on the function ‘facet_wrap(~ Region)’ in order to present the previous chart by regions.

ggplot(data=pop_data, aes(x=Age_Group, fill = Gender)) + 
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank()) +
  geom_text(stat='count', aes(label=..count..), color='black', position = position_stack(vjust = .5), size = 2.8) +
  scale_y_continuous(labels = comma) +
  facet_wrap(~ Region)

We see that the charts are too cluttered with the count value labeled on each stacked bars, so we will remove them by removing the ‘geom_text’ object.

ggplot(data=pop_data, aes(x=Age_Group, fill = Gender)) + 
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank()) +
  scale_y_continuous(labels = comma) +
  facet_wrap(~ Region)

The above final visualization is described in Section 3.3 subsequently.

2.6. Preparation of Resident Population by Region Chart

A new ggplot has to be called with data argument as ‘data=pop_data_pa’ and ‘aes(x=Region)’ since we want to visualize the population count for each region.

ggplot(data=pop_data_pa, aes(x= Region))

Next, we add on a geom object ‘geom_bar’ since we want to visualize the data in bar charts.

ggplot(data=pop_data_pa, aes(x= Region)) + 
  geom_bar()

We add a coord_flip to switch the vertical bar chart to a horizontal bar chart.

ggplot(data=pop_data_pa, aes(x= Region)) + 
  geom_bar() +
  coord_flip()

Again, we want to remove the horizontal gridlines and retain the vertical gridlines. We do this by using the same theme function that was used for previous charts.

ggplot(data=pop_data_pa, aes(x= Region)) + 
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank())

We insert the same geom text function that we used in previous charts to have the population count displayed on each bar.

ggplot(data=pop_data_pa, aes(x= Region)) + 
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank()) +
  geom_text(stat='count', aes(label=..count..), color='blue', hjust=-0.1, size = 2.8)

To sort the regions according to descending population sizes, we set the levels in the order that we want.

## set the levels in the order that we want
pop_data_pa <- within(pop_data_pa, 
                   Region <- factor(Region, 
                                      levels=names(sort(table(Region), 
                                                        decreasing=FALSE))))

ggplot(data=pop_data_pa, aes(x= Region)) + 
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank()) +
  geom_text(stat='count', aes(label=..count..), color='blue', hjust=-0.1, size = 2.8)

Finally, we added ‘scale_y_continuous(labels = comma)’ to be consistent in presenting numbers on the count axis.

## set the levels in the order that we want
pop_data_pa <- within(pop_data_pa, 
                   Region <- factor(Region, 
                                      levels=names(sort(table(Region), 
                                                        decreasing=FALSE))))

ggplot(data=pop_data_pa, aes(x= Region)) + 
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank()) +
  geom_text(stat='count', aes(label=..count..), color='blue', hjust=-0.1, size = 2.8) +
  scale_y_continuous(labels = comma)

The above final visualization is described in Section 3.4.1 subsequently.

2.7. Preparation of Resident Population by Regions and their Planning Areas Chart

We build on the codes for the previous graph. In order to present a similar chart as the previous one but this time, by Regions, we added ‘facet_wrap(~ Region)’.

## set the levels in the order that we want
pop_data_pa <- within(pop_data_pa, 
                   Planning_Area <- factor(Planning_Area, 
                                      levels=names(sort(table(Planning_Area), 
                                                        decreasing=FALSE))))

ggplot(data=pop_data_pa, aes(x= Planning_Area)) + 
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank()) +
  geom_text(stat='count', aes(label=..count..), color='blue', hjust=0.5, size = 2.8) +
  scale_y_continuous(labels = comma) +
  facet_wrap(~ Region)

To make it more intuitive for the reader to recognize which Planning Areas fall under which Region, we could colour them by regions.

## set the levels in the order that we want
pop_data_pa <- within(pop_data_pa, 
                   Planning_Area <- factor(Planning_Area, 
                                      levels=names(sort(table(Planning_Area), 
                                                        decreasing=FALSE))))

ggplot(data=pop_data_pa, aes(x= Planning_Area, fill = Region)) + 
  geom_bar() +
  coord_flip() +
  theme(panel.grid.major.y = element_blank(),   panel.grid.minor.y = element_blank()) +
  geom_text(stat='count', aes(label=..count..), color='blue', hjust=0.5, size = 2.8) +
  scale_y_continuous(labels = comma) +
  facet_wrap(~ Region)

The above final visualization is described in Section 3.4.2 subsequently.

3. Final Data Visualization with Short Descriptions

3.1 Overview of Singapore Demographic Structure in 2019

The below chart presents the overall Singapore demographics in 2019, with the ‘active’ population [25 to 64 years-olds] forming the majority of the resident population and an increasingly shrinking ‘young’ population [0 to 24 years-olds].

Codes presented in Section 2.3.

3.2 Singapore Demographics Breakdown by Gender

The chart shows the Singapore demographics breakdown by gender in 2019. For the ‘old’ population [65 years-olds and older], the females outnumber the males. For the ‘young’ population, the males slightly outnumber the females.

Codes presented in Section 2.4.

3.3 Demographic Structure of Singapore by Gender & Region

The demographic structure of Singapore residents is further broken down into regions - Central, East, North, North-East, South and West. In the North-East region, the key difference compared with the overall demographics structure of Singapore was that we observe a higher proportion of children and youths aged 0 to 14 years old, indicating younger families residing in this region. Only a minority of Singapore residents reside in the South region.

Codes presented in Section 2.5.

3.4 Singapore Resident Population in Regions & Planning Areas

3.4.1 Resident Population by Region

The below chart presents Singapore’s 2019 resident population by regions, sorted in descending sizes, with the North-East region having the highest number of resident population and the South region having the least.

Codes presented in Section 2.6.

3.4.2 Resident Population by Regions and their Planning Areas

The below graph aims to show 2 aspects of information: (1) a sensing of the number of planning areas in each region and (2) how each planning area compare in terms of resident population to others.

From the previous graph, while we see that the population size of the Central region is almost as large as the North-East region, it (Central) consists much more planning areas (albeit with smaller population sizes), while the North-East region consists lesser planning areas but with bigger population sizes.

Codes presented in Section 2.7.

Assignment 4

Tan Hui Ling

7/10/2020