Major Data and Design Challenges

Data Challenges:

  • Does not automatically aggregate population count by PA or Sex unlike Tableau.
    Solution: Make use of aggregate function to achieve the same outcome.

Design challenges:

  • The age groups are not sorted according to smallest to largest, but sorted as characters. Unable to manually change the order like in Tableau.
    Solution: Have to created additional column taking the characters before the first _ to get starting age of each AG, then convert to numeric and then reorder the graph axis according to the new column.

  • Facet_wrap does not work well with too many facets as the information on the charts are hard to read.
    Solution: Resized the fig height and width as well as number to columns to minimise clutter.

  • Scales on axis are hard to read when the values are large , e.g. many ’0’s. Does not auotmatically give a more readable form unlike in Tableau.
    Solution: Customise the scales with scale_y_continuous or scale_x_continuous function. E.g. 10000 to be converted to 10K to make it more reader-friendly.

Sketch of Proposed Visualisations

Step-by-Step Description

1. Import Relevant Packages & Load Dataset

Import tidyverse package and assign the original dataset to pop_data.

library(tidyverse)
pop_data <- read_csv("respopagesextod2011to2020.csv")

2. Data Cleaning

  1. Filter the data by Time for the year 2019 and assign the dataset to pop_filterdate.

  2. Use the aggregate function to get the sum of the population count by planning area and assign it to popcount_PA. Then filter popcount_PA further by excluding planning areas with population count of zero as it will not provide much insight.

  3. Use the aggregate function to get the sum of population count by sex of residents in the same age group in the same planning area and assign it to pop_cleaned. Then filter it even further by only keeping planning areas with population of more than zero. This can be done by making use of planning areas that are already filtered in popcount_PA. Create an additional column, AGstart, to get the starting age for each age group to facilitate sorting of axis later on. Convert the AGstart column from character to numeric.

pop_filterdate <- filter(pop_data, Time == 2019)
popcount_PA <- aggregate(pop_filterdate$Pop, by = list(PA = pop_filterdate$PA), FUN = sum)
popcount_PA <- popcount_PA %>% filter(x!= 0)

pop_cleaned<- aggregate(pop_filterdate$Pop, by = list(PA = pop_filterdate$PA, AG = pop_filterdate$AG, Sex = pop_filterdate$Sex), FUN = sum)

pop_cleaned <- pop_cleaned%>% filter(PA %in% popcount_PA$PA)%>%
  mutate(AGstart = sub("\\_.*", "", AG))
pop_cleaned$AGstart <- as.numeric(pop_cleaned$AGstart)

3. Data Visualisation

3.1 Bar Chart showing Population Count by Planning Area in 2019

Main aim of the bar chart is show population count in each planning area in 2019. Refer to code below to plot the bar chart. (Note: Planning Areas with population count of zero have already been omitted)

ggplot(popcount_PA, aes(x = x, y = reorder(PA,x)))+ geom_bar(stat = 'identity')+scale_x_continuous(breaks = seq(0, 300000, 50000), labels = paste0(as.character(c(0:6*5)),"K"))+
  ggtitle("Population Count by PA in 2019", subtitle = "Bedok, Jurong West and Tampines top 3 largest planning areas") + labs(x = "Population Count", y = "PA", caption = "Source: Singstat Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2020")

3.2 Bar Chart showing Population Distribution by Age Group and Planning Area in 2019

This bar chart, unlike the previous bar chart, provides additional detail in terms of the population count by age group for each planning area in 2019.

ggplot(pop_cleaned, aes(x = reorder(AG, AGstart), y = x))+
  geom_bar(stat = 'identity')+
  scale_y_continuous(breaks = seq(0, 40000, 5000), labels = paste0(as.character(c(0:8*5)),"K"))+labs(y= "Count", x = "Age Group",  caption = "Source: Singstat Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2020")+coord_flip()+facet_wrap(~PA, ncol = 6)+ ggtitle("Population Distribution by PA and Age Group in 2019", subtitle = "Punggol & Sengkang top 2 in terms of proportion of residents aged four and below")

3.3 Population Pyramids by Planning Area in 2019

ggplot(pop_cleaned, aes(x = reorder(AG, AGstart), y = x, fill = Sex))+
  geom_bar(data = subset(pop_cleaned, Sex == 'Females'),stat = 'identity')+
  geom_bar(data = subset(pop_cleaned, Sex == 'Males'),stat = 'identity', aes(y=x*(-1)))+
  scale_y_continuous(breaks = seq(-20000, 20000, 5000), labels = paste0(as.character(c(4:0*5, 1:4*5)),"K"))+coord_flip()+labs(y= "Count", x = "Age Group",  caption = "Source: Singstat Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2020")+ facet_wrap(~PA, ncol = 6)+ggtitle("Population Distribution in PA by Age Group and Sex in 2019")

Final Visualisation

Observations: