0. Introduction

In this assignment, we are required to build a static visualisation of the population data using the data from singstat.gov.sg.

1. Major data and design challenges

We will describe the various data and the design challenges and mitigations follows:

  • The data are provided in an unpivoted form. This is a useful format to be directly imported into packages such as R for plotting graphs but not very intuitive for the human reader that wants to visually explore the raw numeric data for sense-making. In order to get started with this particular dataset, I imported the .csv into Excel and create PivotCharts to quickly get a sense what data is available before deciding and planning on what I wanted to present to the reader.

  • There is way too much raw data that is provided that can be represented a single static visual. For example, the first and most intuitive chart that comes to mind when talking about population is the population pyramid. Since the data is given across a range of years from 2011 to 2019, it is instinctive to immediately think of plotting the population pyramids across the years. However, that results in the following mess:

The above plot is way too cluttered, contains way too much information that results in a loss of focus, and makes drawing any useful insights practically impossible.

To overcome the above challenges, it is important to ask what are the key stories or message that we want to convey to the reader and extract those portion accordingly. This can then either be translated into a summarised visual that conveys the punchline, or a series of static visuals that tells a story in a coherent flow (something that singstats uses extensively, e.g. here).

The dataset contains quite an extensive list of information that is available, including:

  • area where the people live
  • age
  • sex
  • type of dwelling
  • population size
  • year

Given the above, some of the relevant questions may be:

  • What is the trend of the population growth over time?
  • What are the differences in growth rate between different age groups?
  • How does the population size across the age group differs between male and female?
  • Is there any correlation between age group and the type of dwelling? E.g., the working class may prefer to stay in condo, or the old people may stay in smaller HDB flats, etc.
  • Is there any correlation between age group and the area where they live? E.g., mature estates such as Toa Paoh may have more old people staying around those areas.

Depending on which question above we are answer, we can do some grouping of the data to make the visualisation and understanding of the data more manageable for the reader. Some of the possible strategies are:

  • Collapse the population into distinct age groups, e.g., young, active, old
  • Present only the latest info (filter out all years except for 2019)
  • Convert absolute numbers into percentages where comparisons between groups are involved (more intuitive)

We will try to craft a story out of the data that is available to us, in the style of how singstats presents data as mentioned above. There are various types of graphs that we can use to support the visualisation, such as a stacked area plot, line chart and bar graphs. We will experiment with the various plots to see which are the more useful ones. Some of the possible plots are illustrated as below:

2. Step-by-step flow to generate the visuals in R

In this section, we will present the steps used to generate the visuals that will be used to tell our story in part 3 of the assignment. The actual story will then be told in part 3 itself.

2.1 Import data and pre-processing

We start by importing the tidyverse package which is necessary for the visualisation and data processing.

packages=c('tidyverse', 'scales', 'ggpubr')

for (p in packages){
  if (!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

We import the .csv file as follows:

# read the csv file
pop_data <- read_csv("./respopagesextod2011to2019.csv")

# remove all the understores from the AG field
pop_data$AG <- str_replace_all(pop_data$AG, "_", " ")

Perform data-processing by grouping the various age groups into the 3 categories of young, active and old defined as follows:

  • Young: 0 to 24 years old
  • Active: 25 to 64 years old
  • Old: 65 years and above
agpop_mutated <- pop_data %>%
  spread(AG, Pop) %>%
  mutate(Young = rowSums(.[6:10])) %>%
  mutate(Active = rowSums(.[11:18]))  %>%
  mutate(Old = rowSums(.[19:23]))
  # mutate(TOTAL = rowSums(.[24:26]))
pop_age <- data.frame(agpop_mutated)

# drop the columns of ages that we don't need
pop_age <- pop_age[-c(6:24)]

2.2 Stacked area plot

We use the stacked area plot to illustrate the change in the population size in absolute numbers but separated into the various age groups. The code to plot the graph is given as follows:

# generate the pivoted df
pop_age_long <- 
  pop_age[-c(1:4)] %>% 
  group_by(Time) %>%
  summarise(Young = sum(Young), Active = sum(Active), Old = sum(Old)) %>%
  pivot_longer(
    cols = Young:Old,
    names_to = "AGE",
    values_to = "Count"
  )

# change the order of hte age that is appropriate for the stacked plot
pop_age_long$AGE <- factor(pop_age_long$AGE, c("Old", "Active", "Young"))

temp <- pop_age_long %>% filter(Time==2011)
pop2011 = sum(temp$Count)
temp <- pop_age_long %>% filter(Time==2019)
pop2019 = sum(temp$Count)

# plot the chart
respop_stacked <- 
  ggplot(data=pop_age_long, aes(x=Time, y=Count, fill=AGE, factor)) + 
    geom_area() +
    labs(title = "Resident population from 2011 to 2019",
         caption = "Data source: www.singstat.gov.sg",
         fill = "Age Group") +
    scale_y_continuous(name = "Population",
                       breaks = seq(0, 5e6, 1e6), 
                       labels = paste0(as.character(seq(0, 5, 1)), "M"), 
                       limits = c(0, 4.5e6) ) + 
    scale_x_continuous(name = "Year",
                       breaks = seq(2011,2019, 1)) + 
    theme_bw() +
    theme(plot.title = element_text(hjust = 0.5)) +
    annotate("text", 
             x=2011+.1, 
             y=pop2011+.15e6, 
             label=format(round(as.numeric(pop2011/1e6), 2), nsmall=2, big.mark=","),
             size=3) +
    annotate("text", 
             x=2019-.1, 
             y=pop2019+.15e6, 
             label=format(round(as.numeric(pop2019/1e6), 2), nsmall=2, big.mark=","),
             size=3)

respop_stacked

2.3 Line graph

We use the line graph to illustrate the change in the population size of the various age groups expressed as a percent of the starting year, i.e., 2011. The code to plot the graph is given as follows:

# generate the df for the plot
pop_age_line <- 
  pop_age[-c(1:4)] %>% 
  group_by(Time) %>%
  summarise(Young = sum(Young), Active = sum(Active), Old = sum(Old))

pop_age_line$Young <- pop_age_line$Young / pop_age_line$Young[1]
pop_age_line$Active <- pop_age_line$Active / pop_age_line$Active[1]
pop_age_line$Old <- pop_age_line$Old / pop_age_line$Old[1]

pop_age_line <- pop_age_line %>%
  pivot_longer(
    cols = Young:Old,
    names_to = "AGE",
    values_to = "Count"
  )

pop_age_line$AGE <- factor(pop_age_long$AGE, c("Old", "Active", "Young"))

popchange_line <-
  ggplot(data=pop_age_line, aes(x=Time, y=Count, group=AGE, colour=AGE, factor)) + 
    geom_line(size=1) +
    labs(title = "Population change of various age group from 2011 to 2019",
         subtitle = "Change expressed as percentage of 2011's population",
         caption = "Data source: www.singstat.gov.sg",
         col = "Age Group") +
    scale_y_continuous(name = "Percent change",
                       labels = percent,
                       breaks = seq(0, 2, 0.1)) + 
    scale_x_continuous(name = "Year",
                       breaks = seq(2011, 2019, 1)) + 
    theme_bw() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))

popchange_line

### 2.4 Bar graph

We use the bar graph to illustrate the composition of the various age groups in the different districts for 2019. As there are many districts, we will restrict the graph to only the top and bottom 5 districts to make the information useful and prevent clutter. The code to plot the graph is given as follows:

# generate the df for the plot
pop_age_area <- 
  pop_age[-c(2:4)] %>% 
  filter(Time==2019) %>%
  group_by(PA) %>%
  summarise(Young = sum(Young), Active = sum(Active), Old = sum(Old)) %>%
  mutate(Total = rowSums(.[2:4])) %>%
  filter(Total > 10000)  # ignore areas with less than 10K pax

# calculate the percent of people against the population
pop_age_area$Young <- pop_age_area$Young / pop_age_area$Total
pop_age_area$Active <- pop_age_area$Active / pop_age_area$Total
pop_age_area$Old <- pop_age_area$Old / pop_age_area$Total


# define variable for top n
topn <- 5

old_plot <-
  pop_age_area %>%
    arrange(desc(Old)) %>%
    slice( c(1:topn,  (nrow(.)-topn+1):(nrow(.)))  )%>%
    ggplot(aes(x=reorder(PA, -Old), y=Old)) +
      geom_bar(stat = "identity", position = position_dodge(), fill="red") +
      labs(title = paste0("Top and bottom ", as.character(topn), " of Old people by percentage in district (2019)")) +
      scale_y_continuous(name = "% Population in district",
                         labels = percent,
                         breaks = seq(0, 2, 0.1)) + 

      theme_bw() +
      theme(plot.title = element_text(hjust = 0.5),
            axis.text.x = element_text(angle=45, hjust=1),
            axis.title.x = element_blank())

active_plot <-
  pop_age_area %>%
    arrange(desc(Active)) %>%
    slice( c(1:topn,  (nrow(.)-topn+1):(nrow(.)))  )%>%
    ggplot(aes(x=reorder(PA, -Active), y=Active)) +
      geom_bar(stat = "identity", position = position_dodge(), fill="green") +
      labs(title = paste0("Top and bottom ", as.character(topn), " of Active people by percentage in district (2019)")) +
      scale_y_continuous(name = "% Population in district",
                         labels = percent,
                         breaks = seq(0, 2, 0.1)) + 

      theme_bw() +
      theme(plot.title = element_text(hjust = 0.5),
            axis.text.x = element_text(angle=45, hjust=1),
            axis.title.x = element_blank())

young_plot <-
  pop_age_area %>%
    arrange(desc(Young)) %>%
    slice( c(1:topn,  (nrow(.)-topn+1):(nrow(.)))  )%>%
    ggplot(aes(x=reorder(PA, -Young), y=Young)) +
      geom_bar(stat = "identity", position = position_dodge(), fill="blue") +
      labs(title = paste0("Top and bottom ", as.character(topn), " of Young people by percentage in district (2019)"),
           caption = "Data source: www.singstat.gov.sg") +
      scale_y_continuous(name = "% Population in district",
                         labels = percent,
                         breaks = seq(0, 2, 0.1)) + 

      theme_bw() +
      theme(plot.title = element_text(hjust = 0.5),
            axis.text.x = element_text(angle=45, hjust=1),
            axis.title.x = element_blank())

popdis_bar <-
  ggarrange(old_plot, active_plot, young_plot,
            ncol = 1, nrow = 3)

popdis_bar

3.0 A Look at Our Resident Population from 2011 to 2019

Let’s take a look at how our population has changed from 2011 to 2019.

From the graph, we can see that the growth of our population has been fairly linear, from a population of 3.78M in 2011 to 4.01M in 2019. Interestingly, the bulk of this increase is attributed to growth in the Old group, while the sum of the Active and Young group is almost constant over the same period. We can take a deeper look into this as follows:

From the graph, it is obvious that the size of the Young and Active groups remained almost the same over the 9 year period (at -4% and +3% growth respectively). However, the Old group grew a whopping 65%, contributing to almost the whole of the population growth from 2011 to 2019.

A quick check on singstats for the relevant data shows that the data indeed corroborate with observed trend in the increase of the Old group.

Year 2011 2012 2013 2014 2015 2016 2017 2018 2019 Total
Births 36,178 38,641 35,681 37,967 37,861 36,875 35,444 35,040 35,330 329,017
Deaths 16,887 17,273 17,810 18,237 18,640 18,856 19,763 20,095 20,288 167,849
Difference 19,291 21,368 17,871 19,730 19,221 18,019 15,681 14,945 15,042 161,168
Life Expectancy 81.9 82.1 82.4 82.6 82.9 83 83.2 83.4 83.6
(Source singstats.gov.sg)

We can see that the net increase in our resident population is largely contributed by difference between the annual birth and death rates, adding up to an increase of 161K to the population across the 9 years. The life expectancy across the same period showed a corresponding increase as well.

Switching gear, we take a look at our various estates and see if there are any trends of where people prefer to stay for the various age groups. We plot the top and bottom 5 areas for each of the areas using the population data from 2019 for our analysis.

The trend for the Old age group revealed little surprise, where we expect that our mature estates in areas such as Outram, Bukit Merah, Rochor, Ang Mo Kio and Toa Payoh to have a much higher proportion of old people. Conversely, we find our relatively younger estates such as Punggol, Sengkang, Woodlands, Pasir Ris to be more heavily populated with the Young and Active group. These areas correspond to the new towns around the outskirts of the central areas of Singapore.