ISSS608 Assignment 4

0. Introduction

In this assignment, we are required to build a static visualisation of the population data using the data from singstat.gov.sg.

1. Major data and design challenges

We will describe the various data and the design challenges and mitigations follows:

The data are provided in an unpivoted form. This is a useful format to be directly imported into packages such as R for plotting graphs but not very intuitive for the human reader that wants to visually explore the raw numeric data for sense-making. In order to get started with this particular dataset, I imported the .csv into Excel and create PivotCharts to quickly get a sense what data is available before deciding and planning on what I wanted to present to the reader.
There is way too much raw data that is provided that can be represented a single static visual. For example, the first and most intuitive chart that comes to mind when talking about population is the population pyramid. Since the data is given across a range of years from 2011 to 2019, it is instinctive to immediately think of plotting the population pyramids across the years. However, that results in the following mess:

The above plot is way too cluttered, contains way too much information that results in a loss of focus, and makes drawing any useful insights practically impossible.

To overcome the above challenges, it is important to ask what are the key stories or message that we want to convey to the reader and extract those portion accordingly. This can then either be translated into a summarised visual that conveys the punchline, or a series of static visuals that tells a story in a coherent flow (something that singstats uses extensively, e.g. here).

The dataset contains quite an extensive list of information that is available, including:

area where the people live
age
sex
type of dwelling
population size
year

Given the above, some of the relevant questions may be:

What is the trend of the population growth over time?
What are the differences in growth rate between different age groups?
How does the population size across the age group differs between male and female?
Is there any correlation between age group and the type of dwelling? E.g., the working class may prefer to stay in condo, or the old people may stay in smaller HDB flats, etc.
Is there any correlation between age group and the area where they live? E.g., mature estates such as Toa Paoh may have more old people staying around those areas.

Depending on which question above we are answer, we can do some grouping of the data to make the visualisation and understanding of the data more manageable for the reader. Some of the possible strategies are:

Collapse the population into distinct age groups, e.g., young, active, old
Present only the latest info (filter out all years except for 2019)
Convert absolute numbers into percentages where comparisons between groups are involved (more intuitive)

We will try to craft a story out of the data that is available to us, in the style of how singstats presents data as mentioned above. There are various types of graphs that we can use to support the visualisation, such as a stacked area plot, line chart and bar graphs. We will experiment with the various plots to see which are the more useful ones. Some of the possible plots are illustrated as below:

2. Step-by-step flow to generate the visuals in R

In this section, we will present the steps used to generate the visuals that will be used to tell our story in part 3 of the assignment. The actual story will then be told in part 3 itself.

2.1 Import data and pre-processing

We start by importing the tidyverse package which is necessary for the visualisation and data processing.

packages=c('tidyverse', 'scales', 'ggpubr')

for (p in packages){
  if (!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

We import the .csv file as follows:

# read the csv file
pop_data <- read_csv("./respopagesextod2011to2019.csv")

# remove all the understores from the AG field
pop_data$AG <- str_replace_all(pop_data$AG, "_", " ")

Perform data-processing by grouping the various age groups into the 3 categories of young, active and old defined as follows:

Young: 0 to 24 years old
Active: 25 to 64 years old
Old: 65 years and above

agpop_mutated <- pop_data %>%
  spread(AG, Pop) %>%
  mutate(Young = rowSums(.[6:10])) %>%
  mutate(Active = rowSums(.[11:18]))  %>%
  mutate(Old = rowSums(.[19:23]))
  # mutate(TOTAL = rowSums(.[24:26]))
pop_age <- data.frame(agpop_mutated)

# drop the columns of ages that we don't need
pop_age <- pop_age[-c(6:24)]

2.2 Stacked area plot

We use the stacked area plot to illustrate the change in the population size in absolute numbers but separated into the various age groups. The code to plot the graph is given as follows:

# generate the pivoted df
pop_age_long <- 
  pop_age[-c(1:4)] %>% 
  group_by(Time) %>%
  summarise(Young = sum(Young), Active = sum(Active), Old = sum(Old)) %>%
  pivot_longer(
    cols = Young:Old,
    names_to = "AGE",
    values_to = "Count"
  )

# change the order of hte age that is appropriate for the stacked plot
pop_age_long$AGE <- factor(pop_age_long$AGE, c("Old", "Active", "Young"))

temp <- pop_age_long %>% filter(Time==2011)
pop2011 = sum(temp$Count)
temp <- pop_age_long %>% filter(Time==2019)
pop2019 = sum(temp$Count)

# plot the chart
respop_stacked <- 
  ggplot(data=pop_age_long, aes(x=Time, y=Count, fill=AGE, factor)) + 
    geom_area() +
    labs(title = "Resident population from 2011 to 2019",
         caption = "Data source: www.singstat.gov.sg",
         fill = "Age Group") +
    scale_y_continuous(name = "Population",
                       breaks = seq(0, 5e6, 1e6), 
                       labels = paste0(as.character(seq(0, 5, 1)), "M"), 
                       limits = c(0, 4.5e6) ) + 
    scale_x_continuous(name = "Year",
                       breaks = seq(2011,2019, 1)) + 
    theme_bw() +
    theme(plot.title = element_text(hjust = 0.5)) +
    annotate("text", 
             x=2011+.1, 
             y=pop2011+.15e6, 
             label=format(round(as.numeric(pop2011/1e6), 2), nsmall=2, big.mark=","),
             size=3) +
    annotate("text", 
             x=2019-.1, 
             y=pop2019+.15e6, 
             label=format(round(as.numeric(pop2019/1e6), 2), nsmall=2, big.mark=","),
             size=3)

respop_stacked

2.3 Line graph

We use the line graph to illustrate the change in the population size of the various age groups expressed as a percent of the starting year, i.e., 2011. The code to plot the graph is given as follows:

# generate the df for the plot
pop_age_line <- 
  pop_age[-c(1:4)] %>% 
  group_by(Time) %>%
  summarise(Young = sum(Young), Active = sum(Active), Old = sum(Old))

pop_age_line$Young <- pop_age_line$Young / pop_age_line$Young[1]
pop_age_line$Active <- pop_age_line$Active / pop_age_line$Active[1]
pop_age_line$Old <- pop_age_line$Old / pop_age_line$Old[1]

pop_age_line <- pop_age_line %>%
  pivot_longer(
    cols = Young:Old,
    names_to = "AGE",
    values_to = "Count"
  )

pop_age_line$AGE <- factor(pop_age_long$AGE, c("Old", "Active", "Young"))

popchange_line <-
  ggplot(data=pop_age_line, aes(x=Time, y=Count, group=AGE, colour=AGE, factor)) + 
    geom_line(size=1) +
    labs(title = "Population change of various age group from 2011 to 2019",
         subtitle = "Change expressed as percentage of 2011's population",
         caption = "Data source: www.singstat.gov.sg",
         col = "Age Group") +
    scale_y_continuous(name = "Percent change",
                       labels = percent,
                       breaks = seq(0, 2, 0.1)) + 
    scale_x_continuous(name = "Year",
                       breaks = seq(2011, 2019, 1)) + 
    theme_bw() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5))

popchange_line

### 2.4 Bar graph

We use the bar graph to illustrate the composition of the various age groups in the different districts for 2019. As there are many districts, we will restrict the graph to only the top and bottom 5 districts to make the information useful and prevent clutter. The code to plot the graph is given as follows:

# generate the df for the plot
pop_age_area <- 
  pop_age[-c(2:4)] %>% 
  filter(Time==2019) %>%
  group_by(PA) %>%
  summarise(Young = sum(Young), Active = sum(Active), Old = sum(Old)) %>%
  mutate(Total = rowSums(.[2:4])) %>%
  filter(Total > 10000)  # ignore areas with less than 10K pax

# calculate the percent of people against the population
pop_age_area$Young <- pop_age_area$Young / pop_age_area$Total
pop_age_area$Active <- pop_age_area$Active / pop_age_area$Total
pop_age_area$Old <- pop_age_area$Old / pop_age_area$Total


# define variable for top n
topn <- 5

old_plot <-
  pop_age_area %>%
    arrange(desc(Old)) %>%
    slice( c(1:topn,  (nrow(.)-topn+1):(nrow(.)))  )%>%
    ggplot(aes(x=reorder(PA, -Old), y=Old)) +
      geom_bar(stat = "identity", position = position_dodge(), fill="red") +
      labs(title = paste0("Top and bottom ", as.character(topn), " of Old people by percentage in district (2019)")) +
      scale_y_continuous(name = "% Population in district",
                         labels = percent,
                         breaks = seq(0, 2, 0.1)) + 

      theme_bw() +
      theme(plot.title = element_text(hjust = 0.5),
            axis.text.x = element_text(angle=45, hjust=1),
            axis.title.x = element_blank())

active_plot <-
  pop_age_area %>%
    arrange(desc(Active)) %>%
    slice( c(1:topn,  (nrow(.)-topn+1):(nrow(.)))  )%>%
    ggplot(aes(x=reorder(PA, -Active), y=Active)) +
      geom_bar(stat = "identity", position = position_dodge(), fill="green") +
      labs(title = paste0("Top and bottom ", as.character(topn), " of Active people by percentage in district (2019)")) +
      scale_y_continuous(name = "% Population in district",
                         labels = percent,
                         breaks = seq(0, 2, 0.1)) + 

      theme_bw() +
      theme(plot.title = element_text(hjust = 0.5),
            axis.text.x = element_text(angle=45, hjust=1),
            axis.title.x = element_blank())

young_plot <-
  pop_age_area %>%
    arrange(desc(Young)) %>%
    slice( c(1:topn,  (nrow(.)-topn+1):(nrow(.)))  )%>%
    ggplot(aes(x=reorder(PA, -Young), y=Young)) +
      geom_bar(stat = "identity", position = position_dodge(), fill="blue") +
      labs(title = paste0("Top and bottom ", as.character(topn), " of Young people by percentage in district (2019)"),
           caption = "Data source: www.singstat.gov.sg") +
      scale_y_continuous(name = "% Population in district",
                         labels = percent,
                         breaks = seq(0, 2, 0.1)) + 

      theme_bw() +
      theme(plot.title = element_text(hjust = 0.5),
            axis.text.x = element_text(angle=45, hjust=1),
            axis.title.x = element_blank())

popdis_bar <-
  ggarrange(old_plot, active_plot, young_plot,
            ncol = 1, nrow = 3)

popdis_bar

3.0 A Look at Our Resident Population from 2011 to 2019

Let’s take a look at how our population has changed from 2011 to 2019.

From the graph, we can see that the growth of our population has been fairly linear, from a population of 3.78M in 2011 to 4.01M in 2019. Interestingly, the bulk of this increase is attributed to growth in the Old group, while the sum of the Active and Young group is almost constant over the same period. We can take a deeper look into this as follows:

From the graph, it is obvious that the size of the Young and Active groups remained almost the same over the 9 year period (at -4% and +3% growth respectively). However, the Old group grew a whopping 65%, contributing to almost the whole of the population growth from 2011 to 2019.

A quick check on singstats for the relevant data shows that the data indeed corroborate with observed trend in the increase of the Old group.

Year	2011	2012	2013	2014	2015	2016	2017	2018	2019	Total
Births	36,178	38,641	35,681	37,967	37,861	36,875	35,444	35,040	35,330	329,017
Deaths	16,887	17,273	17,810	18,237	18,640	18,856	19,763	20,095	20,288	167,849
Difference	19,291	21,368	17,871	19,730	19,221	18,019	15,681	14,945	15,042	161,168
Life Expectancy	81.9	82.1	82.4	82.6	82.9	83	83.2	83.4	83.6

(Source singstats.gov.sg)

We can see that the net increase in our resident population is largely contributed by difference between the annual birth and death rates, adding up to an increase of 161K to the population across the 9 years. The life expectancy across the same period showed a corresponding increase as well.

Switching gear, we take a look at our various estates and see if there are any trends of where people prefer to stay for the various age groups. We plot the top and bottom 5 areas for each of the areas using the population data from 2019 for our analysis.

The trend for the Old age group revealed little surprise, where we expect that our mature estates in areas such as Outram, Bukit Merah, Rochor, Ang Mo Kio and Toa Payoh to have a much higher proportion of old people. Conversely, we find our relatively younger estates such as Punggol, Sengkang, Woodlands, Pasir Ris to be more heavily populated with the Young and Active group. These areas correspond to the new towns around the outskirts of the central areas of Singapore.