In this assignment, we are required to build a static visualisation of the population data using the data from singstat.gov.sg.
We will describe the various data and the design challenges and mitigations follows:
The data are provided in an unpivoted form. This is a useful format to be directly imported into packages such as R for plotting graphs but not very intuitive for the human reader that wants to visually explore the raw numeric data for sense-making. In order to get started with this particular dataset, I imported the .csv into Excel and create PivotCharts to quickly get a sense what data is available before deciding and planning on what I wanted to present to the reader.
There is way too much raw data that is provided that can be represented a single static visual. For example, the first and most intuitive chart that comes to mind when talking about population is the population pyramid. Since the data is given across a range of years from 2011 to 2019, it is instinctive to immediately think of plotting the population pyramids across the years. However, that results in the following mess:
The above plot is way too cluttered, contains way too much information that results in a loss of focus, and makes drawing any useful insights practically impossible.
To overcome the above challenges, it is important to ask what are the key stories or message that we want to convey to the reader and extract those portion accordingly. This can then either be translated into a summarised visual that conveys the punchline, or a series of static visuals that tells a story in a coherent flow (something that singstats uses extensively, e.g. here).
The dataset contains quite an extensive list of information that is available, including:
Given the above, some of the relevant questions may be:
Depending on which question above we are answer, we can do some grouping of the data to make the visualisation and understanding of the data more manageable for the reader. Some of the possible strategies are:
We will try to craft a story out of the data that is available to us, in the style of how singstats presents data as mentioned above. There are various types of graphs that we can use to support the visualisation, such as a stacked area plot, line chart and bar graphs. We will experiment with the various plots to see which are the more useful ones. Some of the possible plots are illustrated as below:
In this section, we will present the steps used to generate the visuals that will be used to tell our story in part 3 of the assignment. The actual story will then be told in part 3 itself.
We start by importing the tidyverse package which is necessary for the visualisation and data processing.
packages=c('tidyverse', 'scales', 'ggpubr')
for (p in packages){
if (!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
We import the .csv file as follows:
# read the csv file
pop_data <- read_csv("./respopagesextod2011to2019.csv")
# remove all the understores from the AG field
pop_data$AG <- str_replace_all(pop_data$AG, "_", " ")
Perform data-processing by grouping the various age groups into the 3 categories of young, active and old defined as follows:
agpop_mutated <- pop_data %>%
spread(AG, Pop) %>%
mutate(Young = rowSums(.[6:10])) %>%
mutate(Active = rowSums(.[11:18])) %>%
mutate(Old = rowSums(.[19:23]))
# mutate(TOTAL = rowSums(.[24:26]))
pop_age <- data.frame(agpop_mutated)
# drop the columns of ages that we don't need
pop_age <- pop_age[-c(6:24)]
We use the stacked area plot to illustrate the change in the population size in absolute numbers but separated into the various age groups. The code to plot the graph is given as follows:
# generate the pivoted df
pop_age_long <-
pop_age[-c(1:4)] %>%
group_by(Time) %>%
summarise(Young = sum(Young), Active = sum(Active), Old = sum(Old)) %>%
pivot_longer(
cols = Young:Old,
names_to = "AGE",
values_to = "Count"
)
# change the order of hte age that is appropriate for the stacked plot
pop_age_long$AGE <- factor(pop_age_long$AGE, c("Old", "Active", "Young"))
temp <- pop_age_long %>% filter(Time==2011)
pop2011 = sum(temp$Count)
temp <- pop_age_long %>% filter(Time==2019)
pop2019 = sum(temp$Count)
# plot the chart
respop_stacked <-
ggplot(data=pop_age_long, aes(x=Time, y=Count, fill=AGE, factor)) +
geom_area() +
labs(title = "Resident population from 2011 to 2019",
caption = "Data source: www.singstat.gov.sg",
fill = "Age Group") +
scale_y_continuous(name = "Population",
breaks = seq(0, 5e6, 1e6),
labels = paste0(as.character(seq(0, 5, 1)), "M"),
limits = c(0, 4.5e6) ) +
scale_x_continuous(name = "Year",
breaks = seq(2011,2019, 1)) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
annotate("text",
x=2011+.1,
y=pop2011+.15e6,
label=format(round(as.numeric(pop2011/1e6), 2), nsmall=2, big.mark=","),
size=3) +
annotate("text",
x=2019-.1,
y=pop2019+.15e6,
label=format(round(as.numeric(pop2019/1e6), 2), nsmall=2, big.mark=","),
size=3)
respop_stacked
We use the line graph to illustrate the change in the population size of the various age groups expressed as a percent of the starting year, i.e., 2011. The code to plot the graph is given as follows:
# generate the df for the plot
pop_age_line <-
pop_age[-c(1:4)] %>%
group_by(Time) %>%
summarise(Young = sum(Young), Active = sum(Active), Old = sum(Old))
pop_age_line$Young <- pop_age_line$Young / pop_age_line$Young[1]
pop_age_line$Active <- pop_age_line$Active / pop_age_line$Active[1]
pop_age_line$Old <- pop_age_line$Old / pop_age_line$Old[1]
pop_age_line <- pop_age_line %>%
pivot_longer(
cols = Young:Old,
names_to = "AGE",
values_to = "Count"
)
pop_age_line$AGE <- factor(pop_age_long$AGE, c("Old", "Active", "Young"))
popchange_line <-
ggplot(data=pop_age_line, aes(x=Time, y=Count, group=AGE, colour=AGE, factor)) +
geom_line(size=1) +
labs(title = "Population change of various age group from 2011 to 2019",
subtitle = "Change expressed as percentage of 2011's population",
caption = "Data source: www.singstat.gov.sg",
col = "Age Group") +
scale_y_continuous(name = "Percent change",
labels = percent,
breaks = seq(0, 2, 0.1)) +
scale_x_continuous(name = "Year",
breaks = seq(2011, 2019, 1)) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
popchange_line
### 2.4 Bar graph
We use the bar graph to illustrate the composition of the various age groups in the different districts for 2019. As there are many districts, we will restrict the graph to only the top and bottom 5 districts to make the information useful and prevent clutter. The code to plot the graph is given as follows:
# generate the df for the plot
pop_age_area <-
pop_age[-c(2:4)] %>%
filter(Time==2019) %>%
group_by(PA) %>%
summarise(Young = sum(Young), Active = sum(Active), Old = sum(Old)) %>%
mutate(Total = rowSums(.[2:4])) %>%
filter(Total > 10000) # ignore areas with less than 10K pax
# calculate the percent of people against the population
pop_age_area$Young <- pop_age_area$Young / pop_age_area$Total
pop_age_area$Active <- pop_age_area$Active / pop_age_area$Total
pop_age_area$Old <- pop_age_area$Old / pop_age_area$Total
# define variable for top n
topn <- 5
old_plot <-
pop_age_area %>%
arrange(desc(Old)) %>%
slice( c(1:topn, (nrow(.)-topn+1):(nrow(.))) )%>%
ggplot(aes(x=reorder(PA, -Old), y=Old)) +
geom_bar(stat = "identity", position = position_dodge(), fill="red") +
labs(title = paste0("Top and bottom ", as.character(topn), " of Old people by percentage in district (2019)")) +
scale_y_continuous(name = "% Population in district",
labels = percent,
breaks = seq(0, 2, 0.1)) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle=45, hjust=1),
axis.title.x = element_blank())
active_plot <-
pop_age_area %>%
arrange(desc(Active)) %>%
slice( c(1:topn, (nrow(.)-topn+1):(nrow(.))) )%>%
ggplot(aes(x=reorder(PA, -Active), y=Active)) +
geom_bar(stat = "identity", position = position_dodge(), fill="green") +
labs(title = paste0("Top and bottom ", as.character(topn), " of Active people by percentage in district (2019)")) +
scale_y_continuous(name = "% Population in district",
labels = percent,
breaks = seq(0, 2, 0.1)) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle=45, hjust=1),
axis.title.x = element_blank())
young_plot <-
pop_age_area %>%
arrange(desc(Young)) %>%
slice( c(1:topn, (nrow(.)-topn+1):(nrow(.))) )%>%
ggplot(aes(x=reorder(PA, -Young), y=Young)) +
geom_bar(stat = "identity", position = position_dodge(), fill="blue") +
labs(title = paste0("Top and bottom ", as.character(topn), " of Young people by percentage in district (2019)"),
caption = "Data source: www.singstat.gov.sg") +
scale_y_continuous(name = "% Population in district",
labels = percent,
breaks = seq(0, 2, 0.1)) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle=45, hjust=1),
axis.title.x = element_blank())
popdis_bar <-
ggarrange(old_plot, active_plot, young_plot,
ncol = 1, nrow = 3)
popdis_bar
Let’s take a look at how our population has changed from 2011 to 2019.
From the graph, we can see that the growth of our population has been fairly linear, from a population of 3.78M in 2011 to 4.01M in 2019. Interestingly, the bulk of this increase is attributed to growth in the Old group, while the sum of the Active and Young group is almost constant over the same period. We can take a deeper look into this as follows:
From the graph, it is obvious that the size of the Young and Active groups remained almost the same over the 9 year period (at -4% and +3% growth respectively). However, the Old group grew a whopping 65%, contributing to almost the whole of the population growth from 2011 to 2019.
A quick check on singstats for the relevant data shows that the data indeed corroborate with observed trend in the increase of the Old group.
| Year | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | Total |
|---|---|---|---|---|---|---|---|---|---|---|
| Births | 36,178 | 38,641 | 35,681 | 37,967 | 37,861 | 36,875 | 35,444 | 35,040 | 35,330 | 329,017 |
| Deaths | 16,887 | 17,273 | 17,810 | 18,237 | 18,640 | 18,856 | 19,763 | 20,095 | 20,288 | 167,849 |
| Difference | 19,291 | 21,368 | 17,871 | 19,730 | 19,221 | 18,019 | 15,681 | 14,945 | 15,042 | 161,168 |
| Life Expectancy | 81.9 | 82.1 | 82.4 | 82.6 | 82.9 | 83 | 83.2 | 83.4 | 83.6 |
We can see that the net increase in our resident population is largely contributed by difference between the annual birth and death rates, adding up to an increase of 161K to the population across the 9 years. The life expectancy across the same period showed a corresponding increase as well.
Switching gear, we take a look at our various estates and see if there are any trends of where people prefer to stay for the various age groups. We plot the top and bottom 5 areas for each of the areas using the population data from 2019 for our analysis.
The trend for the Old age group revealed little surprise, where we expect that our mature estates in areas such as Outram, Bukit Merah, Rochor, Ang Mo Kio and Toa Payoh to have a much higher proportion of old people. Conversely, we find our relatively younger estates such as Punggol, Sengkang, Woodlands, Pasir Ris to be more heavily populated with the Young and Active group. These areas correspond to the new towns around the outskirts of the central areas of Singapore.