The dataset has been sourced from the Department of Statistics Singapore (https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data) and entails Singapore’s Resident Population by Planning Area, Subzone, Age Group, Sex and Type of Dwelling from 2011 to 2019.
1.1 There are various dimensions through which population demographics can be analyzed (i.e. Planning Area, Age Group, Sex, Time). Therefore, keeping the data visualization static is a given challenge as an interactive one would have allowed the user to select one or more dimensions to visualize its impact on population with a few clicks. The alternative way is to produce different plots for different dimensions so as to understand the impact of each dimension on population separately.
1.2 Another challenge is the highly cardinal data - there are 55 planning areas and 19 age groups. This makes it difficult to detect overall trends and make comparisons. In order to reduce the dimensionality, external data on Region was sought (Central, North, North East, West, East) and combined with the original data using the INDEX/MATCH function in Excel. Similarly, the age groups were aggregated into the following three categories:
0-24 Young
25-64 Economically Active
65+ Aged
This way of grouping the data will also allow for easier comparisons with the use of the facet grid feature in ggplot2.
1.3 The original data needs to be cleaned and transformed for each visualization - it is not in a form that can be used to recreate visualizations directly. Certain visualizations require certain columns of data - therefore, new datasets must be created with the necessary columns for each visualization. The data also needs to be grouped and aggregated since it is in a tall format. In order to make these changes, the R dplyr package will be used and new dataframes will be created. The group by and summarise functions will be used to aggregate and sum Population.
The sketched visualizations display relationships between Population, Age Groups, Sex, Time, Regions and Planning Areas.
To begin with, all necessary packages including tidyverse and CGPfunctions are installed and loaded onto the R environment.
packages = c('tidyverse', 'CGPfunctions')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Next, the data is imported into R using the read_csv function. As previously mentioned, additional data on Region was appended to the original dataset using Excel so no further joins are necessary.
df <- read_csv("respopagesextod2011to2019.csv")
To create the first visualization, the variable Time is converted to categorical using the mutate function. Only the relevant variables being presented in the visualization are selected - the dataframe is then grouped by Time and Sex, and Population values are aggregated.
scatter1 <- df %>%
mutate('Time' = as.character(Time)) %>%
select(Time, Sex, Pop) %>%
group_by(Time, Sex) %>%
summarise(Total_Pop = sum(Pop))
Once the appropriate dataframe is created, ggplot2 is used to create a scatter plot to show the increase in population over time. It is interesting to observe that the female population is higher than that of the male population and appears to be increasing at a faster rate.
ggplot(scatter1, aes(x=Time, y=Total_Pop/1000, color=Sex))+geom_point(position="jitter", size=3)+labs(title="Figure 1: Singapore Resident Population, 2011-2019", x="Year", y="Population (in 000s)", caption="Source: Singapore Department of Statistics")+theme_classic()+theme(plot.title = element_text(hjust = 0.5))+scale_color_manual(values=c("#d23f67", "#505050"))+theme(plot.caption = element_text(hjust = 1, face = "italic"))
The second visualization is an age-sex pyramid to understand the breakdown of Singapore’s population by Sex in 2019. Similar to the first visualization, a new dataframe called pyramid is created where Time is converted to a categorical variable, the year is filtered to only include 2019, the dataframe is grouped by Age Group and Sex, and Population is aggregated and summed. The age groups are reorganized in ascending order of age using the mutate and arrange functions.
pyramid <- df %>%
mutate('Time' = as.character(Time)) %>%
filter(Time=="2019") %>%
select(AG, Sex, Pop) %>%
group_by(AG, Sex) %>%
summarise(Total_Pop = sum(Pop)) %>%
mutate(AG = factor(AG, levels = c("0_to_4", "5_to_9", "10_to_14", "15_to_19", "20_to_24","25_to_29", "30_to_34", "35_to_39", "40_to_44", "45_to_49", "50_to_54", "55_to_59", "60_to_64", "65_to_69", "70_to_74", "75_to_79", "80_to_84", "85_to_89", "90_and_over"))) %>%
arrange(AG)
To construct the age-sex pyramid, two bars are created in opposite directions. An if-else statement is used to convert Male Population values to negative and the coord_flip() function is used to make the bars horizontal. The limits for the horizontal axis are set such that they are equal for both Males and females to enable easier comparisons. The bars are filled with color on the basis of Sex and labels are included to clearly distinguish Population values. Lastly, formatting is done to adjust the theme axes and center the title.
pyramid$Total_Pop <- with(pyramid, ifelse(pyramid$Sex =="Males", -Total_Pop/1000, Total_Pop/1000))
ggplot(pyramid,aes(x=AG, y=Total_Pop, fill=Sex)) + geom_bar(stat="identity") + scale_y_continuous(labels = abs, limits=max(pyramid$Total_Pop) * c(-1,1) * 1.1) + scale_fill_manual(values=as.vector(c("#d23f67","#505050"))) + coord_flip() + labs(title="Figure 2: Singapore's Age-Sex Pyramid, 2019", x="Age Group", y="Population (in 000s)", caption="Source: Singapore Department of Statistics") + theme_minimal() +
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.caption = element_text(hjust = 1, face = "italic"),
plot.title = element_text(hjust = 0.5, size=16)
)+geom_text(aes(label = abs(Total_Pop), hjust=ifelse(test = pyramid$Sex == "Males", yes = 1.1, no = -0.1)))
As seen above, the distribution of Male and Female Population is generally similar across various age groups. A noteworthy observation is that at younger age groups (0 to 24), the Male Population is greater than that of Female Population. However, the Female Population exceeds Male Population at all age groups after 24 years of age.
The third visualization introduces another dimension - Planning Area. To create this visualization, a new dataframe called bubble is created whereby Planning Area, Sex, and Population are the columns of interest selected. The same approach as the previous visualizations is used to clean and transform the data by using the filter, group by and summarise functions.
bubble <- df %>%
mutate('Time' = as.character(Time)) %>%
filter(Time=="2019") %>%
select(PA, Sex, Pop) %>%
group_by(PA, Sex) %>%
summarise(Total_Pop = sum(Pop))
A bubble plot is constructed to understand the relationship between the groups of a categorical variable and another continuous variable. In this case, the Male and Female Population is filtered by Planning Area. Geom_point() is used to display the scatter points and the points are filled according to Sex. The plot is flipped to display the Planning Areas on the vertical axis. Lastly, some formatting is done to center the title and adjust the size of the font.
ggplot(bubble, aes(x=PA, y=Total_Pop/1000, fill = Sex))+geom_point(shape = 21, size=1.5)+coord_flip()+theme_bw()+labs(x="Planning Area", y = "Population (in 000s)", fill = "Sex", title = "Figure 3: Singapore Resident Population by Sex and Planning Area, 2019", caption="Source: Singapore Department of Statistics")+
theme(
plot.caption = element_text(hjust = 1, face = "italic"),
plot.title = element_text(hjust = 0.5, size=18))
As seen from Figure 3, certain planning areas such as Woodlands, River Valley, Newton, and Kallang have almost identical Male and Female Populations. However, other planning areas such as Toh Payoh, Tampines, Queenstown, Bukit Timah, and Bedok have a disparity in their resident population - the Female Population is much larger than the Male Population.
The next visualization considers Region as opposed to Planning Area. The trend in Population over 2011, 2015 and 2019 is considered. A new dataframe called slope is created by selecting Time, Population and Region columns. Not all years were chosen - the years 2011, 2015, and 2019 were filtered so that the general trend in population change can be examined. Showing a year-by-year trend would result in a very cluttered graph.
slope <- df %>%
mutate('Time' = as.character(Time)) %>%
select(Time, Pop, Region) %>%
filter(Time %in% c("2011", "2015", "2019")) %>%
group_by(Region, Time) %>%
summarise(Pop = sum(Pop))
To plot the visualization, the CGPfunctions package is used. This package contains a newggslopegraph function which is used to create a slope graph, which shows the development of Population between dates. The function takes in the dataset called slope, Time which corresponds to the x-axis of the plot, Population which is the column that contains the numbers to be displayed along the y-axis, and Region is based on how the plot is grouped.
newggslopegraph(slope, Time, Pop, Region)+labs(title="Figure 4: Singapore Resident Population by Region, 2011-2019",subtitle="", caption="Source: Singapore Department of Statistics")+theme(
plot.caption = element_text(hjust = 1, face = "italic"),
plot.title = element_text(hjust = 0.5, size=12))
The above slope chart illustrates that while Resident Population has increased in the North, North East, and West Regions, it has increased at a much faster pace in the North East Region. On the contrary, the East and Central Regions have seen slight decreases in population over the last 8 years.
The final visualization aggregates the age groups into 3 categories - Young, Economically Active and Aged. A new variable called Age_Group is created to reflect these 3 categories using the mutate function and an if-else statement. Similar processing is done as previous visualizations and a new dataframe called groups is created.
groups <- df %>%
mutate('Time' = as.character(Time)) %>%
mutate(AG = factor(AG, levels = c("0_to_4", "5_to_9", "10_to_14", "15_to_19", "20_to_24","25_to_29", "30_to_34", "35_to_39", "40_to_44", "45_to_49", "50_to_54", "55_to_59", "60_to_64", "65_to_69", "70_to_74", "75_to_79", "80_to_84", "85_to_89", "90_and_over"))) %>%
mutate(df, Age_Group = ifelse(AG %in% c("0_to_4", "5_to_9", "10_to_14", "15_to_19", "20_to_24"), "Young",
ifelse(AG %in% c("65_to_69", "70_to_74", "75_to_79", "80_to_84", "85_to_89", "90_and_over"), "Aged", "Economically Active"))) %>%
select(Time, Age_Group, Pop, Region) %>%
group_by(Time, Age_Group, Region) %>%
summarise(Total_Pop = sum(Pop)) %>%
arrange(Age_Group)
A line graph is used to show the general trend in Resident Population from 2011 to 2019. The geom_point() function is used to display the scatter points and geom_smooth() is used to display the line of best fit using the linear regression method. The facet_grid() function generates different subsets of the data and allows for easier comparisons across different regions by placing each one next to each other. The x-axis is adjusted to show certain years (2012, 2014, 2016, and 2018) so it does not appear cluttered. The plot grid lines have been removed so that the focus of the visualization can be on the population trends. Lastly, different symbols were used to represent different age groups for more visibility.
ggplot(groups, aes(x=Time, y=Total_Pop/1000, color=Age_Group,shape=Age_Group))+geom_point()+geom_smooth(method=lm, size=0.5)+facet_grid(~ Region)+
scale_color_manual(values=c('#999999','#E69F00', '#56B4E9'))+
scale_size_manual(values=c(2,3,4))+theme_classic()+labs(title="Figure 5: Singapore Resident Population by Age Group, 2011-2019",x="Year", y="Population (in 000's)", caption="Source: Singapore Department of Statistics")+theme(
plot.caption = element_text(hjust = 1, face = "italic"),
plot.title = element_text(hjust = 0.5, size=16))+scale_x_continuous(breaks=c(2012, 2014, 2016, 2018))
As illustrated, the proportion of Aged Population in Singapore has clearly been increasing since 2011, out of which the Central Region has the highest Aged Population. By contrast, the Young Population has been declining in all Regions except the North East Region. The Active Population has been stable in the East and West Regions, has sharply increased in the North East Region but declined in the Central Region.
The final visualization combines Figures 2-5. Figure 1 is not included as it gives a general overview of Singapore’s Resident Population. By combining the above 4 visuals, insights can be generated as a whole.
From the above visuals, the following insights can be drawn.
1) The North East Region has gained visible popularity as demonstrated by a growing resident population size since 2011, especially driven by a rising economically active population. This is supported by the fact that the Singaporean government has initiated a development plan to relocate more working families to the Punggol/Sengkang area, which has resulted in an explosion of housing projects as well as investment opportunities. Furthermore, the upcoming expansion of the MRT network is expected to increase connectivity and cut journey times for residents in the North East Region, which could be another factor that has led to this region growing in its resident population. As a result, the government will need to ensure it has sufficient resources and capacity to accomodate this growing population.
2) Singapore is currently facing an increasingly ageing population as seen by the consistent increase in resident population over 65 years of age across all regions of the country. Also, for the most part, the younger population aged 24 and below has been declining since 2011 likely due to low birth rates. One potential reason that this data suggests could be the fact that the proportion of female population is higher than that of male population.
A rapid ageing population will call for economic and social policy reforms. The government will need to ensure financial security of elderly people through retirement savings or extension of employment years. It will also need to address appropriate housing and health care for its ageing society.
To balance the population from shrinking, the fertility rate will also need to improve. The government should continue its work around enabling couples to get housing faster and more easily, defraying child-raising costs and enhancing work-life balance. It should be noted that although this analysis only focuses on Singaporean residents, foreign immigration (not in scope of this dataset) in the long run can help to moderate the impact of ageing and low birth rates that the Singaporean population is currently facing.