The aim is to provide a better visual understanding of the demographic structure of Singapore population by age cohort and by planning area .
However, there were some data challenges to build the pyramid. I could not use the original long data format to build. I had to transform the original long data format into wide data format.
Using the dcast() function in the reshape2 package, I was able to do so. Example of wide data format:
| Sex | 0_to_4 | 5_to_9 | 10_to_14 | 15_to_19 | 20_to_24 | 25_to_29 | 30_to_34 | 35_to_39 | 40_to_44 | 45_to_49 | 50_to_54 | 55_to_59 | 60_to_64 | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Females | 90850 | 97040 | 102550 | 108910 | 122480 | 145960 | 153460 | 158850 | 157120 | 160230 | 152750 | 153590 | 140770 | ||||
| Males | 94730 | 101290 | 105830 | 113730 | 127040 | 142640 | 140360 | 142310 | 144130 | 151800 | 149360 | 153850 | 138490 |
For the next objective of planning area: I had a few design options of bar chart, bubble plot and ternary plot.
Using Bar Chart, while I could show Population Size across planning areas, I missed out on age composition within the planning area.
Using bubble Plot allows a comparison of Old vs. Economically Active % across population sizes (via circle size) and planning areas (via colour). However, the young will be neglected here.
I decided on Ternary Plot because it is the best among the three visualizations. It has all the benefits of the bubble plot and shows the demographic structure best among the three.
I encountered data challenges and I had to transform the data because I cannot build a ternary plot using the original data.
I transformed the columns of ages 0-19 as ‘YOUNG’, 20-64 as ‘ACTIVE’ and above 65 as ’OLD, and amalgamated planning areas into bigger geographical areas (North, North East etc.).
I also aggregated the rows of different housing types of the same area for greater clarity.
packages <- c('tidyverse', 'reshape2', 'data.table','ggtern','plotly')
for(p in packages) {
if (!require(p, character.only = T)) {
install.packages(p)
}
library(p, character.only = T)
}
Data is obtained from: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data , Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 CSV file; and only Year 2019 data is extracted as “2019.csv”.
#import data
data <- read_csv("2019.csv")
#structure of imported data
str(data)
## tibble [98,192 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ PA : chr [1:98192] "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" ...
## $ SZ : chr [1:98192] "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" ...
## $ AG : chr [1:98192] "0_to_4" "0_to_4" "0_to_4" "0_to_4" ...
## $ Sex: chr [1:98192] "Males" "Males" "Males" "Males" ...
## $ TOD: chr [1:98192] "HDB 1- and 2-Room Flats" "HDB 3-Room Flats" "HDB 4-Room Flats" "HDB 5-Room and Executive Flats" ...
## $ Pop: num [1:98192] 0 10 10 20 0 0 50 0 0 10 ...
## - attr(*, "spec")=
## .. cols(
## .. PA = col_character(),
## .. SZ = col_character(),
## .. AG = col_character(),
## .. Sex = col_character(),
## .. TOD = col_character(),
## .. Pop = col_double()
## .. )
# choose 3rd, 4th and 6th cols - AG, sex, Pop to build Pop Pyramid
totalpp <- data[, c(3, 4, 6)]
#i use dcast() function to transform and fill cells in by population number
total_wide <- dcast(setDT(totalpp), Sex~AG, fun = list(sum), value.var = "Pop")
# Age 5_to_9 is at wrong position
colnames(total_wide)
total_wide_new <- total_wide[, c(1, 2, 13, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20)]
#avoid scientific notation for big numbers
options(scipen = 999)
#separate by sex
pyramid <- melt(total_wide_new,id=c("Sex"))
#set males as negative to do reverse chart
pyramid$value[pyramid$Sex == "Males"] <- pyramid$value[pyramid$Sex == "Males"]*-1
#build pop pyramid, flip chart using ggplot, geom_bar, then coord_flip()
pplot <- ggplot(pyramid, aes(x = variable , y = value, fill = Sex)) +
geom_col() +
scale_y_continuous(breaks = seq(-150000, 150000, 50000),
labels = paste0(as.character(c(seq(150,0,-50),seq(50,150,50))))) +
labs(x = "Age", y = "Population (in Thousands)", title = "Population Pyramid in 2019") +
coord_flip()
pplot2 <- ggplotly(pplot, tooltip=c('variable','Sex'))
#Use spread() function to separate AG and Pop into different cohort
data_mutated <- data %>%
spread(AG, Pop)
#order of AGE 5_to_9 is wrong
colnames(data_mutated)
#change correct order
data_mutated <- data_mutated[, c(1, 2, 3, 4, 5, 14, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23)]
#derive young, active, and old using mutate()
data_mutated <- data_mutated %>%
mutate(YOUNG = rowSums(.[5:8]))%>%
mutate(ACTIVE = rowSums(.[9:17])) %>%
mutate(OLD = rowSums(.[18:23])) %>%
mutate(TOTAL = rowSums(.[5:23])) %>%
filter(TOTAL > 0)
#amalgamate planning areas into simpler geographical regions
data_mutated$AREA <- data_mutated$PA
data_mutated[c(97:178,552:561,1171:1214,1597:1647), "AREA"] = "EAST"
data_mutated[c(179:214,296:440,495:551,692:704,705:754,955:1046,1053:1077,1078:1081,1082:1099,1100:1133,1134:1141,1142:1170,1259:1353,1354:1368,1369:1419,1583:1590,1591:1594,1648:1779), "AREA"] = "CENTRAL"
data_mutated[c(215:295,441:494,562:691,848:954,1780:1781), "AREA"] = "WEST"
data_mutated[c(1047:1052,1422:1468,1595:1596,1782:1924), "AREA"] = "NORTH"
data_mutated[c(1:96,755:847,1215:1258,1420:1421,1469:1582), "AREA"] = "NORTH-EAST"
#aggregate different housing types within a planning area
data_mutated2 <- data_mutated %>%
group_by(PA,SZ,AREA) %>%
summarise(YOUNG=sum(YOUNG), ACTIVE=sum(ACTIVE),
OLD=sum(OLD), TOTAL=sum(TOTAL))
#create an interactive ternary plot using plot_ly() function
#setting function for axis formatting
axis <- function(txt) {
list(
title = txt, tickformat = ".0%", tickfont = list(size = 10))
}
ternaryAxes = list(
aaxis = axis("Young (<20 y.o.) [A]"),
baxis = axis("Active (20-64 y.o.) [B]"),
caxis = axis("Elderly (>65 y.o.) [C]")
)
#set up title
title_detail = list(size = 12, color = 'black')
#use plot_ly() function to build ternary plot
ternaryplot <- plot_ly(
data_mutated2,
a = ~YOUNG, b = ~ACTIVE, c = ~OLD,
color = ~AREA, text = ~SZ,
size = ~TOTAL*10, marker = list(
line = list(color = 'rgba(152, 0, 0, .8)',width=0.2, size=~TOTAL*10)),
type = "scatterternary",
mode = 'markers'
) %>%
layout(
ternary = ternaryAxes,
annotations=list(text="Demographic Structure \n of young, active, old \n in Singapore, 2019",xref="paper",x=0.5,
yref="paper",y=1,yshift=-30,xshift=-150,showarrow=FALSE,
font=list(size=12,color='rgb(217,83,79)'))
)
Using population pyramid we see the age cohorts of 20-64 years old are significantly longer than the young (<20 y.o.) and old (above 65 y.o.) cohorts.
Using ternary plot, we see that most of the resident population across planning areas are economically active (over 60% for most areas), while young and old are about 20% each.
Using population pyramid, we see rising population from 0 to 24 as the age increases, quite consistent from age cohorts between the age of 25-64 and declining population after 64 y.o. onwards.
using population pyramid, we can see that there are more females than males especially the age cohorts above 75 years old.
Using ternary plot, population distribution is most spread out for Central Region and East, and tighter for the other areas.