Objective

The aim is to provide a better visual understanding of the demographic structure of Singapore population by age cohort and by planning area .

1. Design and Data Challenges

1.1 Proposed Design Viz for Age Cohort -Population Pyramid

For the objective of a great visualization for age cohort, I decided on population pyramid because it is able to show the age distribution of the population, and also gender for 2019. So there was not much of a contention here.

1.2 Data Challenges for Population Pyramid

However, there were some data challenges to build the pyramid. I could not use the original long data format to build. I had to transform the original long data format into wide data format.
Using the dcast() function in the reshape2 package, I was able to do so. Example of wide data format:

Sex	0_to_4	5_to_9	10_to_14	15_to_19	20_to_24	25_to_29	30_to_34	35_to_39	40_to_44	45_to_49	50_to_54	55_to_59	60_to_64
Females	90850	97040	102550	108910	122480	145960	153460	158850	157120	160230	152750	153590	140770
Males	94730	101290	105830	113730	127040	142640	140360	142310	144130	151800	149360	153850	138490

1.3 Design Challenges for Planning Area Viz

For the next objective of planning area: I had a few design options of bar chart, bubble plot and ternary plot.

-Bar Chart

Using Bar Chart, while I could show Population Size across planning areas, I missed out on age composition within the planning area.

-Bubble Plot

Using bubble Plot allows a comparison of Old vs. Economically Active % across population sizes (via circle size) and planning areas (via colour). However, the young will be neglected here.

-Ternary Plot

I decided on Ternary Plot because it is the best among the three visualizations. It has all the benefits of the bubble plot and shows the demographic structure best among the three.

1.4 Data Challenges for Ternary Plot

I encountered data challenges and I had to transform the data because I cannot build a ternary plot using the original data.
I transformed the columns of ages 0-19 as ‘YOUNG’, 20-64 as ‘ACTIVE’ and above 65 as ’OLD, and amalgamated planning areas into bigger geographical areas (North, North East etc.).
I also aggregated the rows of different housing types of the same area for greater clarity.

2. Step-by-step description on how the data visualization was prepared

2.1 Install and launching required R libraries:

tidyverse
reshape2
data.table
ggtern
plotly

packages <- c('tidyverse', 'reshape2', 'data.table','ggtern','plotly')
for(p in packages) {
  if (!require(p, character.only = T)) {
    install.packages(p)
  }
  library(p, character.only = T)
}

2.2 Import data and initial peek:

Data is obtained from: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data , Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 CSV file; and only Year 2019 data is extracted as “2019.csv”.

#import  data
data <- read_csv("2019.csv")

#structure of imported data
str(data)

## tibble [98,192 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ PA : chr [1:98192] "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" ...
##  $ SZ : chr [1:98192] "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" ...
##  $ AG : chr [1:98192] "0_to_4" "0_to_4" "0_to_4" "0_to_4" ...
##  $ Sex: chr [1:98192] "Males" "Males" "Males" "Males" ...
##  $ TOD: chr [1:98192] "HDB 1- and 2-Room Flats" "HDB 3-Room Flats" "HDB 4-Room Flats" "HDB 5-Room and Executive Flats" ...
##  $ Pop: num [1:98192] 0 10 10 20 0 0 50 0 0 10 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   PA = col_character(),
##   ..   SZ = col_character(),
##   ..   AG = col_character(),
##   ..   Sex = col_character(),
##   ..   TOD = col_character(),
##   ..   Pop = col_double()
##   .. )

2.3 Transforming Data for Population Pyramid

# choose 3rd, 4th and 6th cols - AG, sex, Pop to build Pop Pyramid
totalpp <- data[, c(3, 4, 6)]

#i use dcast() function to transform and fill cells in by population number
total_wide <- dcast(setDT(totalpp), Sex~AG, fun = list(sum), value.var = "Pop")

# Age 5_to_9 is at wrong position
colnames(total_wide)
total_wide_new <- total_wide[, c(1, 2, 13, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20)]

#avoid scientific notation for big numbers
options(scipen = 999)

2.4 Creating Population Pyramid

#separate by sex
pyramid <- melt(total_wide_new,id=c("Sex"))

#set males as negative to do reverse chart
pyramid$value[pyramid$Sex == "Males"]  <- pyramid$value[pyramid$Sex == "Males"]*-1

#build pop pyramid, flip chart using ggplot, geom_bar, then coord_flip()
pplot <- ggplot(pyramid, aes(x = variable , y = value, fill = Sex)) +
  geom_col() +
  scale_y_continuous(breaks = seq(-150000, 150000, 50000),
                     labels =   paste0(as.character(c(seq(150,0,-50),seq(50,150,50))))) +
  labs(x = "Age", y = "Population (in Thousands)", title = "Population Pyramid in 2019") +
  coord_flip() 

pplot2 <- ggplotly(pplot, tooltip=c('variable','Sex'))

2.5 Prepare Data for Ternary Plot

#Use spread() function to separate AG and Pop into different cohort 
data_mutated <- data %>%
  spread(AG, Pop) 

#order of AGE 5_to_9 is wrong
colnames(data_mutated)

#change correct order
data_mutated <- data_mutated[, c(1, 2, 3, 4, 5, 14, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23)]

#derive young, active, and old using mutate()
data_mutated <- data_mutated %>%
  mutate(YOUNG = rowSums(.[5:8]))%>%
  mutate(ACTIVE = rowSums(.[9:17]))  %>%
  mutate(OLD = rowSums(.[18:23])) %>%
  mutate(TOTAL = rowSums(.[5:23])) %>%
  filter(TOTAL > 0)

#amalgamate planning areas into simpler geographical regions
data_mutated$AREA <- data_mutated$PA
data_mutated[c(97:178,552:561,1171:1214,1597:1647), "AREA"] = "EAST"
data_mutated[c(179:214,296:440,495:551,692:704,705:754,955:1046,1053:1077,1078:1081,1082:1099,1100:1133,1134:1141,1142:1170,1259:1353,1354:1368,1369:1419,1583:1590,1591:1594,1648:1779), "AREA"] = "CENTRAL"
data_mutated[c(215:295,441:494,562:691,848:954,1780:1781), "AREA"] = "WEST"
data_mutated[c(1047:1052,1422:1468,1595:1596,1782:1924), "AREA"] = "NORTH"
data_mutated[c(1:96,755:847,1215:1258,1420:1421,1469:1582), "AREA"] = "NORTH-EAST"

#aggregate different housing types within a planning area
data_mutated2 <- data_mutated %>% 
                  group_by(PA,SZ,AREA) %>%
                  summarise(YOUNG=sum(YOUNG), ACTIVE=sum(ACTIVE),
                            OLD=sum(OLD), TOTAL=sum(TOTAL))

2.5 Build Ternary Plot

#create an interactive ternary plot using plot_ly() function
#setting function for axis formatting
axis <- function(txt) {
  list(
    title = txt, tickformat = ".0%", tickfont = list(size = 10))
  }

ternaryAxes = list(
  aaxis = axis("Young (<20 y.o.) [A]"), 
  baxis = axis("Active (20-64 y.o.) [B]"), 
  caxis = axis("Elderly (>65 y.o.) [C]")
  )

#set up title
title_detail = list(size = 12, color = 'black')

#use plot_ly() function to build ternary plot
ternaryplot <- plot_ly(
  data_mutated2, 
  a = ~YOUNG,   b = ~ACTIVE,   c = ~OLD,
  color = ~AREA,   text = ~SZ,
  size = ~TOTAL*10,   marker = list(
    line = list(color = 'rgba(152, 0, 0, .8)',width=0.2, size=~TOTAL*10)),
  type = "scatterternary",
  mode = 'markers'
) %>%
  layout(
    ternary = ternaryAxes,
    annotations=list(text="Demographic Structure \n of young, active, old \n in Singapore, 2019",xref="paper",x=0.5,
                      yref="paper",y=1,yshift=-30,xshift=-150,showarrow=FALSE, 
                      font=list(size=12,color='rgb(217,83,79)'))
  )

3. Final Data Visualization

3.1 Population Pyramid

3.2 Ternary Plot

3.3 Useful Information from the Visualizations

1. Demographic Structure of Singapore Population in 2019 is mostly economic active.

Using population pyramid we see the age cohorts of 20-64 years old are significantly longer than the young (<20 y.o.) and old (above 65 y.o.) cohorts.
Using ternary plot, we see that most of the resident population across planning areas are economically active (over 60% for most areas), while young and old are about 20% each.

2. Spread of demographic structure

Using population pyramid, we see rising population from 0 to 24 as the age increases, quite consistent from age cohorts between the age of 25-64 and declining population after 64 y.o. onwards.
using population pyramid, we can see that there are more females than males especially the age cohorts above 75 years old.
Using ternary plot, population distribution is most spread out for Central Region and East, and tighter for the other areas.

3. Demographic structure outliers

Using ternary plot, Loyang West has 95% of its population being 64 and above, Changi West has the most young people at 35% of its population and Singapore Polytechnic has the most economically active people at 90% of its population.

A Look at the Demographic Structure of Singapore Population in 2019

Ralph Tay

14 July 2020