Objective

The aim is to provide a better visual understanding of the demographic structure of Singapore population by age cohort and by planning area .

1. Design and Data Challenges

1.1 Proposed Design Viz for Age Cohort -Population Pyramid

  • For the objective of a great visualization for age cohort, I decided on population pyramid because it is able to show the age distribution of the population, and also gender for 2019. So there was not much of a contention here.

1.2 Data Challenges for Population Pyramid

  • However, there were some data challenges to build the pyramid. I could not use the original long data format to build. I had to transform the original long data format into wide data format.

  • Using the dcast() function in the reshape2 package, I was able to do so. Example of wide data format:

Sex 0_to_4 5_to_9 10_to_14 15_to_19 20_to_24 25_to_29 30_to_34 35_to_39 40_to_44 45_to_49 50_to_54 55_to_59 60_to_64
Females 90850 97040 102550 108910 122480 145960 153460 158850 157120 160230 152750 153590 140770
Males 94730 101290 105830 113730 127040 142640 140360 142310 144130 151800 149360 153850 138490

1.3 Design Challenges for Planning Area Viz

For the next objective of planning area: I had a few design options of bar chart, bubble plot and ternary plot.

-Bar Chart

Using Bar Chart, while I could show Population Size across planning areas, I missed out on age composition within the planning area.

-Bubble Plot

Using bubble Plot allows a comparison of Old vs. Economically Active % across population sizes (via circle size) and planning areas (via colour). However, the young will be neglected here.

-Ternary Plot

I decided on Ternary Plot because it is the best among the three visualizations. It has all the benefits of the bubble plot and shows the demographic structure best among the three.

1.4 Data Challenges for Ternary Plot

  • I encountered data challenges and I had to transform the data because I cannot build a ternary plot using the original data.

  • I transformed the columns of ages 0-19 as ‘YOUNG’, 20-64 as ‘ACTIVE’ and above 65 as ’OLD, and amalgamated planning areas into bigger geographical areas (North, North East etc.).

  • I also aggregated the rows of different housing types of the same area for greater clarity.

2. Step-by-step description on how the data visualization was prepared

2.1 Install and launching required R libraries:

  • tidyverse
  • reshape2
  • data.table
  • ggtern
  • plotly
packages <- c('tidyverse', 'reshape2', 'data.table','ggtern','plotly')
for(p in packages) {
  if (!require(p, character.only = T)) {
    install.packages(p)
  }
  library(p, character.only = T)
}

2.2 Import data and initial peek:

Data is obtained from: https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data , Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 CSV file; and only Year 2019 data is extracted as “2019.csv”.

#import  data
data <- read_csv("2019.csv")

#structure of imported data
str(data)
## tibble [98,192 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ PA : chr [1:98192] "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" ...
##  $ SZ : chr [1:98192] "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" ...
##  $ AG : chr [1:98192] "0_to_4" "0_to_4" "0_to_4" "0_to_4" ...
##  $ Sex: chr [1:98192] "Males" "Males" "Males" "Males" ...
##  $ TOD: chr [1:98192] "HDB 1- and 2-Room Flats" "HDB 3-Room Flats" "HDB 4-Room Flats" "HDB 5-Room and Executive Flats" ...
##  $ Pop: num [1:98192] 0 10 10 20 0 0 50 0 0 10 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   PA = col_character(),
##   ..   SZ = col_character(),
##   ..   AG = col_character(),
##   ..   Sex = col_character(),
##   ..   TOD = col_character(),
##   ..   Pop = col_double()
##   .. )

2.3 Transforming Data for Population Pyramid

# choose 3rd, 4th and 6th cols - AG, sex, Pop to build Pop Pyramid
totalpp <- data[, c(3, 4, 6)]

#i use dcast() function to transform and fill cells in by population number
total_wide <- dcast(setDT(totalpp), Sex~AG, fun = list(sum), value.var = "Pop")

# Age 5_to_9 is at wrong position
colnames(total_wide)
total_wide_new <- total_wide[, c(1, 2, 13, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20)]

#avoid scientific notation for big numbers
options(scipen = 999)

2.4 Creating Population Pyramid

#separate by sex
pyramid <- melt(total_wide_new,id=c("Sex"))

#set males as negative to do reverse chart
pyramid$value[pyramid$Sex == "Males"]  <- pyramid$value[pyramid$Sex == "Males"]*-1

#build pop pyramid, flip chart using ggplot, geom_bar, then coord_flip()
pplot <- ggplot(pyramid, aes(x = variable , y = value, fill = Sex)) +
  geom_col() +
  scale_y_continuous(breaks = seq(-150000, 150000, 50000),
                     labels =   paste0(as.character(c(seq(150,0,-50),seq(50,150,50))))) +
  labs(x = "Age", y = "Population (in Thousands)", title = "Population Pyramid in 2019") +
  coord_flip() 

pplot2 <- ggplotly(pplot, tooltip=c('variable','Sex'))

2.5 Prepare Data for Ternary Plot

#Use spread() function to separate AG and Pop into different cohort 
data_mutated <- data %>%
  spread(AG, Pop) 

#order of AGE 5_to_9 is wrong
colnames(data_mutated)

#change correct order
data_mutated <- data_mutated[, c(1, 2, 3, 4, 5, 14, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23)]

#derive young, active, and old using mutate()
data_mutated <- data_mutated %>%
  mutate(YOUNG = rowSums(.[5:8]))%>%
  mutate(ACTIVE = rowSums(.[9:17]))  %>%
  mutate(OLD = rowSums(.[18:23])) %>%
  mutate(TOTAL = rowSums(.[5:23])) %>%
  filter(TOTAL > 0)

#amalgamate planning areas into simpler geographical regions
data_mutated$AREA <- data_mutated$PA
data_mutated[c(97:178,552:561,1171:1214,1597:1647), "AREA"] = "EAST"
data_mutated[c(179:214,296:440,495:551,692:704,705:754,955:1046,1053:1077,1078:1081,1082:1099,1100:1133,1134:1141,1142:1170,1259:1353,1354:1368,1369:1419,1583:1590,1591:1594,1648:1779), "AREA"] = "CENTRAL"
data_mutated[c(215:295,441:494,562:691,848:954,1780:1781), "AREA"] = "WEST"
data_mutated[c(1047:1052,1422:1468,1595:1596,1782:1924), "AREA"] = "NORTH"
data_mutated[c(1:96,755:847,1215:1258,1420:1421,1469:1582), "AREA"] = "NORTH-EAST"

#aggregate different housing types within a planning area
data_mutated2 <- data_mutated %>% 
                  group_by(PA,SZ,AREA) %>%
                  summarise(YOUNG=sum(YOUNG), ACTIVE=sum(ACTIVE),
                            OLD=sum(OLD), TOTAL=sum(TOTAL))

2.5 Build Ternary Plot

#create an interactive ternary plot using plot_ly() function
#setting function for axis formatting
axis <- function(txt) {
  list(
    title = txt, tickformat = ".0%", tickfont = list(size = 10))
  }

ternaryAxes = list(
  aaxis = axis("Young (<20 y.o.) [A]"), 
  baxis = axis("Active (20-64 y.o.) [B]"), 
  caxis = axis("Elderly (>65 y.o.) [C]")
  )

#set up title
title_detail = list(size = 12, color = 'black')

#use plot_ly() function to build ternary plot
ternaryplot <- plot_ly(
  data_mutated2, 
  a = ~YOUNG,   b = ~ACTIVE,   c = ~OLD,
  color = ~AREA,   text = ~SZ,
  size = ~TOTAL*10,   marker = list(
    line = list(color = 'rgba(152, 0, 0, .8)',width=0.2, size=~TOTAL*10)),
  type = "scatterternary",
  mode = 'markers'
) %>%
  layout(
    ternary = ternaryAxes,
    annotations=list(text="Demographic Structure \n of young, active, old \n in Singapore, 2019",xref="paper",x=0.5,
                      yref="paper",y=1,yshift=-30,xshift=-150,showarrow=FALSE, 
                      font=list(size=12,color='rgb(217,83,79)'))
  )

3. Final Data Visualization

3.1 Population Pyramid

3.2 Ternary Plot

3.3 Useful Information from the Visualizations

1. Demographic Structure of Singapore Population in 2019 is mostly economic active.

  • Using population pyramid we see the age cohorts of 20-64 years old are significantly longer than the young (<20 y.o.) and old (above 65 y.o.) cohorts.

  • Using ternary plot, we see that most of the resident population across planning areas are economically active (over 60% for most areas), while young and old are about 20% each.

2. Spread of demographic structure

  • Using population pyramid, we see rising population from 0 to 24 as the age increases, quite consistent from age cohorts between the age of 25-64 and declining population after 64 y.o. onwards.

  • using population pyramid, we can see that there are more females than males especially the age cohorts above 75 years old.

  • Using ternary plot, population distribution is most spread out for Central Region and East, and tighter for the other areas.

3. Demographic structure outliers

  • Using ternary plot, Loyang West has 95% of its population being 64 and above, Changi West has the most young people at 35% of its population and Singapore Polytechnic has the most economically active people at 90% of its population.