IS434 Assignment 4

1.0 Overview

This static visualization aims to uncover the demographic structure of the Singapore population by age cohort (e.g., 0-4, 5-9, ……) and by planning area in 2019.

Data is taken from Singapore Residents by Planning Area Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019.


1.1 Major data and design challenges faced


1.1.1 No information on regions (Data + Design)

There is no information on the regions in Singapore in the given dataset. It would be useful to analyse the planning areas by region. Moreover, it might not be intuitive to visualise datapoints by planning areas only without region context.

Solution: Import shapefile (MP14_SUBZONE_WEB_PL.shp) from data.gov.sg to supplement the region data


1.1.2 Age group Categorisation (Data)

The age distribution is presented as quinquennial age groups (i.e. in 5 years interval) which may be less useful for a broader analysis of the demographics in Singapore.

Solution: Categorise the data in broad age groups of young (0-24 years old), active (25-64 years old), old (65 years old and above). The young and economically active age groups are categorised in this way as people in Singapore usually enter the workforce after 24 years old due to mandatory tertiary education.


1.1.3 Recoding of Age Group (Data)

Age groups are not correctly sorted in ascending order due to poor naming of the age groups. This is undesirable because in order to plot the ternary plot and population pyramid, the spread function in the tidyr package would need the age groups to be correctly sorted.

Solution: Rename the age group that is sorted wrongly (i.e. ‘5_to_9’) by adding a zero in front (i.e. ’05_to_9’) so that it will be sorted after age group ‘0_to_4’ instead of before age group ‘50_to_54’.


1.1.4 Difficult to visualize data on planning areas at a glance (Design)

It would be difficult to visualize all the datapoints in a graph as there are 5 regions, 55 planning areas and 323 subzones. Too many colours may lead to cognitive overload, which is bad design.

Solution: Visualise the datapoints by colouring them according to the 5 different regions and exclude the 323 subzones for the main overall ternary plot so that it is easy for the reader to read, instead of colouring by 55 planning areas.


2.0 Proposed sketched design


3.0 Step-by-step Description


3.1 Load the relevant R packages and inspect data in RStudio

Import ggtern, sf, and tidyverse (containing ggplot2) into R. The R packages will be used to read the data and plot the visualizations. + tidyverse contains a set of packages for data manipulation and exploration + sf to encode spatial vector data + ggtern for ternary plots

packages = c('tidyverse','sf', 'ggtern')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Load the main dataset in R and inspect the data types and variables.

# load the pop data
pop_data <- read_csv("data/respopagesextod2011to2020.csv")

# visualize the pop data
#head(data.frame(pop_data))
str(pop_data)
## tibble [984,656 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ PA  : chr [1:984656] "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" ...
##  $ SZ  : chr [1:984656] "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" ...
##  $ AG  : chr [1:984656] "0_to_4" "0_to_4" "0_to_4" "0_to_4" ...
##  $ Sex : chr [1:984656] "Males" "Males" "Males" "Males" ...
##  $ TOD : chr [1:984656] "HDB 1- and 2-Room Flats" "HDB 3-Room Flats" "HDB 4-Room Flats" "HDB 5-Room and Executive Flats" ...
##  $ Pop : num [1:984656] 0 10 30 50 0 0 40 0 0 10 ...
##  $ Time: num [1:984656] 2011 2011 2011 2011 2011 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   PA = col_character(),
##   ..   SZ = col_character(),
##   ..   AG = col_character(),
##   ..   Sex = col_character(),
##   ..   TOD = col_character(),
##   ..   Pop = col_double(),
##   ..   Time = col_double()
##   .. )

Load the second dataset to create the region column.

# import region shp file
region_data = st_read("data/master-plan-2014-subzone-boundary-web-shp/MP14_SUBZONE_WEB_PL.shp")
## Reading layer `MP14_SUBZONE_WEB_PL' from data source `C:\Users\User\Documents\data\master-plan-2014-subzone-boundary-web-shp\MP14_SUBZONE_WEB_PL.shp' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS:  SVY21
# visualize the region shp file
head(data.frame(region_data))
##   OBJECTID SUBZONE_NO      SUBZONE_N SUBZONE_C CA_IND      PLN_AREA_N
## 1        1          1   MARINA SOUTH    MSSZ01      Y    MARINA SOUTH
## 2        2          1   PEARL'S HILL    OTSZ01      Y          OUTRAM
## 3        3          3      BOAT QUAY    SRSZ03      Y SINGAPORE RIVER
## 4        4          8 HENDERSON HILL    BMSZ08      N     BUKIT MERAH
## 5        5          3        REDHILL    BMSZ03      N     BUKIT MERAH
## 6        6          7 ALEXANDRA HILL    BMSZ07      N     BUKIT MERAH
##   PLN_AREA_C       REGION_N REGION_C          INC_CRC FMEL_UPD_D   X_ADDR
## 1         MS CENTRAL REGION       CR 5ED7EB253F99252E 2014-12-05 31595.84
## 2         OT CENTRAL REGION       CR 8C7149B9EB32EEFC 2014-12-05 28679.06
## 3         SR CENTRAL REGION       CR C35FEFF02B13E0E5 2014-12-05 29654.96
## 4         BM CENTRAL REGION       CR 3775D82C5DDBEFBD 2014-12-05 26782.83
## 5         BM CENTRAL REGION       CR 85D9ABEF0A40678F 2014-12-05 26201.96
## 6         BM CENTRAL REGION       CR 9D286521EF5E3B59 2014-12-05 25358.82
##     Y_ADDR SHAPE_Leng SHAPE_Area                       geometry
## 1 29220.19   5267.381  1630379.3 MULTIPOLYGON (((31495.56 30...
## 2 29782.05   3506.107   559816.2 MULTIPOLYGON (((29092.28 30...
## 3 29974.66   1740.926   160807.5 MULTIPOLYGON (((29932.33 29...
## 4 29933.77   3313.625   595428.9 MULTIPOLYGON (((27131.28 30...
## 5 30005.70   2825.594   387429.4 MULTIPOLYGON (((26451.03 30...
## 6 29991.38   4428.913  1030378.8 MULTIPOLYGON (((25899.7 297...

3.2 Merge the two datasets together

Merge pop_data with region_data on subzone. After merging, filter the merged dataset by the year 2019 and use the mutate function to recode ‘5_to_9’ to ‘05_to_9’.

# convert CAPS to lowercase so that region_data can merge with pop_data on subzone
region_data$SUBZONE_N <- str_to_title(region_data$SUBZONE_N)
region_data$REGION_N <- str_to_title(region_data$REGION_N)


overall_data <- left_join(pop_data, region_data, by = c('SZ'='SUBZONE_N')) %>%
  mutate(AG = gsub('5_to_9', '05_to_9', AG)) %>%
  filter(Time == 2019)



overall_df <- data.frame(overall_data)
head(overall_df)
##           PA                     SZ     AG   Sex
## 1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
##                                       TOD Pop Time OBJECTID SUBZONE_NO
## 1                 HDB 1- and 2-Room Flats   0 2019      249          1
## 2                        HDB 3-Room Flats  10 2019      249          1
## 3                        HDB 4-Room Flats  10 2019      249          1
## 4          HDB 5-Room and Executive Flats  20 2019      249          1
## 5 HUDC Flats (excluding those privatised)   0 2019      249          1
## 6                       Landed Properties   0 2019      249          1
##   SUBZONE_C CA_IND PLN_AREA_N PLN_AREA_C          REGION_N REGION_C
## 1    AMSZ01      N ANG MO KIO         AM North-East Region      NER
## 2    AMSZ01      N ANG MO KIO         AM North-East Region      NER
## 3    AMSZ01      N ANG MO KIO         AM North-East Region      NER
## 4    AMSZ01      N ANG MO KIO         AM North-East Region      NER
## 5    AMSZ01      N ANG MO KIO         AM North-East Region      NER
## 6    AMSZ01      N ANG MO KIO         AM North-East Region      NER
##            INC_CRC FMEL_UPD_D   X_ADDR  Y_ADDR SHAPE_Leng SHAPE_Area
## 1 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4   2642.209     316882
## 2 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4   2642.209     316882
## 3 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4   2642.209     316882
## 4 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4   2642.209     316882
## 5 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4   2642.209     316882
## 6 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4   2642.209     316882
##                         geometry
## 1 MULTIPOLYGON (((29692.8 389...
## 2 MULTIPOLYGON (((29692.8 389...
## 3 MULTIPOLYGON (((29692.8 389...
## 4 MULTIPOLYGON (((29692.8 389...
## 5 MULTIPOLYGON (((29692.8 389...
## 6 MULTIPOLYGON (((29692.8 389...

3.3 Create a dataframe with removed irrelevant columns, grouped by planning area, region, age group and time.

Summarise the population total in each group as Pop Total. This cleaned dataframe will be used to create the dataset for the ternary plots later.

# Data Frame of total population by planning area, region, age group and time
sg<-overall_data %>%
  select(PA, SZ, REGION_N, AG, Sex, TOD, Pop, Time) %>%
  group_by(PA, REGION_N, AG, Time)%>% 
  summarize(Pop_Total = sum(Pop)) 

sg_df <- data.frame(sg)
head(sg_df)
##           PA          REGION_N       AG Time Pop_Total
## 1 Ang Mo Kio North-East Region   0_to_4 2019      5420
## 2 Ang Mo Kio North-East Region  05_to_9 2019      6230
## 3 Ang Mo Kio North-East Region 10_to_14 2019      7380
## 4 Ang Mo Kio North-East Region 15_to_19 2019      7930
## 5 Ang Mo Kio North-East Region 20_to_24 2019      8920
## 6 Ang Mo Kio North-East Region 25_to_29 2019     10620

3.4 Visualisation 1: Ternary Plot

Ternary plots are a way of displaying the distribution and variability of three‐part compositional data, which are age groups, regions, and population in this case. They are useful for presenting data with three variables in a concise manner.


3.4.1 Prepare the dataset for ternary plot

Categorise the age groups into new columns of Young, Active and Old. Filter out the zero values.

# Prepare the data for ternary plot *

#Deriving the young, economy active and old measures
agpop_mutated <- sg_df %>%
  spread(AG, Pop_Total)  %>%
  mutate(Young = rowSums(.[4:8])) %>% 
  mutate(Active = rowSums(.[9:16])) %>%
  mutate(Old = rowSums(.[17:22])) %>%
  mutate(TOTAL = rowSums(.[23:25])) %>%
  filter(TOTAL > 0) %>%
  arrange(desc(TOTAL))

df_ternary <- data.frame(agpop_mutated)
head(df_ternary)
##            PA          REGION_N Time X0_to_4 X05_to_9 X10_to_14 X15_to_19
## 1       Bedok       East Region 2019   10020    11640     13300     14640
## 2 Jurong West       West Region 2019   11090    13600     15430     16680
## 3    Tampines       East Region 2019   11320    11180     12060     14000
## 4   Woodlands      North Region 2019   11610    12830     14880     18280
## 5    Sengkang North-East Region 2019   17370    17260     15360     13620
## 6     Hougang North-East Region 2019    8940     9260     10430     12160
##   X20_to_24 X25_to_29 X30_to_34 X35_to_39 X40_to_44 X45_to_49 X50_to_54
## 1     16660     19530     17940     18310     20070     21290     20870
## 2     17700     19590     18520     20680     21400     21330     20430
## 3     18570     21460     18720     17930     16650     17420     19440
## 4     21400     20220     17880     17930     18280     20890     22040
## 5     12720     14440     22460     24820     21940     19410     15540
## 6     14770     17650     16700     15280     15240     16690     17260
##   X55_to_59 X60_to_64 X65_to_69 X70_to_74 X75_to_79 X80_to_84 X85_to_89
## 1     22550     21830     18810     13660      8300      5600      3130
## 2     19370     17740     13900      8440      4670      2540      1250
## 3     21980     20930     15230      9010      4960      3290      1850
## 4     19670     15230     10180      6120      3610      2250      1150
## 5     14730     12990      9290      5910      3410      1950      1050
## 6     19410     18200     13940      8950      5630      3600      1940
##   X90_and_over Young Active   Old  TOTAL
## 1         1820 66260 162390 51320 279970
## 2          650 74500 159060 31450 265010
## 3         1020 67130 154530 35360 257020
## 4          550 79000 152140 23860 255000
## 5          640 76330 146330 22250 244910
## 6         1060 55560 136430 35120 227110

3.4.2 Plot a Ternary Plot showing the overall Singapore population by Regions

Use the ggtern package to plot the ternary plot.

#Building the static ternary plot
ternary_plot <- ggtern(data=df_ternary, aes(x=Young,y=Active, z=Old, color=REGION_N, size=TOTAL)) +
  geom_point(alpha=0.5) +
  labs(title="Overall Percentage of Young, Active and Old in SG Population, 2019") +
  labs(colour = "Region") +
  scale_size(name="Population") +
  theme_bvbg()

ternary_plot

Plot multiple ternary plots using facet wrap in ggplot2 to show the distribution of age cohorts in regions more clearly.

#Building the static ternary plot
ternary_plots <- ggtern(data=df_ternary, aes(x=Young,y=Active, z=Old, color=REGION_N, size=TOTAL)) +
  geom_point(alpha=0.5, size=3, shape=16) +
  labs(title="Percentage of Young, Active and Old By Planning Area in Regions, 2019") +
  labs(colour = "Planning Area") +
  scale_size(name="Population") +
  theme_bvbg() +
  facet_wrap(~REGION_N, scale="free")

ternary_plots + theme(
  legend.title = element_text(size = 8),
  legend.text = element_text(size = 6),
  legend.position="none"
  )

However, the legend for planning area will look quite cramped using facet_wrap. Hence, I decided to plot the individual ternary plots based on each region, coloured by planning area.


3.4.3 Muliple Ternary Plots showing the distribution in Planning Areas by Regions

Plotting for the Ternary Plot for the North region:

ternary_plots_north_pa <- ggtern(data=df_ternary %>% filter(REGION_N=="North Region"), aes(x=Young,y=Active, z=Old, color=PA, size=TOTAL)) +
  geom_point(alpha=0.5, size=3, shape=16) +
  labs(subtitle="North Region") +
  labs(colour = "Planning Area") +
  scale_size(name="Population") +
  theme_bvbg() 

ternary_plots_north_pa + theme(
  legend.title = element_text(size = 10),
  legend.text = element_text(size = 8),
  legend.position="right",
  plot.subtitle = element_text(hjust = 0.5)
  )

Plotting for the Ternary Plot for the Northeast region:

ternary_plots_northeast_pa <- ggtern(data=df_ternary %>% filter(REGION_N=="North-East Region"), aes(x=Young,y=Active, z=Old, color=PA, size=TOTAL)) +
  geom_point(alpha=0.5, size=3, shape=16) +
  labs(subtitle="Northeast Region") +
  labs(colour = "Planning Area") +
  scale_size(name="Population") +
  theme_bvbg() 

ternary_plots_northeast_pa + theme(
  legend.title = element_text(size = 10),
  legend.text = element_text(size = 8),
  legend.position="right",
  plot.subtitle = element_text(hjust = 0.5)
  )

Plotting for the Ternary Plot for the Central region:

ternary_plots_central_pa <- ggtern(data=df_ternary %>% filter(REGION_N=="Central Region"), aes(x=Young,y=Active, z=Old, color=PA, size=TOTAL)) +
  geom_point(alpha=0.5, size=3, shape=16) +
  labs(subtitle="Central Region") +
  labs(colour = "Planning Area") +
  scale_size(name="Population") +
  theme_bvbg() 

ternary_plots_central_pa + theme(
  legend.title = element_text(size = 8),
  legend.text = element_text(size = 6),
  legend.position="right",
  legend.margin=margin(l = 5, unit='cm'),
  plot.subtitle = element_text(hjust = 0.5)
  )

Plotting for the Ternary Plot for the East region:

ternary_plots_east_pa <- ggtern(data=df_ternary %>% filter(REGION_N=="East Region"), aes(x=Young,y=Active, z=Old, color=PA, size=TOTAL)) +
  geom_point(alpha=0.5, size=3, shape=16) +
  labs(subtitle="East Region") +
  labs(colour = "Planning Area") +
  scale_size(name="Population") +
  theme_bvbg() 

ternary_plots_east_pa + theme(
  legend.title = element_text(size = 10),
  legend.text = element_text(size = 8),
  legend.position="right",
  plot.subtitle = element_text(hjust = 0.5)
  )

Plotting for the Ternary Plot for the West region:

ternary_plots_west_pa <- ggtern(data=df_ternary %>% filter(REGION_N=="West Region"), aes(x=Young,y=Active, z=Old, color=PA, size=TOTAL)) +
  geom_point(alpha=0.5, size=3, shape=16) +
  labs(subtitle="West Region") +
  labs(colour = "Planning Area") +
  scale_size(name="Population") +
  theme_bvbg() 

ternary_plots_west_pa + theme(
  legend.title = element_text(size = 10),
  legend.text = element_text(size = 8),
  legend.position="right",
  plot.subtitle = element_text(hjust = 0.5)
  )

Combining the ternary plots together:

Ternary Plots by Planning Area in Regions, 2019


3.5 Visualisation 2: Population Pyramid

Population Pyramids are useful for studying the current population trend of a country and compare the similarities and differences between the different age groups and gender of the demographics.


3.5.1 Prepare the data for population pyramid

Assign the population values for Females with a negative value, so that they will appear on the left side of the population pyramid after running coord_flip().

data_pyramid<-overall_data %>% 
  select(PA, REGION_N, AG, Sex, TOD, Pop, Time) %>%
  group_by(PA, REGION_N, AG, Sex, Time) %>% 
  summarise(Pop_total = sum(Pop)) %>%
  mutate(Pop_Pyramid = ifelse(Sex == "Females",-Pop_total,Pop_total)) %>%
  mutate(AG = gsub('5_to_9', '05_to_9', AG)) %>%
  filter(Time == 2019)
df_pyramid <- data.frame(data_pyramid)
head(df_pyramid)
##           PA          REGION_N       AG     Sex Time Pop_total Pop_Pyramid
## 1 Ang Mo Kio North-East Region   0_to_4 Females 2019      2660       -2660
## 2 Ang Mo Kio North-East Region   0_to_4   Males 2019      2760        2760
## 3 Ang Mo Kio North-East Region 005_to_9 Females 2019      3110       -3110
## 4 Ang Mo Kio North-East Region 005_to_9   Males 2019      3120        3120
## 5 Ang Mo Kio North-East Region 10_to_14 Females 2019      3670       -3670
## 6 Ang Mo Kio North-East Region 10_to_14   Males 2019      3710        3710

3.5.2 Plot the population pyramid of Singapore

Use ggplot (geom_bar) to plot the population pyramid to see the distribution of males and females in the Singapore population by age cohorts.

pop_pyramid <- ggplot(df_pyramid, aes(x = AG, y = Pop_Pyramid, fill = Sex)) + 
  geom_bar(data=subset(df_pyramid,Sex=="Females"), stat = "identity") + 
  geom_bar(data=subset(df_pyramid,Sex=="Males"), stat = "identity") + 
  ggtitle(label= "Population Pyramid of Singapore, 2019")+
  scale_y_continuous(labels = paste0(as.character(c(seq(2, 0, -1), seq(1, 2, 1))), "m")) + 
  ylab("Population")+
  xlab("Age Group")+
  labs(fill = "Gender", subtitle=paste("Total:", sum(df_pyramid$Pop_total))) +
  scale_colour_manual(values = c("pink", "steelblue"),
                      aesthetics = c("colour", "fill")) +
  coord_flip()
  

pop_pyramid


4.0 Final Data Visualization and Insights Derived

  1. The overall ternary plot shows that majority of the planning areas have a Young group of mostly 40%, an Active group of mostly 60% and an Old group of between 40% to 60%. The Central Region has a higher percentage of Old but lower percentage of Young groups, with a few planning areas with less than 40%. The North and North-East regions have a lower percentage of Old group but higher percentage of Young group. This information is useful for future development plans such as deciding which areas to build more elder-friendly facilities (Central Region) and which areas to build more childcare, schools and family-friendly amenities (North and North-east Regions).

  1. On a closer analysis, the ternary plots in each planning area and region gives insights into how Singapore’s population is generally distributed geographically. In the northeast region, planning areas like Seng Kang and Punggol houses majority of the middle-aged population, thus suggesting that there are more younger families there. Whereas in the Central region, planning areas like Queenstown houses a much mature population. Moreover, based on the number of circles, the Central region has the most number of planning areas, implying that more people live there, whereas the East region has the least number of planning areas, implying that the least number of people live there.

  1. The population pyramid is of a “contracting” type and shows that Singapore’s population is mostly economically active, with ageing population increasing and declining birth rates as it has a narrow base. The proportion of females is greater than males for age group 70 and above, which suggests that females generally have a longer life expectancy than males. For the rest of the age groups, the proportion of females and males seems to be around the same. This information is useful for the planning of government policies, such as implementing more healthcare policies due to the ageing population.

References

https://www.statology.org/population-pyramid-in-r/