IS434 Assignment 4
This static visualization aims to uncover the demographic structure of the Singapore population by age cohort (e.g., 0-4, 5-9, ……) and by planning area in 2019.
Data is taken from Singapore Residents by Planning Area Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019.
There is no information on the regions in Singapore in the given dataset. It would be useful to analyse the planning areas by region. Moreover, it might not be intuitive to visualise datapoints by planning areas only without region context.
Solution: Import shapefile (MP14_SUBZONE_WEB_PL.shp) from data.gov.sg to supplement the region data
The age distribution is presented as quinquennial age groups (i.e. in 5 years interval) which may be less useful for a broader analysis of the demographics in Singapore.
Solution: Categorise the data in broad age groups of young (0-24 years old), active (25-64 years old), old (65 years old and above). The young and economically active age groups are categorised in this way as people in Singapore usually enter the workforce after 24 years old due to mandatory tertiary education.
Age groups are not correctly sorted in ascending order due to poor naming of the age groups. This is undesirable because in order to plot the ternary plot and population pyramid, the spread function in the tidyr package would need the age groups to be correctly sorted.
Solution: Rename the age group that is sorted wrongly (i.e. ‘5_to_9’) by adding a zero in front (i.e. ’05_to_9’) so that it will be sorted after age group ‘0_to_4’ instead of before age group ‘50_to_54’.
It would be difficult to visualize all the datapoints in a graph as there are 5 regions, 55 planning areas and 323 subzones. Too many colours may lead to cognitive overload, which is bad design.
Solution: Visualise the datapoints by colouring them according to the 5 different regions and exclude the 323 subzones for the main overall ternary plot so that it is easy for the reader to read, instead of colouring by 55 planning areas.
Import ggtern, sf, and tidyverse (containing ggplot2) into R. The R packages will be used to read the data and plot the visualizations. + tidyverse contains a set of packages for data manipulation and exploration + sf to encode spatial vector data + ggtern for ternary plots
packages = c('tidyverse','sf', 'ggtern')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Load the main dataset in R and inspect the data types and variables.
# load the pop data
pop_data <- read_csv("data/respopagesextod2011to2020.csv")
# visualize the pop data
#head(data.frame(pop_data))
str(pop_data)
## tibble [984,656 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ PA : chr [1:984656] "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" "Ang Mo Kio" ...
## $ SZ : chr [1:984656] "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" "Ang Mo Kio Town Centre" ...
## $ AG : chr [1:984656] "0_to_4" "0_to_4" "0_to_4" "0_to_4" ...
## $ Sex : chr [1:984656] "Males" "Males" "Males" "Males" ...
## $ TOD : chr [1:984656] "HDB 1- and 2-Room Flats" "HDB 3-Room Flats" "HDB 4-Room Flats" "HDB 5-Room and Executive Flats" ...
## $ Pop : num [1:984656] 0 10 30 50 0 0 40 0 0 10 ...
## $ Time: num [1:984656] 2011 2011 2011 2011 2011 ...
## - attr(*, "spec")=
## .. cols(
## .. PA = col_character(),
## .. SZ = col_character(),
## .. AG = col_character(),
## .. Sex = col_character(),
## .. TOD = col_character(),
## .. Pop = col_double(),
## .. Time = col_double()
## .. )
Load the second dataset to create the region column.
# import region shp file
region_data = st_read("data/master-plan-2014-subzone-boundary-web-shp/MP14_SUBZONE_WEB_PL.shp")
## Reading layer `MP14_SUBZONE_WEB_PL' from data source `C:\Users\User\Documents\data\master-plan-2014-subzone-boundary-web-shp\MP14_SUBZONE_WEB_PL.shp' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS: SVY21
# visualize the region shp file
head(data.frame(region_data))
## OBJECTID SUBZONE_NO SUBZONE_N SUBZONE_C CA_IND PLN_AREA_N
## 1 1 1 MARINA SOUTH MSSZ01 Y MARINA SOUTH
## 2 2 1 PEARL'S HILL OTSZ01 Y OUTRAM
## 3 3 3 BOAT QUAY SRSZ03 Y SINGAPORE RIVER
## 4 4 8 HENDERSON HILL BMSZ08 N BUKIT MERAH
## 5 5 3 REDHILL BMSZ03 N BUKIT MERAH
## 6 6 7 ALEXANDRA HILL BMSZ07 N BUKIT MERAH
## PLN_AREA_C REGION_N REGION_C INC_CRC FMEL_UPD_D X_ADDR
## 1 MS CENTRAL REGION CR 5ED7EB253F99252E 2014-12-05 31595.84
## 2 OT CENTRAL REGION CR 8C7149B9EB32EEFC 2014-12-05 28679.06
## 3 SR CENTRAL REGION CR C35FEFF02B13E0E5 2014-12-05 29654.96
## 4 BM CENTRAL REGION CR 3775D82C5DDBEFBD 2014-12-05 26782.83
## 5 BM CENTRAL REGION CR 85D9ABEF0A40678F 2014-12-05 26201.96
## 6 BM CENTRAL REGION CR 9D286521EF5E3B59 2014-12-05 25358.82
## Y_ADDR SHAPE_Leng SHAPE_Area geometry
## 1 29220.19 5267.381 1630379.3 MULTIPOLYGON (((31495.56 30...
## 2 29782.05 3506.107 559816.2 MULTIPOLYGON (((29092.28 30...
## 3 29974.66 1740.926 160807.5 MULTIPOLYGON (((29932.33 29...
## 4 29933.77 3313.625 595428.9 MULTIPOLYGON (((27131.28 30...
## 5 30005.70 2825.594 387429.4 MULTIPOLYGON (((26451.03 30...
## 6 29991.38 4428.913 1030378.8 MULTIPOLYGON (((25899.7 297...
Merge pop_data with region_data on subzone. After merging, filter the merged dataset by the year 2019 and use the mutate function to recode ‘5_to_9’ to ‘05_to_9’.
# convert CAPS to lowercase so that region_data can merge with pop_data on subzone
region_data$SUBZONE_N <- str_to_title(region_data$SUBZONE_N)
region_data$REGION_N <- str_to_title(region_data$REGION_N)
overall_data <- left_join(pop_data, region_data, by = c('SZ'='SUBZONE_N')) %>%
mutate(AG = gsub('5_to_9', '05_to_9', AG)) %>%
filter(Time == 2019)
overall_df <- data.frame(overall_data)
head(overall_df)
## PA SZ AG Sex
## 1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## TOD Pop Time OBJECTID SUBZONE_NO
## 1 HDB 1- and 2-Room Flats 0 2019 249 1
## 2 HDB 3-Room Flats 10 2019 249 1
## 3 HDB 4-Room Flats 10 2019 249 1
## 4 HDB 5-Room and Executive Flats 20 2019 249 1
## 5 HUDC Flats (excluding those privatised) 0 2019 249 1
## 6 Landed Properties 0 2019 249 1
## SUBZONE_C CA_IND PLN_AREA_N PLN_AREA_C REGION_N REGION_C
## 1 AMSZ01 N ANG MO KIO AM North-East Region NER
## 2 AMSZ01 N ANG MO KIO AM North-East Region NER
## 3 AMSZ01 N ANG MO KIO AM North-East Region NER
## 4 AMSZ01 N ANG MO KIO AM North-East Region NER
## 5 AMSZ01 N ANG MO KIO AM North-East Region NER
## 6 AMSZ01 N ANG MO KIO AM North-East Region NER
## INC_CRC FMEL_UPD_D X_ADDR Y_ADDR SHAPE_Leng SHAPE_Area
## 1 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4 2642.209 316882
## 2 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4 2642.209 316882
## 3 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4 2642.209 316882
## 4 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4 2642.209 316882
## 5 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4 2642.209 316882
## 6 B996E2754B3DA5E1 2014-12-05 29501.77 39419.4 2642.209 316882
## geometry
## 1 MULTIPOLYGON (((29692.8 389...
## 2 MULTIPOLYGON (((29692.8 389...
## 3 MULTIPOLYGON (((29692.8 389...
## 4 MULTIPOLYGON (((29692.8 389...
## 5 MULTIPOLYGON (((29692.8 389...
## 6 MULTIPOLYGON (((29692.8 389...
Summarise the population total in each group as Pop Total. This cleaned dataframe will be used to create the dataset for the ternary plots later.
# Data Frame of total population by planning area, region, age group and time
sg<-overall_data %>%
select(PA, SZ, REGION_N, AG, Sex, TOD, Pop, Time) %>%
group_by(PA, REGION_N, AG, Time)%>%
summarize(Pop_Total = sum(Pop))
sg_df <- data.frame(sg)
head(sg_df)
## PA REGION_N AG Time Pop_Total
## 1 Ang Mo Kio North-East Region 0_to_4 2019 5420
## 2 Ang Mo Kio North-East Region 05_to_9 2019 6230
## 3 Ang Mo Kio North-East Region 10_to_14 2019 7380
## 4 Ang Mo Kio North-East Region 15_to_19 2019 7930
## 5 Ang Mo Kio North-East Region 20_to_24 2019 8920
## 6 Ang Mo Kio North-East Region 25_to_29 2019 10620
Ternary plots are a way of displaying the distribution and variability of three‐part compositional data, which are age groups, regions, and population in this case. They are useful for presenting data with three variables in a concise manner.
Categorise the age groups into new columns of Young, Active and Old. Filter out the zero values.
# Prepare the data for ternary plot *
#Deriving the young, economy active and old measures
agpop_mutated <- sg_df %>%
spread(AG, Pop_Total) %>%
mutate(Young = rowSums(.[4:8])) %>%
mutate(Active = rowSums(.[9:16])) %>%
mutate(Old = rowSums(.[17:22])) %>%
mutate(TOTAL = rowSums(.[23:25])) %>%
filter(TOTAL > 0) %>%
arrange(desc(TOTAL))
df_ternary <- data.frame(agpop_mutated)
head(df_ternary)
## PA REGION_N Time X0_to_4 X05_to_9 X10_to_14 X15_to_19
## 1 Bedok East Region 2019 10020 11640 13300 14640
## 2 Jurong West West Region 2019 11090 13600 15430 16680
## 3 Tampines East Region 2019 11320 11180 12060 14000
## 4 Woodlands North Region 2019 11610 12830 14880 18280
## 5 Sengkang North-East Region 2019 17370 17260 15360 13620
## 6 Hougang North-East Region 2019 8940 9260 10430 12160
## X20_to_24 X25_to_29 X30_to_34 X35_to_39 X40_to_44 X45_to_49 X50_to_54
## 1 16660 19530 17940 18310 20070 21290 20870
## 2 17700 19590 18520 20680 21400 21330 20430
## 3 18570 21460 18720 17930 16650 17420 19440
## 4 21400 20220 17880 17930 18280 20890 22040
## 5 12720 14440 22460 24820 21940 19410 15540
## 6 14770 17650 16700 15280 15240 16690 17260
## X55_to_59 X60_to_64 X65_to_69 X70_to_74 X75_to_79 X80_to_84 X85_to_89
## 1 22550 21830 18810 13660 8300 5600 3130
## 2 19370 17740 13900 8440 4670 2540 1250
## 3 21980 20930 15230 9010 4960 3290 1850
## 4 19670 15230 10180 6120 3610 2250 1150
## 5 14730 12990 9290 5910 3410 1950 1050
## 6 19410 18200 13940 8950 5630 3600 1940
## X90_and_over Young Active Old TOTAL
## 1 1820 66260 162390 51320 279970
## 2 650 74500 159060 31450 265010
## 3 1020 67130 154530 35360 257020
## 4 550 79000 152140 23860 255000
## 5 640 76330 146330 22250 244910
## 6 1060 55560 136430 35120 227110
Use the ggtern package to plot the ternary plot.
#Building the static ternary plot
ternary_plot <- ggtern(data=df_ternary, aes(x=Young,y=Active, z=Old, color=REGION_N, size=TOTAL)) +
geom_point(alpha=0.5) +
labs(title="Overall Percentage of Young, Active and Old in SG Population, 2019") +
labs(colour = "Region") +
scale_size(name="Population") +
theme_bvbg()
ternary_plot
Plot multiple ternary plots using facet wrap in ggplot2 to show the distribution of age cohorts in regions more clearly.
#Building the static ternary plot
ternary_plots <- ggtern(data=df_ternary, aes(x=Young,y=Active, z=Old, color=REGION_N, size=TOTAL)) +
geom_point(alpha=0.5, size=3, shape=16) +
labs(title="Percentage of Young, Active and Old By Planning Area in Regions, 2019") +
labs(colour = "Planning Area") +
scale_size(name="Population") +
theme_bvbg() +
facet_wrap(~REGION_N, scale="free")
ternary_plots + theme(
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
legend.position="none"
)
However, the legend for planning area will look quite cramped using facet_wrap. Hence, I decided to plot the individual ternary plots based on each region, coloured by planning area.
Plotting for the Ternary Plot for the North region:
ternary_plots_north_pa <- ggtern(data=df_ternary %>% filter(REGION_N=="North Region"), aes(x=Young,y=Active, z=Old, color=PA, size=TOTAL)) +
geom_point(alpha=0.5, size=3, shape=16) +
labs(subtitle="North Region") +
labs(colour = "Planning Area") +
scale_size(name="Population") +
theme_bvbg()
ternary_plots_north_pa + theme(
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.position="right",
plot.subtitle = element_text(hjust = 0.5)
)
Plotting for the Ternary Plot for the Northeast region:
ternary_plots_northeast_pa <- ggtern(data=df_ternary %>% filter(REGION_N=="North-East Region"), aes(x=Young,y=Active, z=Old, color=PA, size=TOTAL)) +
geom_point(alpha=0.5, size=3, shape=16) +
labs(subtitle="Northeast Region") +
labs(colour = "Planning Area") +
scale_size(name="Population") +
theme_bvbg()
ternary_plots_northeast_pa + theme(
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.position="right",
plot.subtitle = element_text(hjust = 0.5)
)
Plotting for the Ternary Plot for the Central region:
ternary_plots_central_pa <- ggtern(data=df_ternary %>% filter(REGION_N=="Central Region"), aes(x=Young,y=Active, z=Old, color=PA, size=TOTAL)) +
geom_point(alpha=0.5, size=3, shape=16) +
labs(subtitle="Central Region") +
labs(colour = "Planning Area") +
scale_size(name="Population") +
theme_bvbg()
ternary_plots_central_pa + theme(
legend.title = element_text(size = 8),
legend.text = element_text(size = 6),
legend.position="right",
legend.margin=margin(l = 5, unit='cm'),
plot.subtitle = element_text(hjust = 0.5)
)
Plotting for the Ternary Plot for the East region:
ternary_plots_east_pa <- ggtern(data=df_ternary %>% filter(REGION_N=="East Region"), aes(x=Young,y=Active, z=Old, color=PA, size=TOTAL)) +
geom_point(alpha=0.5, size=3, shape=16) +
labs(subtitle="East Region") +
labs(colour = "Planning Area") +
scale_size(name="Population") +
theme_bvbg()
ternary_plots_east_pa + theme(
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.position="right",
plot.subtitle = element_text(hjust = 0.5)
)
Plotting for the Ternary Plot for the West region:
ternary_plots_west_pa <- ggtern(data=df_ternary %>% filter(REGION_N=="West Region"), aes(x=Young,y=Active, z=Old, color=PA, size=TOTAL)) +
geom_point(alpha=0.5, size=3, shape=16) +
labs(subtitle="West Region") +
labs(colour = "Planning Area") +
scale_size(name="Population") +
theme_bvbg()
ternary_plots_west_pa + theme(
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.position="right",
plot.subtitle = element_text(hjust = 0.5)
)
Combining the ternary plots together:
Ternary Plots by Planning Area in Regions, 2019
Population Pyramids are useful for studying the current population trend of a country and compare the similarities and differences between the different age groups and gender of the demographics.
Assign the population values for Females with a negative value, so that they will appear on the left side of the population pyramid after running coord_flip().
data_pyramid<-overall_data %>%
select(PA, REGION_N, AG, Sex, TOD, Pop, Time) %>%
group_by(PA, REGION_N, AG, Sex, Time) %>%
summarise(Pop_total = sum(Pop)) %>%
mutate(Pop_Pyramid = ifelse(Sex == "Females",-Pop_total,Pop_total)) %>%
mutate(AG = gsub('5_to_9', '05_to_9', AG)) %>%
filter(Time == 2019)
df_pyramid <- data.frame(data_pyramid)
head(df_pyramid)
## PA REGION_N AG Sex Time Pop_total Pop_Pyramid
## 1 Ang Mo Kio North-East Region 0_to_4 Females 2019 2660 -2660
## 2 Ang Mo Kio North-East Region 0_to_4 Males 2019 2760 2760
## 3 Ang Mo Kio North-East Region 005_to_9 Females 2019 3110 -3110
## 4 Ang Mo Kio North-East Region 005_to_9 Males 2019 3120 3120
## 5 Ang Mo Kio North-East Region 10_to_14 Females 2019 3670 -3670
## 6 Ang Mo Kio North-East Region 10_to_14 Males 2019 3710 3710
Use ggplot (geom_bar) to plot the population pyramid to see the distribution of males and females in the Singapore population by age cohorts.
pop_pyramid <- ggplot(df_pyramid, aes(x = AG, y = Pop_Pyramid, fill = Sex)) +
geom_bar(data=subset(df_pyramid,Sex=="Females"), stat = "identity") +
geom_bar(data=subset(df_pyramid,Sex=="Males"), stat = "identity") +
ggtitle(label= "Population Pyramid of Singapore, 2019")+
scale_y_continuous(labels = paste0(as.character(c(seq(2, 0, -1), seq(1, 2, 1))), "m")) +
ylab("Population")+
xlab("Age Group")+
labs(fill = "Gender", subtitle=paste("Total:", sum(df_pyramid$Pop_total))) +
scale_colour_manual(values = c("pink", "steelblue"),
aesthetics = c("colour", "fill")) +
coord_flip()
pop_pyramid
References