1. Major data and design challenges and sketch design

1.1 Challenge 1: Visualizing demographic structure by age cohort and by planning area

Description: There are 4 dimensions in creating this visualization, which are population, gender, age cohort and planning area. The x-axis and y-axis are 2 dimensions, representing age cohort and population, how to display the other 2 dimensions is a question to solve.

Solution: By using facet_wrap, the chart is duplicated by a third dimension. In our case, the third dimension is the planning area. To differentiate males from females, we can either create a stacked bar chart with color fills, or we can create a pyramid.

1.2 Challenge 2: Data filtering

Description: It is important to check the data quality and data distribution before usage. For example, we should exclude rows where the population is 0, null or negative. Another filter is the year of 2019 since the data ranges from 2011 to 2019 but only 2019 data is needed.

Solution: To use functions such as summary(), filter(), subset() and so on. Details listed below.

1.3 Challenge 3: X labelling

Description: When plotted, the labels are ordered alphabetically. Therefore, the age group 5_to_9 will be placed in the middel rather than right after 0_to_4. Another issue is that when labels are too long, they tend to overlap on each other.

Solution: Rename the labels and adjust the size.

1.4 Sketch design

The final visualization should be easy to see, since it is composed of n rows by m coloumns sub-plots where n*m equals . Each plot should be big enough and decorated with labels on its bars to make up for the far X-axis labels if placed far away. The color fill is by gender.

2. Step-by-step description on visualization preparation

2.1 Library and data loading and data Processing

library("dplyr")
library("tidyverse")
library(ggplot2)

2.2 Filtering by time

In this step, we need to read in the dataset first. This dataset contains the demographic statistics from 2011 to 2019, but we need to extract the 2019 data first. 2011-2018 data are not used. Inspired by the boolean mask in python, the same method may be applied to exclude data where the Time is not 2019

file <- read.csv("D:\\xuexi\\visualization\\assignment-4\\respopagesextod2011to2019.csv")
file <- file[file$Time==2019, ]  

2.3 Aggregation and table summary

The population in a PA is divided into subzone, sex and TOD. The data needs to be aggregated with the key of PA, Sex and AG.The Columns of SZ,TOD and Time can now be removed. From the summary we can know that there is no negative values.

file <- file[c("PA", "AG", "Sex", "Pop")]
file <-  file %>% group_by(PA, AG, Sex) %>%
         summarise(total_pop = sum(Pop))
summary(file)
##            PA              AG            Sex         total_pop    
##  Ang Mo Kio :  38   0_to_4  : 110   Females:1045   Min.   :    0  
##  Bedok      :  38   10_to_14: 110   Males  :1045   1st Qu.:    0  
##  Bishan     :  38   15_to_19: 110                  Median :  320  
##  Boon Lay   :  38   20_to_24: 110                  Mean   : 1930  
##  Bukit Batok:  38   25_to_29: 110                  3rd Qu.: 3220  
##  Bukit Merah:  38   30_to_34: 110                  Max.   :12960  
##  (Other)    :1862   (Other) :1430

2.4 Null value detection and 0 value removal

cat("The number of null values in this dataset is ", sum(is.na(file)))
## The number of null values in this dataset is  0
zero <-  file %>% group_by(PA) %>%
         summarise(total_pop = sum(total_pop))%>%
         filter(total_pop == 0) 
file <- subset(file, !(file$PA %in% zero$PA))

2.5 AG re-format.

To replace _to_ with ~ to make the AG cleaner. And have a final check on the file.

file$AG <- str_replace(file$AG, "_to_", "~")
file$AG <- str_replace(file$AG, "_and_over", "+")
file
## # A tibble: 1,596 x 4
## # Groups:   PA, AG [798]
##    PA         AG    Sex     total_pop
##    <fct>      <chr> <fct>       <int>
##  1 Ang Mo Kio 0~4   Females      2660
##  2 Ang Mo Kio 0~4   Males        2760
##  3 Ang Mo Kio 10~14 Females      3670
##  4 Ang Mo Kio 10~14 Males        3710
##  5 Ang Mo Kio 15~19 Females      3890
##  6 Ang Mo Kio 15~19 Males        4040
##  7 Ang Mo Kio 20~24 Females      4390
##  8 Ang Mo Kio 20~24 Males        4530
##  9 Ang Mo Kio 25~29 Females      5410
## 10 Ang Mo Kio 25~29 Males        5210
## # ... with 1,586 more rows

3. Description on ggplot

3.1 Stacked bar chart.

In this chart, create an ordered x-label as age_group first. Use geom_text to add labels and theme to erase the ticks on categorical age_groups.

age_group <- factor(file$AG, level = c('0~4', '5~9', '10~14', '15~19','20~24', '25~29','30~34', '35~39','40~44', '45~49','50~54', '55~59','60~64', '65~69','70~74', '75~79','80~84', '85~89','90+'))
ggplot(data = file, aes(x = age_group, y = total_pop, fill = Sex)) + 
  geom_col()+
  geom_text(aes(label = total_pop), position = position_stack(vjust = 1.2))+
  facet_wrap(~PA, ncol = 2, )+
  theme(axis.text.x = element_text(angle = 0, size = 11),
        axis.ticks = element_blank())

3.2 Pyramid charts.

Similar to the stacked bar chart, use age_group to re-order the age cohorts. Create 2 pairs of geom_bar and geom_text for female and male each. Due to our knowledge from the table summary, the largest population is 12960, so we can set the y-axis’s labels and increment with the upper limit of 14000. Finally, use coord_flip() to reverse the x and y axis. In the initial configuration, set fig.height = 100 and fig.width=25 to accommondate the plotting.

age_group <- c('0~4', '5~9', '10~14', '15~19','20~24', '25~29','30~34', '35~39','40~44', '45~49','50~54', '55~59','60~64', '65~69','70~74', '75~79','80~84', '85~89','90+')
file$AG <- factor(file$AG, level = age_group)

ggplot(data=file, aes(x = file$AG, total_pop)) +
  
geom_bar(aes(AG, total_pop, group=Sex,fill=Sex), subset(file, file$Sex == "Females"), stat = "identity") +
geom_text(aes(AG, total_pop, label = total_pop),subset(file, file$Sex == "Females"), size = 4, hjust = -0.1) +
  
geom_bar(aes(AG, -total_pop, group=Sex,fill=Sex), subset(file, file$Sex == "Males"), stat = "identity") +
geom_text(aes(AG, -total_pop, label = total_pop), subset(file, file$Sex == "Males"), size = 4, hjust = 1.2) +
  
scale_y_continuous(breaks = seq(-14000,14000,2000),labels = abs(seq(-14000,14000,2000)))+
coord_flip()+
facet_wrap(~PA, ncol = 2)+
theme(axis.ticks = element_blank())

4. Findings

4.1 Aging era

Genearlly speaking, a society is regarded as an aging society if the population of people of 60 years old and above takes up 7% of the total population. As we can see from the charts, in major planning area (where the population bar is visible), the aged population exceeds 7% with direct visual sense, which is especially the case in areas like Bedok, Hougang and Jurong West.

4.2 Life expectancy

On the whole, for people aged 60 years old and above, there are more women than men in each age group. Therefore, Women tend to have a longer life expectancy than men.

4.3 Birth rate

The birth rate is decreasing in most planning area. Only Punggol and Sengkang show an obvious trend of increasing young generations. The situation in areas like queenstown, Tanglin and Yishun seems more stable, but in most parts the number of young people is decreasing.