Description: There are 4 dimensions in creating this visualization, which are population, gender, age cohort and planning area. The x-axis and y-axis are 2 dimensions, representing age cohort and population, how to display the other 2 dimensions is a question to solve.
Solution: By using facet_wrap, the chart is duplicated by a third dimension. In our case, the third dimension is the planning area. To differentiate males from females, we can either create a stacked bar chart with color fills, or we can create a pyramid.
Description: It is important to check the data quality and data distribution before usage. For example, we should exclude rows where the population is 0, null or negative. Another filter is the year of 2019 since the data ranges from 2011 to 2019 but only 2019 data is needed.
Solution: To use functions such as summary(), filter(), subset() and so on. Details listed below.
Description: When plotted, the labels are ordered alphabetically. Therefore, the age group 5_to_9 will be placed in the middel rather than right after 0_to_4. Another issue is that when labels are too long, they tend to overlap on each other.
Solution: Rename the labels and adjust the size.
The final visualization should be easy to see, since it is composed of n rows by m coloumns sub-plots where n*m equals . Each plot should be big enough and decorated with labels on its bars to make up for the far X-axis labels if placed far away. The color fill is by gender.
library("dplyr")
library("tidyverse")
library(ggplot2)
In this step, we need to read in the dataset first. This dataset contains the demographic statistics from 2011 to 2019, but we need to extract the 2019 data first. 2011-2018 data are not used. Inspired by the boolean mask in python, the same method may be applied to exclude data where the Time is not 2019
file <- read.csv("D:\\xuexi\\visualization\\assignment-4\\respopagesextod2011to2019.csv")
file <- file[file$Time==2019, ]
The population in a PA is divided into subzone, sex and TOD. The data needs to be aggregated with the key of PA, Sex and AG.The Columns of SZ,TOD and Time can now be removed. From the summary we can know that there is no negative values.
file <- file[c("PA", "AG", "Sex", "Pop")]
file <- file %>% group_by(PA, AG, Sex) %>%
summarise(total_pop = sum(Pop))
summary(file)
## PA AG Sex total_pop
## Ang Mo Kio : 38 0_to_4 : 110 Females:1045 Min. : 0
## Bedok : 38 10_to_14: 110 Males :1045 1st Qu.: 0
## Bishan : 38 15_to_19: 110 Median : 320
## Boon Lay : 38 20_to_24: 110 Mean : 1930
## Bukit Batok: 38 25_to_29: 110 3rd Qu.: 3220
## Bukit Merah: 38 30_to_34: 110 Max. :12960
## (Other) :1862 (Other) :1430
cat("The number of null values in this dataset is ", sum(is.na(file)))
## The number of null values in this dataset is 0
zero <- file %>% group_by(PA) %>%
summarise(total_pop = sum(total_pop))%>%
filter(total_pop == 0)
file <- subset(file, !(file$PA %in% zero$PA))
To replace _to_ with ~ to make the AG cleaner. And have a final check on the file.
file$AG <- str_replace(file$AG, "_to_", "~")
file$AG <- str_replace(file$AG, "_and_over", "+")
file
## # A tibble: 1,596 x 4
## # Groups: PA, AG [798]
## PA AG Sex total_pop
## <fct> <chr> <fct> <int>
## 1 Ang Mo Kio 0~4 Females 2660
## 2 Ang Mo Kio 0~4 Males 2760
## 3 Ang Mo Kio 10~14 Females 3670
## 4 Ang Mo Kio 10~14 Males 3710
## 5 Ang Mo Kio 15~19 Females 3890
## 6 Ang Mo Kio 15~19 Males 4040
## 7 Ang Mo Kio 20~24 Females 4390
## 8 Ang Mo Kio 20~24 Males 4530
## 9 Ang Mo Kio 25~29 Females 5410
## 10 Ang Mo Kio 25~29 Males 5210
## # ... with 1,586 more rows
In this chart, create an ordered x-label as age_group first. Use geom_text to add labels and theme to erase the ticks on categorical age_groups.
age_group <- factor(file$AG, level = c('0~4', '5~9', '10~14', '15~19','20~24', '25~29','30~34', '35~39','40~44', '45~49','50~54', '55~59','60~64', '65~69','70~74', '75~79','80~84', '85~89','90+'))
ggplot(data = file, aes(x = age_group, y = total_pop, fill = Sex)) +
geom_col()+
geom_text(aes(label = total_pop), position = position_stack(vjust = 1.2))+
facet_wrap(~PA, ncol = 2, )+
theme(axis.text.x = element_text(angle = 0, size = 11),
axis.ticks = element_blank())
Similar to the stacked bar chart, use age_group to re-order the age cohorts. Create 2 pairs of geom_bar and geom_text for female and male each. Due to our knowledge from the table summary, the largest population is 12960, so we can set the y-axis’s labels and increment with the upper limit of 14000. Finally, use coord_flip() to reverse the x and y axis. In the initial configuration, set fig.height = 100 and fig.width=25 to accommondate the plotting.
age_group <- c('0~4', '5~9', '10~14', '15~19','20~24', '25~29','30~34', '35~39','40~44', '45~49','50~54', '55~59','60~64', '65~69','70~74', '75~79','80~84', '85~89','90+')
file$AG <- factor(file$AG, level = age_group)
ggplot(data=file, aes(x = file$AG, total_pop)) +
geom_bar(aes(AG, total_pop, group=Sex,fill=Sex), subset(file, file$Sex == "Females"), stat = "identity") +
geom_text(aes(AG, total_pop, label = total_pop),subset(file, file$Sex == "Females"), size = 4, hjust = -0.1) +
geom_bar(aes(AG, -total_pop, group=Sex,fill=Sex), subset(file, file$Sex == "Males"), stat = "identity") +
geom_text(aes(AG, -total_pop, label = total_pop), subset(file, file$Sex == "Males"), size = 4, hjust = 1.2) +
scale_y_continuous(breaks = seq(-14000,14000,2000),labels = abs(seq(-14000,14000,2000)))+
coord_flip()+
facet_wrap(~PA, ncol = 2)+
theme(axis.ticks = element_blank())
Genearlly speaking, a society is regarded as an aging society if the population of people of 60 years old and above takes up 7% of the total population. As we can see from the charts, in major planning area (where the population bar is visible), the aged population exceeds 7% with direct visual sense, which is especially the case in areas like Bedok, Hougang and Jurong West.
On the whole, for people aged 60 years old and above, there are more women than men in each age group. Therefore, Women tend to have a longer life expectancy than men.
The birth rate is decreasing in most planning area. Only Punggol and Sengkang show an obvious trend of increasing young generations. The situation in areas like queenstown, Tanglin and Yishun seems more stable, but in most parts the number of young people is decreasing.