The original data is duty, need to be cleaned and prepared for visualization.
This task requires to reveal the demographic structure of Singapore population in two dimension both age cort and planing area, so it is quite different to clearly shows information in both dimensions.
Different from Tableau, data preparing in R need write coding, so it might be more complicated than Tableau.
In order to clean the prepare the data, we need to extract the information we need from the data table and check true if there are empty values in the dataframe.
In order to display information in two dimension, we could draw a heatmap showing information for both age cohort and planning are, meanwhile we could cluster planning areas.
In order to make better data visualization, need to search for some good package to draw the heatmap plot.
Sketch
packages = c('comprehenr', 'heatmaply', 'tidyverse','dendextend')
for (p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
load and observe data
data = read.csv("../respopagesextod2011to2019.csv")
head(data)
## PA SZ AG Sex
## 1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## TOD Pop Time
## 1 HDB 1- and 2-Room Flats 0 2011
## 2 HDB 3-Room Flats 10 2011
## 3 HDB 4-Room Flats 30 2011
## 4 HDB 5-Room and Executive Flats 50 2011
## 5 HUDC Flats (excluding those privatised) 0 2011
## 6 Landed Properties 0 2011
Since we only need data of year 2019 and only columns of Planning area, Age cohort and Population, we need to extract sub-data table from the original data table.
data2019 = data[which(data$Time == 2019),]
data2019_arrage = aggregate(data2019$Pop,
by=list(PA=data2019$PA,AG=data2019$AG),FUN=sum)
colnames(data2019_arrage) = c("PA","AG","Pop")
head(data2019_arrage)
## PA AG Pop
## 1 Ang Mo Kio 0_to_4 5420
## 2 Bedok 0_to_4 10020
## 3 Bishan 0_to_4 2850
## 4 Boon Lay 0_to_4 0
## 5 Bukit Batok 0_to_4 7130
## 6 Bukit Merah 0_to_4 6100
Noticed that some planing areas have no population, check the total population of each planing area and remove areas have no population
df = aggregate(data2019_arrage$Pop, by=list(PA=data2019_arrage$PA),FUN=sum)
df = df[df$x >0, ]
PAlist = unique(df$PA)
print(length(PAlist))
## [1] 42
Now we know that there are total 42 planning area have population records in 2019, then we exclude those areas with no record in 2019.
data2019_arrage = data2019_arrage[data2019_arrage$PA %in% PAlist,]
In order to draw a heatmap, we need to re-arrange the table, making age cohort as columns and planing are as rows, filling with the population values.
nn<-reshape(data2019_arrage,timevar="AG",idvar="PA",direction="wide")
row.names(nn) <- nn$PA
nn1 <- select(nn, c(2:20))
colnames(nn1) = to_list(for (x in colnames(nn1)) str_remove(x,"Pop."))
nn1 <- nn1[, c(1,10,2:9,11:19)]
head(nn1)
## 0_to_4 5_to_9 10_to_14 15_to_19 20_to_24 25_to_29 30_to_34
## Ang Mo Kio 5420 6230 7380 7930 8920 10620 10510
## Bedok 10020 11640 13300 14640 16660 19530 17940
## Bishan 2850 3850 4430 4740 5570 7090 5430
## Bukit Batok 7130 6640 7800 8800 9850 12510 12480
## Bukit Merah 6100 6650 6640 6380 6850 9140 10550
## Bukit Panjang 6700 7230 7680 8500 9570 10560 10740
## 35_to_39 40_to_44 45_to_49 50_to_54 55_to_59 60_to_64 65_to_69
## Ang Mo Kio 10940 11760 12570 12170 13090 12810 11970
## Bedok 18310 20070 21290 20870 22550 21830 18810
## Bishan 5290 5940 6860 6510 7220 7140 5730
## Bukit Batok 10600 10690 11680 12010 12450 11590 8560
## Bukit Merah 11050 11830 11780 10790 11100 11270 10370
## Bukit Panjang 10230 9610 10610 10450 11410 9970 6910
## 70_to_74 75_to_79 80_to_84 85_to_89 90_and_over
## Ang Mo Kio 8960 6160 3840 2110 1040
## Bedok 13660 8300 5600 3130 1820
## Bishan 3880 2540 1670 970 520
## Bukit Batok 5020 2930 1820 1020 560
## Bukit Merah 8310 5990 4190 2220 1390
## Bukit Panjang 4230 2470 1560 820 450
Next we need to change the data table to data matrix to make our heatmap.
data_matrix <- data.matrix(nn1)
Using heatmaply to draw a heatmap.
heatmaply(data_matrix,
Colv=NA,
seriate = "none",
colors = Blues,
k_row = 5,
fontsize_row = 5,
fontsize_col = 5,
cellnote_size = 5,
main="Demographic Structure of Singapore Population by Age Cohort and By Planning Area, 2019",
xlab = "Age Cohort",
ylab = "Planing Area")
From the heatmap, we could observe:
Planning areas could be grouped to three groups: have high residents population, middle residents population and low residents population. For example, Puggol and Sengkang have more population among planning areas in Singapore, while few people live in Sungel Kadut and Museum in 2019.
In 2019, residents mainly at age between 25 to 65 have highest,followed by young ages.
People at different age might have different residential area. For example, Bedok has a higher proportion of residents at age above 50, while Sengkang has a higher proportion of residents at age between 30-50.
Using R it is more convenient to build a data visualization, because in R, we could reply on some packages instead of build the graph step by step in Tableau by ourselves.
While using R, we could share with others our code, and they could clearly know how we did it and could easily duplicate. However, using Tableau, we have to screenshot each step and write a report in order to let others know how we got the data visualization.
While using R, we could save our code, and easily get a new graph with a different dataset. So it is very useful and time-saving.