makeover8

major data and design challenges

The original data is duty, need to be cleaned and prepared for visualization.
This task requires to reveal the demographic structure of Singapore population in two dimension both age cort and planing area, so it is quite different to clearly shows information in both dimensions.
Different from Tableau, data preparing in R need write coding, so it might be more complicated than Tableau.

Suggestions and sketch

In order to clean the prepare the data, we need to extract the information we need from the data table and check true if there are empty values in the dataframe.
In order to display information in two dimension, we could draw a heatmap showing information for both age cohort and planning are, meanwhile we could cluster planning areas.
In order to make better data visualization, need to search for some good package to draw the heatmap plot.

Sketch alt text here

step-by-step description

1.install and library packages

packages = c('comprehenr', 'heatmaply', 'tidyverse','dendextend')
for (p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2.Data preparing

load and observe data

data = read.csv("../respopagesextod2011to2019.csv")
head(data)

##           PA                     SZ     AG   Sex
## 1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
## 6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males
##                                       TOD Pop Time
## 1                 HDB 1- and 2-Room Flats   0 2011
## 2                        HDB 3-Room Flats  10 2011
## 3                        HDB 4-Room Flats  30 2011
## 4          HDB 5-Room and Executive Flats  50 2011
## 5 HUDC Flats (excluding those privatised)   0 2011
## 6                       Landed Properties   0 2011

Since we only need data of year 2019 and only columns of Planning area, Age cohort and Population, we need to extract sub-data table from the original data table.

data2019 = data[which(data$Time == 2019),]
data2019_arrage = aggregate(data2019$Pop,
                            by=list(PA=data2019$PA,AG=data2019$AG),FUN=sum)
colnames(data2019_arrage) = c("PA","AG","Pop")
head(data2019_arrage)

##            PA     AG   Pop
## 1  Ang Mo Kio 0_to_4  5420
## 2       Bedok 0_to_4 10020
## 3      Bishan 0_to_4  2850
## 4    Boon Lay 0_to_4     0
## 5 Bukit Batok 0_to_4  7130
## 6 Bukit Merah 0_to_4  6100

Noticed that some planing areas have no population, check the total population of each planing area and remove areas have no population

df = aggregate(data2019_arrage$Pop, by=list(PA=data2019_arrage$PA),FUN=sum)
df = df[df$x >0, ]
PAlist = unique(df$PA)
print(length(PAlist))

## [1] 42

Now we know that there are total 42 planning area have population records in 2019, then we exclude those areas with no record in 2019.

data2019_arrage = data2019_arrage[data2019_arrage$PA %in% PAlist,]

In order to draw a heatmap, we need to re-arrange the table, making age cohort as columns and planing are as rows, filling with the population values.

nn<-reshape(data2019_arrage,timevar="AG",idvar="PA",direction="wide")
row.names(nn) <- nn$PA
nn1 <- select(nn, c(2:20))
colnames(nn1) = to_list(for (x in colnames(nn1)) str_remove(x,"Pop."))
nn1 <- nn1[, c(1,10,2:9,11:19)]
head(nn1)

##               0_to_4 5_to_9 10_to_14 15_to_19 20_to_24 25_to_29 30_to_34
## Ang Mo Kio      5420   6230     7380     7930     8920    10620    10510
## Bedok          10020  11640    13300    14640    16660    19530    17940
## Bishan          2850   3850     4430     4740     5570     7090     5430
## Bukit Batok     7130   6640     7800     8800     9850    12510    12480
## Bukit Merah     6100   6650     6640     6380     6850     9140    10550
## Bukit Panjang   6700   7230     7680     8500     9570    10560    10740
##               35_to_39 40_to_44 45_to_49 50_to_54 55_to_59 60_to_64 65_to_69
## Ang Mo Kio       10940    11760    12570    12170    13090    12810    11970
## Bedok            18310    20070    21290    20870    22550    21830    18810
## Bishan            5290     5940     6860     6510     7220     7140     5730
## Bukit Batok      10600    10690    11680    12010    12450    11590     8560
## Bukit Merah      11050    11830    11780    10790    11100    11270    10370
## Bukit Panjang    10230     9610    10610    10450    11410     9970     6910
##               70_to_74 75_to_79 80_to_84 85_to_89 90_and_over
## Ang Mo Kio        8960     6160     3840     2110        1040
## Bedok            13660     8300     5600     3130        1820
## Bishan            3880     2540     1670      970         520
## Bukit Batok       5020     2930     1820     1020         560
## Bukit Merah       8310     5990     4190     2220        1390
## Bukit Panjang     4230     2470     1560      820         450

3.Data Visualization

Next we need to change the data table to data matrix to make our heatmap.

data_matrix <- data.matrix(nn1)

Using heatmaply to draw a heatmap.

heatmaply(data_matrix,
          Colv=NA,
          seriate = "none",
          colors = Blues,
          k_row = 5,
          fontsize_row = 5,
          fontsize_col = 5,
          cellnote_size = 5,
          main="Demographic Structure of Singapore Population by Age Cohort and By Planning Area, 2019",
          xlab = "Age Cohort",
          ylab = "Planing Area")

From the heatmap, we could observe:

Planning areas could be grouped to three groups: have high residents population, middle residents population and low residents population. For example, Puggol and Sengkang have more population among planning areas in Singapore, while few people live in Sungel Kadut and Museum in 2019.
In 2019, residents mainly at age between 25 to 65 have highest,followed by young ages.
People at different age might have different residential area. For example, Bedok has a higher proportion of residents at age above 50, while Sengkang has a higher proportion of residents at age between 30-50.

Advantages of R compare to Tableau

Using R it is more convenient to build a data visualization, because in R, we could reply on some packages instead of build the graph step by step in Tableau by ourselves.
While using R, we could share with others our code, and they could clearly know how we did it and could easily duplicate. However, using Tableau, we have to screenshot each step and write a report in order to let others know how we got the data visualization.
While using R, we could save our code, and easily get a new graph with a different dataset. So it is very useful and time-saving.