1 library preparation
library(ggtern)
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggtern':
## method from
## grid.draw.ggplot ggplot2
## plot.ggplot ggplot2
## print.ggplot ggplot2
## --
## Remember to cite, run citation(package = 'ggtern') for further info.
## --
##
## Attaching package: 'ggtern'
## The following objects are masked from 'package:ggplot2':
##
## aes, annotate, ggplot, ggplot_build, ggplot_gtable, ggplotGrob,
## ggsave, layer_data, theme_bw, theme_classic, theme_dark,
## theme_gray, theme_light, theme_linedraw, theme_minimal, theme_void
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(corrplot)
## corrplot 0.84 loaded
library(ggpubr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.1 ✓ stringr 1.4.0
## ✓ purrr 0.3.4 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x ggtern::aes() masks ggplot2::aes()
## x ggtern::annotate() masks ggplot2::annotate()
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x ggtern::ggsave() masks ggplot2::ggsave()
## x dplyr::lag() masks stats::lag()
## x ggtern::theme_bw() masks ggplot2::theme_bw()
## x ggtern::theme_classic() masks ggplot2::theme_classic()
## x ggtern::theme_dark() masks ggplot2::theme_dark()
## x ggtern::theme_gray() masks ggplot2::theme_gray()
## x ggtern::theme_light() masks ggplot2::theme_light()
## x ggtern::theme_linedraw() masks ggplot2::theme_linedraw()
## x ggtern::theme_minimal() masks ggplot2::theme_minimal()
## x ggtern::theme_void() masks ggplot2::theme_void()
library(corrplot)
library(ggpubr)
2 data import
popdata<-read_csv("data/respopagesextod2011to2020.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## PA = col_character(),
## SZ = col_character(),
## AG = col_character(),
## Sex = col_character(),
## TOD = col_character(),
## Pop = col_double(),
## Time = col_double()
## )
pop_data = filter(popdata,Time == 2019)
3 data Preprocessing
library(knitr)
kable(head(pop_data),caption = "Processed Data" )
PA | SZ | AG | Sex | TOD | Pop | Time |
---|---|---|---|---|---|---|
Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HDB 1- and 2-Room Flats | 0 | 2019 |
Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HDB 3-Room Flats | 10 | 2019 |
Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HDB 4-Room Flats | 10 | 2019 |
Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HDB 5-Room and Executive Flats | 20 | 2019 |
Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | HUDC Flats (excluding those privatised) | 0 | 2019 |
Ang Mo Kio | Ang Mo Kio Town Centre | 0_to_4 | Males | Landed Properties | 0 | 2019 |
4 Visualization 4.1 age data visualization Use spread to convert row value to column to categorize ages into YOUNG, ACTIVE, and OLD after finishing spread, use the mutate() of dplyr to derive the main measures we just generated
agpop_mutated <- pop_data %>%
spread(AG, Pop)%>%
mutate(YOUNG = rowSums(.[,c('0_to_4','5_to_9','10_to_14','15_to_19')]))%>%#based on row sums for Yound, active, and old
mutate(ACTIVE = rowSums(.[,c("20_to_24","25_to_29","30_to_34","35_to_39","40_to_44","45_to_49","50_to_54","55_to_59")])) %>%
mutate(OLD = rowSums(.[,c("60_to_64","65_to_69","70_to_74","75_to_79","80_to_84","85_to_89","90_and_over")])) %>%
mutate(TOTAL = rowSums(.[25:27])) %>%
filter(TOTAL > 0)
genetate interactive ternary plot 1. use resuable function to format axis 2.initiate visualization by applying ploy_ly()
axis<-function(txt) {
list(
title = txt, tickformat = "%", tickfont = list(size = 10)
)
}
ternaryAxes<-list(
aaxis = axis("Active"),
baxis = axis("Young"),
caxis = axis("Old")
)
#intiating visualization
plot_ly(
agpop_mutated,
a = ~ACTIVE,
b = ~YOUNG,
c = ~OLD,
#text = ~YOUNG,
type = "scatterternary",
color = I("red")
)%>%
layout(
ternary = ternaryAxes
)
## No scatterternary mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Generate heatmap to compare age distribution of each PA
#install package "gplots" and load the package
#install.packages("gplots")
library("gplots")
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(data.table)
##
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
##
## transpose
## The following objects are masked from 'package:dplyr':
##
## between, first, last
#select age columns to be visualized
heat<- agpop_mutated %>%
select(c(1,6:28))
#convert to
df1<- data.frame(heat)
setDT(df1)
heatdf<- df1[,lapply(.SD,sum),by=PA, .SDcols = 2:24]
### PA with Population=0 in 2019
heatdf_0<- heatdf %>%
filter(TOTAL==0)
heatdf_0
## Empty data.table (0 rows and 24 cols): PA,X0_to_4,X10_to_14,X15_to_19,X20_to_24,X25_to_29...
### PA with Population in 2019
heatdf<- heatdf %>%
filter(TOTAL>0)
heatdf
## PA X0_to_4 X10_to_14 X15_to_19 X20_to_24 X25_to_29
## 1: Ang Mo Kio 5420 7380 7930 8920 10620
## 2: Bedok 10020 13300 14640 16660 19530
## 3: Bishan 2850 4430 4740 5570 7090
## 4: Bukit Batok 7130 7800 8800 9850 12510
## 5: Bukit Merah 6100 6640 6380 6850 9140
## 6: Bukit Panjang 6700 7680 8500 9570 10560
## 7: Bukit Timah 3160 5000 4810 4690 4710
## 8: Changi 130 140 110 80 100
## 9: Choa Chu Kang 8980 10410 12770 15980 16220
## 10: Clementi 3960 4610 4550 4700 5430
## 11: Downtown Core 80 50 70 110 170
## 12: Geylang 3960 4910 5200 5990 7980
## 13: Hougang 8940 10430 12160 14770 17650
## 14: Jurong East 2630 3630 4330 4920 5790
## 15: Jurong West 11090 15430 16680 17700 19590
## 16: Kallang 4030 4230 4430 4700 6140
## 17: Lim Chu Kang 0 0 10 10 0
## 18: Mandai 80 130 140 170 160
## 19: Marine Parade 1860 2450 2400 2410 2490
## 20: Museum 20 30 10 0 30
## 21: Newton 410 470 350 370 400
## 22: Novena 2140 2550 2300 2440 2900
## 23: Orchard 50 50 30 30 30
## 24: Outram 670 740 600 680 1010
## 25: Pasir Ris 5930 7930 10090 13170 12730
## 26: Punggol 16880 10850 8210 6410 8880
## 27: Queenstown 4200 4240 4140 4790 5900
## 28: River Valley 490 550 510 440 550
## 29: Rochor 350 500 530 590 750
## 30: Seletar 10 20 20 20 0
## 31: Sembawang 5600 6250 6240 5570 7200
## 32: Sengkang 17370 15360 13620 12720 14440
## 33: Serangoon 3520 5320 6390 7770 9400
## 34: Singapore River 220 160 110 80 110
## 35: Southern Islands 120 110 90 90 70
## 36: Sungei Kadut 20 20 40 50 60
## 37: Tampines 11320 12060 14000 18570 21460
## 38: Tanglin 1110 1330 1100 990 1050
## 39: Toa Payoh 4270 5270 5650 6390 8070
## 40: Western Water Catchment 40 30 30 140 40
## 41: Woodlands 11610 14880 18280 21400 20220
## 42: Yishun 12110 11010 11650 13160 17420
## PA X0_to_4 X10_to_14 X15_to_19 X20_to_24 X25_to_29
## X30_to_34 X35_to_39 X40_to_44 X45_to_49 X5_to_9 X50_to_54 X55_to_59
## 1: 10510 10940 11760 12570 6230 12170 13090
## 2: 17940 18310 20070 21290 11640 20870 22550
## 3: 5430 5290 5940 6860 3850 6510 7220
## 4: 12480 10600 10690 11680 6640 12010 12450
## 5: 10550 11050 11830 11780 6650 10790 11100
## 6: 10740 10230 9610 10610 7230 10450 11410
## 7: 4060 4700 6010 6890 4820 5930 5340
## 8: 140 160 160 170 130 140 100
## 9: 14040 12960 12690 14610 9150 16710 15680
## 10: 6080 6990 7280 7460 4310 6640 6540
## 11: 270 310 260 270 90 200 160
## 12: 7910 7790 8060 8220 4320 8150 8900
## 13: 16700 15280 15240 16690 9260 17260 19410
## 14: 5460 5210 5380 5920 3010 5910 6360
## 15: 18520 20680 21400 21330 13600 20430 19370
## 16: 7040 7450 7870 8230 4190 7430 7770
## 17: 0 0 0 0 0 10 20
## 18: 130 130 140 170 90 170 180
## 19: 2430 3170 3800 4130 2430 3510 3350
## 20: 40 50 60 50 30 40 40
## 21: 490 590 770 820 550 660 570
## 22: 3200 3760 4280 4400 2900 3610 3370
## 23: 50 90 100 90 40 70 70
## 24: 1080 1370 1620 1420 880 1310 1450
## 25: 8930 9230 10090 11160 6670 13850 13490
## 26: 18640 21770 16680 12950 14730 8870 7560
## 27: 6860 7340 7370 7540 4240 6930 6810
## 28: 630 900 1040 1100 600 840 660
## 29: 890 970 1120 1050 450 940 960
## 30: 0 10 20 50 20 30 20
## 31: 8010 8140 8420 8340 5690 6960 6310
## 32: 22460 24820 21940 19410 17260 15540 14730
## 33: 7630 6740 7220 8390 4350 8640 9690
## 34: 180 340 410 360 200 210 180
## 35: 80 160 190 250 110 190 180
## 36: 40 40 40 50 40 50 50
## 37: 18720 17930 16650 17420 11180 19440 21980
## 38: 1220 1650 2100 2090 1580 1740 1460
## 39: 7670 8070 8810 9500 5030 8940 9340
## 40: 40 70 70 40 40 40 60
## 41: 17880 17930 18280 20890 12830 22040 19670
## 42: 18650 17940 15780 15780 11270 15880 17790
## X30_to_34 X35_to_39 X40_to_44 X45_to_49 X5_to_9 X50_to_54 X55_to_59
## X60_to_64 X65_to_69 X70_to_74 X75_to_79 X80_to_84 X85_to_89 X90_and_over
## 1: 12810 11970 8960 6160 3840 2110 1040
## 2: 21830 18810 13660 8300 5600 3130 1820
## 3: 7140 5730 3880 2540 1670 970 520
## 4: 11590 8560 5020 2930 1820 1020 560
## 5: 11270 10370 8310 5990 4190 2220 1390
## 6: 9970 6910 4230 2470 1560 820 450
## 7: 4890 4260 3320 2180 1560 890 500
## 8: 100 70 30 30 0 0 0
## 9: 12070 8160 4630 2780 1800 980 480
## 10: 6250 6460 5070 3130 1970 960 520
## 11: 140 110 100 30 20 10 50
## 12: 8790 7070 5130 3510 2380 1390 860
## 13: 18200 13940 8950 5630 3600 1940 1060
## 14: 6630 5850 3760 2100 1280 690 370
## 15: 17740 13900 8440 4670 2540 1250 650
## 16: 7690 6870 5260 3760 2550 1380 920
## 17: 10 0 10 0 0 0 0
## 18: 120 90 70 40 30 20 0
## 19: 2990 2630 2290 1640 1290 720 460
## 20: 20 10 0 0 0 0 0
## 21: 450 390 330 170 100 70 40
## 22: 3090 2680 2090 1530 1060 670 420
## 23: 50 70 40 30 10 0 0
## 24: 1520 1430 1180 810 620 330 330
## 25: 9640 6350 3790 2280 1530 860 490
## 26: 6160 5110 3410 1880 1110 540 280
## 27: 6270 5720 4980 3750 2950 1590 850
## 28: 560 480 380 190 130 70 60
## 29: 1010 950 700 470 320 160 150
## 30: 20 20 0 0 0 0 0
## 31: 5010 3420 1990 1360 840 490 230
## 32: 12990 9290 5910 3410 1950 1050 640
## 33: 10080 8130 5480 3340 2350 1320 730
## 34: 130 80 80 40 20 10 20
## 35: 110 70 40 20 0 0 0
## 36: 50 40 40 30 30 10 0
## 37: 20930 15230 9010 4960 3290 1850 1020
## 38: 1210 1100 830 500 320 190 140
## 39: 9160 7790 6130 4750 3360 1830 1030
## 40: 30 10 0 0 0 0 0
## 41: 15230 10180 6120 3610 2250 1150 550
## 42: 15310 11510 6990 4120 2520 1230 690
## X60_to_64 X65_to_69 X70_to_74 X75_to_79 X80_to_84 X85_to_89 X90_and_over
## YOUNG ACTIVE OLD TOTAL
## 1: 26960 90580 46890 164430
## 2: 49600 157220 73150 279970
## 3: 15870 49910 22450 88230
## 4: 30370 92270 31500 154140
## 5: 25770 83090 43740 152600
## 6: 30110 83180 26410 139700
## 7: 17790 42330 17600 77720
## 8: 510 1050 230 1790
## 9: 41310 118890 30900 191100
## 10: 17430 51120 24360 92910
## 11: 290 1750 460 2500
## 12: 18390 63000 29130 110520
## 13: 40790 133000 53320 227110
## 14: 13600 44950 20680 79230
## 15: 56800 159020 49190 265010
## 16: 16880 56630 28430 101940
## 17: 10 40 20 70
## 18: 440 1250 370 2060
## 19: 9140 25290 12020 46450
## 20: 90 310 30 430
## 21: 1780 4670 1550 8000
## 22: 9890 27960 11540 49390
## 23: 170 530 200 900
## 24: 2890 9940 6220 19050
## 25: 30620 92650 24940 148210
## 26: 50670 101760 18490 170920
## 27: 16820 53540 26110 96470
## 28: 2150 6160 1870 10180
## 29: 1830 7270 3760 12860
## 30: 70 150 40 260
## 31: 23780 58950 13340 96070
## 32: 63610 146060 35240 244910
## 33: 19580 65480 31430 116490
## 34: 690 1870 380 2940
## 35: 430 1210 240 1880
## 36: 120 380 200 700
## 37: 48560 152170 56290 257020
## 38: 5120 12300 4290 21710
## 39: 20220 66790 34050 121060
## 40: 140 500 40 680
## 41: 57600 158310 39090 255000
## 42: 46040 132400 42370 220810
## YOUNG ACTIVE OLD TOTAL
row.names(heatdf) <- heatdf$PA
heatdf1 <- select(heatdf, c(2:20))
heatdf1_matrix <- data.matrix(heatdf1)
heatmap.2(x=heatdf1_matrix,cexRow=0.5,cexCol =0.6)
4.3 Population data visualization import data.table to generate the dataframe which includes the PA population data we want to visualize later - specify which column to be kept and sort the table ascendingly to match PA with bars.
library(data.table)
df <- data.frame(agpop_mutated)
setDT(df)
PAdf<-df[, lapply(.SD, sum), by = PA, .SDcols = 28]
PAdf<-PAdf[order(PAdf$TOTAL),]
PAdf
## PA TOTAL
## 1: Lim Chu Kang 70
## 2: Seletar 260
## 3: Museum 430
## 4: Western Water Catchment 680
## 5: Sungei Kadut 700
## 6: Orchard 900
## 7: Changi 1790
## 8: Southern Islands 1880
## 9: Mandai 2060
## 10: Downtown Core 2500
## 11: Singapore River 2940
## 12: Newton 8000
## 13: River Valley 10180
## 14: Rochor 12860
## 15: Outram 19050
## 16: Tanglin 21710
## 17: Marine Parade 46450
## 18: Novena 49390
## 19: Bukit Timah 77720
## 20: Jurong East 79230
## 21: Bishan 88230
## 22: Clementi 92910
## 23: Sembawang 96070
## 24: Queenstown 96470
## 25: Kallang 101940
## 26: Geylang 110520
## 27: Serangoon 116490
## 28: Toa Payoh 121060
## 29: Bukit Panjang 139700
## 30: Pasir Ris 148210
## 31: Bukit Merah 152600
## 32: Bukit Batok 154140
## 33: Ang Mo Kio 164430
## 34: Punggol 170920
## 35: Choa Chu Kang 191100
## 36: Yishun 220810
## 37: Hougang 227110
## 38: Sengkang 244910
## 39: Woodlands 255000
## 40: Tampines 257020
## 41: Jurong West 265010
## 42: Bedok 279970
## PA TOTAL
generate bar chart of population in each PA, add in color by specify the value for “fill” as well as “color”.
library(stringr)
PAdf$newx = str_wrap(PAdf$PA,width = 100)
PAdf$newx
## [1] "Lim Chu Kang" "Seletar"
## [3] "Museum" "Western Water Catchment"
## [5] "Sungei Kadut" "Orchard"
## [7] "Changi" "Southern Islands"
## [9] "Mandai" "Downtown Core"
## [11] "Singapore River" "Newton"
## [13] "River Valley" "Rochor"
## [15] "Outram" "Tanglin"
## [17] "Marine Parade" "Novena"
## [19] "Bukit Timah" "Jurong East"
## [21] "Bishan" "Clementi"
## [23] "Sembawang" "Queenstown"
## [25] "Kallang" "Geylang"
## [27] "Serangoon" "Toa Payoh"
## [29] "Bukit Panjang" "Pasir Ris"
## [31] "Bukit Merah" "Bukit Batok"
## [33] "Ang Mo Kio" "Punggol"
## [35] "Choa Chu Kang" "Yishun"
## [37] "Hougang" "Sengkang"
## [39] "Woodlands" "Tampines"
## [41] "Jurong West" "Bedok"
ggplot(data = PAdf,aes(x=reorder(newx,-TOTAL),y=TOTAL))+
geom_bar(stat="identity",color="blue",fill="white")+
coord_flip()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
4.4 Gender data visualization Use spread and mutate() to generate the data to be visualized here is to generate gender ddta for each PA, so we select PA, Females, Males, and total
sexpop_mutated <- pop_data %>%
spread(Sex, Pop)%>%
select(PA,Females,Males,)%>%
mutate(TOTAL = rowSums(.[2:3]))%>%
filter(Females>0 |Males>0)
sexpop_mutated
## # A tibble: 15,658 x 4
## PA Females Males TOTAL
## <chr> <dbl> <dbl> <dbl>
## 1 Ang Mo Kio 40 50 90
## 2 Ang Mo Kio 10 10 20
## 3 Ang Mo Kio 10 10 20
## 4 Ang Mo Kio 20 20 40
## 5 Ang Mo Kio 80 70 150
## 6 Ang Mo Kio 10 10 20
## 7 Ang Mo Kio 30 20 50
## 8 Ang Mo Kio 60 50 110
## 9 Ang Mo Kio 80 90 170
## 10 Ang Mo Kio 10 10 20
## # … with 15,648 more rows
convert the mutated dota to dataframe to merge data of same PA easier
sexdf <- data.frame(sexpop_mutated)
setDT(sexdf)
sexdf<-sexdf[, lapply(.SD, sum), by = PA, .SDcols = 2:4]
sexdf
## PA Females Males TOTAL
## 1: Ang Mo Kio 85770 78660 164430
## 2: Bedok 144160 135810 279970
## 3: Bishan 45500 42730 88230
## 4: Bukit Batok 78600 75540 154140
## 5: Bukit Merah 79230 73370 152600
## 6: Bukit Panjang 70840 68860 139700
## 7: Bukit Timah 41710 36010 77720
## 8: Changi 910 880 1790
## 9: Choa Chu Kang 95860 95240 191100
## 10: Clementi 48840 44070 92910
## 11: Downtown Core 1300 1200 2500
## 12: Geylang 55870 54650 110520
## 13: Hougang 115760 111350 227110
## 14: Jurong East 39960 39270 79230
## 15: Jurong West 131920 133090 265010
## 16: Kallang 51490 50450 101940
## 17: Lim Chu Kang 40 30 70
## 18: Mandai 1070 990 2060
## 19: Marine Parade 24620 21830 46450
## 20: Museum 240 190 430
## 21: Newton 4140 3860 8000
## 22: Novena 25440 23950 49390
## 23: Orchard 480 420 900
## 24: Outram 9560 9490 19050
## 25: Pasir Ris 75070 73140 148210
## 26: Punggol 86650 84270 170920
## 27: Queenstown 51040 45430 96470
## 28: River Valley 5490 4690 10180
## 29: Rochor 6420 6440 12860
## 30: Seletar 130 130 260
## 31: Sembawang 48720 47350 96070
## 32: Sengkang 124530 120380 244910
## 33: Serangoon 60450 56040 116490
## 34: Singapore River 1550 1390 2940
## 35: Southern Islands 990 890 1880
## 36: Sungei Kadut 340 360 700
## 37: Tampines 131490 125530 257020
## 38: Tanglin 11690 10020 21710
## 39: Toa Payoh 63610 57450 121060
## 40: Western Water Catchment 370 310 680
## 41: Woodlands 127060 127940 255000
## 42: Yishun 111810 109000 220810
## PA Females Males TOTAL
visualization by Parallel Coordinates Plot to evaluate if the balance of the geneders in each PA
library(parcoords)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
parcoords(
sexdf[, c(1,4,2,3)]
)
generate extenstion plot to compare female and male desitribution of each PA
library(ggExtra)
base<-ggplot(data = sexdf, aes(x=sexdf$Females,
y=sexdf$Males))+
geom_point(stat = "identity")+
geom_vline(xintercept = 5000,linetype = 'longdash')+
geom_hline(yintercept = 5000, linetype = 'longdash')+
coord_cartesian(xlim = c(0,100000),
ylim = c(0,100000))
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
ggMarginal(base,type="histogram",color = "blue",fill= "light blue")
Problems for data processing: the scope of data is too large and the data is not grouped well. For instance, to compare the age distribution of each place, the categorization need to be generated. To overcome this, I used mutate and spread. Spread would be helpful to sum population of each PA at a range of ages easier.
The challenge for visualization is the scope of data is too large. Due to the scope of data(too many places), it is hard to visualize the name of each PA in the graph. Actually this issue has not been solved well, to make it easier to understand, i decided to output the dataframe use to generate the visualization to help readers better gain the essential information like which PA has least population, etc.
Challenge solved by designing scatch: since the labels of bars cannot be loaded, i designed another chart which is heatmap to display the distribution of population against each age group the challenge of displaying the gender distribution of each PA is that due to the highly imbalance of population distribution, it is hard to do an overall comparison. I wanted to draft a geaph which can disply the ratio of female between male, as well as bars that shows population fo each geneder.To achieve this target, i sketched the extention plot which can exactly match my target.