Rpub link: https://rpubs.com/TY2018/618920
In this take-home exercise, you are tasked to segment Singapore at the planning subzone level into homogeneous socioeconomic areas by combining geodemographic data extracted from Singapore Department of Statistics and urban functions extracted from the geospatial data provided.
From the Singapore Residents by Planning Area Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019 provide by Singapore Department of Statistics, you are required the extract the following indicators for 2019: * Economy active population (i.e. age 25-64) * Young group (i.e. age 0-24) * Aged group (i.e 65 and above) * Population density * HDB1-2RM dwellers * HDB3-4RM dwellers * HDB5RM-Ec dweller * Condominium and apartment dwellers * Landed property dwellers
From the geospatial data provided, you are required to extract the following urban functions: * Government including embassy * Business * Industry * Shopping * Financial * Upmarket residential area
For the purpose of this study, five geospatial data sets are provided, they are: Govt_Embassy, Private residential, Shopping, Business and Financial. These data sets are in ESRI shapefile format. You are also required to download the following data from the relevant government portals: * Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019, and * URA Master Plan 2014 Planning Subzone boundary
packages = c('rgdal', 'spdep', 'ClustGeo', 'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
## Loading required package: rgdal
## Loading required package: sp
## rgdal: version: 1.4-8, (SVN revision 845)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
## Path to GDAL shared files: C:/Users/TYZ/Documents/R/win-library/3.6/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: C:/Users/TYZ/Documents/R/win-library/3.6/rgdal/proj
## Linking to sp version: 1.4-1
## Loading required package: spdep
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## Loading required package: sf
## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
## Loading required package: ClustGeo
## Loading required package: tmap
## Loading required package: ggpubr
## Loading required package: ggplot2
## Loading required package: cluster
## Loading required package: heatmaply
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: viridis
## Loading required package: viridisLite
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
##
## ======================
## Welcome to heatmaply version 1.1.0
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================
## Loading required package: corrplot
## corrplot 0.84 loaded
## Loading required package: psych
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Loading required package: tidyverse
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x psych::%+%() masks ggplot2::%+%()
## x psych::alpha() masks ggplot2::alpha()
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
Industry and business is all within “Business” layer
business_layer <- st_read(dsn = "data/geospatial", layer = "Business")
## Reading layer `Business' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## CRS: 4326
Examine the geospatial data:
head(business_layer, 10)
## Simple feature collection with 10 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.7354 ymin: 1.2972 xmax: 103.8858 ymax: 1.34032
## CRS: 4326
## POI_ID SEQ_NUM FAC_TYPE POI_NAME
## 1 1101180209 1 5000 JOHN CHEN
## 2 1101180210 1 5000 TROPICAL INDUSTRIAL BUILDING
## 3 1101180211 1 5000 LIAN CHEONG INDUSTRIAL BUILDING
## 4 1101180212 1 5000 MALAYSIA GARMENT MANUFACTURERS
## 5 1101180213 1 5000 UNIGOLD
## 6 1192316144 1 5000 NUS UNIVERSITY HALL
## 7 1144317654 1 5000 SUITES AT BUKIT TIMAH
## 8 1103507488 1 5000 TIONG HUAT
## 9 1001052867 1 5000 LEE CHOON GUAN TIMBER MERCHANT
## 10 1001052868 1 5000 WEIGHT BRIDGE SERVICE
## ST_NAME geometry
## 1 LITTLE RD POINT (103.8856 1.33841)
## 2 LITTLE RD POINT (103.8852 1.33832)
## 3 LITTLE RD POINT (103.8852 1.33834)
## 4 <NA> POINT (103.8855 1.33821)
## 5 LITTLE RD POINT (103.8858 1.33844)
## 6 LOWER KENT RIDGE RD POINT (103.7777 1.2972)
## 7 JALAN JURONG KECHIL POINT (103.7738 1.34032)
## 8 KALLANG PUDDING RD POINT (103.879 1.32773)
## 9 PENJURU RD POINT (103.7354 1.3184)
## 10 PENJURU RD POINT (103.7361 1.31927)
Note that for business, FAC_TYPE = 5000, for industry, FAC_TYPE = 9991.
Seperating industry and business:
industry <- business_layer %>%
filter(FAC_TYPE == 9991)
business <- business_layer %>%
filter(FAC_TYPE == 5000)
Financial:
financial <- st_read(dsn = "data/geospatial", layer = "Financial")
## Reading layer `Financial' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3320 features and 29 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6256 ymin: 1.24392 xmax: 103.9998 ymax: 1.46247
## CRS: 4326
Gove/Embassy:
govt_embassy <- st_read(dsn = "data/geospatial", layer = "Govt_Embassy")
## Reading layer `Govt_Embassy' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 443 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6282 ymin: 1.24911 xmax: 103.9884 ymax: 1.45765
## CRS: 4326
Private residential:
private_residential <- st_read(dsn = "data/geospatial", layer = "Private residential")
## Reading layer `Private residential' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3604 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6295 ymin: 1.23943 xmax: 103.9749 ymax: 1.45379
## CRS: 4326
Shopping:
shopping <- st_read(dsn = "data/geospatial", layer = "Shopping")
## Reading layer `Shopping' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 511 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.679 ymin: 1.24779 xmax: 103.9644 ymax: 1.4535
## CRS: 4326
Subzone data:
mpsz <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")
## Reading layer `MP14_SUBZONE_WEB_PL' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## proj4string: +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs
resident <- read_csv ("data/aspatial/popdata.csv")
## Parsed with column specification:
## cols(
## planning_area = col_character(),
## subzone = col_character(),
## age_group = col_character(),
## sex = col_character(),
## type_of_dwelling = col_character(),
## resident_count = col_double(),
## year = col_double()
## )
st_crs(industry)
## Coordinate Reference System:
## User input: 4326
## wkt:
## GEOGCS["GCS_WGS_1984",
## DATUM["WGS_1984",
## SPHEROID["WGS_84",6378137.0,298.257223563]],
## PRIMEM["Greenwich",0.0],
## UNIT["Degree",0.0174532925199433],
## AUTHORITY["EPSG","4326"]]
st_crs(business)
## Coordinate Reference System:
## User input: 4326
## wkt:
## GEOGCS["GCS_WGS_1984",
## DATUM["WGS_1984",
## SPHEROID["WGS_84",6378137.0,298.257223563]],
## PRIMEM["Greenwich",0.0],
## UNIT["Degree",0.0174532925199433],
## AUTHORITY["EPSG","4326"]]
st_crs(financial)
## Coordinate Reference System:
## User input: 4326
## wkt:
## GEOGCS["GCS_WGS_1984",
## DATUM["WGS_1984",
## SPHEROID["WGS_84",6378137.0,298.257223563]],
## PRIMEM["Greenwich",0.0],
## UNIT["Degree",0.0174532925199433],
## AUTHORITY["EPSG","4326"]]
st_crs(govt_embassy)
## Coordinate Reference System:
## User input: 4326
## wkt:
## GEOGCS["GCS_WGS_1984",
## DATUM["WGS_1984",
## SPHEROID["WGS_84",6378137.0,298.257223563]],
## PRIMEM["Greenwich",0.0],
## UNIT["Degree",0.0174532925199433],
## AUTHORITY["EPSG","4326"]]
st_crs(private_residential)
## Coordinate Reference System:
## User input: 4326
## wkt:
## GEOGCS["GCS_WGS_1984",
## DATUM["WGS_1984",
## SPHEROID["WGS_84",6378137.0,298.257223563]],
## PRIMEM["Greenwich",0.0],
## UNIT["Degree",0.0174532925199433],
## AUTHORITY["EPSG","4326"]]
st_crs(shopping)
## Coordinate Reference System:
## User input: 4326
## wkt:
## GEOGCS["GCS_WGS_1984",
## DATUM["WGS_1984",
## SPHEROID["WGS_84",6378137.0,298.257223563]],
## PRIMEM["Greenwich",0.0],
## UNIT["Degree",0.0174532925199433],
## AUTHORITY["EPSG","4326"]]
st_crs(mpsz)
## Coordinate Reference System:
## No user input
## wkt:
## PROJCS["SVY21",
## GEOGCS["SVY21[WGS84]",
## DATUM["WGS_1984",
## SPHEROID["WGS_84",6378137.0,298.257223563]],
## PRIMEM["Greenwich",0.0],
## UNIT["Degree",0.0174532925199433],
## AUTHORITY["EPSG","4326"]],
## PROJECTION["Transverse_Mercator"],
## PARAMETER["False_Easting",28001.642],
## PARAMETER["False_Northing",38744.572],
## PARAMETER["Central_Meridian",103.8333333333333],
## PARAMETER["Scale_Factor",1.0],
## PARAMETER["Latitude_Of_Origin",1.366666666666667],
## UNIT["Meter",1.0]]
mpsz_sg <- st_set_crs(mpsz, 'EPSG:3414')
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
## that
st_crs(mpsz_sg)
## Coordinate Reference System:
## User input: EPSG:3414
## wkt:
## PROJCS["SVY21 / Singapore TM",
## GEOGCS["SVY21",
## DATUM["SVY21",
## SPHEROID["WGS 84",6378137,298.257223563,
## AUTHORITY["EPSG","7030"]],
## AUTHORITY["EPSG","6757"]],
## PRIMEM["Greenwich",0,
## AUTHORITY["EPSG","8901"]],
## UNIT["degree",0.0174532925199433,
## AUTHORITY["EPSG","9122"]],
## AUTHORITY["EPSG","4757"]],
## PROJECTION["Transverse_Mercator"],
## PARAMETER["latitude_of_origin",1.366666666666667],
## PARAMETER["central_meridian",103.8333333333333],
## PARAMETER["scale_factor",1],
## PARAMETER["false_easting",28001.642],
## PARAMETER["false_northing",38744.572],
## UNIT["metre",1,
## AUTHORITY["EPSG","9001"]],
## AUTHORITY["EPSG","3414"]]
business_sg <- st_transform(business,3414)
industry_sg <- st_transform(industry,3414)
financial_sg <- st_transform(financial,3414)
govt_embassy_sg <- st_transform(govt_embassy,3414)
private_sg <- st_transform(private_residential,3414)
shopping_sg <- st_transform(shopping,3414)
st_crs(business_sg)
## Coordinate Reference System:
## User input: EPSG:3414
## wkt:
## PROJCS["SVY21 / Singapore TM",
## GEOGCS["SVY21",
## DATUM["SVY21",
## SPHEROID["WGS 84",6378137,298.257223563,
## AUTHORITY["EPSG","7030"]],
## AUTHORITY["EPSG","6757"]],
## PRIMEM["Greenwich",0,
## AUTHORITY["EPSG","8901"]],
## UNIT["degree",0.0174532925199433,
## AUTHORITY["EPSG","9122"]],
## AUTHORITY["EPSG","4757"]],
## PROJECTION["Transverse_Mercator"],
## PARAMETER["latitude_of_origin",1.366666666666667],
## PARAMETER["central_meridian",103.8333333333333],
## PARAMETER["scale_factor",1],
## PARAMETER["false_easting",28001.642],
## PARAMETER["false_northing",38744.572],
## UNIT["metre",1,
## AUTHORITY["EPSG","9001"]],
## AUTHORITY["EPSG","3414"]]
Examining the data:
summary(resident)
## planning_area subzone age_group sex
## Length:883728 Length:883728 Length:883728 Length:883728
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## type_of_dwelling resident_count year
## Length:883728 Min. : 0.00 Min. :2011
## Class :character 1st Qu.: 0.00 1st Qu.:2013
## Mode :character Median : 0.00 Median :2015
## Mean : 39.83 Mean :2015
## 3rd Qu.: 10.00 3rd Qu.:2017
## Max. :2860.00 Max. :2019
There are no missing values, therefore proceed.
resident_2019 <- resident %>%
filter(year == 2019)
resident_age <- resident_2019 %>%
spread(age_group, resident_count) %>%
mutate(YOUNG = `0_to_4`+`5_to_9`+`10_to_14`+
`15_to_19`+`20_to_24`) %>%
mutate(`ECONOMY ACTIVE` = rowSums(.[10:14]) + rowSums(.[16:18])) %>%
mutate(`AGED`=rowSums(.[19:24])) %>%
mutate_at(.vars = vars(subzone), .funs = funs(toupper)) %>%
select(`subzone`, `YOUNG`, `ECONOMY ACTIVE`, `AGED`) %>%
group_by(`subzone`) %>%
summarise_at(c("YOUNG", "ECONOMY ACTIVE", "AGED"), sum)
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
resident_tod <- resident_2019 %>%
spread(type_of_dwelling, resident_count) %>%
rename(`HDB1-2RM dwellers` = `HDB 1- and 2-Room Flats`) %>%
rename(`HDB5RM-Ec dwellers` = `HDB 5-Room and Executive Flats`) %>%
rename(`Condominium and apartment dwellers` = `Condominiums and Other Apartments`) %>%
rename(`Landed property dwellers` = `Landed Properties`) %>%
mutate(`HDB3-4RM dwellers` = `HDB 3-Room Flats` + `HDB 4-Room Flats`) %>%
mutate_at(.vars = vars(subzone), .funs = funs(toupper)) %>%
select(`subzone`, `HDB1-2RM dwellers`, `HDB3-4RM dwellers`, `HDB5RM-Ec dwellers`, `Condominium and apartment dwellers`, `Landed property dwellers`) %>%
group_by(`subzone`) %>%
summarise_at(c("HDB1-2RM dwellers", "HDB3-4RM dwellers", "HDB5RM-Ec dwellers", "Condominium and apartment dwellers", "Landed property dwellers"), sum)
resident_total <- resident_2019 %>%
mutate_at(.vars = vars(subzone), .funs = funs(toupper)) %>%
group_by(`subzone`) %>%
summarise(TOTAL = sum(resident_count))
Joining resident data with spatial data:
resident_mpsz <- full_join(resident_total, mpsz_sg, by = c("subzone" = "SUBZONE_N"))
## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector
Create population density:
resident_mpsz$density<-resident_mpsz$TOTAL/resident_mpsz$SHAPE_Area
Examining the results:
head(resident_mpsz)
## # A tibble: 6 x 18
## subzone TOTAL OBJECTID SUBZONE_NO SUBZONE_C CA_IND PLN_AREA_N PLN_AREA_C
## <chr> <dbl> <int> <int> <fct> <fct> <fct> <fct>
## 1 ADMIRA~ 14110 296 5 SBSZ05 N SEMBAWANG SB
## 2 AIRPOR~ 0 179 4 PLSZ04 N PAYA LEBAR PL
## 3 ALEXAN~ 13780 6 7 BMSZ07 N BUKIT MER~ BM
## 4 ALEXAN~ 2120 12 6 BMSZ06 N BUKIT MER~ BM
## 5 ALJUNI~ 40190 163 4 GLSZ04 N GEYLANG GL
## 6 ANAK B~ 22250 211 1 BTSZ01 N BUKIT TIM~ BT
## # ... with 10 more variables: REGION_N <fct>, REGION_C <fct>, INC_CRC <fct>,
## # FMEL_UPD_D <date>, X_ADDR <dbl>, Y_ADDR <dbl>, SHAPE_Leng <dbl>,
## # SHAPE_Area <dbl>, geometry <MULTIPOLYGON [m]>, density <dbl>
resident_age_tod <- left_join(resident_age, resident_tod)
## Joining, by = "subzone"
resident_mpsz_all <- full_join(resident_age_tod, resident_mpsz, by = c("subzone" = "subzone"))
Examine the result:
summary(resident_mpsz_all)
## subzone YOUNG ECONOMY ACTIVE AGED
## Length:323 Min. : 0 Min. : 0 Min. : 0
## Class :character 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Mode :character Median : 1170 Median : 2840 Median : 730
## Mean : 3296 Mean : 7386 Mean : 1806
## 3rd Qu.: 4365 3rd Qu.:10285 3rd Qu.: 3000
## Max. :34260 Max. :79780 Max. :18860
##
## HDB1-2RM dwellers HDB3-4RM dwellers HDB5RM-Ec dwellers
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 0 Median : 0 Median : 0
## Mean : 542 Mean : 5953 Mean : 3297
## 3rd Qu.: 605 3rd Qu.: 9705 3rd Qu.: 3660
## Max. :4700 Max. :75000 Max. :47960
##
## Condominium and apartment dwellers Landed property dwellers TOTAL
## Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0
## Median : 230 Median : 0.0 Median : 4890
## Mean : 1827 Mean : 773.9 Mean : 12487
## 3rd Qu.: 2835 3rd Qu.: 400.0 3rd Qu.: 17065
## Max. :16770 Max. :18820.0 Max. :132900
##
## OBJECTID SUBZONE_NO SUBZONE_C CA_IND PLN_AREA_N
## Min. : 1.0 Min. : 1.000 AMSZ01 : 1 N:274 BUKIT MERAH : 17
## 1st Qu.: 81.5 1st Qu.: 2.000 AMSZ02 : 1 Y: 49 QUEENSTOWN : 15
## Median :162.0 Median : 4.000 AMSZ03 : 1 ANG MO KIO : 12
## Mean :162.0 Mean : 4.625 AMSZ04 : 1 DOWNTOWN CORE: 12
## 3rd Qu.:242.5 3rd Qu.: 6.500 AMSZ05 : 1 TOA PAYOH : 12
## Max. :323.0 Max. :17.000 AMSZ06 : 1 HOUGANG : 10
## (Other):317 (Other) :245
## PLN_AREA_C REGION_N REGION_C INC_CRC
## BM : 17 CENTRAL REGION :134 CR :134 00F5E30B5C9B7AD8: 1
## QT : 15 EAST REGION : 30 ER : 30 013B509B8EDF15BE: 1
## AM : 12 NORTH-EAST REGION: 48 NER: 48 01A4287FB060A0A6: 1
## DT : 12 NORTH REGION : 41 NR : 41 029BD940F4455194: 1
## TP : 12 WEST REGION : 70 WR : 70 0524461C92F35D94: 1
## HG : 10 05FD555397CBEE7A: 1
## (Other):245 (Other) :317
## FMEL_UPD_D X_ADDR Y_ADDR SHAPE_Leng
## Min. :2014-12-05 Min. : 5093 Min. :19579 Min. : 871.5
## 1st Qu.:2014-12-05 1st Qu.:21864 1st Qu.:31776 1st Qu.: 3709.6
## Median :2014-12-05 Median :28465 Median :35113 Median : 5211.9
## Mean :2014-12-05 Mean :27257 Mean :36106 Mean : 6524.4
## 3rd Qu.:2014-12-05 3rd Qu.:31674 3rd Qu.:39869 3rd Qu.: 6942.6
## Max. :2014-12-05 Max. :50425 Max. :49553 Max. :68083.9
##
## SHAPE_Area geometry density
## Min. : 39438 MULTIPOLYGON :323 Min. :0.00000
## 1st Qu.: 628261 epsg:3414 : 0 1st Qu.:0.00000
## Median : 1229894 +proj=tmer...: 0 Median :0.00591
## Mean : 2420882 Mean :0.01074
## 3rd Qu.: 2106483 3rd Qu.:0.01993
## Max. :69748299 Max. :0.04606
##
resident_mpsz_all <- resident_mpsz_all %>%
select(`subzone`, `YOUNG`, `ECONOMY ACTIVE`, `AGED`, `HDB1-2RM dwellers`, `HDB3-4RM dwellers`, `HDB5RM-Ec dwellers`, `Condominium and apartment dwellers`, `Landed property dwellers`, `TOTAL`, `SHAPE_Area`, `density`, `geometry`)
mpsz_business <- st_join(mpsz_sg, business_sg, join = st_contains)
Count the data points by subzone:
mpsz_business_grouped <- mpsz_business %>%
group_by(SUBZONE_N)%>%
summarise(Business = sum(!is.na(POI_ID)))
Get rid of geometry data:
mpsz_business_grouped <- mpsz_business_grouped %>% st_set_geometry(NULL)
Join the business counted data with the previous population and dwelling dataframe:
mpsz_all <- full_join(resident_mpsz_all, mpsz_business_grouped, by = c("subzone" = "SUBZONE_N"))
## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector
Similarly, I proceed with all other urban function types: 2. Industry:
mpsz_industry <- st_join(mpsz_sg, industry_sg, join = st_contains)
mpsz_industry_grouped <- mpsz_industry %>%
group_by(SUBZONE_N)%>%
summarise(Industry = sum(!is.na(POI_ID)))
mpsz_industry_grouped <- mpsz_industry_grouped %>% st_set_geometry(NULL)
mpsz_all <- full_join(mpsz_all, mpsz_industry_grouped, by = c("subzone" = "SUBZONE_N"))
## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector
mpsz_private <- st_join(mpsz_sg, private_sg, join = st_contains)
mpsz_private_grouped <- mpsz_private %>%
group_by(SUBZONE_N)%>%
summarise(Private = sum(!is.na(POI_ID)))
mpsz_private_grouped <- mpsz_private_grouped %>% st_set_geometry(NULL)
mpsz_all <- full_join(mpsz_all, mpsz_private_grouped, by = c("subzone" = "SUBZONE_N"))
## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector
mpsz_shopping <- st_join(mpsz_sg, shopping_sg, join = st_contains)
mpsz_shopping_grouped <- mpsz_shopping %>%
group_by(SUBZONE_N)%>%
summarise(Shopping = sum(!is.na(POI_ID)))
mpsz_shopping_grouped <- mpsz_shopping_grouped %>% st_set_geometry(NULL)
mpsz_all <- full_join(mpsz_all, mpsz_shopping_grouped, by = c("subzone" = "SUBZONE_N"))
## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector
mpsz_financial <- st_join(mpsz_sg, financial_sg, join = st_contains)
mpsz_financial_grouped <- mpsz_financial %>%
group_by(SUBZONE_N)%>%
summarise(Financial = sum(!is.na(POI_ID)))
mpsz_financial_grouped <- mpsz_financial_grouped %>% st_set_geometry(NULL)
mpsz_all <- full_join(mpsz_all, mpsz_financial_grouped, by = c("subzone" = "SUBZONE_N"))
## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector
6: Gov/Embassy:
mpsz_govt_embassy <- st_join(mpsz_sg, govt_embassy_sg, join = st_contains)
mpsz_govt_embassy_grouped <- mpsz_govt_embassy %>%
group_by(SUBZONE_N)%>%
summarise(Govt_Embassy = sum(!is.na(POI_ID)))
mpsz_govt_embassy_grouped <- mpsz_govt_embassy_grouped %>% st_set_geometry(NULL)
mpsz_all <- full_join(mpsz_all, mpsz_govt_embassy_grouped, by = c("subzone" = "SUBZONE_N"))
## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector
Get rid of potential NA values:
mpsz_all[is.na(mpsz_all)] <- 0
Examine the result table:
summary(mpsz_all)
## subzone YOUNG ECONOMY ACTIVE AGED
## Length:323 Min. : 0 Min. : 0 Min. : 0
## Class :character 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Mode :character Median : 1170 Median : 2840 Median : 730
## Mean : 3296 Mean : 7386 Mean : 1806
## 3rd Qu.: 4365 3rd Qu.:10285 3rd Qu.: 3000
## Max. :34260 Max. :79780 Max. :18860
## HDB1-2RM dwellers HDB3-4RM dwellers HDB5RM-Ec dwellers
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 0 Median : 0 Median : 0
## Mean : 542 Mean : 5953 Mean : 3297
## 3rd Qu.: 605 3rd Qu.: 9705 3rd Qu.: 3660
## Max. :4700 Max. :75000 Max. :47960
## Condominium and apartment dwellers Landed property dwellers TOTAL
## Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0
## Median : 230 Median : 0.0 Median : 4890
## Mean : 1827 Mean : 773.9 Mean : 12487
## 3rd Qu.: 2835 3rd Qu.: 400.0 3rd Qu.: 17065
## Max. :16770 Max. :18820.0 Max. :132900
## SHAPE_Area density geometry Business
## Min. : 39438 Min. :0.00000 MULTIPOLYGON :323 Min. : 0.00
## 1st Qu.: 628261 1st Qu.:0.00000 epsg:3414 : 0 1st Qu.: 0.00
## Median : 1229894 Median :0.00591 +proj=tmer...: 0 Median : 2.00
## Mean : 2420882 Mean :0.01074 Mean : 19.94
## 3rd Qu.: 2106483 3rd Qu.:0.01993 3rd Qu.: 14.00
## Max. :69748299 Max. :0.04606 Max. :308.00
## Industry Private Shopping Financial
## Min. :0.0000 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 1.00
## Median :0.0000 Median : 4.00 Median : 0.000 Median : 5.00
## Mean :0.3406 Mean : 11.16 Mean : 1.582 Mean : 10.28
## 3rd Qu.:0.0000 3rd Qu.: 11.00 3rd Qu.: 1.000 3rd Qu.: 13.00
## Max. :8.0000 Max. :217.00 Max. :31.000 Max. :134.00
## Govt_Embassy
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 1.372
## 3rd Qu.: 1.000
## Max. :19.000
Covert data table into sf object for mapping purpose:
mpsz_all_sf <- st_as_sf(mpsz_all, sf_column_name = "geometry")
Deriving choropleth maps using qtm() function:
total_qtm <- qtm(mpsz_all_sf, fill = "TOTAL")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
business_qtm <- qtm(mpsz_all_sf, fill = "Business")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
industry_qtm <- qtm(mpsz_all_sf, fill = "Industry")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
shopping_qtm <- qtm(mpsz_all_sf, fill = "Shopping")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
govt_embassy_qtm <- qtm(mpsz_all_sf, fill = "Govt_Embassy")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
private_qtm <- qtm(mpsz_all_sf, fill = "Private")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
financial_qtm <- qtm(mpsz_all_sf, fill = "Financial")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
young_qtm <- qtm(mpsz_all_sf, fill = "YOUNG")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
aged_qtm <- qtm(mpsz_all_sf, fill = "AGED")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
active_qtm <- qtm(mpsz_all_sf, fill = "ECONOMY ACTIVE")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
hdb12_qtm <- qtm(mpsz_all_sf, fill = "HDB1-2RM dwellers")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
hdb34_qtm <- qtm(mpsz_all_sf, fill = "HDB3-4RM dwellers")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
hdb5_qtm <- qtm(mpsz_all_sf, fill = "HDB5RM-Ec dwellers")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
condo_qtm <- qtm(mpsz_all_sf, fill = "Condominium and apartment dwellers")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
landed_qtm <- qtm(mpsz_all_sf, fill = "Landed property dwellers")
## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid
Plotting the choropleth maps: By functions:
tmap_arrange(business_qtm, industry_qtm, shopping_qtm, govt_embassy_qtm, private_qtm, financial_qtm)
By age:
tmap_arrange(total_qtm, young_qtm, aged_qtm, active_qtm)
By dwelling:
tmap_arrange(hdb12_qtm, hdb34_qtm, hdb5_qtm, condo_qtm, landed_qtm)
hist_b <- ggplot(data=mpsz_all, aes(x=`Business`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_i <- ggplot(data=mpsz_all, aes(x=`Industry`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_p <- ggplot(data=mpsz_all, aes(x=`Private`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_s <- ggplot(data=mpsz_all, aes(x=`Shopping`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_g <- ggplot(data=mpsz_all, aes(x=`Govt_Embassy`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_f <- ggplot(data=mpsz_all, aes(x=`Financial`)) +
geom_histogram(bins=20, color="black", fill="light blue")
Plotting:
ggarrange(hist_b, hist_i, hist_p, hist_s, hist_g, hist_f,
ncol = 3,
nrow = 3)
hist_young <- ggplot(data=mpsz_all, aes(x=`YOUNG`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_active <- ggplot(data=mpsz_all, aes(x=`ECONOMY ACTIVE`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_aged <- ggplot(data=mpsz_all, aes(x=`AGED`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_total <- ggplot(data=mpsz_all, aes(x=`TOTAL`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_hdb12 <- ggplot(data=mpsz_all, aes(x=`HDB1-2RM dwellers`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_hdb34 <- ggplot(data=mpsz_all, aes(x=`HDB3-4RM dwellers`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_hdb5 <- ggplot(data=mpsz_all, aes(x=`HDB5RM-Ec dwellers`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_condo <- ggplot(data=mpsz_all, aes(x=`Condominium and apartment dwellers`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_landed <- ggplot(data=mpsz_all, aes(x=`Landed property dwellers`)) +
geom_histogram(bins=20, color="black", fill="light blue")
ggarrange(hist_total, hist_young, hist_active, hist_aged, hist_hdb12, hist_hdb34, hist_hdb5, hist_condo, hist_landed,
ncol = 3,
nrow = 3)
However, the population distribution is affected by the total population in an area, therefore I derive the weighted population index based on total population
mpsz_derived <- mpsz_all %>%
mutate(`young_pr` = `YOUNG`/`TOTAL`*1000) %>%
mutate(`economy_active_pr` = `ECONOMY ACTIVE`/`TOTAL`*1000) %>%
mutate(`aged_pr` = `AGED`/`TOTAL`*1000) %>%
mutate(`hdb12_pr` = `HDB1-2RM dwellers`/`TOTAL`*1000) %>%
mutate(`hdb34_pr` = `HDB3-4RM dwellers`/`TOTAL`*1000) %>%
mutate(`hdb5_pr` = `HDB5RM-Ec dwellers`/`TOTAL`*1000) %>%
mutate(`condo_pr` = `Condominium and apartment dwellers`/`TOTAL`*1000) %>%
mutate(`landed_pr` = `Landed property dwellers`/`TOTAL`*1000) %>%
select(`subzone`, `young_pr`, `economy_active_pr`, `aged_pr`, `hdb12_pr`, `hdb34_pr`, `hdb5_pr`, `condo_pr`, `landed_pr`, `Business`, `Industry`, `Govt_Embassy`, `Shopping`, `Financial`, `Private`, `TOTAL`,`density`)
Replace NA values with 0:
mpsz_derived[is.na(mpsz_derived)] <- 0
hist_young <- ggplot(data=mpsz_derived,
aes(x=`young_pr`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_active <- ggplot(data=mpsz_derived, aes(x=`economy_active_pr`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_aged <- ggplot(data=mpsz_derived, aes(x=`aged_pr`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_total <- ggplot(data=mpsz_derived, aes(x=`TOTAL`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_hdb12 <- ggplot(data=mpsz_derived, aes(x=`hdb12_pr`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_hdb34 <- ggplot(data=mpsz_derived, aes(x=`hdb34_pr`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_hdb5 <- ggplot(data=mpsz_derived, aes(x=`hdb5_pr`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_condo <- ggplot(data=mpsz_derived, aes(x=`condo_pr`)) +
geom_histogram(bins=20, color="black", fill="light blue")
hist_landed <- ggplot(data=mpsz_derived, aes(x=`landed_pr`)) +
geom_histogram(bins=20, color="black", fill="light blue")
ggarrange(hist_total, hist_young, hist_active, hist_aged, hist_hdb12, hist_hdb34, hist_hdb5, hist_condo, hist_landed,
ncol = 3,
nrow = 3)
Convert to data.frame for analysis:
mpsz_derived_df <- as.data.frame(mpsz_derived)
cluster_vars.cor = cor(mpsz_derived_df[,2:17])
corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black")
Observation: Young and economy active have a very high correlation. Financial and shopping have a high correlation. HDB with 3-4 rooms and HDB with 5+ rooms have a high correlation with population density, in fact all dwelling types have a positive correlation with population and population density. Total population also have a high correlation with population density. As young and economy are highly correlated, I will proceed to drop young in my following analysis.
cluster_vars <- mpsz_derived_df %>%
select("subzone", "economy_active_pr", "aged_pr", "hdb12_pr", "hdb34_pr", "hdb5_pr", "condo_pr", "landed_pr", "density", "Business", "Industry", "Shopping", "Govt_Embassy", "Private", "Financial")
head(cluster_vars,5)
## subzone economy_active_pr aged_pr hdb12_pr hdb34_pr hdb5_pr
## 1 ADMIRALTY 593.9050 96.38554 80.79376 478.3841 440.8221
## 2 AIRPORT ROAD 0.0000 0.00000 0.00000 0.0000 0.0000
## 3 ALEXANDRA HILL 548.6212 240.20319 283.74456 458.6357 226.4151
## 4 ALEXANDRA NORTH 669.8113 56.60377 0.00000 0.0000 0.0000
## 5 ALJUNIED 605.3745 186.86240 52.74944 448.3702 159.2436
## condo_pr landed_pr density Business Industry Shopping Govt_Embassy
## 1 0.00 0.00000 0.011183778 0 0 1 0
## 2 0.00 0.00000 0.000000000 0 0 0 0
## 3 0.00 0.00000 0.013373723 39 1 1 7
## 4 1000.00 0.00000 0.007218093 3 0 0 0
## 5 296.84 11.44563 0.013580604 62 1 2 2
## Private Financial
## 1 5 2
## 2 0 0
## 3 4 15
## 4 3 0
## 5 140 42
Change the rows by subzone name instead of row number by using the code chunk below:
row.names(cluster_vars) <- cluster_vars$"subzone"
head(cluster_vars,5)
## subzone economy_active_pr aged_pr hdb12_pr hdb34_pr
## ADMIRALTY ADMIRALTY 593.9050 96.38554 80.79376 478.3841
## AIRPORT ROAD AIRPORT ROAD 0.0000 0.00000 0.00000 0.0000
## ALEXANDRA HILL ALEXANDRA HILL 548.6212 240.20319 283.74456 458.6357
## ALEXANDRA NORTH ALEXANDRA NORTH 669.8113 56.60377 0.00000 0.0000
## ALJUNIED ALJUNIED 605.3745 186.86240 52.74944 448.3702
## hdb5_pr condo_pr landed_pr density Business Industry
## ADMIRALTY 440.8221 0.00 0.00000 0.011183778 0 0
## AIRPORT ROAD 0.0000 0.00 0.00000 0.000000000 0 0
## ALEXANDRA HILL 226.4151 0.00 0.00000 0.013373723 39 1
## ALEXANDRA NORTH 0.0000 1000.00 0.00000 0.007218093 3 0
## ALJUNIED 159.2436 296.84 11.44563 0.013580604 62 1
## Shopping Govt_Embassy Private Financial
## ADMIRALTY 1 0 5 2
## AIRPORT ROAD 0 0 0 0
## ALEXANDRA HILL 1 7 4 15
## ALEXANDRA NORTH 0 0 3 0
## ALJUNIED 2 2 140 42
Now, we will delete the redundant column by using the code chunk below.
subzone_cluster <- select(cluster_vars, c(2:15))
head(subzone_cluster, 5)
## economy_active_pr aged_pr hdb12_pr hdb34_pr hdb5_pr
## ADMIRALTY 593.9050 96.38554 80.79376 478.3841 440.8221
## AIRPORT ROAD 0.0000 0.00000 0.00000 0.0000 0.0000
## ALEXANDRA HILL 548.6212 240.20319 283.74456 458.6357 226.4151
## ALEXANDRA NORTH 669.8113 56.60377 0.00000 0.0000 0.0000
## ALJUNIED 605.3745 186.86240 52.74944 448.3702 159.2436
## condo_pr landed_pr density Business Industry Shopping
## ADMIRALTY 0.00 0.00000 0.011183778 0 0 1
## AIRPORT ROAD 0.00 0.00000 0.000000000 0 0 0
## ALEXANDRA HILL 0.00 0.00000 0.013373723 39 1 1
## ALEXANDRA NORTH 1000.00 0.00000 0.007218093 3 0 0
## ALJUNIED 296.84 11.44563 0.013580604 62 1 2
## Govt_Embassy Private Financial
## ADMIRALTY 0 5 2
## AIRPORT ROAD 0 0 0
## ALEXANDRA HILL 7 4 15
## ALEXANDRA NORTH 0 3 0
## ALJUNIED 2 140 42
subzone_cluster.std <- normalize(subzone_cluster)
summary(subzone_cluster.std)
## economy_active_pr aged_pr hdb12_pr hdb34_pr
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.6392 Median :0.1308 Median :0.00000 Median :0.0000
## Mean :0.4798 Mean :0.1196 Mean :0.04134 Mean :0.2620
## 3rd Qu.:0.6657 3rd Qu.:0.1923 3rd Qu.:0.04233 3rd Qu.:0.5316
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## hdb5_pr condo_pr landed_pr density
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.04583 Median :0.00000 Median :0.1283
## Mean :0.1377 Mean :0.21118 Mean :0.09062 Mean :0.2331
## 3rd Qu.:0.2477 3rd Qu.:0.29509 3rd Qu.:0.03887 3rd Qu.:0.4327
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## Business Industry Shopping Govt_Embassy
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.006494 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.064734 Mean :0.04257 Mean :0.05103 Mean :0.07219
## 3rd Qu.:0.045455 3rd Qu.:0.00000 3rd Qu.:0.03226 3rd Qu.:0.05263
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## Private Financial
## Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.007463
## Median :0.01843 Median :0.037313
## Mean :0.05142 Mean :0.076706
## 3rd Qu.:0.05069 3rd Qu.:0.097015
## Max. :1.00000 Max. :1.000000
active <- ggplot(data=subzone_cluster.std,
aes(x= `economy_active_pr`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
aged <- ggplot(data=subzone_cluster.std,
aes(x= `aged_pr`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hdb12 <- ggplot(data=subzone_cluster.std,
aes(x= `hdb12_pr`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hdb34 <- ggplot(data=subzone_cluster.std,
aes(x= `hdb34_pr`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hdb5 <- ggplot(data=subzone_cluster.std,
aes(x= `hdb5_pr`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
condo <- ggplot(data=subzone_cluster.std,
aes(x= `condo_pr`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
landed <- ggplot(data=subzone_cluster.std,
aes(x= `landed_pr`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
density <- ggplot(data=subzone_cluster.std,
aes(x= `density`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
ggarrange(active, aged, hdb12, hdb34, hdb5, condo, landed, density,
ncol = 3,
nrow = 3)
From the histograms, I can see that apart from the 0 values, most distribution are somewhat normally distributed, with some exceptions though like condo distribution.
proxmat <- dist(subzone_cluster.std, method = 'euclidean')
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(subzone_cluster.std, method = x)$ac
}
map_dbl(m, ac)
## average single complete ward
## 0.8804043 0.8269970 0.9145901 0.9815686
We can see that “ward” has the strongest clustering structure among the four methods assessed. Hence, in the subsequent analysis, only Ward’s method will be used.
hclust_ward <- hclust(proxmat, method = 'ward.D')
Plotting the clustering plot:
plot(hclust_ward, cex = 0.4)
plot(hclust_ward, cex = 0.4)
rect.hclust(hclust_ward, k =5, border = 2:5)
subzone_cluster_matrix <- data.matrix(subzone_cluster.std)
Below I have used normalize method to get a clearer view of the cluster patterns:
heatmaply(normalize(subzone_cluster_matrix),
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 5,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="Geographic Segmentation of Singapore Subzones by Various Indicators",
xlab = "Indicators",
ylab = "Subzones"
)
Creating the sf object for use:
mpsz_derived_sf <- left_join(mpsz_all_sf, mpsz_derived)
## Joining, by = c("subzone", "TOTAL", "density", "Business", "Industry", "Private", "Shopping", "Financial", "Govt_Embassy")
Set number of clusters to 5:
groups <- as.factor(cutree(hclust_ward, k=5))
Deriving the cluster for mapping:
mpsz_cluster <- cbind(mpsz_derived_sf, as.matrix(groups)) %>%
rename(`CLUSTER`=`as.matrix.groups.`)
Mapping clusters using qtm:
qtm(mpsz_cluster, "CLUSTER")
## Warning: The shape mpsz_cluster is invalid. See sf::st_is_valid
Observation: Cluster 2 are mainly subzones with minimal residents. Cluster 1 and 3 are subzones with large number of hdb residents. Cluster 4 are more populated with condo and landed property residents. The quality of the clustering is actually not very fragmented and quite satisfying.
derived_sp <- as_Spatial(mpsz_derived_sf)
derived.nb <- poly2nb(derived_sp)
summary(derived.nb)
## Neighbour list object:
## Number of regions: 323
## Number of nonzero links: 1934
## Percentage nonzero weights: 1.853751
## Average number of links: 5.987616
## 5 regions with no links:
## 175 208 227 255 258
## Link number distribution:
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 14 17
## 5 2 6 10 26 77 87 51 34 16 3 3 1 1 1
## 2 least connected regions:
## 42 109 with 1 link
## 1 most connected region:
## 40 with 17 links
sub_nb <- subset(derived.nb, subset=card(derived.nb) > 0)
plot(derived_sp, border=grey(.5))
plot(derived.nb, coordinates(derived_sp), col="blue", add=TRUE)
Compute the cost of each edge. It is the distance between nodes.
lcosts <- nbcosts(sub_nb, subzone_cluster.std)
derived.w <- nb2listw(sub_nb, lcosts, style="B")
summary(derived.w)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 318
## Number of nonzero links: 1934
## Percentage nonzero weights: 1.912503
## Average number of links: 6.081761
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 14 17
## 2 6 10 26 77 87 51 34 16 3 3 1 1 1
## 2 least connected regions:
## 42 109 with 1 link
## 1 most connected region:
## 40 with 17 links
##
## Weights style: B
## Weights constants summary:
## n nn S0 S1 S2
## B 318 101124 1736.912 3768.324 43913.02
derived_mst <- mstree(derived.w)
plot(derived_sp, border=gray(.5))
plot.mst(derived_mst, coordinates(derived_sp),
col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)
clust5 <- skater(derived_mst[,1:2], subzone_cluster.std, 4)
plot(derived_sp, border=gray(.5))
plot(clust5, coordinates(derived_sp), cex.lab=.7,
groups.colors=c("red","green","blue", "brown", "pink"), cex.circles=0.005, add=TRUE)
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
groups_mat <- as.matrix(clust5$groups)
mpsz_spatialcluster <- cbind(mpsz_cluster[1:318,], as.factor(groups_mat)) %>%
rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(mpsz_spatialcluster, "SP_CLUSTER")
## Warning: The shape mpsz_spatialcluster is invalid. See sf::st_is_valid
Observation: Compared to non spatial constrained clusterings, I do not think the SKATER approach is not optimal in this case. Although clusters 3, 4, 5 are more centralized, too many subzones are under cluster 1, which makes the map not very intuitive.