1 Setting up
2 Data import and wrangling
3 Exploratory Data Analysis(EDA)
4 Hierarchy Cluster Analysis
5 Spatially Constrained Clustering - SKATER approach

1 Setting up

Rpub link: https://rpubs.com/TY2018/618920

1.1 The task

In this take-home exercise, you are tasked to segment Singapore at the planning subzone level into homogeneous socioeconomic areas by combining geodemographic data extracted from Singapore Department of Statistics and urban functions extracted from the geospatial data provided.

From the Singapore Residents by Planning Area Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019 provide by Singapore Department of Statistics, you are required the extract the following indicators for 2019: * Economy active population (i.e. age 25-64) * Young group (i.e. age 0-24) * Aged group (i.e 65 and above) * Population density * HDB1-2RM dwellers * HDB3-4RM dwellers * HDB5RM-Ec dweller * Condominium and apartment dwellers * Landed property dwellers

From the geospatial data provided, you are required to extract the following urban functions: * Government including embassy * Business * Industry * Shopping * Financial * Upmarket residential area

1.2 Data

For the purpose of this study, five geospatial data sets are provided, they are: Govt_Embassy, Private residential, Shopping, Business and Financial. These data sets are in ESRI shapefile format. You are also required to download the following data from the relevant government portals: * Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019, and * URA Master Plan 2014 Planning Subzone boundary

1.3 Installing and loading R packages

packages = c('rgdal', 'spdep', 'ClustGeo',  'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p,character.only = T)
}

## Loading required package: rgdal

## Loading required package: sp

## rgdal: version: 1.4-8, (SVN revision 845)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
##  Path to GDAL shared files: C:/Users/TYZ/Documents/R/win-library/3.6/rgdal/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/Users/TYZ/Documents/R/win-library/3.6/rgdal/proj
##  Linking to sp version: 1.4-1

## Loading required package: spdep

## Loading required package: spData

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

## Loading required package: sf

## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3

## Loading required package: ClustGeo

## Loading required package: tmap

## Loading required package: ggpubr

## Loading required package: ggplot2

## Loading required package: cluster

## Loading required package: heatmaply

## Loading required package: plotly

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

## Loading required package: viridis

## Loading required package: viridisLite

## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus

## 
## ======================
## Welcome to heatmaply version 1.1.0
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================

## Loading required package: corrplot

## corrplot 0.84 loaded

## Loading required package: psych

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## Loading required package: tidyverse

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x psych::%+%()    masks ggplot2::%+%()
## x psych::alpha()  masks ggplot2::alpha()
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag()    masks stats::lag()

2 Data import and wrangling

2.1 Importing and seperating business geospatial data

Industry and business is all within “Business” layer

business_layer <- st_read(dsn = "data/geospatial", layer = "Business")

## Reading layer `Business' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## CRS:            4326

Examine the geospatial data:

head(business_layer, 10)

## Simple feature collection with 10 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.7354 ymin: 1.2972 xmax: 103.8858 ymax: 1.34032
## CRS:            4326
##        POI_ID SEQ_NUM FAC_TYPE                        POI_NAME
## 1  1101180209       1     5000                       JOHN CHEN
## 2  1101180210       1     5000    TROPICAL INDUSTRIAL BUILDING
## 3  1101180211       1     5000 LIAN CHEONG INDUSTRIAL BUILDING
## 4  1101180212       1     5000  MALAYSIA GARMENT MANUFACTURERS
## 5  1101180213       1     5000                         UNIGOLD
## 6  1192316144       1     5000             NUS UNIVERSITY HALL
## 7  1144317654       1     5000           SUITES AT BUKIT TIMAH
## 8  1103507488       1     5000                      TIONG HUAT
## 9  1001052867       1     5000  LEE CHOON GUAN TIMBER MERCHANT
## 10 1001052868       1     5000           WEIGHT BRIDGE SERVICE
##                ST_NAME                 geometry
## 1            LITTLE RD POINT (103.8856 1.33841)
## 2            LITTLE RD POINT (103.8852 1.33832)
## 3            LITTLE RD POINT (103.8852 1.33834)
## 4                 <NA> POINT (103.8855 1.33821)
## 5            LITTLE RD POINT (103.8858 1.33844)
## 6  LOWER KENT RIDGE RD  POINT (103.7777 1.2972)
## 7  JALAN JURONG KECHIL POINT (103.7738 1.34032)
## 8   KALLANG PUDDING RD  POINT (103.879 1.32773)
## 9           PENJURU RD  POINT (103.7354 1.3184)
## 10          PENJURU RD POINT (103.7361 1.31927)

Note that for business, FAC_TYPE = 5000, for industry, FAC_TYPE = 9991.

Seperating industry and business:

industry <- business_layer %>%
  filter(FAC_TYPE == 9991)
business <- business_layer %>%
  filter(FAC_TYPE == 5000)

2.2 Import other geospatial layers

Financial:

financial <- st_read(dsn = "data/geospatial", layer = "Financial")

## Reading layer `Financial' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3320 features and 29 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6256 ymin: 1.24392 xmax: 103.9998 ymax: 1.46247
## CRS:            4326

Gove/Embassy:

govt_embassy <- st_read(dsn = "data/geospatial", layer = "Govt_Embassy")

## Reading layer `Govt_Embassy' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 443 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6282 ymin: 1.24911 xmax: 103.9884 ymax: 1.45765
## CRS:            4326

Private residential:

private_residential <- st_read(dsn = "data/geospatial", layer = "Private residential")

## Reading layer `Private residential' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3604 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6295 ymin: 1.23943 xmax: 103.9749 ymax: 1.45379
## CRS:            4326

Shopping:

shopping <- st_read(dsn = "data/geospatial", layer = "Shopping")

## Reading layer `Shopping' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 511 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.679 ymin: 1.24779 xmax: 103.9644 ymax: 1.4535
## CRS:            4326

Subzone data:

mpsz <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")

## Reading layer `MP14_SUBZONE_WEB_PL' from data source `D:\SMU\IS415\Take-home ex3\Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## proj4string:    +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs

resident <- read_csv ("data/aspatial/popdata.csv")

## Parsed with column specification:
## cols(
##   planning_area = col_character(),
##   subzone = col_character(),
##   age_group = col_character(),
##   sex = col_character(),
##   type_of_dwelling = col_character(),
##   resident_count = col_double(),
##   year = col_double()
## )

2.3 Checking CRS

st_crs(industry)

## Coordinate Reference System:
##   User input: 4326 
##   wkt:
## GEOGCS["GCS_WGS_1984",
##     DATUM["WGS_1984",
##         SPHEROID["WGS_84",6378137.0,298.257223563]],
##     PRIMEM["Greenwich",0.0],
##     UNIT["Degree",0.0174532925199433],
##     AUTHORITY["EPSG","4326"]]

st_crs(business)

## Coordinate Reference System:
##   User input: 4326 
##   wkt:
## GEOGCS["GCS_WGS_1984",
##     DATUM["WGS_1984",
##         SPHEROID["WGS_84",6378137.0,298.257223563]],
##     PRIMEM["Greenwich",0.0],
##     UNIT["Degree",0.0174532925199433],
##     AUTHORITY["EPSG","4326"]]

st_crs(financial)

## Coordinate Reference System:
##   User input: 4326 
##   wkt:
## GEOGCS["GCS_WGS_1984",
##     DATUM["WGS_1984",
##         SPHEROID["WGS_84",6378137.0,298.257223563]],
##     PRIMEM["Greenwich",0.0],
##     UNIT["Degree",0.0174532925199433],
##     AUTHORITY["EPSG","4326"]]

st_crs(govt_embassy)

## Coordinate Reference System:
##   User input: 4326 
##   wkt:
## GEOGCS["GCS_WGS_1984",
##     DATUM["WGS_1984",
##         SPHEROID["WGS_84",6378137.0,298.257223563]],
##     PRIMEM["Greenwich",0.0],
##     UNIT["Degree",0.0174532925199433],
##     AUTHORITY["EPSG","4326"]]

st_crs(private_residential)

## Coordinate Reference System:
##   User input: 4326 
##   wkt:
## GEOGCS["GCS_WGS_1984",
##     DATUM["WGS_1984",
##         SPHEROID["WGS_84",6378137.0,298.257223563]],
##     PRIMEM["Greenwich",0.0],
##     UNIT["Degree",0.0174532925199433],
##     AUTHORITY["EPSG","4326"]]

st_crs(shopping)

## Coordinate Reference System:
##   User input: 4326 
##   wkt:
## GEOGCS["GCS_WGS_1984",
##     DATUM["WGS_1984",
##         SPHEROID["WGS_84",6378137.0,298.257223563]],
##     PRIMEM["Greenwich",0.0],
##     UNIT["Degree",0.0174532925199433],
##     AUTHORITY["EPSG","4326"]]

st_crs(mpsz)

## Coordinate Reference System:
##   No user input
##   wkt:
## PROJCS["SVY21",
##     GEOGCS["SVY21[WGS84]",
##         DATUM["WGS_1984",
##             SPHEROID["WGS_84",6378137.0,298.257223563]],
##         PRIMEM["Greenwich",0.0],
##         UNIT["Degree",0.0174532925199433],
##         AUTHORITY["EPSG","4326"]],
##     PROJECTION["Transverse_Mercator"],
##     PARAMETER["False_Easting",28001.642],
##     PARAMETER["False_Northing",38744.572],
##     PARAMETER["Central_Meridian",103.8333333333333],
##     PARAMETER["Scale_Factor",1.0],
##     PARAMETER["Latitude_Of_Origin",1.366666666666667],
##     UNIT["Meter",1.0]]

2.4 Setting crs to EPSG:3414

mpsz_sg <- st_set_crs(mpsz, 'EPSG:3414')

## Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
## that

st_crs(mpsz_sg)

## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCS["SVY21 / Singapore TM",
##     GEOGCS["SVY21",
##         DATUM["SVY21",
##             SPHEROID["WGS 84",6378137,298.257223563,
##                 AUTHORITY["EPSG","7030"]],
##             AUTHORITY["EPSG","6757"]],
##         PRIMEM["Greenwich",0,
##             AUTHORITY["EPSG","8901"]],
##         UNIT["degree",0.0174532925199433,
##             AUTHORITY["EPSG","9122"]],
##         AUTHORITY["EPSG","4757"]],
##     PROJECTION["Transverse_Mercator"],
##     PARAMETER["latitude_of_origin",1.366666666666667],
##     PARAMETER["central_meridian",103.8333333333333],
##     PARAMETER["scale_factor",1],
##     PARAMETER["false_easting",28001.642],
##     PARAMETER["false_northing",38744.572],
##     UNIT["metre",1,
##         AUTHORITY["EPSG","9001"]],
##     AUTHORITY["EPSG","3414"]]

business_sg <- st_transform(business,3414)
industry_sg <- st_transform(industry,3414)
financial_sg <- st_transform(financial,3414)
govt_embassy_sg <- st_transform(govt_embassy,3414)
private_sg <- st_transform(private_residential,3414)
shopping_sg <- st_transform(shopping,3414)

st_crs(business_sg)

## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCS["SVY21 / Singapore TM",
##     GEOGCS["SVY21",
##         DATUM["SVY21",
##             SPHEROID["WGS 84",6378137,298.257223563,
##                 AUTHORITY["EPSG","7030"]],
##             AUTHORITY["EPSG","6757"]],
##         PRIMEM["Greenwich",0,
##             AUTHORITY["EPSG","8901"]],
##         UNIT["degree",0.0174532925199433,
##             AUTHORITY["EPSG","9122"]],
##         AUTHORITY["EPSG","4757"]],
##     PROJECTION["Transverse_Mercator"],
##     PARAMETER["latitude_of_origin",1.366666666666667],
##     PARAMETER["central_meridian",103.8333333333333],
##     PARAMETER["scale_factor",1],
##     PARAMETER["false_easting",28001.642],
##     PARAMETER["false_northing",38744.572],
##     UNIT["metre",1,
##         AUTHORITY["EPSG","9001"]],
##     AUTHORITY["EPSG","3414"]]

2.5 Extract indicators

Examining the data:

summary(resident)

##  planning_area        subzone           age_group             sex           
##  Length:883728      Length:883728      Length:883728      Length:883728     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  type_of_dwelling   resident_count         year     
##  Length:883728      Min.   :   0.00   Min.   :2011  
##  Class :character   1st Qu.:   0.00   1st Qu.:2013  
##  Mode  :character   Median :   0.00   Median :2015  
##                     Mean   :  39.83   Mean   :2015  
##                     3rd Qu.:  10.00   3rd Qu.:2017  
##                     Max.   :2860.00   Max.   :2019

There are no missing values, therefore proceed.

2.5.1 Filter year = 2019

resident_2019 <- resident %>%
  filter(year == 2019)

2.5.2 Extract different age groups into new columns

resident_age <- resident_2019 %>%
  spread(age_group, resident_count) %>%
  mutate(YOUNG = `0_to_4`+`5_to_9`+`10_to_14`+
  `15_to_19`+`20_to_24`) %>%
  mutate(`ECONOMY ACTIVE` = rowSums(.[10:14]) + rowSums(.[16:18])) %>%
  mutate(`AGED`=rowSums(.[19:24])) %>%
  mutate_at(.vars = vars(subzone), .funs = funs(toupper)) %>%
  select(`subzone`, `YOUNG`, `ECONOMY ACTIVE`, `AGED`) %>%
  group_by(`subzone`) %>%
  summarise_at(c("YOUNG", "ECONOMY ACTIVE", "AGED"), sum)

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

2.5.3 Extracting different type of dwellings

resident_tod <- resident_2019 %>%
  spread(type_of_dwelling, resident_count) %>%
  rename(`HDB1-2RM dwellers` = `HDB 1- and 2-Room Flats`) %>%
  rename(`HDB5RM-Ec dwellers` = `HDB 5-Room and Executive Flats`) %>%
  rename(`Condominium and apartment dwellers` = `Condominiums and Other Apartments`) %>%
  rename(`Landed property dwellers` = `Landed Properties`) %>%
  mutate(`HDB3-4RM dwellers` = `HDB 3-Room Flats` + `HDB 4-Room Flats`) %>%
  mutate_at(.vars = vars(subzone), .funs = funs(toupper)) %>%
  select(`subzone`, `HDB1-2RM dwellers`, `HDB3-4RM dwellers`, `HDB5RM-Ec dwellers`, `Condominium and apartment dwellers`, `Landed property dwellers`) %>%
  group_by(`subzone`) %>%
  summarise_at(c("HDB1-2RM dwellers",              "HDB3-4RM dwellers", "HDB5RM-Ec dwellers", "Condominium and apartment dwellers", "Landed property dwellers"), sum)

2.5.4 Getting total population for each subzones

resident_total <- resident_2019 %>%
  mutate_at(.vars = vars(subzone), .funs = funs(toupper)) %>%
  group_by(`subzone`) %>%
  summarise(TOTAL = sum(resident_count))

2.5.5 Calculating population density

Joining resident data with spatial data:

resident_mpsz <- full_join(resident_total, mpsz_sg, by = c("subzone" = "SUBZONE_N"))

## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector

Create population density:

resident_mpsz$density<-resident_mpsz$TOTAL/resident_mpsz$SHAPE_Area

Examining the results:

head(resident_mpsz)

## # A tibble: 6 x 18
##   subzone TOTAL OBJECTID SUBZONE_NO SUBZONE_C CA_IND PLN_AREA_N PLN_AREA_C
##   <chr>   <dbl>    <int>      <int> <fct>     <fct>  <fct>      <fct>     
## 1 ADMIRA~ 14110      296          5 SBSZ05    N      SEMBAWANG  SB        
## 2 AIRPOR~     0      179          4 PLSZ04    N      PAYA LEBAR PL        
## 3 ALEXAN~ 13780        6          7 BMSZ07    N      BUKIT MER~ BM        
## 4 ALEXAN~  2120       12          6 BMSZ06    N      BUKIT MER~ BM        
## 5 ALJUNI~ 40190      163          4 GLSZ04    N      GEYLANG    GL        
## 6 ANAK B~ 22250      211          1 BTSZ01    N      BUKIT TIM~ BT        
## # ... with 10 more variables: REGION_N <fct>, REGION_C <fct>, INC_CRC <fct>,
## #   FMEL_UPD_D <date>, X_ADDR <dbl>, Y_ADDR <dbl>, SHAPE_Leng <dbl>,
## #   SHAPE_Area <dbl>, geometry <MULTIPOLYGON [m]>, density <dbl>

2.5.6 Combining the indicator data tables with subzone spatial data (resident_mpsz, resident_tod, resident_age)

resident_age_tod <- left_join(resident_age, resident_tod)

## Joining, by = "subzone"

resident_mpsz_all <- full_join(resident_age_tod, resident_mpsz, by = c("subzone" = "subzone"))

Examine the result:

summary(resident_mpsz_all)

##    subzone              YOUNG       ECONOMY ACTIVE       AGED      
##  Length:323         Min.   :    0   Min.   :    0   Min.   :    0  
##  Class :character   1st Qu.:    0   1st Qu.:    0   1st Qu.:    0  
##  Mode  :character   Median : 1170   Median : 2840   Median :  730  
##                     Mean   : 3296   Mean   : 7386   Mean   : 1806  
##                     3rd Qu.: 4365   3rd Qu.:10285   3rd Qu.: 3000  
##                     Max.   :34260   Max.   :79780   Max.   :18860  
##                                                                    
##  HDB1-2RM dwellers HDB3-4RM dwellers HDB5RM-Ec dwellers
##  Min.   :   0      Min.   :    0     Min.   :    0     
##  1st Qu.:   0      1st Qu.:    0     1st Qu.:    0     
##  Median :   0      Median :    0     Median :    0     
##  Mean   : 542      Mean   : 5953     Mean   : 3297     
##  3rd Qu.: 605      3rd Qu.: 9705     3rd Qu.: 3660     
##  Max.   :4700      Max.   :75000     Max.   :47960     
##                                                        
##  Condominium and apartment dwellers Landed property dwellers     TOTAL       
##  Min.   :    0                      Min.   :    0.0          Min.   :     0  
##  1st Qu.:    0                      1st Qu.:    0.0          1st Qu.:     0  
##  Median :  230                      Median :    0.0          Median :  4890  
##  Mean   : 1827                      Mean   :  773.9          Mean   : 12487  
##  3rd Qu.: 2835                      3rd Qu.:  400.0          3rd Qu.: 17065  
##  Max.   :16770                      Max.   :18820.0          Max.   :132900  
##                                                                              
##     OBJECTID       SUBZONE_NO       SUBZONE_C   CA_IND          PLN_AREA_N 
##  Min.   :  1.0   Min.   : 1.000   AMSZ01 :  1   N:274   BUKIT MERAH  : 17  
##  1st Qu.: 81.5   1st Qu.: 2.000   AMSZ02 :  1   Y: 49   QUEENSTOWN   : 15  
##  Median :162.0   Median : 4.000   AMSZ03 :  1           ANG MO KIO   : 12  
##  Mean   :162.0   Mean   : 4.625   AMSZ04 :  1           DOWNTOWN CORE: 12  
##  3rd Qu.:242.5   3rd Qu.: 6.500   AMSZ05 :  1           TOA PAYOH    : 12  
##  Max.   :323.0   Max.   :17.000   AMSZ06 :  1           HOUGANG      : 10  
##                                   (Other):317           (Other)      :245  
##    PLN_AREA_C               REGION_N   REGION_C              INC_CRC   
##  BM     : 17   CENTRAL REGION   :134   CR :134   00F5E30B5C9B7AD8:  1  
##  QT     : 15   EAST REGION      : 30   ER : 30   013B509B8EDF15BE:  1  
##  AM     : 12   NORTH-EAST REGION: 48   NER: 48   01A4287FB060A0A6:  1  
##  DT     : 12   NORTH REGION     : 41   NR : 41   029BD940F4455194:  1  
##  TP     : 12   WEST REGION      : 70   WR : 70   0524461C92F35D94:  1  
##  HG     : 10                                     05FD555397CBEE7A:  1  
##  (Other):245                                     (Other)         :317  
##    FMEL_UPD_D             X_ADDR          Y_ADDR        SHAPE_Leng     
##  Min.   :2014-12-05   Min.   : 5093   Min.   :19579   Min.   :  871.5  
##  1st Qu.:2014-12-05   1st Qu.:21864   1st Qu.:31776   1st Qu.: 3709.6  
##  Median :2014-12-05   Median :28465   Median :35113   Median : 5211.9  
##  Mean   :2014-12-05   Mean   :27257   Mean   :36106   Mean   : 6524.4  
##  3rd Qu.:2014-12-05   3rd Qu.:31674   3rd Qu.:39869   3rd Qu.: 6942.6  
##  Max.   :2014-12-05   Max.   :50425   Max.   :49553   Max.   :68083.9  
##                                                                        
##    SHAPE_Area                geometry      density       
##  Min.   :   39438   MULTIPOLYGON :323   Min.   :0.00000  
##  1st Qu.:  628261   epsg:3414    :  0   1st Qu.:0.00000  
##  Median : 1229894   +proj=tmer...:  0   Median :0.00591  
##  Mean   : 2420882                       Mean   :0.01074  
##  3rd Qu.: 2106483                       3rd Qu.:0.01993  
##  Max.   :69748299                       Max.   :0.04606  
##

2.5.7 Get rid of irrelevant columns:

resident_mpsz_all <- resident_mpsz_all %>%
  select(`subzone`, `YOUNG`, `ECONOMY ACTIVE`, `AGED`, `HDB1-2RM dwellers`, `HDB3-4RM dwellers`, `HDB5RM-Ec dwellers`, `Condominium and apartment dwellers`, `Landed property dwellers`, `TOTAL`, `SHAPE_Area`, `density`, `geometry`)

2.6 Joining geospatial data and urban functions

2.6.1 Joining urban functions data with subzones spatial data and count the data by subzone.

Business: Joining the urban function data with subzone data:

mpsz_business <- st_join(mpsz_sg, business_sg, join = st_contains)

Count the data points by subzone:

mpsz_business_grouped <- mpsz_business %>%
  group_by(SUBZONE_N)%>%
  summarise(Business = sum(!is.na(POI_ID)))

Get rid of geometry data:

mpsz_business_grouped <- mpsz_business_grouped %>% st_set_geometry(NULL)

Join the business counted data with the previous population and dwelling dataframe:

mpsz_all <- full_join(resident_mpsz_all, mpsz_business_grouped, by = c("subzone" = "SUBZONE_N"))

## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector

Similarly, I proceed with all other urban function types: 2. Industry:

mpsz_industry <- st_join(mpsz_sg, industry_sg, join = st_contains)
mpsz_industry_grouped <- mpsz_industry %>%
  group_by(SUBZONE_N)%>%
  summarise(Industry = sum(!is.na(POI_ID)))
mpsz_industry_grouped <- mpsz_industry_grouped %>% st_set_geometry(NULL)
mpsz_all <- full_join(mpsz_all, mpsz_industry_grouped, by = c("subzone" = "SUBZONE_N"))

## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector

Private residential:

mpsz_private <- st_join(mpsz_sg, private_sg, join = st_contains)
mpsz_private_grouped <- mpsz_private %>%
  group_by(SUBZONE_N)%>%
  summarise(Private = sum(!is.na(POI_ID)))
mpsz_private_grouped <- mpsz_private_grouped %>% st_set_geometry(NULL)
mpsz_all <- full_join(mpsz_all, mpsz_private_grouped, by = c("subzone" = "SUBZONE_N"))

## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector

Shopping:

mpsz_shopping <- st_join(mpsz_sg, shopping_sg, join = st_contains)
mpsz_shopping_grouped <- mpsz_shopping %>%
  group_by(SUBZONE_N)%>%
  summarise(Shopping = sum(!is.na(POI_ID)))
mpsz_shopping_grouped <- mpsz_shopping_grouped %>% st_set_geometry(NULL)
mpsz_all <- full_join(mpsz_all, mpsz_shopping_grouped, by = c("subzone" = "SUBZONE_N"))

## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector

Financial:

mpsz_financial <- st_join(mpsz_sg, financial_sg, join = st_contains)
mpsz_financial_grouped <- mpsz_financial %>%
  group_by(SUBZONE_N)%>%
  summarise(Financial = sum(!is.na(POI_ID)))
mpsz_financial_grouped <- mpsz_financial_grouped %>% st_set_geometry(NULL)
mpsz_all <- full_join(mpsz_all, mpsz_financial_grouped, by = c("subzone" = "SUBZONE_N"))

## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector

6: Gov/Embassy:

mpsz_govt_embassy <- st_join(mpsz_sg, govt_embassy_sg, join = st_contains)
mpsz_govt_embassy_grouped <- mpsz_govt_embassy %>%
  group_by(SUBZONE_N)%>%
  summarise(Govt_Embassy = sum(!is.na(POI_ID)))
mpsz_govt_embassy_grouped <- mpsz_govt_embassy_grouped %>% st_set_geometry(NULL)
mpsz_all <- full_join(mpsz_all, mpsz_govt_embassy_grouped, by = c("subzone" = "SUBZONE_N"))

## Warning: Column `subzone`/`SUBZONE_N` joining character vector and factor,
## coercing into character vector

Get rid of potential NA values:

mpsz_all[is.na(mpsz_all)] <- 0

Examine the result table:

summary(mpsz_all)

##    subzone              YOUNG       ECONOMY ACTIVE       AGED      
##  Length:323         Min.   :    0   Min.   :    0   Min.   :    0  
##  Class :character   1st Qu.:    0   1st Qu.:    0   1st Qu.:    0  
##  Mode  :character   Median : 1170   Median : 2840   Median :  730  
##                     Mean   : 3296   Mean   : 7386   Mean   : 1806  
##                     3rd Qu.: 4365   3rd Qu.:10285   3rd Qu.: 3000  
##                     Max.   :34260   Max.   :79780   Max.   :18860  
##  HDB1-2RM dwellers HDB3-4RM dwellers HDB5RM-Ec dwellers
##  Min.   :   0      Min.   :    0     Min.   :    0     
##  1st Qu.:   0      1st Qu.:    0     1st Qu.:    0     
##  Median :   0      Median :    0     Median :    0     
##  Mean   : 542      Mean   : 5953     Mean   : 3297     
##  3rd Qu.: 605      3rd Qu.: 9705     3rd Qu.: 3660     
##  Max.   :4700      Max.   :75000     Max.   :47960     
##  Condominium and apartment dwellers Landed property dwellers     TOTAL       
##  Min.   :    0                      Min.   :    0.0          Min.   :     0  
##  1st Qu.:    0                      1st Qu.:    0.0          1st Qu.:     0  
##  Median :  230                      Median :    0.0          Median :  4890  
##  Mean   : 1827                      Mean   :  773.9          Mean   : 12487  
##  3rd Qu.: 2835                      3rd Qu.:  400.0          3rd Qu.: 17065  
##  Max.   :16770                      Max.   :18820.0          Max.   :132900  
##    SHAPE_Area          density                 geometry      Business     
##  Min.   :   39438   Min.   :0.00000   MULTIPOLYGON :323   Min.   :  0.00  
##  1st Qu.:  628261   1st Qu.:0.00000   epsg:3414    :  0   1st Qu.:  0.00  
##  Median : 1229894   Median :0.00591   +proj=tmer...:  0   Median :  2.00  
##  Mean   : 2420882   Mean   :0.01074                       Mean   : 19.94  
##  3rd Qu.: 2106483   3rd Qu.:0.01993                       3rd Qu.: 14.00  
##  Max.   :69748299   Max.   :0.04606                       Max.   :308.00  
##     Industry         Private          Shopping        Financial     
##  Min.   :0.0000   Min.   :  0.00   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.:0.0000   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:  1.00  
##  Median :0.0000   Median :  4.00   Median : 0.000   Median :  5.00  
##  Mean   :0.3406   Mean   : 11.16   Mean   : 1.582   Mean   : 10.28  
##  3rd Qu.:0.0000   3rd Qu.: 11.00   3rd Qu.: 1.000   3rd Qu.: 13.00  
##  Max.   :8.0000   Max.   :217.00   Max.   :31.000   Max.   :134.00  
##   Govt_Embassy   
##  Min.   : 0.000  
##  1st Qu.: 0.000  
##  Median : 0.000  
##  Mean   : 1.372  
##  3rd Qu.: 1.000  
##  Max.   :19.000

3 Exploratory Data Analysis(EDA)

3.1 An overview using Choropleth Maps

Covert data table into sf object for mapping purpose:

mpsz_all_sf <- st_as_sf(mpsz_all, sf_column_name = "geometry")

Deriving choropleth maps using qtm() function:

total_qtm <- qtm(mpsz_all_sf, fill = "TOTAL")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

business_qtm <- qtm(mpsz_all_sf, fill = "Business")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

industry_qtm <- qtm(mpsz_all_sf, fill = "Industry")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

shopping_qtm <- qtm(mpsz_all_sf, fill = "Shopping")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

govt_embassy_qtm <- qtm(mpsz_all_sf, fill = "Govt_Embassy")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

private_qtm <- qtm(mpsz_all_sf, fill = "Private")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

financial_qtm <- qtm(mpsz_all_sf, fill = "Financial")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

young_qtm <- qtm(mpsz_all_sf, fill = "YOUNG")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

aged_qtm <- qtm(mpsz_all_sf, fill = "AGED")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

active_qtm <- qtm(mpsz_all_sf, fill = "ECONOMY ACTIVE")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

hdb12_qtm <- qtm(mpsz_all_sf, fill = "HDB1-2RM dwellers")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

hdb34_qtm <- qtm(mpsz_all_sf, fill = "HDB3-4RM dwellers")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

hdb5_qtm <- qtm(mpsz_all_sf, fill = "HDB5RM-Ec dwellers")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

condo_qtm <- qtm(mpsz_all_sf, fill = "Condominium and apartment dwellers")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

landed_qtm <- qtm(mpsz_all_sf, fill = "Landed property dwellers")

## Warning: The shape mpsz_all_sf is invalid. See sf::st_is_valid

Plotting the choropleth maps: By functions:

tmap_arrange(business_qtm, industry_qtm, shopping_qtm, govt_embassy_qtm, private_qtm, financial_qtm)

By age:

tmap_arrange(total_qtm, young_qtm, aged_qtm, active_qtm)

By dwelling:

tmap_arrange(hdb12_qtm, hdb34_qtm, hdb5_qtm, condo_qtm, landed_qtm)

3.2 EDA using statistical graphics

3.2.1 Histograms of urban functions data:

hist_b <- ggplot(data=mpsz_all, aes(x=`Business`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
hist_i <- ggplot(data=mpsz_all, aes(x=`Industry`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
hist_p <- ggplot(data=mpsz_all, aes(x=`Private`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
hist_s <- ggplot(data=mpsz_all, aes(x=`Shopping`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
hist_g <- ggplot(data=mpsz_all, aes(x=`Govt_Embassy`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
hist_f <- ggplot(data=mpsz_all, aes(x=`Financial`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

Plotting:

ggarrange(hist_b, hist_i, hist_p, hist_s, hist_g, hist_f,
          ncol = 3,
          nrow = 3)

3.2.2 Histograms of population data:

hist_young <- ggplot(data=mpsz_all, aes(x=`YOUNG`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_active <- ggplot(data=mpsz_all, aes(x=`ECONOMY ACTIVE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_aged <- ggplot(data=mpsz_all, aes(x=`AGED`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_total <- ggplot(data=mpsz_all, aes(x=`TOTAL`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_hdb12 <- ggplot(data=mpsz_all, aes(x=`HDB1-2RM dwellers`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_hdb34 <- ggplot(data=mpsz_all, aes(x=`HDB3-4RM dwellers`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_hdb5 <- ggplot(data=mpsz_all, aes(x=`HDB5RM-Ec dwellers`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_condo <- ggplot(data=mpsz_all, aes(x=`Condominium and apartment dwellers`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_landed <- ggplot(data=mpsz_all, aes(x=`Landed property dwellers`)) +
  geom_histogram(bins=20, color="black", fill="light blue")


ggarrange(hist_total, hist_young, hist_active, hist_aged, hist_hdb12, hist_hdb34, hist_hdb5, hist_condo, hist_landed,
          ncol = 3,
          nrow = 3)

However, the population distribution is affected by the total population in an area, therefore I derive the weighted population index based on total population

3.2.2.1 Deriving the penetration rate of population(weighted population composition)

mpsz_derived <- mpsz_all %>%
  mutate(`young_pr` = `YOUNG`/`TOTAL`*1000) %>%
  mutate(`economy_active_pr` = `ECONOMY ACTIVE`/`TOTAL`*1000) %>%
  mutate(`aged_pr` = `AGED`/`TOTAL`*1000) %>%
  mutate(`hdb12_pr` = `HDB1-2RM dwellers`/`TOTAL`*1000) %>%
  mutate(`hdb34_pr` = `HDB3-4RM dwellers`/`TOTAL`*1000) %>%
  mutate(`hdb5_pr` = `HDB5RM-Ec dwellers`/`TOTAL`*1000) %>%
  mutate(`condo_pr` = `Condominium and apartment dwellers`/`TOTAL`*1000) %>%
  mutate(`landed_pr` = `Landed property dwellers`/`TOTAL`*1000) %>%
  select(`subzone`, `young_pr`, `economy_active_pr`, `aged_pr`, `hdb12_pr`, `hdb34_pr`, `hdb5_pr`, `condo_pr`, `landed_pr`, `Business`, `Industry`, `Govt_Embassy`, `Shopping`, `Financial`, `Private`, `TOTAL`,`density`)

Replace NA values with 0:

mpsz_derived[is.na(mpsz_derived)] <- 0

3.2.3 Plot the derived variables histogram:

hist_young <- ggplot(data=mpsz_derived, 
aes(x=`young_pr`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_active <- ggplot(data=mpsz_derived, aes(x=`economy_active_pr`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_aged <- ggplot(data=mpsz_derived, aes(x=`aged_pr`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_total <- ggplot(data=mpsz_derived, aes(x=`TOTAL`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_hdb12 <- ggplot(data=mpsz_derived, aes(x=`hdb12_pr`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_hdb34 <- ggplot(data=mpsz_derived, aes(x=`hdb34_pr`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_hdb5 <- ggplot(data=mpsz_derived, aes(x=`hdb5_pr`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_condo <- ggplot(data=mpsz_derived, aes(x=`condo_pr`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

hist_landed <- ggplot(data=mpsz_derived, aes(x=`landed_pr`)) +
  geom_histogram(bins=20, color="black", fill="light blue")


ggarrange(hist_total, hist_young, hist_active, hist_aged, hist_hdb12, hist_hdb34, hist_hdb5, hist_condo, hist_landed,
          ncol = 3,
          nrow = 3)

3.3 Correlation Analysis

Convert to data.frame for analysis:

mpsz_derived_df <- as.data.frame(mpsz_derived)

cluster_vars.cor = cor(mpsz_derived_df[,2:17])
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Observation: Young and economy active have a very high correlation. Financial and shopping have a high correlation. HDB with 3-4 rooms and HDB with 5+ rooms have a high correlation with population density, in fact all dwelling types have a positive correlation with population and population density. Total population also have a high correlation with population density. As young and economy are highly correlated, I will proceed to drop young in my following analysis.

4 Hierarchy Cluster Analysis

4.1 Extrating clustering variables

cluster_vars <- mpsz_derived_df %>%
  select("subzone", "economy_active_pr", "aged_pr", "hdb12_pr", "hdb34_pr", "hdb5_pr", "condo_pr", "landed_pr", "density", "Business", "Industry", "Shopping", "Govt_Embassy", "Private", "Financial")
head(cluster_vars,5)

##           subzone economy_active_pr   aged_pr  hdb12_pr hdb34_pr  hdb5_pr
## 1       ADMIRALTY          593.9050  96.38554  80.79376 478.3841 440.8221
## 2    AIRPORT ROAD            0.0000   0.00000   0.00000   0.0000   0.0000
## 3  ALEXANDRA HILL          548.6212 240.20319 283.74456 458.6357 226.4151
## 4 ALEXANDRA NORTH          669.8113  56.60377   0.00000   0.0000   0.0000
## 5        ALJUNIED          605.3745 186.86240  52.74944 448.3702 159.2436
##   condo_pr landed_pr     density Business Industry Shopping Govt_Embassy
## 1     0.00   0.00000 0.011183778        0        0        1            0
## 2     0.00   0.00000 0.000000000        0        0        0            0
## 3     0.00   0.00000 0.013373723       39        1        1            7
## 4  1000.00   0.00000 0.007218093        3        0        0            0
## 5   296.84  11.44563 0.013580604       62        1        2            2
##   Private Financial
## 1       5         2
## 2       0         0
## 3       4        15
## 4       3         0
## 5     140        42

Change the rows by subzone name instead of row number by using the code chunk below:

row.names(cluster_vars) <- cluster_vars$"subzone"
head(cluster_vars,5)

##                         subzone economy_active_pr   aged_pr  hdb12_pr hdb34_pr
## ADMIRALTY             ADMIRALTY          593.9050  96.38554  80.79376 478.3841
## AIRPORT ROAD       AIRPORT ROAD            0.0000   0.00000   0.00000   0.0000
## ALEXANDRA HILL   ALEXANDRA HILL          548.6212 240.20319 283.74456 458.6357
## ALEXANDRA NORTH ALEXANDRA NORTH          669.8113  56.60377   0.00000   0.0000
## ALJUNIED               ALJUNIED          605.3745 186.86240  52.74944 448.3702
##                  hdb5_pr condo_pr landed_pr     density Business Industry
## ADMIRALTY       440.8221     0.00   0.00000 0.011183778        0        0
## AIRPORT ROAD      0.0000     0.00   0.00000 0.000000000        0        0
## ALEXANDRA HILL  226.4151     0.00   0.00000 0.013373723       39        1
## ALEXANDRA NORTH   0.0000  1000.00   0.00000 0.007218093        3        0
## ALJUNIED        159.2436   296.84  11.44563 0.013580604       62        1
##                 Shopping Govt_Embassy Private Financial
## ADMIRALTY              1            0       5         2
## AIRPORT ROAD           0            0       0         0
## ALEXANDRA HILL         1            7       4        15
## ALEXANDRA NORTH        0            0       3         0
## ALJUNIED               2            2     140        42

Now, we will delete the redundant column by using the code chunk below.

subzone_cluster <- select(cluster_vars, c(2:15))
head(subzone_cluster, 5)

##                 economy_active_pr   aged_pr  hdb12_pr hdb34_pr  hdb5_pr
## ADMIRALTY                593.9050  96.38554  80.79376 478.3841 440.8221
## AIRPORT ROAD               0.0000   0.00000   0.00000   0.0000   0.0000
## ALEXANDRA HILL           548.6212 240.20319 283.74456 458.6357 226.4151
## ALEXANDRA NORTH          669.8113  56.60377   0.00000   0.0000   0.0000
## ALJUNIED                 605.3745 186.86240  52.74944 448.3702 159.2436
##                 condo_pr landed_pr     density Business Industry Shopping
## ADMIRALTY           0.00   0.00000 0.011183778        0        0        1
## AIRPORT ROAD        0.00   0.00000 0.000000000        0        0        0
## ALEXANDRA HILL      0.00   0.00000 0.013373723       39        1        1
## ALEXANDRA NORTH  1000.00   0.00000 0.007218093        3        0        0
## ALJUNIED          296.84  11.44563 0.013580604       62        1        2
##                 Govt_Embassy Private Financial
## ADMIRALTY                  0       5         2
## AIRPORT ROAD               0       0         0
## ALEXANDRA HILL             7       4        15
## ALEXANDRA NORTH            0       3         0
## ALJUNIED                   2     140        42

4.2 Data Standarzation(Using Min-max)

subzone_cluster.std <- normalize(subzone_cluster)
summary(subzone_cluster.std)

##  economy_active_pr    aged_pr          hdb12_pr          hdb34_pr     
##  Min.   :0.0000    Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000    1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.6392    Median :0.1308   Median :0.00000   Median :0.0000  
##  Mean   :0.4798    Mean   :0.1196   Mean   :0.04134   Mean   :0.2620  
##  3rd Qu.:0.6657    3rd Qu.:0.1923   3rd Qu.:0.04233   3rd Qu.:0.5316  
##  Max.   :1.0000    Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##     hdb5_pr          condo_pr         landed_pr          density      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.04583   Median :0.00000   Median :0.1283  
##  Mean   :0.1377   Mean   :0.21118   Mean   :0.09062   Mean   :0.2331  
##  3rd Qu.:0.2477   3rd Qu.:0.29509   3rd Qu.:0.03887   3rd Qu.:0.4327  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##     Business           Industry          Shopping        Govt_Embassy    
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.006494   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.064734   Mean   :0.04257   Mean   :0.05103   Mean   :0.07219  
##  3rd Qu.:0.045455   3rd Qu.:0.00000   3rd Qu.:0.03226   3rd Qu.:0.05263  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##     Private          Financial       
##  Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.007463  
##  Median :0.01843   Median :0.037313  
##  Mean   :0.05142   Mean   :0.076706  
##  3rd Qu.:0.05069   3rd Qu.:0.097015  
##  Max.   :1.00000   Max.   :1.000000

4.3 Visualizing the standardised clustering variables

active <- ggplot(data=subzone_cluster.std, 
             aes(x= `economy_active_pr`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

aged <- ggplot(data=subzone_cluster.std, 
             aes(x= `aged_pr`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hdb12 <- ggplot(data=subzone_cluster.std, 
             aes(x= `hdb12_pr`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hdb34 <- ggplot(data=subzone_cluster.std, 
             aes(x= `hdb34_pr`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hdb5 <- ggplot(data=subzone_cluster.std, 
             aes(x= `hdb5_pr`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

condo <- ggplot(data=subzone_cluster.std, 
             aes(x= `condo_pr`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

landed <- ggplot(data=subzone_cluster.std, 
             aes(x= `landed_pr`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

density <- ggplot(data=subzone_cluster.std, 
             aes(x= `density`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

ggarrange(active, aged, hdb12, hdb34, hdb5, condo, landed, density, 
          ncol = 3, 
          nrow = 3)

From the histograms, I can see that apart from the 0 values, most distribution are somewhat normally distributed, with some exceptions though like condo distribution.

4.4 Computing proximity matrix

proxmat <- dist(subzone_cluster.std, method = 'euclidean')

4.5 Selecting the optimal clustering algorithm

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(subzone_cluster.std, method = x)$ac
}

map_dbl(m, ac)

##   average    single  complete      ward 
## 0.8804043 0.8269970 0.9145901 0.9815686

We can see that “ward” has the strongest clustering structure among the four methods assessed. Hence, in the subsequent analysis, only Ward’s method will be used.

4.6 Computing hierarchical clustering

hclust_ward <- hclust(proxmat, method = 'ward.D')

Plotting the clustering plot:

plot(hclust_ward, cex = 0.4)

4.7 Interpreting the dendrograms

plot(hclust_ward, cex = 0.4)
rect.hclust(hclust_ward, k =5, border = 2:5)

4.8 Visually-driven hierarchical clustering analysis

4.8.1 Transforming the data frame into a matrix

subzone_cluster_matrix <- data.matrix(subzone_cluster.std)

4.8.2 Plotting interactive cluster heatmap using heatmaply()

Below I have used normalize method to get a clearer view of the cluster patterns:

heatmaply(normalize(subzone_cluster_matrix),
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 5,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Geographic Segmentation of Singapore Subzones by Various Indicators",
          xlab = "Indicators",
          ylab = "Subzones"
          )

4.9 Mapping the clusters formed

Creating the sf object for use:

mpsz_derived_sf <- left_join(mpsz_all_sf, mpsz_derived)

## Joining, by = c("subzone", "TOTAL", "density", "Business", "Industry", "Private", "Shopping", "Financial", "Govt_Embassy")

Set number of clusters to 5:

groups <- as.factor(cutree(hclust_ward, k=5))

Deriving the cluster for mapping:

mpsz_cluster <- cbind(mpsz_derived_sf, as.matrix(groups)) %>%
  rename(`CLUSTER`=`as.matrix.groups.`)

Mapping clusters using qtm:

qtm(mpsz_cluster, "CLUSTER")

## Warning: The shape mpsz_cluster is invalid. See sf::st_is_valid

Observation: Cluster 2 are mainly subzones with minimal residents. Cluster 1 and 3 are subzones with large number of hdb residents. Cluster 4 are more populated with condo and landed property residents. The quality of the clustering is actually not very fragmented and quite satisfying.

5 Spatially Constrained Clustering - SKATER approach

5.1 Converting sf into SpatialPolygonsDataFrame

derived_sp <- as_Spatial(mpsz_derived_sf)

5.2 Computing Neighbour List

derived.nb <- poly2nb(derived_sp)
summary(derived.nb)

## Neighbour list object:
## Number of regions: 323 
## Number of nonzero links: 1934 
## Percentage nonzero weights: 1.853751 
## Average number of links: 5.987616 
## 5 regions with no links:
## 175 208 227 255 258
## Link number distribution:
## 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  5  2  6 10 26 77 87 51 34 16  3  3  1  1  1 
## 2 least connected regions:
## 42 109 with 1 link
## 1 most connected region:
## 40 with 17 links

5.3 Getting rid of no-neighbour nodes for later use

sub_nb <- subset(derived.nb, subset=card(derived.nb) > 0)

5.4 Plotting Nearest Neighbours

plot(derived_sp, border=grey(.5))
plot(derived.nb, coordinates(derived_sp), col="blue", add=TRUE)

5.5 Computing minimum spanning tree

5.5.1 Calculating edge costs

Compute the cost of each edge. It is the distance between nodes.

lcosts <- nbcosts(sub_nb, subzone_cluster.std)
derived.w <- nb2listw(sub_nb, lcosts, style="B")
summary(derived.w)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 318 
## Number of nonzero links: 1934 
## Percentage nonzero weights: 1.912503 
## Average number of links: 6.081761 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  2  6 10 26 77 87 51 34 16  3  3  1  1  1 
## 2 least connected regions:
## 42 109 with 1 link
## 1 most connected region:
## 40 with 17 links
## 
## Weights style: B 
## Weights constants summary:
##     n     nn       S0       S1       S2
## B 318 101124 1736.912 3768.324 43913.02

5.5.2 Creating the minimum spanning tree

derived_mst <- mstree(derived.w)

5.5.3 Plotting result

plot(derived_sp, border=gray(.5))
plot.mst(derived_mst, coordinates(derived_sp), 
     col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)

5.6 Computing spatially constrained clusters using SKATER method

clust5 <- skater(derived_mst[,1:2], subzone_cluster.std, 4)

5.6.1 Plotting the pruned tree that shows the spatially constrained five clusters.

plot(derived_sp, border=gray(.5))
plot(clust5, coordinates(derived_sp), cex.lab=.7,
     groups.colors=c("red","green","blue", "brown", "pink"), cex.circles=0.005, add=TRUE)

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

5.7 Visualising the spatially constrained clusters in choropleth map

groups_mat <- as.matrix(clust5$groups)
mpsz_spatialcluster <- cbind(mpsz_cluster[1:318,], as.factor(groups_mat)) %>%
  rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(mpsz_spatialcluster, "SP_CLUSTER")

## Warning: The shape mpsz_spatialcluster is invalid. See sf::st_is_valid

Observation: Compared to non spatial constrained clusterings, I do not think the SKATER approach is not optimal in this case. Although clusters 3, 4, 5 are more centralized, too many subzones are under cluster 1, which makes the map not very intuitive.

Take-home_Ex03

Zhao Tianyun

5/24/2020