Students are tasked to segment Singapore at the planning subzone level into homogeneous socioeconomic areas by combining geodemographic data extracted from Singapore Department of Statistics and urban functions extracted from the geospatial data provided.
The task involves 3 parts:
Geospatial and Aspatial Data Wrangling
Performing clustering based on the informational space of the data
Forming homogeneous clusters by using geographically weighted segmentation
The R packages needed for this exercise are as follows:
packages = c('rgdal', 'spdep', 'ClustGeo', 'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse', "clValid", "reshape2")
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
set.seed(1234) #For consistency
For the purpose of this analysis, we will need to introduce demographic data to form some of our indicators. We can retrieve them from government sources and map them to the planning subzones further on.
Source: https://data.gov.sg/dataset/singapore-residents-by-subzone-and-type-of-dwelling-2011-2019
SG_Residents <- read_csv ("data/aspatial/planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv")
summary(SG_Residents)
## planning_area subzone age_group sex
## Length:883728 Length:883728 Length:883728 Length:883728
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## type_of_dwelling resident_count year
## Length:883728 Min. : 0.00 Min. :2011
## Class :character 1st Qu.: 0.00 1st Qu.:2013
## Mode :character Median : 0.00 Median :2015
## Mean : 39.83 Mean :2015
## 3rd Qu.: 10.00 3rd Qu.:2017
## Max. :2860.00 Max. :2019
We can see that our data has no missing values which means that no clean-up is required. Our data seems to consist of multiple years as well as include some unneeded variables. We will need to segment and aggregate the data to be useful to us.
SG_Residents_2019 <- SG_Residents %>%
filter(year == 2019)
Now we have only the 2019 data which is relevant to us.
The specified indicators in the assignment document as follows:
The population density will be defined as the total resident count divided by the area of the polygon (m^2) which will be derived further on.
Demographic_data <- SG_Residents_2019 %>%
spread(age_group, resident_count) %>%
mutate_at(.vars = vars(planning_area, subzone), .funs = funs(toupper)) %>%
group_by(planning_area, subzone) %>%
mutate(YOUNG=`0_to_4`+`5_to_9`+`10_to_14`+`15_to_19`+`20_to_24`) %>%
mutate(ACTIVE=`25_to_29`+`30_to_34`+`35_to_39`+`40_to_44`+`45_to_49`+`50_to_54`+`55_to_59`+`60_to_64`) %>%
mutate(AGED=`65_to_69`+`70_to_74`+`75_to_79`+`80_to_84`+`85_to_89`+`90_and_over`) %>%
summarise(YOUNG=sum(YOUNG), ACTIVE=sum(ACTIVE), AGED=sum(AGED), TOTAL=sum(YOUNG, ACTIVE, AGED))
The specified indicators in the assignment document as follows:
We will define “dwellers” as the resident count for each property type in each subzone.
Housing_data <- SG_Residents_2019 %>%
spread(type_of_dwelling, resident_count) %>%
mutate_at(.vars = vars(planning_area, subzone), .funs = funs(toupper)) %>%
group_by(planning_area, subzone) %>%
summarise(HDB1_2RM=sum(`HDB 1- and 2-Room Flats`), HDB3RM=sum(`HDB 3-Room Flats`), HDB4RM=sum(`HDB 4-Room Flats`), HDB5RM_EF =sum(`HDB 5-Room and Executive Flats`), CONDO_APART =sum(`Condominiums and Other Apartments`), LANDED=sum(`Landed Properties`))
As “Others” and “HUDC Apartments” were not one of the indicators of interest, they will be excluded for the purpose of this study as we are unable to determine which category they fall under.
summary(Demographic_data)
## planning_area subzone YOUNG ACTIVE
## Length:323 Length:323 Min. : 0 Min. : 0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0
## Mode :character Mode :character Median : 1170 Median : 2840
## Mean : 3296 Mean : 7386
## 3rd Qu.: 4365 3rd Qu.:10285
## Max. :34260 Max. :79780
## AGED TOTAL
## Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0
## Median : 730 Median : 4890
## Mean : 1806 Mean : 12487
## 3rd Qu.: 3000 3rd Qu.: 17065
## Max. :18860 Max. :132900
No missing values
summary(Housing_data)
## planning_area subzone HDB1_2RM HDB3RM
## Length:323 Length:323 Min. : 0 Min. : 0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0
## Mode :character Mode :character Median : 0 Median : 0
## Mean : 542 Mean : 1803
## 3rd Qu.: 605 3rd Qu.: 2090
## Max. :4700 Max. :31350
## HDB4RM HDB5RM_EF CONDO_APART LANDED
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0
## Median : 0 Median : 0 Median : 230 Median : 0.0
## Mean : 4150 Mean : 3297 Mean : 1827 Mean : 773.9
## 3rd Qu.: 5590 3rd Qu.: 3660 3rd Qu.: 2835 3rd Qu.: 400.0
## Max. :51400 Max. :47960 Max. :16770 Max. :18820.0
No missing values
This is for easier merger later on and removes the need for additional clean-up
Housing_data <- within(Housing_data, rm(planning_area)) #drops planning_area Column
Demographic_data <- within(Demographic_data, rm(planning_area)) # Drops planning_area Column
In this case, the data has been provided to us for this assignment. The source and date of retrieval are not specified but we will assume they are all within 2019.
From the geospatial data provided, you are required to extract the following urban functions: * Government including embassy * Business * Industry * Shopping * Financial * Upmarket residential area
Based on inspection of the data provided, Industry appears to be a missing dataset. We will need to extract it from the business dataset below.
Businesses_all <- st_read(dsn="data/geospatial/urban_function", layer="Business")
## Reading layer `Business' from data source `D:\GSA\Take_Home_EX03\data\geospatial\urban_function' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## geographic CRS: WGS 84
summary(Businesses_all)
## POI_ID SEQ_NUM FAC_TYPE
## Min. :3.644e+07 Min. :1.000 Min. :5000
## 1st Qu.:9.967e+08 1st Qu.:1.000 1st Qu.:5000
## Median :1.097e+09 Median :1.000 Median :5000
## Mean :9.933e+08 Mean :1.021 Mean :5084
## 3rd Qu.:1.108e+09 3rd Qu.:1.000 3rd Qu.:5000
## Max. :1.204e+09 Max. :3.000 Max. :9991
##
## POI_NAME ST_NAME
## CAMBRIDGE INDUSTRIAL TRUST: 8 TAGORE LN : 83
## DHL : 6 JOO KOON CIR : 80
## NATIONAL OILWELL VARCO : 6 GUL CIR : 62
## ST MICROELECTRONICS : 6 KAKI BUKIT PL : 53
## CWT : 5 KAKI BUKIT IND TER: 52
## HALLIBURTON : 5 (Other) :5941
## (Other) :6514 NA's : 279
## geometry
## POINT :6550
## epsg:4326 : 0
## +proj=long...: 0
##
##
##
##
We can see that FAC_TYPE has two different values, indicating that we need to split industry indicator out of the Businesses_all which we will do so below
By inspecting the names of the companies, we are able to determine which FAC_TYPE corresponds to either Industry or Businesses. However, it should be noted that there are examples in which “Industrial building” is still labelled as FAC_TYPE 5000 which may be a data quality issue.
Industry <- Businesses_all %>%
filter(FAC_TYPE==9991)
Businesses <- Businesses_all %>%
filter(FAC_TYPE==5000)
Financial <- st_read(dsn="data/geospatial/urban_function", layer="Financial")
Embassies <- st_read(dsn="data/geospatial/urban_function", layer="Govt_Embassy")
Residential <- st_read(dsn="data/geospatial/urban_function", layer="Private residential")
Shopping <- st_read(dsn="data/geospatial/urban_function", layer="Shopping")
mpsz <- st_read(dsn="data/geospatial/subzone", layer="MP14_SUBZONE_WEB_PL")
You should inspect all of them for differences in the FAC_TYPE. But for the purpose of our analysis, we will consider the rest do not require separation as they follow our required indicators.
# Retrieves a quick breakdown of the number of NA rows and invalid polygons/points
Validity_NA_Check <- function(target_st) {
validity <- st_is_valid(target_st)
NA_rows <- target_st[rowSums(is.na(target_st))!=0,]
Invalid_rows <- which(validity==FALSE)
print(paste("For:", deparse(substitute(target_st))))
print(paste("Number of Invalid polygons/points is:", length(Invalid_rows)))
print(paste("Number of NA rows is:", length(NA_rows)))
}
# Retrieves the exact polygon which is invalid
get_invalid <- function(target_st) {
validity <- st_is_valid(target_st)
Invalid_rows <- which(validity==FALSE)
return(Invalid_rows)
}
# Retrieves the exact rows which contain NA values for you to check the columns
get_NA_rows <- function(target_st) {
NA_rows <- target_st[rowSums(is.na(target_st))!=0,]
return(NA_rows)
}
# A cleaning function that replaces NA with "Missing" so that calculations can still be done.
## This function is a little unnessary as we will not be using the data attached to the geospatial points.
replace_NA_with_unknown <- function(x){
x %>%
mutate_if(is.factor, funs(factor(replace(as.character(.), is.na(.), "Missing"))))
}
Validity_NA_Check(Businesses)
## [1] "For: Businesses"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Industry)
## [1] "For: Industry"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Financial)
## [1] "For: Financial"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 30"
Validity_NA_Check(Embassies)
## [1] "For: Embassies"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Residential)
## [1] "For: Residential"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Shopping)
## [1] "For: Shopping"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(mpsz)
## [1] "For: mpsz"
## [1] "Number of Invalid polygons/points is: 9"
## [1] "Number of NA rows is: 16"
There are several NA values within the dataframes. For the purpose of our analysis, we do not really need to cater to replacing the NA values as we only care about the number of points within the polygon. However, for hygiene sake, we will attempt to clear the NA values.
# Retrieve Rows which are NA so we can inspect which columns are missing
get_NA_rows(Businesses)
get_NA_rows(Industry)
get_NA_rows(Financial)
get_NA_rows(Embassies)
get_NA_rows(Residential)
get_NA_rows(Shopping)
get_NA_rows(mpsz)
get_invalid(mpsz) #Finds which polygons are invalid
We will hide the out for simplicity of reporting.
mpsz <- st_make_valid(mpsz)
We will use the st_make_valid() function from the sp package to convert the polygons to valid ones.
Businesses <- replace_NA_with_unknown(Businesses)
Industry <- replace_NA_with_unknown(Industry)
Financial <- replace_NA_with_unknown(Financial)
Embassies <- replace_NA_with_unknown(Embassies)
Residential <- replace_NA_with_unknown(Residential)
Shopping <- replace_NA_with_unknown(Shopping)
mpsz <- replace_NA_with_unknown(mpsz)
One final check and our data should be clear.
Validity_NA_Check(Businesses)
## [1] "For: Businesses"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Industry)
## [1] "For: Industry"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Financial)
## [1] "For: Financial"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 30"
Validity_NA_Check(Embassies)
## [1] "For: Embassies"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Residential)
## [1] "For: Residential"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Shopping)
## [1] "For: Shopping"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(mpsz)
## [1] "For: mpsz"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 16"
Cleared of NA Values and invalid geometries.
In order to properly map out data together. We will need to standardize the CRS to be inline with our Singapore study area.
mpsz <- st_transform(mpsz, crs= 3414)
Businesses <- st_transform(Businesses, crs= 3414)
Industry <- st_transform(Industry, crs= 3414)
Financial <- st_transform(Financial, crs= 3414)
Embassies <- st_transform(Embassies, crs= 3414)
Residential <- st_transform(Residential, crs= 3414)
Shopping <- st_transform(Shopping, crs= 3414)
This will allow us to see what units of measurement we’re dealing with as well as validate that the st_transform() function earlier has worked.
converting them to geospatial polygon dataframes for inspection
mpsz.spdf <- as_Spatial(mpsz)
Businesses.spdf <- as_Spatial(Businesses)
Industry.spdf <- as_Spatial(Industry)
Financial.spdf <- as_Spatial(Financial)
Embassies.spdf <- as_Spatial(Embassies)
Residential.spdf <- as_Spatial(Residential)
Shopping.spdf <- as_Spatial(Shopping)
Using the project4string() to identify units of measurement and validate that the conversion matches
proj4string(mpsz.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Businesses.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Industry.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Financial.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Embassies.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Residential.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Shopping.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
Seems our data is all aligned. We can proceed to plot the datasets to inspect them.
plot(mpsz.spdf, main= "Singapore Urban Functions")
plot(Businesses, col="blue", add=TRUE)
plot(Industry, col="green", add=TRUE)
plot(Financial, col="yellow", add=TRUE)
plot(Embassies, col="purple", add=TRUE)
plot(Residential, col="orange", add=TRUE)
plot(Shopping, col="red", add=TRUE)
Based on the map above, we can see that our datasets are properly plotted. Although some data-points are falling within the Water catchment areas and airport. This will be dealt with further down.
Now that we have prepared out dataframes. We will need to combine them into a singular dataframe to perform our analysis.
For this analysis, we are interested in how many points for each urban function falls within each subzone. As such, we simply need a count for each point that falls within the polygon. As such, we will be using the st_intersects{} function from the sp package and taking the “lengths” of those outputs to give us a count on how many points fall within each polygon.
mpsz$Businesses <- lengths(st_intersects(mpsz, Businesses))
mpsz$Industry <- lengths(st_intersects(mpsz, Industry))
mpsz$Financial <- lengths(st_intersects(mpsz, Financial))
mpsz$Embassies <- lengths(st_intersects(mpsz, Embassies))
mpsz$Residential <- lengths(st_intersects(mpsz, Residential))
mpsz$Shopping <- lengths(st_intersects(mpsz, Shopping))
Here, we will be using the left_join() function from the dplyr package that will help us to easily format our final super dataframe with all our desired indicators.
SG_Superframe <- left_join(mpsz , Demographic_data, by = c("SUBZONE_N" = "subzone"))
SG_Superframe$Density <- (SG_Superframe$TOTAL / SG_Superframe$SHAPE_Area)
SG_Superframe <- left_join(SG_Superframe, Housing_data, c("SUBZONE_N" = "subzone"))
SG_Superframe <- within(SG_Superframe, rm(SUBZONE_NO, SUBZONE_C, CA_IND, PLN_AREA_C, REGION_C, INC_CRC, FMEL_UPD_D))
For this study, we will need to concentrate on mainland Singapore in order to properly perform our cluster analysis. This is primarily because we need the nearest neighbor weights further down the line, when conducting our geospatially weighted cluster analysis.
SG_Superframe <-SG_Superframe[!SG_Superframe$SUBZONE_N=="CENTRAL WATER CATCHMENT",] #Some odd datapoints in there
SG_Superframe <-SG_Superframe[!SG_Superframe$PLN_AREA_N=="WESTERN ISLANDS",] # Removing external islands
SG_Superframe <-SG_Superframe[!SG_Superframe$PLN_AREA_N=="SOUTHERN ISLANDS",] # Removing external islands
SG_Superframe <-SG_Superframe[!SG_Superframe$PLN_AREA_N=="NORTH-EASTERN ISLANDS",] # Removing external islands
SG_Superframe <-SG_Superframe[!SG_Superframe$SUBZONE_N=="PULAU SELETAR",] # Removing external islands
SG_Superframe <-SG_Superframe[!SG_Superframe$SUBZONE_N=="WESTERN WATER CATCHMENT",] #Some odd datapoints in there
SG_Superframe <-SG_Superframe[!SG_Superframe$SUBZONE_N=="TUAS VIEW EXTENSION",] # Not built yet
# Removing polygons with no data on all variables.
SG_Superframe <-SG_Superframe[!sum(SG_Superframe[9:25] %>%
st_set_geometry(NULL))==0,]
tm_shape(SG_Superframe)+
tm_fill(col= "SHAPE_Leng",
style="jenks",
title = "Singapore Study Areas")+
tm_borders(alpha=0.7)
Indicator_vars <- SG_Superframe %>%
st_set_geometry(NULL) %>% #Dropping away geometric columns because we're using non-geospatial heirarichal cluster analysis
select("ACTIVE", "YOUNG", "AGED", "Density", "HDB1_2RM", "HDB3RM", "HDB4RM", "HDB5RM_EF", "CONDO_APART", "LANDED", "Businesses", "Industry", "Financial", "Embassies", "Residential", "Shopping")
summary(Indicator_vars)
## ACTIVE YOUNG AGED Density
## Min. : 0 Min. : 0 Min. : 0 Min. :0.00000
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.:0.00000
## Median : 3420 Median : 1400 Median : 780 Median :0.00611
## Mean : 7617 Mean : 3398 Mean : 1863 Mean :0.01108
## 3rd Qu.:11320 3rd Qu.: 4630 3rd Qu.: 3080 3rd Qu.:0.02061
## Max. :79780 Max. :34260 Max. :18860 Max. :0.04606
## HDB1_2RM HDB3RM HDB4RM HDB5RM_EF
## Min. : 0.0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 0.0 Median : 0 Median : 0 Median : 0
## Mean : 559.3 Mean : 1860 Mean : 4283 Mean : 3403
## 3rd Qu.: 770.0 3rd Qu.: 2130 3rd Qu.: 5950 3rd Qu.: 3830
## Max. :4700.0 Max. :31350 Max. :51400 Max. :47960
## CONDO_APART LANDED Businesses Industry
## Min. : 0 Min. : 0.0 Min. : 0.00 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:0.0000
## Median : 280 Median : 0.0 Median : 2.00 Median :0.0000
## Mean : 1881 Mean : 796.5 Mean : 20.25 Mean :0.3514
## 3rd Qu.: 2950 3rd Qu.: 440.0 3rd Qu.: 14.00 3rd Qu.:0.0000
## Max. :16770 Max. :18820.0 Max. :308.00 Max. :8.0000
## Financial Embassies Residential Shopping
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 0.00
## Median : 5.00 Median : 0.000 Median : 4.00 Median : 0.00
## Mean : 10.46 Mean : 1.406 Mean : 11.45 Mean : 1.61
## 3rd Qu.: 13.00 3rd Qu.: 1.000 3rd Qu.: 11.00 3rd Qu.: 1.00
## Max. :134.00 Max. :19.000 Max. :217.00 Max. :31.00
Based on the summary, we can already see that the data ranges of the respective variables very different from one another. Which is not a surprise as density is a derived metric while the others are raw counts. Nonetheless, we will need to examine their distributions in order to determine the appropriate standardization technique.
Indicator_vars %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
Based on the histograms displayed above, its clear that the data appears to be extremely skewed towards the lower values. As such, we cannot assume a normal distribution. However, due to the nature of the data-array. We must examine the data further by removing the zeros from the histogram which may be contributing to the skew in the graph.
no_zeros <- Indicator_vars %>%
mutate_all(funs((replace(., .==0, NaN))))
no_zeros %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
Now we can see our data a little better. However, the data is still very much skewed to the left. Thus an appropriate standardization of Min Max should be applied
For this analysis, we will be using the normalize() from the heatmaply package.
Indicator_vars.norm <- normalize(Indicator_vars)
summary(Indicator_vars.norm)
## ACTIVE YOUNG AGED Density
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.04287 Median :0.04086 Median :0.04136 Median :0.1327
## Mean :0.09547 Mean :0.09919 Mean :0.09879 Mean :0.2405
## 3rd Qu.:0.14189 3rd Qu.:0.13514 3rd Qu.:0.16331 3rd Qu.:0.4474
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## HDB1_2RM HDB3RM HDB4RM HDB5RM_EF
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.1190 Mean :0.05934 Mean :0.08333 Mean :0.07095
## 3rd Qu.:0.1638 3rd Qu.:0.06794 3rd Qu.:0.11576 3rd Qu.:0.07986
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## CONDO_APART LANDED Businesses Industry
## Min. :0.0000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.0167 Median :0.00000 Median :0.006494 Median :0.00000
## Mean :0.1122 Mean :0.04232 Mean :0.065744 Mean :0.04393
## 3rd Qu.:0.1759 3rd Qu.:0.02338 3rd Qu.:0.045455 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.000000 Max. :1.00000
## Financial Embassies Residential Shopping
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.007463 1st Qu.:0.00000 1st Qu.:0.004608 1st Qu.:0.00000
## Median :0.037313 Median :0.00000 Median :0.018433 Median :0.00000
## Mean :0.078036 Mean :0.07399 Mean :0.052753 Mean :0.05194
## 3rd Qu.:0.097015 3rd Qu.:0.05263 3rd Qu.:0.050691 3rd Qu.:0.03226
## Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.00000
Indicator_vars.norm %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
Now we can see a little more clearly some distributions differ from one another. For example, Density and HDB1_2RN are much more spread out in terms of distribution compared to Shopping and residential
Indicator_vars.norm.cor = cor(Indicator_vars.norm) #This is the base r functions to calculate the correlation matrix.
corrplot.mixed(Indicator_vars.norm.cor, #mixed version means first half and second half are different representations.
lower = "ellipse", #specify shape type
upper = "number", #specify the representation
tl.pos = "lt",
diag = "l",
tl.col = "black")
Unfortunately it seems we have a number of variables which have high collinearity with one another. Thus it would affect our regression and clustering. We must then decide whether or not these indicators are indeed reliable or if they must be further derived in order to form better clustering.
Based on the data presented to us. It seems that many of the population variables (both the demographic and housing data) are highly correlated with one another. This is probably due to common derivation from the total resident count in each subzone. As such, we should instead use Ratios instead of raw count in order to properly determine proportion of Youths, Economically Active and Aged.
SG_Derivedframe <- SG_Superframe %>%
mutate(YOUNG_RATIO = ifelse((YOUNG == 0), 0, (YOUNG/TOTAL))) %>%
mutate(ACTIVE_RATIO = ifelse((ACTIVE == 0), 0, (ACTIVE/TOTAL))) %>%
mutate(AGED_RATIO = ifelse((AGED == 0), 0, (AGED/TOTAL)))%>%
mutate(HDB1_2RM_RATIO = ifelse((HDB1_2RM == 0), 0, (HDB1_2RM/TOTAL)))%>%
mutate(HDB3RM_RATIO = ifelse((HDB3RM == 0), 0, (HDB3RM/TOTAL)))%>%
mutate(HDB4RM_RATIO = ifelse((HDB4RM == 0), 0, (HDB4RM/TOTAL)))%>%
mutate(HDB5RM_EF_RATIO = ifelse((HDB5RM_EF == 0), 0, (HDB5RM_EF/TOTAL)))%>%
mutate(CONDO_APART_RATIO = ifelse((CONDO_APART == 0), 0, (CONDO_APART/TOTAL)))%>%
mutate(LANDED_RATIO = ifelse((LANDED == 0), 0, (LANDED/TOTAL)))
Indicator_vars_derived <- SG_Derivedframe %>%
st_set_geometry(NULL) %>% #Dropping away geometric columns because we're using non-geospatial heirarichal cluster analysis
select("ACTIVE_RATIO", "YOUNG_RATIO", "AGED_RATIO", "Density", "HDB1_2RM_RATIO", "HDB3RM_RATIO", "HDB4RM_RATIO", "HDB5RM_EF_RATIO", "CONDO_APART_RATIO", "LANDED_RATIO", "Businesses", "Industry", "Financial", "Embassies", "Residential", "Shopping")
Indicator_vars_derived %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
Now we can see a larger amount of difference in distribution by using a derived value. However, it should give us a better indicator to properly cluster these areas.
Indicator_vars_derived.norm <- normalize(Indicator_vars_derived)
summary(Indicator_vars_derived.norm)
## ACTIVE_RATIO YOUNG_RATIO AGED_RATIO Density
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.6406 Median :0.4048 Median :0.1362 Median :0.1327
## Mean :0.4908 Mean :0.3272 Mean :0.1231 Mean :0.2405
## 3rd Qu.:0.6667 3rd Qu.:0.4880 3rd Qu.:0.1962 3rd Qu.:0.4474
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## HDB1_2RM_RATIO HDB3RM_RATIO HDB4RM_RATIO HDB5RM_EF_RATIO
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.04266 Mean :0.1008 Mean :0.2176 Mean :0.1421
## 3rd Qu.:0.04368 3rd Qu.:0.1666 3rd Qu.:0.4499 3rd Qu.:0.2530
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## CONDO_APART_RATIO LANDED_RATIO Businesses Industry
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.05067 Median :0.00000 Median :0.006494 Median :0.00000
## Mean :0.21585 Mean :0.09239 Mean :0.065744 Mean :0.04393
## 3rd Qu.:0.29954 3rd Qu.:0.05000 3rd Qu.:0.045455 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.00000
## Financial Embassies Residential Shopping
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.007463 1st Qu.:0.00000 1st Qu.:0.004608 1st Qu.:0.00000
## Median :0.037313 Median :0.00000 Median :0.018433 Median :0.00000
## Mean :0.078036 Mean :0.07399 Mean :0.052753 Mean :0.05194
## 3rd Qu.:0.097015 3rd Qu.:0.05263 3rd Qu.:0.050691 3rd Qu.:0.03226
## Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.00000
Indicator_vars_derived.norm.cor = cor(Indicator_vars_derived.norm) #This is the base r functions to calculate the correlation matrix.
corrplot.mixed(Indicator_vars_derived.norm.cor, #mixed version means first half and second half are different representations.
lower = "ellipse", #specify shape type
upper = "number", #specify the representation
tl.pos = "lt",
diag = "l",
tl.col = "black")
Based on the results above, now we have much clearer picture between the indicators after controlling for total resident count in the subzone. However, there are still some variables which are highly correlated with one-another.
There are multiple other tests such as VIFs and F-Tests that we can conduct to get a clearer picture. Further discussion can be found here: https://www.r-bloggers.com/dealing-with-the-problem-of-multicollinearity-in-r/
For the purpose of our analysis, we will be setting a threshold in order to determine which indicators to drop. Existing literature tells us that correlation >0.75 may lead to issues when running multi-regression analysis. Further discussion can be found here https://www.researchgate.net/post/How_can_I_avoid_multicollinearity
and in “Multivariate Data Analysis” by Hair, 2006
With 0.75 as a threshold, we will be dropping the YOUNG_RATIO as an indicator as it has a high correlation with the ACTIVE_RATIO.
Final_Indicators <- SG_Derivedframe %>%
st_set_geometry(NULL) %>% #Dropping away geometric columns because we're using non-geospatial heirarichal cluster analysis
select("ACTIVE_RATIO", "AGED_RATIO", "Density", "HDB1_2RM_RATIO", "HDB3RM_RATIO", "HDB4RM_RATIO", "HDB5RM_EF_RATIO", "CONDO_APART_RATIO", "LANDED_RATIO", "Businesses", "Industry", "Financial", "Embassies", "Residential", "Shopping")
Final_Indicators.norm <- normalize(Final_Indicators)
summary(Final_Indicators.norm)
## ACTIVE_RATIO AGED_RATIO Density HDB1_2RM_RATIO
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.6406 Median :0.1362 Median :0.1327 Median :0.00000
## Mean :0.4908 Mean :0.1231 Mean :0.2405 Mean :0.04266
## 3rd Qu.:0.6667 3rd Qu.:0.1962 3rd Qu.:0.4474 3rd Qu.:0.04368
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## HDB3RM_RATIO HDB4RM_RATIO HDB5RM_EF_RATIO CONDO_APART_RATIO
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.05067
## Mean :0.1008 Mean :0.2176 Mean :0.1421 Mean :0.21585
## 3rd Qu.:0.1666 3rd Qu.:0.4499 3rd Qu.:0.2530 3rd Qu.:0.29954
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## LANDED_RATIO Businesses Industry Financial
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.007463
## Median :0.00000 Median :0.006494 Median :0.00000 Median :0.037313
## Mean :0.09239 Mean :0.065744 Mean :0.04393 Mean :0.078036
## 3rd Qu.:0.05000 3rd Qu.:0.045455 3rd Qu.:0.00000 3rd Qu.:0.097015
## Max. :1.00000 Max. :1.000000 Max. :1.00000 Max. :1.000000
## Embassies Residential Shopping
## Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.004608 1st Qu.:0.00000
## Median :0.00000 Median :0.018433 Median :0.00000
## Mean :0.07399 Mean :0.052753 Mean :0.05194
## 3rd Qu.:0.05263 3rd Qu.:0.050691 3rd Qu.:0.03226
## Max. :1.00000 Max. :1.000000 Max. :1.00000
Final_Indicators.norm.cor = cor(Final_Indicators.norm) #This is the base r functions to calculate the correlation matrix.
corrplot.mixed(Final_Indicators.norm.cor, #mixed version means first half and second half are different representations.
lower = "ellipse", #specify shape type
upper = "number", #specify the representation
tl.pos = "lt",
diag = "l",
tl.col = "black")
With this, we can see that our indicators are no longer too heavily correlated, all of which fall below 0.75 correlation coefficient.
Clustering_indicators <- SG_Derivedframe %>%
st_set_geometry(NULL) %>% #Dropping away geometric columns because we're using non-geospatial heirarichal cluster analysis
select("SUBZONE_N", "ACTIVE_RATIO", "AGED_RATIO", "Density", "HDB1_2RM_RATIO", "HDB3RM_RATIO", "HDB4RM_RATIO", "HDB5RM_EF_RATIO", "CONDO_APART_RATIO", "LANDED_RATIO", "Businesses", "Industry", "Financial", "Embassies", "Residential", "Shopping")
Mapping_sf <- left_join(SG_Superframe, Clustering_indicators, by=c("SUBZONE_N"="SUBZONE_N"))
tm_shape(Mapping_sf) +
tm_polygons(c("ACTIVE_RATIO", "AGED_RATIO", "Density.y", "HDB1_2RM_RATIO", "HDB3RM_RATIO", "HDB4RM_RATIO", "HDB5RM_EF_RATIO", "CONDO_APART_RATIO", "LANDED_RATIO", "Businesses.y", "Industry.y", "Financial.y", "Embassies.y", "Residential.y", "Shopping.y"),
style="jenks") +
tm_facets(sync = TRUE, ncol = 3) +
tm_legend(legend.position = c("right", "bottom"))+
tm_layout(outer.margins=0, asp=0)
Just taking a quick snapshot of all our indicators of the subzones, we can see that the geographic distribution of high to low values vary a lot between indicators.
row.names(Clustering_indicators) <- Clustering_indicators$SUBZONE_N
Cluster_names <- select(Clustering_indicators, c(2:16))
Clustering_indicators.norm <- normalize(Clustering_indicators)
summary(Clustering_indicators.norm)
## SUBZONE_N ACTIVE_RATIO AGED_RATIO Density
## Length:313 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :0.6406 Median :0.1362 Median :0.1327
## Mean :0.4908 Mean :0.1231 Mean :0.2405
## 3rd Qu.:0.6667 3rd Qu.:0.1962 3rd Qu.:0.4474
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## HDB1_2RM_RATIO HDB3RM_RATIO HDB4RM_RATIO HDB5RM_EF_RATIO
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.04266 Mean :0.1008 Mean :0.2176 Mean :0.1421
## 3rd Qu.:0.04368 3rd Qu.:0.1666 3rd Qu.:0.4499 3rd Qu.:0.2530
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## CONDO_APART_RATIO LANDED_RATIO Businesses Industry
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.05067 Median :0.00000 Median :0.006494 Median :0.00000
## Mean :0.21585 Mean :0.09239 Mean :0.065744 Mean :0.04393
## 3rd Qu.:0.29954 3rd Qu.:0.05000 3rd Qu.:0.045455 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.00000
## Financial Embassies Residential Shopping
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.007463 1st Qu.:0.00000 1st Qu.:0.004608 1st Qu.:0.00000
## Median :0.037313 Median :0.00000 Median :0.018433 Median :0.00000
## Mean :0.078036 Mean :0.07399 Mean :0.052753 Mean :0.05194
## 3rd Qu.:0.097015 3rd Qu.:0.05263 3rd Qu.:0.050691 3rd Qu.:0.03226
## Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.00000
Cluster_names.norm <- normalize(Cluster_names)
summary(Cluster_names.norm)
## ACTIVE_RATIO AGED_RATIO Density HDB1_2RM_RATIO
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.6406 Median :0.1362 Median :0.1327 Median :0.00000
## Mean :0.4908 Mean :0.1231 Mean :0.2405 Mean :0.04266
## 3rd Qu.:0.6667 3rd Qu.:0.1962 3rd Qu.:0.4474 3rd Qu.:0.04368
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## HDB3RM_RATIO HDB4RM_RATIO HDB5RM_EF_RATIO CONDO_APART_RATIO
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.05067
## Mean :0.1008 Mean :0.2176 Mean :0.1421 Mean :0.21585
## 3rd Qu.:0.1666 3rd Qu.:0.4499 3rd Qu.:0.2530 3rd Qu.:0.29954
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## LANDED_RATIO Businesses Industry Financial
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.007463
## Median :0.00000 Median :0.006494 Median :0.00000 Median :0.037313
## Mean :0.09239 Mean :0.065744 Mean :0.04393 Mean :0.078036
## 3rd Qu.:0.05000 3rd Qu.:0.045455 3rd Qu.:0.00000 3rd Qu.:0.097015
## Max. :1.00000 Max. :1.000000 Max. :1.00000 Max. :1.000000
## Embassies Residential Shopping
## Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.004608 1st Qu.:0.00000
## Median :0.00000 Median :0.018433 Median :0.00000
## Mean :0.07399 Mean :0.052753 Mean :0.05194
## 3rd Qu.:0.05263 3rd Qu.:0.050691 3rd Qu.:0.03226
## Max. :1.00000 Max. :1.000000 Max. :1.00000
Cluster_names.mat <- data.matrix(Cluster_names.norm)
proxmat <- dist(Cluster_names.mat, method = 'euclidean')
In order to determine which clustering method is most appropriate, we need to map different methods to our clustering matrix and observe the agglomeration coefficients.
m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")
Apply_agnes <- function(x) {
agnes(Cluster_names.norm, method = x)$ac
}
map_dbl(m, Apply_agnes)
## average single complete ward
## 0.8669997 0.8074952 0.9045193 0.9783950
As we want something as close to 1 as we can get it. We will use the “ward” method.
hclust_ward <- hclust(proxmat, method = 'ward.D')
hclust_ward
##
## Call:
## hclust(d = proxmat, method = "ward.D")
##
## Cluster method : ward.D
## Distance : euclidean
## Number of objects: 313
While often seen as a arbitrary task, we need to objectively determine the best number of clusters we need in order to properly meet the objective of performing Social Area Analysis
Based on existing literature, there are different methods to do this. Usually they involve either some measure of dis/similarity (distance) between clusters or to adapt statistical rules or tests to determine the right number of clusters.
One method (although tends to underestimate the actual number of clusters) is to look at the within cluster similarity at each stage. This was introduced in 1953 by R. L. Thorndike (Psychometrika, 18[4]; 267-276), which resulted in the “Thorndike” method that is used widely in social science.
Source: https://hlab.stanford.edu/brian/number_of_clusters_.html
Another method is using the Dunn’s Index. The Dunn’s index is the ratio between the minimum inter-cluster distances to the maximum intra-cluster diameter. The diameter of a cluster is the distance between its two furthermost points. In order to have well separated and compact clusters you should aim for a higher Dunn’s index.
As we have readily available functions to test for the Dunn’s index. We will use that method instead.
Source: https://www.datacamp.com/community/tutorials/hierarchical-clustering-R Documentation: https://www.rdocumentation.org/packages/clValid/versions/0.6-7/topics/dunn
Get_Dunn <- function(x){
groups <- as.factor(cutree(hclust_ward, k=x))
Result_cluster <- cbind(Final_Indicators.norm, as.matrix(groups)) %>%
rename(`CLUSTER` = `as.matrix(groups)`)
Result_cluster$CLUSTER <- as.numeric(levels(Result_cluster$CLUSTER))[Result_cluster$CLUSTER]
dunn_int <- dunn(clusters= Result_cluster$CLUSTER, Data = Result_cluster, method = "euclidean")
return(dunn_int)
}
Cluster_numbers <- 2:30
Dunn_index <- lapply(Cluster_numbers, Get_Dunn)
plot(Cluster_numbers,Dunn_index)
Based on our results, we can determine that there appears to be an upward jump in the Dunn’s Index every 5-6 added clusters. For the purpose of meaningful interpretation of the clusters, the optimal amount of clusters should be small. Given the graph presented, the optimal number of clusters for our case appears to be 6, which provides a high enough Dunn’s Index without overly segmenting the data.
plot(hclust_ward, cex=0.6)
rect.hclust(hclust_ward, k= 6, border = 2:8)
heatmaply(normalize(Cluster_names.mat),
Colv=NA,
dist_method = "euclidean", #method you want to use maximum, minimum, etc
hclust_method = "ward.D", #how you calculate the agglomerations method.
seriate = "OLO", #refers to optimal leaf odering. No fixed reason to select, you trial and error to see which is the most distinguishable.
colors = Blues,
k_row = 6,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="Segmentation of Singapore Subzones by indicators",
xlab = "Indicators",
ylab = "Subzones"
)
Merging_sf <- Mapping_sf %>%
select(SUBZONE_N, geometry)
Merging_sf <- left_join(Merging_sf,Clustering_indicators.norm, by="SUBZONE_N")
Final_clusters <- as.factor(cutree(hclust_ward, k=6))
Mapping_cluster_sf <- cbind(Merging_sf, as.matrix(Final_clusters)) %>%
rename(`CLUSTER` = `as.matrix.Final_clusters.`)
tmap_mode("view")
Indicator_map <- tm_shape(Mapping_cluster_sf)+
tm_fill(col="CLUSTER",
title=" Indicator Clusters")+
tm_borders(alpha=0.5)
Indicator_map
The choropleth map above reveals the clusters are very fragmented. The is one of the major limitation when non-spatial clustering algorithm such as hierarchical cluster analysis method is used. That being said, we can clearly see that Cluster 1 indicates areas with very little population. These zones are inclusive of the ports, airports and industrial zone with lower populations.
We need to factor in geospatial factors in order to achieve homogeneous clusters. We will use the SKATER approach in order to achieve this.
nzmean <- function(x) {
zvals <- x==0
if (all(zvals)) 0 else mean(x[!zvals])
}
We will remove the zeros from our average so as not to be affected by missing values.
Cluster_summary <- Mapping_cluster_sf %>%
st_set_geometry(NULL) %>%
group_by(CLUSTER) %>%
summarise(ACTIVE_RATIO=nzmean(ACTIVE_RATIO), AGED_RATIO=nzmean(AGED_RATIO), DENSITY=nzmean(Density), HDB1_2RM_RATIO=nzmean(HDB1_2RM_RATIO), HDB3RM_RATIO=nzmean(HDB3RM_RATIO), HDB4RM_RATIO=nzmean(HDB4RM_RATIO),HDB5RM_EF_RATIO=nzmean(HDB5RM_EF_RATIO), CONDO_APART_RATIO=nzmean(CONDO_APART_RATIO), LANDED_RATIO=nzmean(LANDED_RATIO), BUSINESSES=nzmean(Businesses), INDUSTRY=nzmean(Industry), FINANCIAL=nzmean(Financial), EMBASSIES=nzmean(Embassies),RESIDENTIAL=nzmean(Residential),SHOPPING=nzmean(Shopping))
Cluster_summary.molten <- melt(Cluster_summary, na.rm=TRUE)
Cluster_summary_plot <- ggplot(Cluster_summary.molten, aes(variable, value, fill = variable)) +
facet_wrap(~ CLUSTER) +
geom_bar(stat="identity", show_guide=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Cluster_summary_plot
As we can see, our Cluster 1 consists of entirely non-residential areas with no population at all. This is good because these areas are still relevant as commercial or industrial zones for the purpose of our analysis.
Interestingly Cluster 6 consists of metrics where services seem to be highest in areas which have very little HDB buildings but high number of condominiums. Additionally Cluster 3 and Cluster 6 show two different groups of Condominium dwellers, Cluster 3 are more towards residential zones where as cluster 6 are near services in the city center.
Cluster 4 seems to show the high density residential zones. If we reference the map, they seem to be smaller polygons with a larger number of HDB 4-Room and 5-room blocks, and Apartments.
All of them have significantly high ratio of economically active population except 1 which has a majority Aged population. This could purely be due to lower total population counts in these areas which makes it hard to determine why this is the case.
Additionally, the high CONDO_APART_RATIO and LANDED_RATIO may be causing some issues with the clustering as they seem to be extremely high in comparison to their other counterparts.
In this section, we will be deriving spatially constrained cluster by using SKATER method.
Merging_sp <- as_Spatial(Merging_sf)
Merging.nb <- poly2nb(Merging_sp)
summary(Merging.nb)
## Neighbour list object:
## Number of regions: 313
## Number of nonzero links: 1862
## Percentage nonzero weights: 1.900601
## Average number of links: 5.948882
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11
## 2 4 12 32 79 80 50 32 16 3 3
## 2 least connected regions:
## 228 289 with 1 link
## 3 most connected regions:
## 43 259 294 with 11 links
plot(Merging_sp, border=grey(.5), main= "Nearest Neighbor Connections")
plot(Merging.nb, coordinates(Merging_sp), col="blue", add=TRUE)
Next, nbcosts() of spdep package is used to compute the cost of each edge. It is the distance between it nodes. This function compute this distance using a data.frame with observations vector in each node.
The code chunk below is used to compute the cost of each edge.
lcosts <- nbcosts(Merging.nb, Cluster_names.norm)
For each observation, this gives the pairwise dissimilarity between its values on the 16 variables and the values for the neighboring observation (from the neighbor list). Basically, this is the notion of a generalized weight for a spatial weights matrix.
Next, We will incorporate these costs into a weights object in the same way as we did in the calculation of inverse of distance weights. In other words, we convert the neighbor list to a list weights object by specifying the just computed lcosts as the weights.
In order to achieve this, nb2listw() of spdep package is used as shown in the code chunk below.
GWeighted.w <- nb2listw(Merging.nb, lcosts, style="B") # B is for Binary weights
summary(GWeighted.w)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 313
## Number of nonzero links: 1862
## Percentage nonzero weights: 1.900601
## Average number of links: 5.948882
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11
## 2 4 12 32 79 80 50 32 16 3 3
## 2 least connected regions:
## 228 289 with 1 link
## 3 most connected regions:
## 43 259 294 with 11 links
##
## Weights style: B
## Weights constants summary:
## n nn S0 S1 S2
## B 313 97969 1529.634 3119.553 36023.12
GWeighted.mst <- mstree(GWeighted.w)
class(GWeighted.mst)
## [1] "mst" "matrix"
Checking to see the class of the matrix to make sure conversion was correct.
In this case, we should have 312 as the output as we have 313 polygons.
dim(GWeighted.mst)
## [1] 312 3
plot(Merging_sp, border=gray(.5), main="Weighted Nearest Neighbor Connections")
plot.mst(GWeighted.mst, coordinates(Merging_sp),
col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)
Here we can see the final connections after incorporating our indicators in the information space.
Cluster6 <- skater(GWeighted.mst[,1:2], Cluster_names.norm, 5) #we want 6 clusters based on our earlier calculations
The skater() takes three mandatory arguments: - the first two columns of the MST matrix (i.e. not the cost), - the data matrix (to update the costs as units are being grouped), and - the number of cuts. Note: It is set to one less than the number of clusters. So, the value specified is not the number of clusters, but the number of cuts in the graph, one less than the number of clusters.
groups_mat <- as.matrix(Cluster6$groups)
Mapping_spatialcluster_sf <- cbind(Mapping_cluster_sf, as.factor(groups_mat)) %>%
rename(SP_CLUSTER=as.factor.groups_mat.) #this is the default output field name. it was a bit hard to read so we renamed it.
Spatial_cluster_map <- tm_shape(Mapping_spatialcluster_sf)+
tm_fill(col="SP_CLUSTER",
title="Spatially Weighted Clusters")+
tm_borders(alpha=0.5)
Spatial_cluster_map
Spatial_Cluster_summary <- Mapping_spatialcluster_sf %>%
st_set_geometry(NULL) %>%
group_by(SP_CLUSTER) %>%
summarise(ACTIVE_RATIO=nzmean(ACTIVE_RATIO), AGED_RATIO=nzmean(AGED_RATIO), DENSITY=nzmean(Density), HDB1_2RM_RATIO=nzmean(HDB1_2RM_RATIO), HDB3RM_RATIO=nzmean(HDB3RM_RATIO), HDB4RM_RATIO=nzmean(HDB4RM_RATIO),HDB5RM_EF_RATIO=nzmean(HDB5RM_EF_RATIO), CONDO_APART_RATIO=nzmean(CONDO_APART_RATIO), LANDED_RATIO=nzmean(LANDED_RATIO), BUSINESSES=nzmean(Businesses), INDUSTRY=nzmean(Industry), FINANCIAL=nzmean(Financial), EMBASSIES=nzmean(Embassies),RESIDENTIAL=nzmean(Residential),SHOPPING=nzmean(Shopping))
Spatial_Cluster_summary.molten <- melt(Spatial_Cluster_summary, na.rm=TRUE)
Spatial_cluster_Summary_plot <- ggplot(Spatial_Cluster_summary.molten, aes(variable, value, fill = variable)) +
facet_wrap(~ SP_CLUSTER) +
geom_bar(stat="identity", show_guide=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Spatial_cluster_Summary_plot
Based on the information presented to us, it appears that the geographical clustering has resulted in the formation of a general residential population area represented by Cluster 1. Although it seems to include the airport as well which may purely be due to the nearest neighbor weightage constraint.
Cluster 2 appears to be a a more business and industry area together with some HDB residential parts. We can define this group as more industrial workers.
Cluster 3 is defined by a high condominium and Apartment ratio. It is likely a more affluent residential zone within the shopping and commercial areas in town.
Cluster 4 appears to consist of both industrial port areas as well embassies, financial and shopping urban functions. It’s also defined by a lack of any residential at all. It is likely that this zone is a purely commercial and industry part of the downtown area.
Cluster 5 appears to be defined by its extremely high ratio of Landed property ownership. This zone appears to be an affluent pocket in the north-western part of Singapore.
Cluster 6 appears to consist of primarily condominiums and landed properties with no HDB ownership. It could be considered a more affluent pocket in the north-eastern part of Singapore.
Overall the addition of the SKEWER approach seems to have overly combined some of the zones. It appears that in general, Singaporea social groups are well spreadout across the country in smaller pockets rather than geographically segmented from one another.
When comparing the two, we can see that there is very little consistency between the clustering. In fact we can see that Cluster1 in the Indicators cluster had been split into Clusters 2,4 and 5 in the Spatially weighted clusters. This could underscore an issue with relying on the Binary method in our nb2listw() function that we can further examine. Additionally more clusters could be added to provide more defined social areas as part of our analysis.
For this section, we will attempt to examine effects of more clusters on defining the social areas ## Investing additional clusters ### Check Dunn’s Index Plot
plot(Cluster_numbers,Dunn_index)
As we can see based on the Dunn’s Index, the next jump appears to be around 13 clusters.
thirteen_clusters <- as.factor(cutree(hclust_ward, k=13))
thirteen_cluster_sf <- cbind(Merging_sf, as.matrix(thirteen_clusters)) %>%
rename(`CLUSTER` = `as.matrix.thirteen_clusters.`)
tmap_mode("view")
thirteen_clusters_map <- tm_shape(thirteen_cluster_sf)+
tm_fill(col="CLUSTER",
title=" New Clusters")+
tm_borders(alpha=0.5)
thirteen_clusters_map
Thirteen_Cluster_summary <- thirteen_cluster_sf %>%
st_set_geometry(NULL) %>%
group_by(CLUSTER) %>%
summarise(ACTIVE_RATIO=nzmean(ACTIVE_RATIO), AGED_RATIO=nzmean(AGED_RATIO), DENSITY=nzmean(Density), HDB1_2RM_RATIO=nzmean(HDB1_2RM_RATIO), HDB3RM_RATIO=nzmean(HDB3RM_RATIO), HDB4RM_RATIO=nzmean(HDB4RM_RATIO),HDB5RM_EF_RATIO=nzmean(HDB5RM_EF_RATIO), CONDO_APART_RATIO=nzmean(CONDO_APART_RATIO), LANDED_RATIO=nzmean(LANDED_RATIO), BUSINESSES=nzmean(Businesses), INDUSTRY=nzmean(Industry), FINANCIAL=nzmean(Financial), EMBASSIES=nzmean(Embassies),RESIDENTIAL=nzmean(Residential),SHOPPING=nzmean(Shopping))
Thirteen_Cluster_summary.molten <- melt(Thirteen_Cluster_summary, na.rm=TRUE)
Thirteen_Cluster_summary_plot <- ggplot(Thirteen_Cluster_summary.molten, aes(variable, value, fill = variable)) +
facet_wrap(~ CLUSTER) +
geom_bar(stat="identity", show_guide=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Thirteen_Cluster_summary_plot
Unfortunately, it does not seem that adding in extra clusters helped to make the social areas more defined based on the information space. However, we can do a quick SKEWER method to see if they might make more sense when geographically weighted,
Cluster13 <- skater(GWeighted.mst[,1:2], Cluster_names.norm, 12) #we want 13 clusters based on our earlier calculations
groups_mat13 <- as.matrix(Cluster13$groups)
Mapping_spatialcluster13_sf <- cbind(Mapping_cluster_sf, as.factor(groups_mat13)) %>%
rename(SP_CLUSTER=as.factor.groups_mat13.) #this is the default output field name. it was a bit hard to read so we renamed it.
Spatial_cluster13_map <- tm_shape(Mapping_spatialcluster13_sf)+
tm_fill(col="SP_CLUSTER",
title="Spatially Weighted Clusters")+
tm_borders(alpha=0.5)
Spatial_cluster13_map
Spatial_Cluster13_summary <- Mapping_spatialcluster13_sf %>%
st_set_geometry(NULL) %>%
group_by(SP_CLUSTER) %>%
summarise(ACTIVE_RATIO=nzmean(ACTIVE_RATIO), AGED_RATIO=nzmean(AGED_RATIO), DENSITY=nzmean(Density), HDB1_2RM_RATIO=nzmean(HDB1_2RM_RATIO), HDB3RM_RATIO=nzmean(HDB3RM_RATIO), HDB4RM_RATIO=nzmean(HDB4RM_RATIO),HDB5RM_EF_RATIO=nzmean(HDB5RM_EF_RATIO), CONDO_APART_RATIO=nzmean(CONDO_APART_RATIO), LANDED_RATIO=nzmean(LANDED_RATIO), BUSINESSES=nzmean(Businesses), INDUSTRY=nzmean(Industry), FINANCIAL=nzmean(Financial), EMBASSIES=nzmean(Embassies),RESIDENTIAL=nzmean(Residential),SHOPPING=nzmean(Shopping))
Spatial_Cluster13_summary.molten <- melt(Spatial_Cluster13_summary, na.rm=TRUE)
Spatial_cluster13_Summary_plot <- ggplot(Spatial_Cluster13_summary.molten, aes(variable, value, fill = variable)) +
facet_wrap(~ SP_CLUSTER) +
geom_bar(stat="identity", show_guide=FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Spatial_cluster13_Summary_plot
Now we can see slightly more defined social areas. 1, 8, and 9 appear to be high-density residential zones with slight variations in terms of services available.
Clusters 3, 5, 6 and 10 appear to be slightly more affluent low-density residential zones with slight differences in housing type in specific areas.
Cluster 4 and 11 appear to be commercial zones with virtually no residential in the area.
Clusters 2, 7, 12 and 13 appear to be a mix of residential and commercial zones with different proportions of the two, most likely residences close to work areas. The major differences being close or out of city zones which affect the number of services around.
Overall, having more clusters slightly define the social areas better. However, it seems that they form into their own groups on their own. This supports the idea that Singapore spreads out its demographic groups into several different parts of the island, with no specific segment of the island characterizing a specific social group.