1 Task Overview

Students are tasked to segment Singapore at the planning subzone level into homogeneous socioeconomic areas by combining geodemographic data extracted from Singapore Department of Statistics and urban functions extracted from the geospatial data provided.

The task involves 3 parts:

  1. Geospatial and Aspatial Data Wrangling

  2. Performing clustering based on the informational space of the data

  3. Forming homogeneous clusters by using geographically weighted segmentation

1.1 Loading in Packages

The R packages needed for this exercise are as follows:

  • Spatial data handling
    • sf, and spdep
  • Geospatial and hierarchical cluster analysis package
    • ClustGeo, heatmaply, cluster
    • clValid (Used for calculating Dunn’s Index which is used in determining the best number of clusters)
  • Attribute data handling
    • tidyverse, readr, ggplot2 and dplyr
    • reshape2 (For easy conversion of dataframes to long lists)
  • Choropleth mapping
    • tmap
packages = c('rgdal', 'spdep', 'ClustGeo',  'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse', "clValid", "reshape2")
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p,character.only = T)
}
set.seed(1234) #For consistency

2 Geospatial and Aspatial Data Wrangling

2.1 Aspatial Data Wrangling

For the purpose of this analysis, we will need to introduce demographic data to form some of our indicators. We can retrieve them from government sources and map them to the planning subzones further on.

Source: https://data.gov.sg/dataset/singapore-residents-by-subzone-and-type-of-dwelling-2011-2019

2.1.1 Loading in Aspatial Data

SG_Residents <- read_csv ("data/aspatial/planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv")

2.1.2 Analyzing input dataset

summary(SG_Residents)
##  planning_area        subzone           age_group             sex           
##  Length:883728      Length:883728      Length:883728      Length:883728     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  type_of_dwelling   resident_count         year     
##  Length:883728      Min.   :   0.00   Min.   :2011  
##  Class :character   1st Qu.:   0.00   1st Qu.:2013  
##  Mode  :character   Median :   0.00   Median :2015  
##                     Mean   :  39.83   Mean   :2015  
##                     3rd Qu.:  10.00   3rd Qu.:2017  
##                     Max.   :2860.00   Max.   :2019

We can see that our data has no missing values which means that no clean-up is required. Our data seems to consist of multiple years as well as include some unneeded variables. We will need to segment and aggregate the data to be useful to us.

2.1.3 Filtering dataset by 2019

SG_Residents_2019 <- SG_Residents %>%
  filter(year == 2019)

Now we have only the 2019 data which is relevant to us.

2.1.4 Creating Demographic Indicators

The specified indicators in the assignment document as follows:

  • Economy active population (i.e. age 25-64)
  • Young group (i.e. age 0-24)
  • Aged group (i.e 65 and above)
  • Population density

The population density will be defined as the total resident count divided by the area of the polygon (m^2) which will be derived further on.

Demographic_data <- SG_Residents_2019 %>%
  spread(age_group, resident_count) %>%
  mutate_at(.vars = vars(planning_area, subzone), .funs = funs(toupper)) %>%
  group_by(planning_area, subzone) %>%
  mutate(YOUNG=`0_to_4`+`5_to_9`+`10_to_14`+`15_to_19`+`20_to_24`) %>%
  mutate(ACTIVE=`25_to_29`+`30_to_34`+`35_to_39`+`40_to_44`+`45_to_49`+`50_to_54`+`55_to_59`+`60_to_64`) %>%
  mutate(AGED=`65_to_69`+`70_to_74`+`75_to_79`+`80_to_84`+`85_to_89`+`90_and_over`) %>%
  summarise(YOUNG=sum(YOUNG), ACTIVE=sum(ACTIVE), AGED=sum(AGED), TOTAL=sum(YOUNG, ACTIVE, AGED))

2.1.5 Creating Housing Indicators

The specified indicators in the assignment document as follows:

  • HDB1-2RM dwellers
  • HDB3-4RM dwellers
  • HDB5RM-Ec dweller
  • Condominium and apartment dwellers
  • Landed property dwellers

We will define “dwellers” as the resident count for each property type in each subzone.

Housing_data <- SG_Residents_2019 %>%
  spread(type_of_dwelling, resident_count) %>%
  mutate_at(.vars = vars(planning_area, subzone), .funs = funs(toupper)) %>%
  group_by(planning_area, subzone) %>%
  summarise(HDB1_2RM=sum(`HDB 1- and 2-Room Flats`), HDB3RM=sum(`HDB 3-Room Flats`), HDB4RM=sum(`HDB 4-Room Flats`), HDB5RM_EF =sum(`HDB 5-Room and Executive Flats`), CONDO_APART =sum(`Condominiums and Other Apartments`), LANDED=sum(`Landed Properties`))

As “Others” and “HUDC Apartments” were not one of the indicators of interest, they will be excluded for the purpose of this study as we are unable to determine which category they fall under.

2.1.6 Inspecting Demographic Indicators

summary(Demographic_data)
##  planning_area        subzone              YOUNG           ACTIVE     
##  Length:323         Length:323         Min.   :    0   Min.   :    0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:    0  
##  Mode  :character   Mode  :character   Median : 1170   Median : 2840  
##                                        Mean   : 3296   Mean   : 7386  
##                                        3rd Qu.: 4365   3rd Qu.:10285  
##                                        Max.   :34260   Max.   :79780  
##       AGED           TOTAL       
##  Min.   :    0   Min.   :     0  
##  1st Qu.:    0   1st Qu.:     0  
##  Median :  730   Median :  4890  
##  Mean   : 1806   Mean   : 12487  
##  3rd Qu.: 3000   3rd Qu.: 17065  
##  Max.   :18860   Max.   :132900

No missing values

2.1.7 Inspecting Housing Indicators

summary(Housing_data)
##  planning_area        subzone             HDB1_2RM        HDB3RM     
##  Length:323         Length:323         Min.   :   0   Min.   :    0  
##  Class :character   Class :character   1st Qu.:   0   1st Qu.:    0  
##  Mode  :character   Mode  :character   Median :   0   Median :    0  
##                                        Mean   : 542   Mean   : 1803  
##                                        3rd Qu.: 605   3rd Qu.: 2090  
##                                        Max.   :4700   Max.   :31350  
##      HDB4RM        HDB5RM_EF      CONDO_APART        LANDED       
##  Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0.0  
##  1st Qu.:    0   1st Qu.:    0   1st Qu.:    0   1st Qu.:    0.0  
##  Median :    0   Median :    0   Median :  230   Median :    0.0  
##  Mean   : 4150   Mean   : 3297   Mean   : 1827   Mean   :  773.9  
##  3rd Qu.: 5590   3rd Qu.: 3660   3rd Qu.: 2835   3rd Qu.:  400.0  
##  Max.   :51400   Max.   :47960   Max.   :16770   Max.   :18820.0

No missing values

2.1.8 Removing unneeded planning area column

This is for easier merger later on and removes the need for additional clean-up

Housing_data <- within(Housing_data, rm(planning_area)) #drops planning_area Column
Demographic_data <- within(Demographic_data, rm(planning_area)) # Drops planning_area Column

2.2 Geospatial Data Wrangling

In this case, the data has been provided to us for this assignment. The source and date of retrieval are not specified but we will assume they are all within 2019.

From the geospatial data provided, you are required to extract the following urban functions: * Government including embassy * Business * Industry * Shopping * Financial * Upmarket residential area

Based on inspection of the data provided, Industry appears to be a missing dataset. We will need to extract it from the business dataset below.

2.2.1 Reading in Business Geospatial Data as Simple Feature Dataframes

Businesses_all <- st_read(dsn="data/geospatial/urban_function", layer="Business")
## Reading layer `Business' from data source `D:\GSA\Take_Home_EX03\data\geospatial\urban_function' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## geographic CRS: WGS 84
summary(Businesses_all)
##      POI_ID             SEQ_NUM         FAC_TYPE   
##  Min.   :3.644e+07   Min.   :1.000   Min.   :5000  
##  1st Qu.:9.967e+08   1st Qu.:1.000   1st Qu.:5000  
##  Median :1.097e+09   Median :1.000   Median :5000  
##  Mean   :9.933e+08   Mean   :1.021   Mean   :5084  
##  3rd Qu.:1.108e+09   3rd Qu.:1.000   3rd Qu.:5000  
##  Max.   :1.204e+09   Max.   :3.000   Max.   :9991  
##                                                    
##                        POI_NAME                  ST_NAME    
##  CAMBRIDGE INDUSTRIAL TRUST:   8   TAGORE LN         :  83  
##  DHL                       :   6   JOO KOON CIR      :  80  
##  NATIONAL OILWELL VARCO    :   6   GUL CIR           :  62  
##  ST MICROELECTRONICS       :   6   KAKI BUKIT PL     :  53  
##  CWT                       :   5   KAKI BUKIT IND TER:  52  
##  HALLIBURTON               :   5   (Other)           :5941  
##  (Other)                   :6514   NA's              : 279  
##           geometry   
##  POINT        :6550  
##  epsg:4326    :   0  
##  +proj=long...:   0  
##                      
##                      
##                      
## 

We can see that FAC_TYPE has two different values, indicating that we need to split industry indicator out of the Businesses_all which we will do so below

2.2.2 Separating Industry Businesses from Other businesses

By inspecting the names of the companies, we are able to determine which FAC_TYPE corresponds to either Industry or Businesses. However, it should be noted that there are examples in which “Industrial building” is still labelled as FAC_TYPE 5000 which may be a data quality issue.

Industry <- Businesses_all %>% 
  filter(FAC_TYPE==9991)

Businesses <- Businesses_all %>%
  filter(FAC_TYPE==5000)

2.2.3 Loading in the rest of the Geospatial Data provided

Financial <- st_read(dsn="data/geospatial/urban_function", layer="Financial")
Embassies <- st_read(dsn="data/geospatial/urban_function", layer="Govt_Embassy")
Residential <- st_read(dsn="data/geospatial/urban_function", layer="Private residential")
Shopping <- st_read(dsn="data/geospatial/urban_function", layer="Shopping")
mpsz <- st_read(dsn="data/geospatial/subzone", layer="MP14_SUBZONE_WEB_PL")

You should inspect all of them for differences in the FAC_TYPE. But for the purpose of our analysis, we will consider the rest do not require separation as they follow our required indicators.

2.2.4 Creating testing functions for validity and NA values

# Retrieves a quick breakdown of the number of NA rows and invalid polygons/points
Validity_NA_Check <- function(target_st) {
  validity <- st_is_valid(target_st)
  NA_rows <- target_st[rowSums(is.na(target_st))!=0,]
  Invalid_rows <- which(validity==FALSE)
  print(paste("For:", deparse(substitute(target_st))))
  print(paste("Number of Invalid polygons/points is:", length(Invalid_rows)))
  print(paste("Number of NA rows is:", length(NA_rows)))
}

# Retrieves the exact polygon which is invalid
get_invalid <- function(target_st) {
  validity <- st_is_valid(target_st)
  Invalid_rows <- which(validity==FALSE)
  return(Invalid_rows)
}

# Retrieves the exact rows which contain NA values for you to check the columns
get_NA_rows <- function(target_st) {
  NA_rows <- target_st[rowSums(is.na(target_st))!=0,]
  return(NA_rows)
}

# A cleaning function that replaces NA with "Missing" so that calculations can still be done.
## This function is a little unnessary as we will not be using the data attached to the geospatial points. 
replace_NA_with_unknown <- function(x){
  x %>%
  mutate_if(is.factor, funs(factor(replace(as.character(.), is.na(.), "Missing"))))
}

2.2.5 Testing Validity and NA checks on Geospatial Data

Validity_NA_Check(Businesses)
## [1] "For: Businesses"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Industry)
## [1] "For: Industry"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Financial)
## [1] "For: Financial"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 30"
Validity_NA_Check(Embassies)
## [1] "For: Embassies"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Residential)
## [1] "For: Residential"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Shopping)
## [1] "For: Shopping"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(mpsz)
## [1] "For: mpsz"
## [1] "Number of Invalid polygons/points is: 9"
## [1] "Number of NA rows is: 16"

There are several NA values within the dataframes. For the purpose of our analysis, we do not really need to cater to replacing the NA values as we only care about the number of points within the polygon. However, for hygiene sake, we will attempt to clear the NA values.

2.2.6 Inspecting invalid data and NA

# Retrieve Rows which are NA so we can inspect which columns are missing
get_NA_rows(Businesses)
get_NA_rows(Industry)
get_NA_rows(Financial)
get_NA_rows(Embassies)
get_NA_rows(Residential)
get_NA_rows(Shopping)
get_NA_rows(mpsz)

get_invalid(mpsz) #Finds which polygons are invalid

We will hide the out for simplicity of reporting.

2.2.7 Cleaning Geospatial Data

mpsz <- st_make_valid(mpsz) 

We will use the st_make_valid() function from the sp package to convert the polygons to valid ones.

Businesses <- replace_NA_with_unknown(Businesses)
Industry <- replace_NA_with_unknown(Industry)
Financial <- replace_NA_with_unknown(Financial)
Embassies <- replace_NA_with_unknown(Embassies)
Residential <- replace_NA_with_unknown(Residential)
Shopping <- replace_NA_with_unknown(Shopping)
mpsz <- replace_NA_with_unknown(mpsz)

One final check and our data should be clear.

Validity_NA_Check(Businesses)
## [1] "For: Businesses"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Industry)
## [1] "For: Industry"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Financial)
## [1] "For: Financial"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 30"
Validity_NA_Check(Embassies)
## [1] "For: Embassies"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Residential)
## [1] "For: Residential"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(Shopping)
## [1] "For: Shopping"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 6"
Validity_NA_Check(mpsz)
## [1] "For: mpsz"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 16"

Cleared of NA Values and invalid geometries.

2.2.8 Setting CRS for our projections

In order to properly map out data together. We will need to standardize the CRS to be inline with our Singapore study area.

mpsz <- st_transform(mpsz, crs= 3414)
Businesses <- st_transform(Businesses, crs= 3414)
Industry <- st_transform(Industry, crs= 3414)
Financial <- st_transform(Financial, crs= 3414)
Embassies <- st_transform(Embassies, crs= 3414)
Residential <- st_transform(Residential, crs= 3414)
Shopping <- st_transform(Shopping, crs= 3414)

2.2.9 Inspecting Project string

This will allow us to see what units of measurement we’re dealing with as well as validate that the st_transform() function earlier has worked.

converting them to geospatial polygon dataframes for inspection

mpsz.spdf <- as_Spatial(mpsz)
Businesses.spdf <- as_Spatial(Businesses)
Industry.spdf <- as_Spatial(Industry)
Financial.spdf <- as_Spatial(Financial)
Embassies.spdf <- as_Spatial(Embassies)
Residential.spdf <- as_Spatial(Residential)
Shopping.spdf <- as_Spatial(Shopping)

Using the project4string() to identify units of measurement and validate that the conversion matches

proj4string(mpsz.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Businesses.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Industry.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Financial.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Embassies.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Residential.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
proj4string(Shopping.spdf)
## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

Seems our data is all aligned. We can proceed to plot the datasets to inspect them.

2.2.10 Doing a simple plot to inspect data

plot(mpsz.spdf, main= "Singapore Urban Functions")
plot(Businesses, col="blue", add=TRUE)
plot(Industry, col="green", add=TRUE)
plot(Financial, col="yellow", add=TRUE)
plot(Embassies, col="purple", add=TRUE)
plot(Residential, col="orange", add=TRUE)
plot(Shopping, col="red", add=TRUE)

Based on the map above, we can see that our datasets are properly plotted. Although some data-points are falling within the Water catchment areas and airport. This will be dealt with further down.

2.3 Merging Datasets

Now that we have prepared out dataframes. We will need to combine them into a singular dataframe to perform our analysis.

2.3.1 Merging Geospatial Data-points with Geospatial Polygons

For this analysis, we are interested in how many points for each urban function falls within each subzone. As such, we simply need a count for each point that falls within the polygon. As such, we will be using the st_intersects{} function from the sp package and taking the “lengths” of those outputs to give us a count on how many points fall within each polygon.

mpsz$Businesses <- lengths(st_intersects(mpsz, Businesses))
mpsz$Industry <- lengths(st_intersects(mpsz, Industry))
mpsz$Financial <- lengths(st_intersects(mpsz, Financial))
mpsz$Embassies <- lengths(st_intersects(mpsz, Embassies))
mpsz$Residential <- lengths(st_intersects(mpsz, Residential))
mpsz$Shopping <- lengths(st_intersects(mpsz, Shopping))

2.3.2 Merging Geospatial and Aspatial Simple Feature Dataframes together into a super frame

Here, we will be using the left_join() function from the dplyr package that will help us to easily format our final super dataframe with all our desired indicators.

SG_Superframe <- left_join(mpsz , Demographic_data, by = c("SUBZONE_N" = "subzone"))
SG_Superframe$Density <- (SG_Superframe$TOTAL / SG_Superframe$SHAPE_Area) 
SG_Superframe <- left_join(SG_Superframe, Housing_data, c("SUBZONE_N" = "subzone"))

2.3.3 Cleaning up the SG_Superframe

SG_Superframe <- within(SG_Superframe, rm(SUBZONE_NO, SUBZONE_C, CA_IND, PLN_AREA_C, REGION_C, INC_CRC, FMEL_UPD_D))

2.4 Removing Non-Essential Polygons from study

2.4.1 Selecting polygons to be removed

For this study, we will need to concentrate on mainland Singapore in order to properly perform our cluster analysis. This is primarily because we need the nearest neighbor weights further down the line, when conducting our geospatially weighted cluster analysis.

SG_Superframe <-SG_Superframe[!SG_Superframe$SUBZONE_N=="CENTRAL WATER CATCHMENT",] #Some odd datapoints in there
SG_Superframe <-SG_Superframe[!SG_Superframe$PLN_AREA_N=="WESTERN ISLANDS",] # Removing external islands
SG_Superframe <-SG_Superframe[!SG_Superframe$PLN_AREA_N=="SOUTHERN ISLANDS",] # Removing external islands
SG_Superframe <-SG_Superframe[!SG_Superframe$PLN_AREA_N=="NORTH-EASTERN ISLANDS",] # Removing external islands
SG_Superframe <-SG_Superframe[!SG_Superframe$SUBZONE_N=="PULAU SELETAR",] # Removing external islands
SG_Superframe <-SG_Superframe[!SG_Superframe$SUBZONE_N=="WESTERN WATER CATCHMENT",] #Some odd datapoints in there
SG_Superframe <-SG_Superframe[!SG_Superframe$SUBZONE_N=="TUAS VIEW EXTENSION",] # Not built yet

# Removing polygons with no data on all variables.
SG_Superframe <-SG_Superframe[!sum(SG_Superframe[9:25] %>%
   st_set_geometry(NULL))==0,]

2.4.2 Checking on final polygons

tm_shape(SG_Superframe)+
  tm_fill(col= "SHAPE_Leng",
          style="jenks",
          title = "Singapore Study Areas")+
  tm_borders(alpha=0.7)

3 Exploratory Data Analysis (EDA)

3.1 EDA using statistical graphics

3.1.1 Separating Indicators from Simple Feature DataFrame

Indicator_vars <- SG_Superframe %>%
  st_set_geometry(NULL) %>% #Dropping away geometric columns because we're using non-geospatial heirarichal cluster analysis
  select("ACTIVE", "YOUNG", "AGED", "Density", "HDB1_2RM", "HDB3RM", "HDB4RM", "HDB5RM_EF", "CONDO_APART", "LANDED", "Businesses", "Industry", "Financial", "Embassies", "Residential", "Shopping")

3.1.2 Summary of Indicator Variables

summary(Indicator_vars)
##      ACTIVE          YOUNG            AGED          Density       
##  Min.   :    0   Min.   :    0   Min.   :    0   Min.   :0.00000  
##  1st Qu.:    0   1st Qu.:    0   1st Qu.:    0   1st Qu.:0.00000  
##  Median : 3420   Median : 1400   Median :  780   Median :0.00611  
##  Mean   : 7617   Mean   : 3398   Mean   : 1863   Mean   :0.01108  
##  3rd Qu.:11320   3rd Qu.: 4630   3rd Qu.: 3080   3rd Qu.:0.02061  
##  Max.   :79780   Max.   :34260   Max.   :18860   Max.   :0.04606  
##     HDB1_2RM          HDB3RM          HDB4RM        HDB5RM_EF    
##  Min.   :   0.0   Min.   :    0   Min.   :    0   Min.   :    0  
##  1st Qu.:   0.0   1st Qu.:    0   1st Qu.:    0   1st Qu.:    0  
##  Median :   0.0   Median :    0   Median :    0   Median :    0  
##  Mean   : 559.3   Mean   : 1860   Mean   : 4283   Mean   : 3403  
##  3rd Qu.: 770.0   3rd Qu.: 2130   3rd Qu.: 5950   3rd Qu.: 3830  
##  Max.   :4700.0   Max.   :31350   Max.   :51400   Max.   :47960  
##   CONDO_APART        LANDED          Businesses        Industry     
##  Min.   :    0   Min.   :    0.0   Min.   :  0.00   Min.   :0.0000  
##  1st Qu.:    0   1st Qu.:    0.0   1st Qu.:  0.00   1st Qu.:0.0000  
##  Median :  280   Median :    0.0   Median :  2.00   Median :0.0000  
##  Mean   : 1881   Mean   :  796.5   Mean   : 20.25   Mean   :0.3514  
##  3rd Qu.: 2950   3rd Qu.:  440.0   3rd Qu.: 14.00   3rd Qu.:0.0000  
##  Max.   :16770   Max.   :18820.0   Max.   :308.00   Max.   :8.0000  
##    Financial        Embassies       Residential        Shopping    
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.:  1.00   1st Qu.: 0.000   1st Qu.:  1.00   1st Qu.: 0.00  
##  Median :  5.00   Median : 0.000   Median :  4.00   Median : 0.00  
##  Mean   : 10.46   Mean   : 1.406   Mean   : 11.45   Mean   : 1.61  
##  3rd Qu.: 13.00   3rd Qu.: 1.000   3rd Qu.: 11.00   3rd Qu.: 1.00  
##  Max.   :134.00   Max.   :19.000   Max.   :217.00   Max.   :31.00

Based on the summary, we can already see that the data ranges of the respective variables very different from one another. Which is not a surprise as density is a derived metric while the others are raw counts. Nonetheless, we will need to examine their distributions in order to determine the appropriate standardization technique.

3.1.3 Plotting Histograms of indicators (raw)

Indicator_vars %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

Based on the histograms displayed above, its clear that the data appears to be extremely skewed towards the lower values. As such, we cannot assume a normal distribution. However, due to the nature of the data-array. We must examine the data further by removing the zeros from the histogram which may be contributing to the skew in the graph.

3.1.4 Plotting Histograms of indicators (no zeros)

no_zeros <- Indicator_vars %>%
   mutate_all(funs((replace(., .==0, NaN))))

no_zeros %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

Now we can see our data a little better. However, the data is still very much skewed to the left. Thus an appropriate standardization of Min Max should be applied

3.1.5 Standardization using Min-Max Method

For this analysis, we will be using the normalize() from the heatmaply package.

Indicator_vars.norm <- normalize(Indicator_vars)
summary(Indicator_vars.norm)
##      ACTIVE            YOUNG              AGED            Density      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.04287   Median :0.04086   Median :0.04136   Median :0.1327  
##  Mean   :0.09547   Mean   :0.09919   Mean   :0.09879   Mean   :0.2405  
##  3rd Qu.:0.14189   3rd Qu.:0.13514   3rd Qu.:0.16331   3rd Qu.:0.4474  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##     HDB1_2RM          HDB3RM            HDB4RM          HDB5RM_EF      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.1190   Mean   :0.05934   Mean   :0.08333   Mean   :0.07095  
##  3rd Qu.:0.1638   3rd Qu.:0.06794   3rd Qu.:0.11576   3rd Qu.:0.07986  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##   CONDO_APART         LANDED          Businesses          Industry      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.0167   Median :0.00000   Median :0.006494   Median :0.00000  
##  Mean   :0.1122   Mean   :0.04232   Mean   :0.065744   Mean   :0.04393  
##  3rd Qu.:0.1759   3rd Qu.:0.02338   3rd Qu.:0.045455   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
##    Financial          Embassies        Residential          Shopping      
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.007463   1st Qu.:0.00000   1st Qu.:0.004608   1st Qu.:0.00000  
##  Median :0.037313   Median :0.00000   Median :0.018433   Median :0.00000  
##  Mean   :0.078036   Mean   :0.07399   Mean   :0.052753   Mean   :0.05194  
##  3rd Qu.:0.097015   3rd Qu.:0.05263   3rd Qu.:0.050691   3rd Qu.:0.03226  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000

3.1.6 Plotting Normalized data

Indicator_vars.norm %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

Now we can see a little more clearly some distributions differ from one another. For example, Density and HDB1_2RN are much more spread out in terms of distribution compared to Shopping and residential

3.1.7 Analysis of Correlations

Indicator_vars.norm.cor = cor(Indicator_vars.norm) #This is the base r functions to calculate the correlation matrix.
corrplot.mixed(Indicator_vars.norm.cor, #mixed version means first half and second half are different representations. 
         lower = "ellipse", #specify shape type
               upper = "number", #specify the representation
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Unfortunately it seems we have a number of variables which have high collinearity with one another. Thus it would affect our regression and clustering. We must then decide whether or not these indicators are indeed reliable or if they must be further derived in order to form better clustering.

Based on the data presented to us. It seems that many of the population variables (both the demographic and housing data) are highly correlated with one another. This is probably due to common derivation from the total resident count in each subzone. As such, we should instead use Ratios instead of raw count in order to properly determine proportion of Youths, Economically Active and Aged.

3.2 Building derived variables

3.2.1 Building new dataframe with derived values

SG_Derivedframe <- SG_Superframe %>%
  mutate(YOUNG_RATIO = ifelse((YOUNG == 0), 0, (YOUNG/TOTAL))) %>%
  mutate(ACTIVE_RATIO = ifelse((ACTIVE == 0), 0, (ACTIVE/TOTAL))) %>%
  mutate(AGED_RATIO = ifelse((AGED == 0), 0, (AGED/TOTAL)))%>%
  mutate(HDB1_2RM_RATIO = ifelse((HDB1_2RM == 0), 0, (HDB1_2RM/TOTAL)))%>%
  mutate(HDB3RM_RATIO = ifelse((HDB3RM == 0), 0, (HDB3RM/TOTAL)))%>%
  mutate(HDB4RM_RATIO = ifelse((HDB4RM == 0), 0, (HDB4RM/TOTAL)))%>%
  mutate(HDB5RM_EF_RATIO = ifelse((HDB5RM_EF == 0), 0, (HDB5RM_EF/TOTAL)))%>%
  mutate(CONDO_APART_RATIO = ifelse((CONDO_APART == 0), 0, (CONDO_APART/TOTAL)))%>%
  mutate(LANDED_RATIO = ifelse((LANDED == 0), 0, (LANDED/TOTAL)))

3.2.2 Separating Indicators from Simple Feature DataFrame

Indicator_vars_derived <- SG_Derivedframe %>%
  st_set_geometry(NULL) %>% #Dropping away geometric columns because we're using non-geospatial heirarichal cluster analysis
  select("ACTIVE_RATIO", "YOUNG_RATIO", "AGED_RATIO", "Density", "HDB1_2RM_RATIO", "HDB3RM_RATIO", "HDB4RM_RATIO", "HDB5RM_EF_RATIO", "CONDO_APART_RATIO", "LANDED_RATIO", "Businesses", "Industry", "Financial", "Embassies", "Residential", "Shopping")

3.2.3 Plotting Derived Indicators

Indicator_vars_derived %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

Now we can see a larger amount of difference in distribution by using a derived value. However, it should give us a better indicator to properly cluster these areas.

3.2.4 Normalizing Derived Indicators

Indicator_vars_derived.norm <- normalize(Indicator_vars_derived)
summary(Indicator_vars_derived.norm)
##   ACTIVE_RATIO     YOUNG_RATIO       AGED_RATIO        Density      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.6406   Median :0.4048   Median :0.1362   Median :0.1327  
##  Mean   :0.4908   Mean   :0.3272   Mean   :0.1231   Mean   :0.2405  
##  3rd Qu.:0.6667   3rd Qu.:0.4880   3rd Qu.:0.1962   3rd Qu.:0.4474  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  HDB1_2RM_RATIO     HDB3RM_RATIO     HDB4RM_RATIO    HDB5RM_EF_RATIO 
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.04266   Mean   :0.1008   Mean   :0.2176   Mean   :0.1421  
##  3rd Qu.:0.04368   3rd Qu.:0.1666   3rd Qu.:0.4499   3rd Qu.:0.2530  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  CONDO_APART_RATIO  LANDED_RATIO       Businesses          Industry      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.05067   Median :0.00000   Median :0.006494   Median :0.00000  
##  Mean   :0.21585   Mean   :0.09239   Mean   :0.065744   Mean   :0.04393  
##  3rd Qu.:0.29954   3rd Qu.:0.05000   3rd Qu.:0.045455   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
##    Financial          Embassies        Residential          Shopping      
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.007463   1st Qu.:0.00000   1st Qu.:0.004608   1st Qu.:0.00000  
##  Median :0.037313   Median :0.00000   Median :0.018433   Median :0.00000  
##  Mean   :0.078036   Mean   :0.07399   Mean   :0.052753   Mean   :0.05194  
##  3rd Qu.:0.097015   3rd Qu.:0.05263   3rd Qu.:0.050691   3rd Qu.:0.03226  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000

3.2.5 Analysis of Correlations for Derived Indicators

Indicator_vars_derived.norm.cor = cor(Indicator_vars_derived.norm) #This is the base r functions to calculate the correlation matrix.
corrplot.mixed(Indicator_vars_derived.norm.cor, #mixed version means first half and second half are different representations. 
         lower = "ellipse", #specify shape type
               upper = "number", #specify the representation
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Based on the results above, now we have much clearer picture between the indicators after controlling for total resident count in the subzone. However, there are still some variables which are highly correlated with one-another.

There are multiple other tests such as VIFs and F-Tests that we can conduct to get a clearer picture. Further discussion can be found here: https://www.r-bloggers.com/dealing-with-the-problem-of-multicollinearity-in-r/

For the purpose of our analysis, we will be setting a threshold in order to determine which indicators to drop. Existing literature tells us that correlation >0.75 may lead to issues when running multi-regression analysis. Further discussion can be found here https://www.researchgate.net/post/How_can_I_avoid_multicollinearity

and in “Multivariate Data Analysis” by Hair, 2006

With 0.75 as a threshold, we will be dropping the YOUNG_RATIO as an indicator as it has a high correlation with the ACTIVE_RATIO.

3.2.6 Creating Final Indicators

Final_Indicators <- SG_Derivedframe %>%
  st_set_geometry(NULL) %>% #Dropping away geometric columns because we're using non-geospatial heirarichal cluster analysis
  select("ACTIVE_RATIO", "AGED_RATIO", "Density", "HDB1_2RM_RATIO", "HDB3RM_RATIO", "HDB4RM_RATIO", "HDB5RM_EF_RATIO", "CONDO_APART_RATIO", "LANDED_RATIO", "Businesses", "Industry", "Financial", "Embassies", "Residential", "Shopping")

3.2.7 Normalizing Final Indicators

Final_Indicators.norm <- normalize(Final_Indicators)
summary(Final_Indicators.norm)
##   ACTIVE_RATIO      AGED_RATIO        Density       HDB1_2RM_RATIO   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.6406   Median :0.1362   Median :0.1327   Median :0.00000  
##  Mean   :0.4908   Mean   :0.1231   Mean   :0.2405   Mean   :0.04266  
##  3rd Qu.:0.6667   3rd Qu.:0.1962   3rd Qu.:0.4474   3rd Qu.:0.04368  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##   HDB3RM_RATIO     HDB4RM_RATIO    HDB5RM_EF_RATIO  CONDO_APART_RATIO
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.05067  
##  Mean   :0.1008   Mean   :0.2176   Mean   :0.1421   Mean   :0.21585  
##  3rd Qu.:0.1666   3rd Qu.:0.4499   3rd Qu.:0.2530   3rd Qu.:0.29954  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##   LANDED_RATIO       Businesses          Industry         Financial       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.007463  
##  Median :0.00000   Median :0.006494   Median :0.00000   Median :0.037313  
##  Mean   :0.09239   Mean   :0.065744   Mean   :0.04393   Mean   :0.078036  
##  3rd Qu.:0.05000   3rd Qu.:0.045455   3rd Qu.:0.00000   3rd Qu.:0.097015  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000   Max.   :1.000000  
##    Embassies        Residential          Shopping      
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.004608   1st Qu.:0.00000  
##  Median :0.00000   Median :0.018433   Median :0.00000  
##  Mean   :0.07399   Mean   :0.052753   Mean   :0.05194  
##  3rd Qu.:0.05263   3rd Qu.:0.050691   3rd Qu.:0.03226  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000

3.2.8 Analysis of Correlations for Final Indicators

Final_Indicators.norm.cor = cor(Final_Indicators.norm) #This is the base r functions to calculate the correlation matrix.
corrplot.mixed(Final_Indicators.norm.cor, #mixed version means first half and second half are different representations. 
         lower = "ellipse", #specify shape type
               upper = "number", #specify the representation
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

With this, we can see that our indicators are no longer too heavily correlated, all of which fall below 0.75 correlation coefficient.

3.3 Loading in the final indicators

Clustering_indicators <- SG_Derivedframe %>%
  st_set_geometry(NULL) %>% #Dropping away geometric columns because we're using non-geospatial heirarichal cluster analysis
  select("SUBZONE_N", "ACTIVE_RATIO", "AGED_RATIO", "Density", "HDB1_2RM_RATIO", "HDB3RM_RATIO", "HDB4RM_RATIO", "HDB5RM_EF_RATIO", "CONDO_APART_RATIO", "LANDED_RATIO", "Businesses", "Industry", "Financial", "Embassies", "Residential", "Shopping")

3.4 Visualizing Final Indicators using Choropleth mapping

Mapping_sf <- left_join(SG_Superframe, Clustering_indicators, by=c("SUBZONE_N"="SUBZONE_N"))
tm_shape(Mapping_sf) +
    tm_polygons(c("ACTIVE_RATIO", "AGED_RATIO", "Density.y", "HDB1_2RM_RATIO", "HDB3RM_RATIO", "HDB4RM_RATIO", "HDB5RM_EF_RATIO", "CONDO_APART_RATIO", "LANDED_RATIO", "Businesses.y", "Industry.y", "Financial.y", "Embassies.y", "Residential.y", "Shopping.y"),
                style="jenks") +
    tm_facets(sync = TRUE, ncol = 3) +
  tm_legend(legend.position = c("right", "bottom"))+
  tm_layout(outer.margins=0, asp=0)

Just taking a quick snapshot of all our indicators of the subzones, we can see that the geographic distribution of high to low values vary a lot between indicators.

4 Hierarchy Cluster Analysis

4.1 Preparing the Data for Dendrogram

4.1.1 Renaming rows to subzones

row.names(Clustering_indicators) <- Clustering_indicators$SUBZONE_N

Cluster_names <- select(Clustering_indicators, c(2:16))

4.1.2 Data Standardization using normalization

Clustering_indicators.norm <- normalize(Clustering_indicators)
summary(Clustering_indicators.norm)
##   SUBZONE_N          ACTIVE_RATIO      AGED_RATIO        Density      
##  Length:313         Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.6406   Median :0.1362   Median :0.1327  
##                     Mean   :0.4908   Mean   :0.1231   Mean   :0.2405  
##                     3rd Qu.:0.6667   3rd Qu.:0.1962   3rd Qu.:0.4474  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  HDB1_2RM_RATIO     HDB3RM_RATIO     HDB4RM_RATIO    HDB5RM_EF_RATIO 
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.04266   Mean   :0.1008   Mean   :0.2176   Mean   :0.1421  
##  3rd Qu.:0.04368   3rd Qu.:0.1666   3rd Qu.:0.4499   3rd Qu.:0.2530  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  CONDO_APART_RATIO  LANDED_RATIO       Businesses          Industry      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.05067   Median :0.00000   Median :0.006494   Median :0.00000  
##  Mean   :0.21585   Mean   :0.09239   Mean   :0.065744   Mean   :0.04393  
##  3rd Qu.:0.29954   3rd Qu.:0.05000   3rd Qu.:0.045455   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
##    Financial          Embassies        Residential          Shopping      
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.007463   1st Qu.:0.00000   1st Qu.:0.004608   1st Qu.:0.00000  
##  Median :0.037313   Median :0.00000   Median :0.018433   Median :0.00000  
##  Mean   :0.078036   Mean   :0.07399   Mean   :0.052753   Mean   :0.05194  
##  3rd Qu.:0.097015   3rd Qu.:0.05263   3rd Qu.:0.050691   3rd Qu.:0.03226  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000
Cluster_names.norm <- normalize(Cluster_names)
summary(Cluster_names.norm)
##   ACTIVE_RATIO      AGED_RATIO        Density       HDB1_2RM_RATIO   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.6406   Median :0.1362   Median :0.1327   Median :0.00000  
##  Mean   :0.4908   Mean   :0.1231   Mean   :0.2405   Mean   :0.04266  
##  3rd Qu.:0.6667   3rd Qu.:0.1962   3rd Qu.:0.4474   3rd Qu.:0.04368  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##   HDB3RM_RATIO     HDB4RM_RATIO    HDB5RM_EF_RATIO  CONDO_APART_RATIO
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.05067  
##  Mean   :0.1008   Mean   :0.2176   Mean   :0.1421   Mean   :0.21585  
##  3rd Qu.:0.1666   3rd Qu.:0.4499   3rd Qu.:0.2530   3rd Qu.:0.29954  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##   LANDED_RATIO       Businesses          Industry         Financial       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.007463  
##  Median :0.00000   Median :0.006494   Median :0.00000   Median :0.037313  
##  Mean   :0.09239   Mean   :0.065744   Mean   :0.04393   Mean   :0.078036  
##  3rd Qu.:0.05000   3rd Qu.:0.045455   3rd Qu.:0.00000   3rd Qu.:0.097015  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000   Max.   :1.000000  
##    Embassies        Residential          Shopping      
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.004608   1st Qu.:0.00000  
##  Median :0.00000   Median :0.018433   Median :0.00000  
##  Mean   :0.07399   Mean   :0.052753   Mean   :0.05194  
##  3rd Qu.:0.05263   3rd Qu.:0.050691   3rd Qu.:0.03226  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000

4.1.3 Transforming dataframe into matrix

Cluster_names.mat <- data.matrix(Cluster_names.norm)

4.1.4 Computing proximity matrix

proxmat <- dist(Cluster_names.mat, method = 'euclidean')

4.2 Computing hierarchical clustering

4.2.1 Determining the best clustering method

In order to determine which clustering method is most appropriate, we need to map different methods to our clustering matrix and observe the agglomeration coefficients.

m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")

Apply_agnes <- function(x) {
  agnes(Cluster_names.norm, method = x)$ac
}

map_dbl(m, Apply_agnes)
##   average    single  complete      ward 
## 0.8669997 0.8074952 0.9045193 0.9783950

As we want something as close to 1 as we can get it. We will use the “ward” method.

4.2.2 Creating clusters using ward.D method

hclust_ward <- hclust(proxmat, method = 'ward.D')

hclust_ward
## 
## Call:
## hclust(d = proxmat, method = "ward.D")
## 
## Cluster method   : ward.D 
## Distance         : euclidean 
## Number of objects: 313

4.3 Determining appropriate number of clusters

While often seen as a arbitrary task, we need to objectively determine the best number of clusters we need in order to properly meet the objective of performing Social Area Analysis

Based on existing literature, there are different methods to do this. Usually they involve either some measure of dis/similarity (distance) between clusters or to adapt statistical rules or tests to determine the right number of clusters.

One method (although tends to underestimate the actual number of clusters) is to look at the within cluster similarity at each stage. This was introduced in 1953 by R. L. Thorndike (Psychometrika, 18[4]; 267-276), which resulted in the “Thorndike” method that is used widely in social science.

Source: https://hlab.stanford.edu/brian/number_of_clusters_.html

Another method is using the Dunn’s Index. The Dunn’s index is the ratio between the minimum inter-cluster distances to the maximum intra-cluster diameter. The diameter of a cluster is the distance between its two furthermost points. In order to have well separated and compact clusters you should aim for a higher Dunn’s index.

As we have readily available functions to test for the Dunn’s index. We will use that method instead.

Source: https://www.datacamp.com/community/tutorials/hierarchical-clustering-R Documentation: https://www.rdocumentation.org/packages/clValid/versions/0.6-7/topics/dunn

4.3.1 Creating grouping function for testing Dunn’s Index

Get_Dunn <- function(x){
  groups <- as.factor(cutree(hclust_ward, k=x))
  Result_cluster <- cbind(Final_Indicators.norm, as.matrix(groups)) %>%
    rename(`CLUSTER` = `as.matrix(groups)`)
  Result_cluster$CLUSTER <- as.numeric(levels(Result_cluster$CLUSTER))[Result_cluster$CLUSTER]
  dunn_int <- dunn(clusters= Result_cluster$CLUSTER, Data = Result_cluster, method = "euclidean")
  return(dunn_int)
}

4.3.2 Applying Dunn’s Index test to the number of clusters and plotting it on a graph

Cluster_numbers <- 2:30
Dunn_index <- lapply(Cluster_numbers, Get_Dunn)

plot(Cluster_numbers,Dunn_index)

Based on our results, we can determine that there appears to be an upward jump in the Dunn’s Index every 5-6 added clusters. For the purpose of meaningful interpretation of the clusters, the optimal amount of clusters should be small. Given the graph presented, the optimal number of clusters for our case appears to be 6, which provides a high enough Dunn’s Index without overly segmenting the data.

4.4 Plotting Dendrogram

4.4.1 A simple cluster Dendrogram

plot(hclust_ward, cex=0.6)
rect.hclust(hclust_ward, k= 6, border = 2:8)

4.4.2 A Dendrogram showing correlation between variables and subzones

heatmaply(normalize(Cluster_names.mat),
          Colv=NA,
          dist_method = "euclidean", #method you want to use maximum, minimum, etc
          hclust_method = "ward.D", #how you calculate the agglomerations method. 
          seriate = "OLO", #refers to optimal leaf odering. No fixed reason to select, you trial and error to see which is the most distinguishable. 
          colors = Blues,
          k_row = 6,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Segmentation of Singapore Subzones by indicators",
          xlab = "Indicators",
          ylab = "Subzones"
          )

4.5 Mapping clusters

4.5.1 Creating a Cluster Simple Feature DataFrame

Merging_sf <- Mapping_sf %>%
  select(SUBZONE_N, geometry)

Merging_sf <- left_join(Merging_sf,Clustering_indicators.norm, by="SUBZONE_N")

Final_clusters <- as.factor(cutree(hclust_ward, k=6))
Mapping_cluster_sf <- cbind(Merging_sf, as.matrix(Final_clusters)) %>%
    rename(`CLUSTER` = `as.matrix.Final_clusters.`)

4.5.2 Plotting clusters on the map

tmap_mode("view")

Indicator_map <- tm_shape(Mapping_cluster_sf)+
  tm_fill(col="CLUSTER",
          title=" Indicator Clusters")+
  tm_borders(alpha=0.5)

Indicator_map

The choropleth map above reveals the clusters are very fragmented. The is one of the major limitation when non-spatial clustering algorithm such as hierarchical cluster analysis method is used. That being said, we can clearly see that Cluster 1 indicates areas with very little population. These zones are inclusive of the ports, airports and industrial zone with lower populations.

We need to factor in geospatial factors in order to achieve homogeneous clusters. We will use the SKATER approach in order to achieve this.

4.6 Interpreting clusters

4.6.1 Creating function for calculating mean without zeros

nzmean <- function(x) {
    zvals <- x==0
    if (all(zvals)) 0 else mean(x[!zvals])
}

We will remove the zeros from our average so as not to be affected by missing values.

4.6.2 Creating cluster summary for our reference

Cluster_summary <- Mapping_cluster_sf %>%
  st_set_geometry(NULL) %>%
  group_by(CLUSTER) %>% 
  summarise(ACTIVE_RATIO=nzmean(ACTIVE_RATIO), AGED_RATIO=nzmean(AGED_RATIO), DENSITY=nzmean(Density), HDB1_2RM_RATIO=nzmean(HDB1_2RM_RATIO), HDB3RM_RATIO=nzmean(HDB3RM_RATIO), HDB4RM_RATIO=nzmean(HDB4RM_RATIO),HDB5RM_EF_RATIO=nzmean(HDB5RM_EF_RATIO), CONDO_APART_RATIO=nzmean(CONDO_APART_RATIO), LANDED_RATIO=nzmean(LANDED_RATIO), BUSINESSES=nzmean(Businesses), INDUSTRY=nzmean(Industry), FINANCIAL=nzmean(Financial), EMBASSIES=nzmean(Embassies),RESIDENTIAL=nzmean(Residential),SHOPPING=nzmean(Shopping))

4.6.3 Plotting out Cluster summaries

Cluster_summary.molten <- melt(Cluster_summary, na.rm=TRUE)

Cluster_summary_plot <- ggplot(Cluster_summary.molten, aes(variable, value, fill = variable)) + 
               facet_wrap(~ CLUSTER) +
               geom_bar(stat="identity", show_guide=FALSE) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Cluster_summary_plot

As we can see, our Cluster 1 consists of entirely non-residential areas with no population at all. This is good because these areas are still relevant as commercial or industrial zones for the purpose of our analysis.

Interestingly Cluster 6 consists of metrics where services seem to be highest in areas which have very little HDB buildings but high number of condominiums. Additionally Cluster 3 and Cluster 6 show two different groups of Condominium dwellers, Cluster 3 are more towards residential zones where as cluster 6 are near services in the city center.

Cluster 4 seems to show the high density residential zones. If we reference the map, they seem to be smaller polygons with a larger number of HDB 4-Room and 5-room blocks, and Apartments.

All of them have significantly high ratio of economically active population except 1 which has a majority Aged population. This could purely be due to lower total population counts in these areas which makes it hard to determine why this is the case.

Additionally, the high CONDO_APART_RATIO and LANDED_RATIO may be causing some issues with the clustering as they seem to be extremely high in comparison to their other counterparts.

5 Spatially Constrained Clustering - SKATER approach

In this section, we will be deriving spatially constrained cluster by using SKATER method.

5.1 Data Preparation

5.1.1 Converting Simple Feature DataFrame to Spatial Polygon DataFrame

Merging_sp <- as_Spatial(Merging_sf)

5.1.2 Computing Nearest Neighbors

Merging.nb <- poly2nb(Merging_sp)
summary(Merging.nb)
## Neighbour list object:
## Number of regions: 313 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.900601 
## Average number of links: 5.948882 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 
##  2  4 12 32 79 80 50 32 16  3  3 
## 2 least connected regions:
## 228 289 with 1 link
## 3 most connected regions:
## 43 259 294 with 11 links

5.1.3 Plotting connections

plot(Merging_sp, border=grey(.5), main= "Nearest Neighbor Connections")
plot(Merging.nb, coordinates(Merging_sp), col="blue", add=TRUE)

5.1.4 Computing lcost using nbcosts() function

Next, nbcosts() of spdep package is used to compute the cost of each edge. It is the distance between it nodes. This function compute this distance using a data.frame with observations vector in each node.

The code chunk below is used to compute the cost of each edge.

lcosts <- nbcosts(Merging.nb, Cluster_names.norm)

For each observation, this gives the pairwise dissimilarity between its values on the 16 variables and the values for the neighboring observation (from the neighbor list). Basically, this is the notion of a generalized weight for a spatial weights matrix.

5.1.5 Creating Geospatially distance weights

Next, We will incorporate these costs into a weights object in the same way as we did in the calculation of inverse of distance weights. In other words, we convert the neighbor list to a list weights object by specifying the just computed lcosts as the weights.

In order to achieve this, nb2listw() of spdep package is used as shown in the code chunk below.

GWeighted.w <- nb2listw(Merging.nb, lcosts, style="B") # B is for Binary weights
summary(GWeighted.w)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 313 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.900601 
## Average number of links: 5.948882 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 
##  2  4 12 32 79 80 50 32 16  3  3 
## 2 least connected regions:
## 228 289 with 1 link
## 3 most connected regions:
## 43 259 294 with 11 links
## 
## Weights style: B 
## Weights constants summary:
##     n    nn       S0       S1       S2
## B 313 97969 1529.634 3119.553 36023.12

5.1.6 Computing minimum spanning tree again, now with weighted connections

GWeighted.mst <- mstree(GWeighted.w)
class(GWeighted.mst)
## [1] "mst"    "matrix"

Checking to see the class of the matrix to make sure conversion was correct.

5.1.7 Checking dimensions

In this case, we should have 312 as the output as we have 313 polygons.

dim(GWeighted.mst)
## [1] 312   3

5.1.8 Plotting final connections

plot(Merging_sp, border=gray(.5), main="Weighted Nearest Neighbor Connections")
plot.mst(GWeighted.mst, coordinates(Merging_sp), 
     col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)

Here we can see the final connections after incorporating our indicators in the information space.

5.2 Computing spatially constrained clusters using SKATER method

5.2.1 Compute the spatially constrained cluster using skater() of spdep package.

Cluster6 <- skater(GWeighted.mst[,1:2], Cluster_names.norm, 5) #we want 6 clusters based on our earlier calculations

The skater() takes three mandatory arguments: - the first two columns of the MST matrix (i.e. not the cost), - the data matrix (to update the costs as units are being grouped), and - the number of cuts. Note: It is set to one less than the number of clusters. So, the value specified is not the number of clusters, but the number of cuts in the graph, one less than the number of clusters.

5.2.2 Appending clusters to Sf frame

groups_mat <- as.matrix(Cluster6$groups)
Mapping_spatialcluster_sf <- cbind(Mapping_cluster_sf, as.factor(groups_mat)) %>%
  rename(SP_CLUSTER=as.factor.groups_mat.) #this is the default output field name. it was a bit hard to read so we renamed it. 

5.2.3 Visualizing final clusters using tmaps

Spatial_cluster_map <- tm_shape(Mapping_spatialcluster_sf)+
  tm_fill(col="SP_CLUSTER",
          title="Spatially Weighted Clusters")+
  tm_borders(alpha=0.5)
Spatial_cluster_map

5.2.4 Analyzing Spatial clusters

Spatial_Cluster_summary <- Mapping_spatialcluster_sf %>%
  st_set_geometry(NULL) %>%
  group_by(SP_CLUSTER) %>% 
  summarise(ACTIVE_RATIO=nzmean(ACTIVE_RATIO), AGED_RATIO=nzmean(AGED_RATIO), DENSITY=nzmean(Density), HDB1_2RM_RATIO=nzmean(HDB1_2RM_RATIO), HDB3RM_RATIO=nzmean(HDB3RM_RATIO), HDB4RM_RATIO=nzmean(HDB4RM_RATIO),HDB5RM_EF_RATIO=nzmean(HDB5RM_EF_RATIO), CONDO_APART_RATIO=nzmean(CONDO_APART_RATIO), LANDED_RATIO=nzmean(LANDED_RATIO), BUSINESSES=nzmean(Businesses), INDUSTRY=nzmean(Industry), FINANCIAL=nzmean(Financial), EMBASSIES=nzmean(Embassies),RESIDENTIAL=nzmean(Residential),SHOPPING=nzmean(Shopping))

5.2.5 Plotting out Spatial cluster summaries

Spatial_Cluster_summary.molten <- melt(Spatial_Cluster_summary, na.rm=TRUE)

Spatial_cluster_Summary_plot <- ggplot(Spatial_Cluster_summary.molten, aes(variable, value, fill = variable)) + 
               facet_wrap(~ SP_CLUSTER) +
               geom_bar(stat="identity", show_guide=FALSE) +
              theme(axis.text.x = element_text(angle = 90, hjust = 1))

Spatial_cluster_Summary_plot

5.2.6 Interpretation of results.

Based on the information presented to us, it appears that the geographical clustering has resulted in the formation of a general residential population area represented by Cluster 1. Although it seems to include the airport as well which may purely be due to the nearest neighbor weightage constraint.

Cluster 2 appears to be a a more business and industry area together with some HDB residential parts. We can define this group as more industrial workers.

Cluster 3 is defined by a high condominium and Apartment ratio. It is likely a more affluent residential zone within the shopping and commercial areas in town.

Cluster 4 appears to consist of both industrial port areas as well embassies, financial and shopping urban functions. It’s also defined by a lack of any residential at all. It is likely that this zone is a purely commercial and industry part of the downtown area.

Cluster 5 appears to be defined by its extremely high ratio of Landed property ownership. This zone appears to be an affluent pocket in the north-western part of Singapore.

Cluster 6 appears to consist of primarily condominiums and landed properties with no HDB ownership. It could be considered a more affluent pocket in the north-eastern part of Singapore.

Overall the addition of the SKEWER approach seems to have overly combined some of the zones. It appears that in general, Singaporea social groups are well spreadout across the country in smaller pockets rather than geographically segmented from one another.

5.2.7 Comparison of clustering methods

When comparing the two, we can see that there is very little consistency between the clustering. In fact we can see that Cluster1 in the Indicators cluster had been split into Clusters 2,4 and 5 in the Spatially weighted clusters. This could underscore an issue with relying on the Binary method in our nb2listw() function that we can further examine. Additionally more clusters could be added to provide more defined social areas as part of our analysis.

6 (BONUS) Extended Investigation

For this section, we will attempt to examine effects of more clusters on defining the social areas ## Investing additional clusters ### Check Dunn’s Index Plot

plot(Cluster_numbers,Dunn_index)

As we can see based on the Dunn’s Index, the next jump appears to be around 13 clusters.

6.0.1 Forming new clusters

thirteen_clusters <- as.factor(cutree(hclust_ward, k=13))
thirteen_cluster_sf <- cbind(Merging_sf, as.matrix(thirteen_clusters)) %>%
    rename(`CLUSTER` = `as.matrix.thirteen_clusters.`)

6.0.2 Plotting clusters on the map

tmap_mode("view")

thirteen_clusters_map <- tm_shape(thirteen_cluster_sf)+
  tm_fill(col="CLUSTER",
          title=" New Clusters")+
  tm_borders(alpha=0.5)

thirteen_clusters_map

6.0.3 Creating cluster summary for our new clusters

Thirteen_Cluster_summary <- thirteen_cluster_sf %>%
  st_set_geometry(NULL) %>%
  group_by(CLUSTER) %>% 
  summarise(ACTIVE_RATIO=nzmean(ACTIVE_RATIO), AGED_RATIO=nzmean(AGED_RATIO), DENSITY=nzmean(Density), HDB1_2RM_RATIO=nzmean(HDB1_2RM_RATIO), HDB3RM_RATIO=nzmean(HDB3RM_RATIO), HDB4RM_RATIO=nzmean(HDB4RM_RATIO),HDB5RM_EF_RATIO=nzmean(HDB5RM_EF_RATIO), CONDO_APART_RATIO=nzmean(CONDO_APART_RATIO), LANDED_RATIO=nzmean(LANDED_RATIO), BUSINESSES=nzmean(Businesses), INDUSTRY=nzmean(Industry), FINANCIAL=nzmean(Financial), EMBASSIES=nzmean(Embassies),RESIDENTIAL=nzmean(Residential),SHOPPING=nzmean(Shopping))

6.0.4 Plotting out new Cluster summaries

Thirteen_Cluster_summary.molten <- melt(Thirteen_Cluster_summary, na.rm=TRUE)

Thirteen_Cluster_summary_plot <- ggplot(Thirteen_Cluster_summary.molten, aes(variable, value, fill = variable)) + 
               facet_wrap(~ CLUSTER) +
               geom_bar(stat="identity", show_guide=FALSE) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Thirteen_Cluster_summary_plot

Unfortunately, it does not seem that adding in extra clusters helped to make the social areas more defined based on the information space. However, we can do a quick SKEWER method to see if they might make more sense when geographically weighted,

6.0.5 Compute the spatially constrained cluster using skater() of spdep package.

Cluster13 <- skater(GWeighted.mst[,1:2], Cluster_names.norm, 12) #we want 13 clusters based on our earlier calculations

6.0.6 Appending clusters to Sf frame

groups_mat13 <- as.matrix(Cluster13$groups)
Mapping_spatialcluster13_sf <- cbind(Mapping_cluster_sf, as.factor(groups_mat13)) %>%
  rename(SP_CLUSTER=as.factor.groups_mat13.) #this is the default output field name. it was a bit hard to read so we renamed it. 

6.0.7 Visualizing final clusters using tmaps

Spatial_cluster13_map <- tm_shape(Mapping_spatialcluster13_sf)+
  tm_fill(col="SP_CLUSTER",
          title="Spatially Weighted Clusters")+
  tm_borders(alpha=0.5)
Spatial_cluster13_map

6.0.8 Analyzing Spatial clusters

Spatial_Cluster13_summary <- Mapping_spatialcluster13_sf %>%
  st_set_geometry(NULL) %>%
  group_by(SP_CLUSTER) %>% 
  summarise(ACTIVE_RATIO=nzmean(ACTIVE_RATIO), AGED_RATIO=nzmean(AGED_RATIO), DENSITY=nzmean(Density), HDB1_2RM_RATIO=nzmean(HDB1_2RM_RATIO), HDB3RM_RATIO=nzmean(HDB3RM_RATIO), HDB4RM_RATIO=nzmean(HDB4RM_RATIO),HDB5RM_EF_RATIO=nzmean(HDB5RM_EF_RATIO), CONDO_APART_RATIO=nzmean(CONDO_APART_RATIO), LANDED_RATIO=nzmean(LANDED_RATIO), BUSINESSES=nzmean(Businesses), INDUSTRY=nzmean(Industry), FINANCIAL=nzmean(Financial), EMBASSIES=nzmean(Embassies),RESIDENTIAL=nzmean(Residential),SHOPPING=nzmean(Shopping))

6.0.9 Plotting out Spatial cluster summaries

Spatial_Cluster13_summary.molten <- melt(Spatial_Cluster13_summary, na.rm=TRUE)

Spatial_cluster13_Summary_plot <- ggplot(Spatial_Cluster13_summary.molten, aes(variable, value, fill = variable)) + 
               facet_wrap(~ SP_CLUSTER) +
               geom_bar(stat="identity", show_guide=FALSE) +
              theme(axis.text.x = element_text(angle = 90, hjust = 1))

Spatial_cluster13_Summary_plot

Now we can see slightly more defined social areas. 1, 8, and 9 appear to be high-density residential zones with slight variations in terms of services available.

Clusters 3, 5, 6 and 10 appear to be slightly more affluent low-density residential zones with slight differences in housing type in specific areas.

Cluster 4 and 11 appear to be commercial zones with virtually no residential in the area.

Clusters 2, 7, 12 and 13 appear to be a mix of residential and commercial zones with different proportions of the two, most likely residences close to work areas. The major differences being close or out of city zones which affect the number of services around.

Overall, having more clusters slightly define the social areas better. However, it seems that they form into their own groups on their own. This supports the idea that Singapore spreads out its demographic groups into several different parts of the island, with no specific segment of the island characterizing a specific social group.