Social area analysis (SAA), which is a generic name for a number of methods employing census and other data to classify small areas into similar socioeconomic groups, is an approach which quantifies data in a useful fashion and has important applications in regional development, urban planning, medical and health services research. The potentialities of the approach for Liveable City planning and Smart City preparedness, however, have yet to be explored.
In this exercise, we will segment Singapore at the planning subzone level into homogeneous socioeconomic areas by combining geodemographic data extracted from Singapore Department of Statistics and urban functions extracted from the geospatial data provided.
The following data sets will be used for this study:
5 Geospatial Data Sets: Govt_Embassy, Private residential, Shopping, Business and Financial in ESRI shapefile format.
Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 in CSV format taken from Singstat
URA Master Plan 2014 Planning Subzone boundary in ESRI shapefile format taken from URA.
.
packages = c('rgdal', 'spdep', 'ClustGeo', 'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse', 'raster', 'formattable')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The csv file containing Singapore residents’ data will be imported using read_csv(). For the purpose of the analysis, we will only be using data from 2019. The subzone names are changed to capital letters for ease of data handling further on.
population <- read_csv("data/aspatial/respopagesextod2011to2019.csv") %>%
filter(Time==2019) %>%
mutate_at(.vars = vars(PA, SZ), .funs = funs(toupper))
## Parsed with column specification:
## cols(
## PA = col_character(),
## SZ = col_character(),
## AG = col_character(),
## Sex = col_character(),
## TOD = col_character(),
## Pop = col_double(),
## Time = col_double()
## )
Checking for Na/Invalid Values:
population[rowSums(is.na(population)) != 0,]
## # A tibble: 0 x 7
## # ... with 7 variables: PA <chr>, SZ <chr>, AG <chr>, Sex <chr>, TOD <chr>,
## # Pop <dbl>, Time <dbl>
The data set is valid.
Importing Government Embassies, Business, Shopping, Financial and Residential data using st_read():
government <- st_read(dsn = "data/geospatial", layer = "Govt_Embassy")
## Reading layer `Govt_Embassy' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 443 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6282 ymin: 1.24911 xmax: 103.9884 ymax: 1.45765
## CRS: 4326
business <- st_read(dsn = "data/geospatial", layer = "Business")
## Reading layer `Business' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## CRS: 4326
shopping <- st_read(dsn = "data/geospatial", layer = "Shopping")
## Reading layer `Shopping' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 511 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.679 ymin: 1.24779 xmax: 103.9644 ymax: 1.4535
## CRS: 4326
financial <- st_read(dsn = "data/geospatial", layer = "Financial")
## Reading layer `Financial' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3320 features and 29 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6256 ymin: 1.24392 xmax: 103.9998 ymax: 1.46247
## CRS: 4326
residential <- st_read(dsn = "data/geospatial", layer = "Private residential")
## Reading layer `Private residential' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3604 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6295 ymin: 1.23943 xmax: 103.9749 ymax: 1.45379
## CRS: 4326
Importing Planning Area subzones data using st_read():
subzones <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")
## Reading layer `MP14_SUBZONE_WEB_PL' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## proj4string: +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs
Checking subzone data set for NA/Invalid values:
test <- st_is_valid(subzones)
length(which(test==FALSE))
## [1] 9
There are 9 subzone polygons that are invalid as observed from the st_is_valid test. Hence, we need to make these polygons valid.
Making polygons valid:
subzones <- st_make_valid(subzones)
test <- st_is_valid(subzones)
length(which(test==FALSE))
## [1] 0
The 9 subzones are now valid.
Checking CRS of Geospatial Data:
crs(government)
## [1] "+proj=longlat +datum=WGS84 +no_defs "
crs(subzones)
## [1] "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs "
The CRS of the Geospatial data and the subzone data is not the same hence we will transform the crs of the geospatial data to match the crs of the subzones.
Transforming CRS:
government_crs <- st_transform(government, crs = "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs")
business_crs <- st_transform(business, crs = "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs")
shopping_crs <- st_transform(shopping, crs = "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs")
financial_crs <- st_transform(financial, crs = "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs")
residential_crs <- st_transform(residential, crs = "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs")
Rechecking to confirm if the CRS are the same:
crs(government_crs)
## [1] "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs "
crs(subzones)
## [1] "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs "
The crs projections are now the same.
Extracting Economy Active Population:
i_economy_active <- population %>%
filter(AG == "25_to_29" | AG == "30_to_34" | AG == "35_to_39" | AG == "40_to_44" | AG == "45_to_49" | AG == "50_to_54" | AG == "55_to_59" | AG == "60_to_64") %>%
group_by(SZ = SZ) %>%
summarise(economy_active = sum(Pop)) %>%
ungroup()
Extracting Young Group:
i_young <- population %>%
filter(AG == "0_to_4" | AG == "5_to_9" | AG == "10_to_14" | AG == "15_to_19" | AG == "20_to_24" ) %>%
group_by(SZ = SZ) %>%
summarise(young = sum(Pop)) %>%
ungroup()
Extracting Aged Group:
i_aged <- population %>%
filter(AG == "65_to_69" | AG == "70_to_74" | AG == "75_to_79" | AG == "80_to_84" | AG == "85_to_89" | AG == "90_and_over") %>%
group_by(SZ = SZ) %>%
summarise(aged = sum(Pop)) %>%
ungroup()
Calculating Population Density :
popsum_by_sz <- population %>%
group_by(SZ) %>%
summarise(pop_sum = sum(Pop))
sg_popsum <- left_join(popsum_by_sz, subzones, by = c("SZ" = "SUBZONE_N"))
## Warning: Column `SZ`/`SUBZONE_N` joining character vector and factor, coercing
## into character vector
density <- sg_popsum %>%
group_by(SZ) %>%
transmute(density = pop_sum/(SHAPE_Area/1000000)) %>%
ungroup()
Getting HDB1-2RM Dwellers:
i_hdb_1to2 <- population %>%
filter(TOD == "HDB 1- and 2-Room Flats") %>%
group_by(SZ = SZ) %>%
summarise(hdb_1to2 = sum(Pop)) %>%
ungroup()
Getting HDB3-4RM Dwellers:
i_hdb_3to4 <- population %>%
filter(TOD == "HDB 3-Room Flats" | TOD == "HDB 4-Room Flats") %>%
group_by(SZ = SZ) %>%
summarise(hdb_3to4 = sum(Pop)) %>%
ungroup()
Getting HDB5RM-Ec Dwellers:
HUDC flats that are not privatised are considered to be executive condominiums according to https://www.propertyguru.com.sg/property-guides/executive-condo-vs-private-condo-10371 hence they will be included for this indicator.
i_hdb_5toec <- population %>%
filter(TOD == "HDB 5-Room and Executive Flats" | TOD == "HUDC Flats (excluding those privatised)") %>%
group_by(SZ = SZ) %>%
summarise(hdb_5toec = sum(Pop)) %>%
ungroup()
Getting Condominium & Apartment Dwellers:
i_condo <- population %>%
filter(TOD == "Condominiums and Other Apartments") %>%
group_by(SZ = SZ) %>%
summarise(condo = sum(Pop)) %>%
ungroup()
Getting Landed Property Dwellers:
i_landed <- population %>%
filter(TOD == "Landed Properties") %>%
group_by(SZ = SZ) %>%
summarise(landed = sum(Pop)) %>%
ungroup()
sg_government <- st_join(government_crs, subzones)
sg_business <- st_join(business_crs, subzones)
sg_shopping <- st_join(shopping_crs, subzones)
sg_financial <- st_join(financial_crs, subzones)
sg_residential <- st_join(residential_crs, subzones)
Extracting Government Buildings:
sg_government <- st_set_geometry(sg_government,NULL)
u_government <- sg_government %>%
group_by(SUBZONE_N = SUBZONE_N) %>%
summarise(government = length(POI_NAME))
Extracting Businesses:
Upon examining the dataset, it can be seen that the facility type for businesses is 5000. Hence, we will filter out businesses with FAC_TYPE equal to 5000.
sg_business <- st_set_geometry(sg_business,NULL)
u_business <- sg_business %>%
filter(FAC_TYPE == 5000) %>%
group_by(SUBZONE_N = SUBZONE_N) %>%
summarise(business = length(POI_NAME))
Extracting Industries:
Upon examining the dataset, it can be seen that the facility type for industries is 9991. Hence, we will filter out industries with FAC_TYPE equal to 9991.
u_industry <- sg_business %>%
filter(FAC_TYPE == 9991) %>%
group_by(SUBZONE_N = SUBZONE_N) %>%
summarise(industry = length(POI_NAME))
Extracting Shopping Establishments:
sg_shopping <- st_set_geometry(sg_shopping,NULL)
u_shopping <- sg_shopping %>%
group_by(SUBZONE_N = SUBZONE_N) %>%
summarise(shopping = length(POI_NAME))
Extracting Financial Establishments:
sg_financial <- st_set_geometry(sg_financial,NULL)
u_financial <- sg_financial %>%
group_by(SUBZONE_N = SUBZONE_N) %>%
summarise(financial = length(POI_NAME))
Extracting Upmarket Residential Areas:
sg_residential <- st_set_geometry(sg_residential,NULL)
u_residential <- sg_residential %>%
group_by(SUBZONE_N = SUBZONE_N) %>%
summarise(upmarket_residential = length(POI_NAME))
Combining the 6 urban functions using the common variable ‘SUBZONE_N’:
combine1 <- left_join(u_financial, u_business, by = c('SUBZONE_N' = 'SUBZONE_N'))
combine2 <- left_join(combine1, u_industry, by = c('SUBZONE_N'='SUBZONE_N'))
combine3 <- left_join(combine2, u_shopping, by = c('SUBZONE_N'='SUBZONE_N'))
combine4 <- left_join(combine3, u_residential, by = c('SUBZONE_N'='SUBZONE_N'))
urbanfunctions <- left_join(combine4, u_government, by = c('SUBZONE_N' = 'SUBZONE_N'))
Plotting histograms to display the distribution of the social demographic indicators as well as the urban functions for subzones:
hist_density <- ggplot(data=sd_uf,
aes(x= `density`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_young <- ggplot(data=sd_uf,
aes(x= `young`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_economy_active<- ggplot(data=sd_uf,
aes(x= `economy_active`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_aged<- ggplot(data=sd_uf,
aes(x= `aged`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_hdb1to2<- ggplot(data=sd_uf,
aes(x= `hdb_1to2`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_hdb3to4<- ggplot(data=sd_uf,
aes(x= `hdb_3to4`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_hdb5toec<- ggplot(data=sd_uf,
aes(x= `hdb_5toec`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_condo<- ggplot(data=sd_uf,
aes(x= `condo`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_landed<- ggplot(data=sd_uf,
aes(x= `landed`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_government<- ggplot(data=sd_uf,
aes(x= `government`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_business<- ggplot(data=sd_uf,
aes(x= `business`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_industry<- ggplot(data=sd_uf,
aes(x= `industry`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_shopping<- ggplot(data=sd_uf,
aes(x= `shopping`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_financial<- ggplot(data=sd_uf,
aes(x= `financial`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
hist_upmarket_residential<- ggplot(data=sd_uf,
aes(x= `upmarket_residential`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
ggarrange(hist_density, hist_young, hist_economy_active, hist_aged, hist_hdb1to2, hist_hdb3to4, hist_hdb5toec, hist_condo, hist_landed, hist_government, hist_business, hist_industry, hist_shopping, hist_financial, hist_upmarket_residential,
ncol = 3,
nrow = 5)
From the histograms above, it is observed that the data is right skewed. The data will need to be standardised subsequently.
Plotting correlation plot:
cluster_vars.cor = cor(sd_uf[,2:16])
corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black")
From the correlation plot, we can see that the 3 age groups (young, economy active and aged) and hdb housing (3-4 rooms and 5-ec) are highly correlated to one another. This suggests that only one out of these 5 variables will be required for cluster analysis. Due to the strong postive correlation, we will only be used economy active age group.
Excluding unwanted variables:
Amongst the 5 highly correlated variables, we will only include economy active people.
variables <- sd_uf %>%
dplyr::select("SZ", "density", "economy_active", "hdb_1to2", "condo", "landed", "government", "business", "industry", "shopping", "financial", "upmarket_residential")
Changing the rows to subzone names instead of row numbers:
row.names(variables) <- variables$SZ
## Warning: Setting row names on a tibble is deprecated.
Removing the subzone names column:
as_data_frame(variables)
## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 323 x 12
## SZ density economy_active hdb_1to2 condo landed government business
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 ADMI~ 11184. 8380 1140 0 0 0 0
## 2 AIRP~ 0 0 0 0 0 0 0
## 3 ALEX~ 13374. 7560 3910 0 0 7 39
## 4 ALEX~ 7218. 1420 0 2120 0 0 0
## 5 ALJU~ 13581. 24330 2120 11930 460 2 62
## 6 ANAK~ 8037. 12490 160 9020 7600 1 13
## 7 ANCH~ 31092. 27740 1680 5000 0 0 0
## 8 ANG ~ 15432. 2790 0 1930 0 1 1
## 9 ANSON 0 0 0 0 0 3 15
## 10 BALE~ 17004. 19320 1790 9610 570 6 23
## # ... with 313 more rows, and 4 more variables: industry <int>, shopping <int>,
## # financial <int>, upmarket_residential <int>
cluster_variables <- dplyr::select(variables, c(2:12))
summary(cluster_variables)
## density economy_active hdb_1to2 condo
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 5910 Median : 2840 Median : 0 Median : 230
## Mean :10737 Mean : 7386 Mean : 542 Mean : 1827
## 3rd Qu.:19928 3rd Qu.:10285 3rd Qu.: 605 3rd Qu.: 2835
## Max. :46058 Max. :79780 Max. :4700 Max. :16770
## landed government business industry
## Min. : 0.0 Min. : 0.000 Min. : 0.00 Min. :0.0000
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:0.0000
## Median : 0.0 Median : 0.000 Median : 1.00 Median :0.0000
## Mean : 773.9 Mean : 1.294 Mean : 12.43 Mean :0.2632
## 3rd Qu.: 400.0 3rd Qu.: 1.000 3rd Qu.: 8.00 3rd Qu.:0.0000
## Max. :18820.0 Max. :19.000 Max. :254.00 Max. :8.0000
## shopping financial upmarket_residential
## Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 0.00
## Median : 0.000 Median : 5.00 Median : 3.00
## Mean : 1.573 Mean : 10.28 Mean : 10.54
## 3rd Qu.: 1.000 3rd Qu.: 13.00 3rd Qu.: 10.50
## Max. :31.000 Max. :134.00 Max. :217.00
The values range of the variables are very different. In order to prevent the cluster analysis from being biased, we will standardise the variables before proceeding with cluster analysis.
We will use min-max standardisation as the variables are not normally distributed as observed from the EDA previously.
cluster_variables.std <- heatmaply::normalize(cluster_variables)
summary(cluster_variables.std)
## density economy_active hdb_1to2 condo
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.1283 Median :0.03560 Median :0.0000 Median :0.01371
## Mean :0.2331 Mean :0.09258 Mean :0.1153 Mean :0.10894
## 3rd Qu.:0.4327 3rd Qu.:0.12892 3rd Qu.:0.1287 3rd Qu.:0.16905
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## landed government business industry
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.003937 Median :0.00000
## Mean :0.04112 Mean :0.06811 Mean :0.048951 Mean :0.03289
## 3rd Qu.:0.02125 3rd Qu.:0.05263 3rd Qu.:0.031496 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.00000
## shopping financial upmarket_residential
## Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.007463 1st Qu.:0.00000
## Median :0.00000 Median :0.037313 Median :0.01382
## Mean :0.05073 Mean :0.076706 Mean :0.04859
## 3rd Qu.:0.03226 3rd Qu.:0.097015 3rd Qu.:0.04839
## Max. :1.00000 Max. :1.000000 Max. :1.00000
Comparing distribution of variables before and after standardisation:
We will use density as an example to do the comparison.
r <- ggplot(data= sd_uf,
aes(x= `density`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
df_cluster_variables.std <- as.data.frame(cluster_variables.std)
s <- ggplot(data=df_cluster_variables.std,
aes(x=`density`)) +
geom_histogram(bins=20,
color="black",
fill="light blue") +
ggtitle("Min-Max Standardisation")
ggarrange(r, s,
ncol = 2,
nrow = 1)
This shows us that the overall distribution of population density has not been altered significantly hence the distribution for the rest of the variables should not differ greatly too.
Calculating proximity matrix using default euclidean method:
proximity_matrix <- dist(cluster_variables.std, method = 'euclidean')
Getting aggomerative coefficient of all the the hierarchical clustering algorithms by using the agnes() function of the cluster package.
m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")
ac <- function(x) {
agnes(cluster_variables.std, method = x)$ac
}
map_dbl(m, ac)
## average single complete ward
## 0.9002887 0.8142469 0.9236774 0.9769280
From the output, we can observe that the Ward’s method provides the strongest clustering structure as it has the highest agglomerative coefficient amongst all the methods assessed. From https://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html we note that the method “ward” in agnes() corresponds to “ward.D2” in hclust. Hence, the Ward’s method will be used to analyse the variables.
hcluster_wards <- hclust(proximity_matrix, method = 'ward.D2')
Plotting clustering dendogram:
plot(hcluster_wards, cex = 0.6)
Determining number of clusters using K elbow method:
mydata <- cluster_variables.std
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 1:11) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:11, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
As seen from the plot, 2 and 5 have elbows. 5 clusters will be most suitable for the analysis.
Plotting the dendogram with highlighted clusters:
plot(hcluster_wards, cex = 0.6)
rect.hclust(hcluster_wards, k = 5, border = 2:5)
As we move up the tree, observations that are similar to one another are combined into branches. The higher the height of the fusion, the less similar the observations are.
Converting the dataframe into a matrix to plot a heatmap displaying the clustering:
matrix_cluster_variables <- data.matrix(cluster_variables.std)
heatmaply(matrix_cluster_variables,
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D2",
seriate = "OLO",
colors = Blues,
k_row = 5,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="Geographic Segmentation of Shan State by ICT indicators",
xlab = "Cluster Variables",
ylab = "Subzones of Singapore"
)
We will be retaining 5 clusters to be mapped upon observation of the dendogram.
groups <- as.factor(cutree(hcluster_wards, k = 5))
Plotting choropleth map with the clusters:
subzone_clusters <- cbind(subzones, as.matrix(groups)) %>% rename(`CLUSTER` = `as.matrix.groups.`)
qtm(subzone_clusters, "CLUSTER")
From the chloropleth, it is revealed that the clusters are very fragmented. The different clusters are scattered across Singapore. This signifies the major constrain faced when the hierarchical cluster analysis method is used.
The SKATER function requires the spatial data to be in ‘sp’ format.
subzones_sp <- as_Spatial(subzones)
Computing neighbours list will from the polygon list:
subzones.nb <- poly2nb(subzones_sp)
summary(subzones.nb)
## Neighbour list object:
## Number of regions: 323
## Number of nonzero links: 1934
## Percentage nonzero weights: 1.853751
## Average number of links: 5.987616
## 5 regions with no links:
## 17 18 19 295 302
## Link number distribution:
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 14 17
## 5 2 6 10 26 77 87 51 34 16 3 3 1 1 1
## 2 least connected regions:
## 16 234 with 1 link
## 1 most connected region:
## 313 with 17 links
Plotting neighbours list:
plot(subzones_sp, border=grey(0.5))
plot(subzones.nb, coordinates(subzones_sp), col = "red", add = TRUE)
There are some clusters without neighbours.
There are 5 regions with no links hence we will remove them as we will not be able to calculate edge costs without any neighbours.
sub_nb_subzones <- subset(subzones.nb, subset = card(subzones.nb) > 0)
summary(sub_nb_subzones)
## Neighbour list object:
## Number of regions: 318
## Number of nonzero links: 1934
## Percentage nonzero weights: 1.912503
## Average number of links: 6.081761
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 14 17
## 2 6 10 26 77 87 51 34 16 3 3 1 1 1
## 2 least connected regions:
## 16 234 with 1 link
## 1 most connected region:
## 313 with 17 links
The subzones without neighbours have been removed.
The cost of each edge is computed as it is the distance between the nodes.
lcosts <- nbcosts(sub_nb_subzones, cluster_variables.std)
Incorporating costs as weights objects:
subzones.w <- nb2listw(sub_nb_subzones, lcosts, style = "B")
summary(subzones.w)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 318
## Number of nonzero links: 1934
## Percentage nonzero weights: 1.912503
## Average number of links: 6.081761
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 14 17
## 2 6 10 26 77 87 51 34 16 3 3 1 1 1
## 2 least connected regions:
## 16 234 with 1 link
## 1 most connected region:
## 313 with 17 links
##
## Weights style: B
## Weights constants summary:
## n nn S0 S1 S2
## B 318 101124 1266.783 2190.305 25762.91
The minimum spanning tree is computing by using the mstree() function from spdep package:
subzones.mst <- mstree(subzones.w)
Checking the class and dimension of the MST computed:
class(subzones.mst)
## [1] "mst" "matrix"
dim(subzones.mst)
## [1] 317 3
It is a matrix and the dimension is 318.
Plotting the MST to view the observation numbers:
plot(subzones_sp, border=gray(0.5))
plot.mst(subzones.mst, coordinates(subzones_sp), col = "red", cex.lab = 0.7, cex.circles = 0.005, add = TRUE)
Computing Clusters:
clusters <- skater(subzones.mst[,1:2], cluster_variables.std, 4)
Plotting pruned tree that shows the clusters:
plot(subzones_sp, border = gray(0.5))
plot(clusters, coordinates(subzones_sp), cex.lab = 0.7, groups.colors = c("red", "blue", "green", "brown", "pink"), cex.circles = 0.005, add = TRUE)
Converting the clusters into a matrix:
groups_matrix <- as.matrix(clusters$groups)
Creating function to include subzones with no neighbours:
insert_row <- function(existingDF, newrow, r) {
existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
existingDF[r,] <- newrow
existingDF
}
Implementing function to add in rows for the subzones without neighbours that were removed previously:
The value added for each of this subzone will be 0 to indicate that they have no neighbours.
The subzones that were removed were (17,18,19,295 and 302) hence their rows will be appended respectively.
existingDF <- as.data.frame(groups_matrix)
newrow <- 0
r <- 17
existingDF <- insert_row(existingDF, newrow , r)
newrow <- 0
r <- 18
existingDF <- insert_row(existingDF, newrow , r)
newrow <- 0
r <- 19
existingDF <- insert_row(existingDF, newrow , r)
newrow <- 0
r <- 295
existingDF <- insert_row(existingDF, newrow , r)
newrow <- 0
r <- 302
existingDF <- insert_row(existingDF, newrow , r)
groups_matrix <- as.matrix(existingDF)
Computing Mean Values of Variables for Clusters
dataset <- cbind(cluster_variables.std, as.factor(groups_matrix)) %>%
rename(`SP_CLUSTER` = `as.factor(groups_matrix)`)
mean_values <- dataset %>%
group_by(SP_CLUSTER) %>%
summarise_all(funs(mean))
mean_vals <- data.frame(t(mean_values[-1]))
colnames(mean_vals) <- mean_values[, 1]
names(mean_vals)[1] <- "Cluster 0"
names(mean_vals)[2] <- "Cluster 1"
names(mean_vals)[3] <- "Cluster 2"
names(mean_vals)[4] <- "Cluster 3"
names(mean_vals)[5] <- "Cluster 4"
names(mean_vals)[6] <- "Cluster 5"
mean_vals <- mean_vals %>%
dplyr::select(`Cluster 1`, `Cluster 2`, `Cluster 3`, `Cluster 4`, `Cluster 5`)
Viewing as a table:
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | |
|---|---|---|---|---|---|
| density | 0.22332473 | 0.13495013 | 0.29323263 | 0.26552265 | 0.25944264 |
| economy_active | 0.08786714 | 0.10049511 | 0.11238861 | 0.05569483 | 0.13948358 |
| hdb_1to2 | 0.10570247 | 0.07500000 | 0.14727273 | 0.37943262 | 0.17787234 |
| condo | 0.10921035 | 0.07871199 | 0.11968342 | 0.02862254 | 0.12176506 |
| landed | 0.04520129 | 0.01554198 | 0.03267317 | 0.00000000 | 0.02412327 |
| government | 0.07138291 | 0.01973684 | 0.05454545 | 0.35087719 | 0.02105263 |
| business | 0.05017693 | 0.08661417 | 0.03493200 | 0.12992126 | 0.06771654 |
| industry | 0.02935223 | 0.14062500 | 0.02954545 | 0.12500000 | 0.05000000 |
| shopping | 0.05472117 | 0.03629032 | 0.04105572 | 0.02150538 | 0.01290323 |
| financial | 0.08042782 | 0.04570896 | 0.06309362 | 0.19402985 | 0.04328358 |
| upmarket_residential | 0.05362040 | 0.01152074 | 0.03544198 | 0.01689708 | 0.02396313 |
Visualising the clusters obtained through SKATER method:
subzones_sf_clusters <- cbind(subzone_clusters, as.factor(groups_matrix)) %>%
rename(`SP_CLUSTER` = `as.factor.groups_matrix.`)
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(subzones_sf_clusters)+
tm_borders(alpha = 0.5)+
tm_fill(col = "SP_CLUSTER", alpha = 0.9, id = "SUBZONE_N")
Cluster 0 occupies the green regions of the map (e.g North-Eastern Islands, Sudong, Semakau, etc). These are areas with no neighbours and least social activity is found here.
Cluster 1 occupies the majority of Singapore, covering most areas except Northern Singapore. Cluster 1 has the highest proportion of landed properties, government buildings such as embassies, shopping establishments such as malls and upmarket residential areas. Cluster 1 has the least number of industial areas. This shows us that government buildings and amenties such as shopping malls are widely spread around Singapore for ease of accessibility.
Cluster 2 is located closer to the North of Singapore. Cluster 2 is dominated by industrial areas and the least population density. It also has the lowest proportion of government areas and housing such as 1-2 bedroom hdb or upmarket residential areas. It can be inferred that Cluster 2 represents an industrial area with less people residing.
Cluster 3 is located towards the Northern most part of Singapore and surrounds Cluster 2. Cluster 3 has the highest population density and the lowest proportion of business areas. This is a densely populated residential area with many housing estates.
Cluster 4 occupies a small region around Boat Quay, Clarke Quay and Raffles Place. It has the highest proportion of business and financial buildings. It has the least number of condominium and landed properties. Cluster 4 represents the Central Business District of Singapore that is dominated by businesses and banks hence there are less residential areas here.
Cluster 5 lies towards the North-East region of Singapore at Punggol. It has the highest proportion of economy active people staying there and the lowest proportion of financial buildings. The reason why many economy active people are staying there may be due to Punggol/ Sengkang being developing residential towns with many new HDBs being built there. Many economy active people will choose to buy a new flat at the Punggol region.
It can be inferred that the East, Central, South and West regions of Singapore are more prominent social areas as most of the urban functions are found in these regions.