1 Introduction

Brazil is one of the largest country in terms of land size and population. Despite having a high GDP, it is not spread equally among the municipalities and result is a huge income gap. São Paulo Macrometropolis is one of the most densely populated region and hence we will be segmenting São Paulo Macrometropolis at the municipality level into homogeneous industry specialization clusters.

For more information about São Paulo Macrometropolis, please take a look at : https://en.wikipedia.org/wiki/S%C3%A3o_Paulo_Macrometropolis

2 Installing the necessary packages

First we have to install all the necesary packages to conduct the segmentation. Do note that we will be use geobr library to retrieve the geospatial data of brazil municipality.

The rest of the library are use to:

Process the data
Performing correlation analysis
Performing clustering
Plotting of chart

packages = c('rgdal', 'spdep', 'tmap', 'sf', 'ggpubr', 'cluster', 'factoextra', 'NbClust', 'heatmaply', 'corrplot', 'psych', 'tidyverse', 'geobr','stringr','dbscan')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p,character.only = T)
}

3 Data Wrangling

The data that is being use for this study is a combined dataset of publicly available information on Brazil municipality. The combined dataset can be found on https://www.kaggle.com/crisparada/brazilian-cities?select=Data_Dictionary.csv

The feature we will be focusing on will mainly come from Brazilian Institute of Geography and Statistics (IBGE)

3.1 Aspatial Data

Although the file is save as a csv, the delimiter used in this file is a semi colon. Hence, we need to use read_delim and set the delimiter to “;” in order to read the data file properly.

brazil_data <- read_delim("data/aspatial/BRAZIL_CITIES.csv", delim = ";")

Data dictionary is the meta data for the Brazil_cities data. It contain information such as what does each columns refers to.

data_dict <- read_delim("data/aspatial/Data_Dictionary.csv", delim = ";")

As we will be performing segmentation based on the number of companies across industries in each municipality, we will be looking at all the features that contains “COMP”. To understand what does each of them refers to we will look at the meta data for each of the feature.

print(data_dict %>%
  filter(grepl('COMP',FIELD)))

## # A tibble: 22 x 6
##    FIELD  DESCRIPTION                      REFERENCE UNIT  SOURCE          X6   
##    <chr>  <chr>                            <chr>     <chr> <chr>           <chr>
##  1 COMP_~ Total number of companies        2016      -     https://sidra.~ <NA> 
##  2 COMP_A Number of Companies: Agricultur~ 2016      -     https://sidra.~ <NA> 
##  3 COMP_B Number of Companies: Extractive~ 2016      -     https://sidra.~ <NA> 
##  4 COMP_C Number of Companies: Industries~ 2016      -     https://sidra.~ <NA> 
##  5 COMP_D Number of Companies: Electricit~ 2016      -     https://sidra.~ <NA> 
##  6 COMP_E Number of Companies: Water, sew~ 2016      -     https://sidra.~ <NA> 
##  7 COMP_F Number of Companies: Constructi~ 2016      -     https://sidra.~ <NA> 
##  8 COMP_G Number of Companies: Trade; rep~ 2016      -     https://sidra.~ <NA> 
##  9 COMP_H Number of Companies: Transport,~ 2016      -     https://sidra.~ <NA> 
## 10 COMP_I Number of Companies: Accommodat~ 2016      -     https://sidra.~ <NA> 
## # ... with 12 more rows

3.2 Geospatial Data

3.2.1 Retrieving the Spatial Data

We will be retrieving the municipality and metropolitan boundary using geobr library functions.

read_municipality will download and return us a spatial dataframe containing all the municipality across Brazil
read_metro_area will download and return us a spatial dataframe containing all the metro area across Brazil

Although the metro_area includes data at municipality level,there are some munipality that is not included in the Sau Paulo metro area.

muni <- read_municipality(year=2016)
metro <- read_metro_area(year=2016)

3.2.2 Filter Area of Interest

Filter the metro boundary such that it only consists of Sau Paulo (SP) Metropolitan city and select the relevant fields needed for analysis.

metro_state_clean <- metro %>%
  filter(abbrev_state =="SP") %>%
  select(name_muni, geom, name_metro)

Display the all metro name that is part of Sau Paulo Metropolitan city.

unique(metro_state_clean$name_metro)

## [1] Aglomeração Urbana de Jundiaí                  
## [2] Aglomeração Urbana de Piracicaba-AU- Piracicaba
## [3] RM Baixada Santista                            
## [4] RM Campinas                                    
## [5] RM de Sorocaba                                 
## [6] RM do Vale do Paraíba e Litoral Norte          
## [7] RM São Paulo                                   
## 77 Levels: Aglomeração Urbana de Jundiaí ...

As Regional Unit of Bragança Paulista city is not part of the metropolitan city, we need to select the municipality manually from the municipality boundaries. The municipality data do not provide us with the metro name so we will assign all the metro name as Regional Unit of Bragança Paulista city it since we know that all of these municipalities belong to it.

bp_city <- muni %>% 
  filter(abbrev_state =="SP" & name_muni %in% c("Atibaia","Bom Jesus Dos Perdões",
                                                "Bragança Paulista","Joanópolis",
                                                "Nazaré Paulista", "Pedra Bela",
                                                "Pinhalzinho","Piracaia","Tuiuti",
                                                "Vargem"))%>%
  mutate("name_metro" = "Regional Unit of Bragança Paulista city") %>%
  select(name_muni, geom, name_metro)

3.2.3 Combining the Spatial Data Frame

With all the fields selected from the 2 layers, we can easily combine the data using rbind. This will add the municipalities that is not found in the metro layer into 1 Spatial Data Frame. We will also convert name_metro from a factor to a character type to make plotting easier.

all_muni <- rbind(metro_state_clean,bp_city)%>% 
  mutate(name_metro = as.character(name_metro))

3.3 Joining Aspatial and Spatial Data

As there are many features that are not used, we will be removing these features before joining the data. We will select the city and all the features that contain “COMP”. We will be exclude COMP_T which is domestic services as it is 0 across all municipality and it would not help us out with analysis.

brazil_data_clean <- brazil_data %>%
  select (CITY,contains(c("COMP")),-"COMP_T")

Next we will perform a join of the spatial data with the clean aspatial data.

brazil_geo <- left_join(all_muni,brazil_data_clean, by = c("name_muni" = "CITY"))

3.4 Calculations

We will be computing the Location Quotients for each municipality. The formula is as follows: (number of companies from industry X in municipality)/(total number of companies in municipality)/ ((number of companies from industry X in study area)/ (total number of companies in study area))

The code chunk below replace all the missing value with 0 and calculate the location quotient for each industry.

brazil_lq <- brazil_geo %>%
  mutate_all(~replace(., is.na(.), 0))%>%
  mutate_at(vars(contains("COMP")),funs(LQ = (./COMP_TOT)/(sum(.)/sum(COMP_TOT))))

4 EDA

4.1 Boundaries of Metropolitan City

The code below create a simple map that distinguish each metro city on the map. This can help us to identify the metropolitan boundaries in Sau Paulo. From this map we can see that some of the metropolitan city is a lot larger compared to others

tm_shape(brazil_lq)+
  tm_fill("name_metro")+
  tm_borders(lwd = 1)

## Some legend labels were too wide. These labels have been resized to 0.63. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.

This is the distribution of each of the company after applying normalisation to each of the location quotients. We can see that even after apply min-max normalisation, not all the variable follows the normal distribution.

ggplot(data= brazil_lq %>% st_set_geometry(NULL)%>%
  as.data.frame() %>%
  select(contains("LQ")) %>%
  normalize()%>%
  gather("Industry","Num_Company",contains("LQ")), 
             aes_string(x = "Num_Company"))+
  facet_wrap("Industry") +
  geom_histogram(bins=10, 
                 color="black", 
                 fill="light blue")

## Warning: Removed 206 rows containing non-finite values (stat_bin).

5 Choropleth Maps

We will be preparing chloropleth map to show the distribution of spatial specialization by industry type.

5.1 Convert Spatial Data Frame to SP Object

brazil_sp <- as_Spatial(brazil_lq)

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition

## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Sistema de Referencia Geocentrico para las
## AmericaS 2000 in CRS definition

Set the tmap mode to plot mode

tmap_mode("plot")

## tmap mode set to plotting

5.2 Creating a Function for Each Tmap

As there are a total of 21 industry, it will be a hassel to repeat the codes 21 times. Hence, we will be creating a function to help us to generate the Location Quotient chloropleth map based on each industry.

plot_lq_tmap <- function(lq) {
  lq_tmap <- tm_shape(brazil_sp)+
    tm_borders(alpha=0.8)+
    tm_fill(lq)+
    tm_layout(main.title = paste("Location Quotients for Industry ",str_sub(lq,-4,-4)),
            main.title.position = "center",
            legend.show =  TRUE,
            legend.text.size = 0.6,
            frame = TRUE)
  tmap_arrange(lq_tmap)
}

Get a list of all the features by using colnames. We convert the sp object to data frame and select all the Location Quotient columns. After that we apply colnames to the resulted dataframe to get a list of the features that we need.

industry_list <- colnames(brazil_sp %>%
                            as.data.frame() %>%
                            select (contains(c("LQ")),-"COMP_TOT_LQ"))

Perform a loop to create a tmap object for each industry and append to a list.

tmap_list <- list()
for(i in 1:length(industry_list)) 
  {
   tmap_list <- c(tmap_list, assign(paste("LQ", i, sep = ""), plot_lq_tmap(industry_list[[i]])))
}

Walk is a function that help us to iterate through each element in a list. In order to make an animated plot, we will be iterating through the list of tmap object created in the previous code chunk. Together with the r code chunk setting, the choropleth map will play in a loop like an animation.

walk(tmap_list, print)

From the choropleth map, we can see that the distribution of each industry across Sau Paulo. Different industry have different kind of distribution. For example Industry A which is Agriculture, livestock, forestry, fishing and aquaculture are mainly near the right side of the study area. This concentration could be explain by their location which is near to the sea. This would make their main source of income to be fishing.

As for Industry U, it is evenly distributed across the municipality. Industry U is Human health and social services which is an essential service. As we can see from the legend it varies from between 0 to 2 which its variance is a lot lesser compared to industry which varies from 0 to 70. Although there are municipality with lesser healthcare companies, most of them are evenly spread which suggest that healthcare in Sau Paulo is well taken care of.

Likewise for industry P which is education, it is being well spread across the municipality. With each of the choropleth map we can see that in term of social development, it is somewhat equally distributed across each of the municipality. However, since we do not evaluate it based on the population size, there could still be a chance where certain municipality have greater access to these facilities compared to others. This is evident as municipality which a larger land size do not have a greater concentration of these services.

In general, most of the industries is somewhat evenly spread across the Sau Paulo Macrometropolitan.

6 Correlation Analysis

6.1 Preparing Data for Correlation Analysis

In this code chunk we remove the COMP_TOT_LQ as the location quotient will always be 1 for all municipality

brazil_lq_df <- brazil_lq %>%
  as.data.frame() %>%
  select(contains("LQ"))%>%
  select(-"COMP_TOT_LQ")%>%
  mutate_all(~replace(., is.na(.), 0))

The code chunk below uses cor calculate the correlation matrix for all the industry. None of the number exceed 0.7 which means that we do not need to remove any variable as none of the variable will lead to multi-colinearity.

cluster_vars.cor = cor(brazil_lq_df)
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

The features with the highest correlation is industry K (Financial, insurance and related services activities) and industry M (Professional, scientific and technical activities) with a score of 0.6. These 2 industries doesn’t seems to be related but both requires educated people which might explain their high correlation

While industry A (Agriculture, livestock, forestry, fishing and aquaculture) and industry G (Trade; repair of motor vehicles and motorcycles) is the most negatively correlated with a score of -0.56. This could be because the farmers and fisherman do not earn as much as the professional staff that is why they could not afford motor vehicles which leads to negative correlation.

However, none of them are strongly correlated such that it will lead to multi co-linearity.

7 Hierarchy Cluster Analysis

Clustering help us to identify municipality which have similar behaviors and group them together. For example, grouping the well educated municipality as a group and not so well educated as another.

7.1 Preparing clustering variables

Fortunately, we do not have any features which are highly correlated hence there is no need to remove any features. However, we need to change row name to municipality name as we could not have a dataframe with categorical variable to perform clustering. In order to identify each records with respect to their municipality we need to make the row names as the municipality name.

row.names(brazil_lq_df) <- make.unique(brazil_lq$name_muni)
head(brazil_lq_df)

##                      COMP_A_LQ COMP_B_LQ COMP_C_LQ COMP_D_LQ COMP_E_LQ
## Cabreúva             1.6779333  3.608341  2.456897 0.0000000 2.9005319
## Campo Limpo Paulista 1.0347479  0.000000  1.827087 0.0000000 0.8570854
## Itupeva              1.7042262  3.242012  2.240169 1.5792344 2.6060617
## Jarinu               2.8015282  4.778127  1.319434 0.0000000 1.2802847
## Jundiaí              0.9356752  1.285533  0.910678 0.3415654 1.3151909
## Louveira             2.1416336  0.000000  1.708842 0.0000000 2.5043617
##                      COMP_F_LQ COMP_G_LQ COMP_H_LQ COMP_I_LQ COMP_J_LQ
## Cabreúva             1.0260814  1.088566 2.3387028  1.004653 0.2175585
## Campo Limpo Paulista 1.2801750  1.181146 0.9890531  1.045489 0.6107257
## Itupeva              1.0812533  1.015435 1.2980329  1.168658 0.3583642
## Jarinu               1.0316255  1.145486 1.4774134  1.368911 0.6481997
## Jundiaí              0.9132823  1.021213 1.1577283  1.086507 0.7609971
## Louveira             1.1484326  1.233127 1.9636940  1.282290 0.3287257
##                      COMP_K_LQ COMP_L_LQ COMP_M_LQ COMP_N_LQ COMP_O_LQ
## Cabreúva             0.7497962 0.7173763 0.3862925 0.4599131 2.8497140
## Campo Limpo Paulista 0.2492540 0.6733460 0.6392205 0.4800370 2.5262072
## Itupeva              0.4678297 1.6176846 0.6093094 0.6664862 3.4138705
## Jarinu               0.2068484 0.4470316 0.5967785 0.7465310 3.7735614
## Jundiaí              0.6273455 1.2519203 1.0776323 0.9148801 0.8306669
## Louveira             0.2157950 0.8744375 0.4335898 0.4163100 4.9209695
##                      COMP_P_LQ COMP_Q_LQ COMP_R_LQ COMP_S_LQ COMP_U_LQ
## Cabreúva             0.5308189 0.8918234  0.305231 0.8119685         0
## Campo Limpo Paulista 2.1959421 0.7002292  0.676451 0.7197918         0
## Itupeva              1.0775056 0.3205132  1.005558 0.8997600         0
## Jarinu               1.4448600 0.5061185  1.111505 0.4838401         0
## Jundiaí              1.1193982 1.3237265  1.136867 0.9401525         0
## Louveira             0.9930198 0.4180073  0.856508 0.7536454         0

8 Data Standardisation

8.1 Min-Max standardisation

As we can see in the choropleth map, the variance of each industry is different. With features of different scales, the clustering results may be more affected by the features with greater scales as the clustering was computed by measuring distance between the features. Hence, we will apply min max standardisation using normalise function. Note that the default method for normalise function is min max standardise.

brazil_lq_df.std <- normalize(brazil_lq_df)
summary(brazil_lq_df.std)

##    COMP_A_LQ          COMP_B_LQ         COMP_C_LQ        COMP_D_LQ      
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.005384   1st Qu.:0.00000   1st Qu.:0.1787   1st Qu.:0.00000  
##  Median :0.027261   Median :0.01412   Median :0.2839   Median :0.00000  
##  Mean   :0.107719   Mean   :0.06464   Mean   :0.3165   Mean   :0.01851  
##  3rd Qu.:0.114746   3rd Qu.:0.05951   3rd Qu.:0.4210   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
##    COMP_E_LQ         COMP_F_LQ        COMP_G_LQ        COMP_H_LQ     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.01955   1st Qu.:0.3177   1st Qu.:0.4831   1st Qu.:0.1747  
##  Median :0.08226   Median :0.4509   Median :0.5383   Median :0.2505  
##  Mean   :0.10537   Mean   :0.4402   Mean   :0.5356   Mean   :0.2963  
##  3rd Qu.:0.14351   3rd Qu.:0.5528   3rd Qu.:0.6025   3rd Qu.:0.3615  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    COMP_I_LQ        COMP_J_LQ         COMP_K_LQ         COMP_L_LQ     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.1911   1st Qu.:0.02256   1st Qu.:0.07436   1st Qu.:0.1240  
##  Median :0.2322   Median :0.03811   Median :0.14918   Median :0.2936  
##  Mean   :0.2606   Mean   :0.05217   Mean   :0.16285   Mean   :0.2971  
##  3rd Qu.:0.2885   3rd Qu.:0.05514   3rd Qu.:0.21736   3rd Qu.:0.4280  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##    COMP_M_LQ        COMP_N_LQ        COMP_O_LQ          COMP_P_LQ     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.1335   1st Qu.:0.1086   1st Qu.:0.005648   1st Qu.:0.1594  
##  Median :0.1964   Median :0.1914   Median :0.014244   Median :0.2251  
##  Mean   :0.2096   Mean   :0.2108   Mean   :0.050124   Mean   :0.2316  
##  3rd Qu.:0.2689   3rd Qu.:0.2695   3rd Qu.:0.038530   3rd Qu.:0.3029  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000000   Max.   :1.0000  
##    COMP_Q_LQ        COMP_R_LQ        COMP_S_LQ        COMP_U_LQ       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000000  
##  1st Qu.:0.1935   1st Qu.:0.1590   1st Qu.:0.1269   1st Qu.:0.000000  
##  Median :0.3113   Median :0.2176   Median :0.1604   Median :0.000000  
##  Mean   :0.3256   Mean   :0.2078   Mean   :0.1711   Mean   :0.009808  
##  3rd Qu.:0.4222   3rd Qu.:0.2694   3rd Qu.:0.1989   3rd Qu.:0.000000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000000

With min max standardization all the values fall between the range of 0 and 1. This will ensure a fair clustering across all features.

8.2 Computing proximity matrix

As mentioned previously, clustering take account into the distance of each municipal across each feature. The code chunk below help us to calculate the euclidean distance and return us a proximity matrix based on each municipal.

proxmat <- dist(brazil_lq_df, method = 'euclidean')

8.3 Computing hierarchical clustering

We will be using ward method where the objective function of this method is to reduce the intra-cluster variance. In simpler term, we want to make sure that the difference between each municipality within a cluster group is being minimise. So each cluster will contain municipality with the most similar behavior.

hclust_ward <- hclust(proxmat, method = 'ward.D')

Next we will plot a dengrogram based on the ward cluster that is perform in the previous code chunk. As there are many municipality, it is difficult to use eye power to determine the best number of cluster.

plot(hclust_ward, cex = 0.6)

8.4 Selecting the optimal clustering algorithm

Before we proceed to select the number cluster, we want to select the most optimal clustering method. The code chunk below iterate through each of the methods and return us the agglomerative coefficient using the agnes function. The closer the coefficient is to 1 the better the clustering method.

methods <- c( "average","single" , "complete","ward")
names(methods) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(brazil_lq_df, method = x)$ac
}

map_dbl(methods,ac)

##   average    single  complete      ward 
## 0.9587330 0.9305396 0.9725822 0.9873336

From the results, we can see that ward clustering method is the most optimal among the rest.

8.5 Selecting the optimal number of cluster

we will be using the gap statistic method to determine the optimal number of cluster. The gap statistic compares the total within intra-cluster variation across different number of cluster. The larger the value, the further away the cluster structure is from random uniform distribution.

set.seed(12345)
gap_stat <- clusGap(brazil_lq_df, FUN = hcut, nstart = 25, K.max = 20, B = 50)
# Print the result
print(gap_stat, method = "firstmax")

## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = brazil_lq_df, FUNcluster = hcut, K.max = 20, B = 50,     nstart = 25)
## B=50 simulated reference sets, k = 1..20; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 1
##           logW   E.logW      gap     SE.sim
##  [1,] 7.218456 8.462920 1.244463 0.02499099
##  [2,] 6.946855 8.125598 1.178743 0.03485009
##  [3,] 6.693381 7.981988 1.288607 0.01957957
##  [4,] 6.624222 7.905851 1.281629 0.01901593
##  [5,] 6.488012 7.841814 1.353802 0.02020927
##  [6,] 6.418956 7.785693 1.366736 0.02071964
##  [7,] 6.376828 7.734372 1.357544 0.01960608
##  [8,] 6.316574 7.688499 1.371925 0.01912219
##  [9,] 6.173703 7.646229 1.472525 0.01996544
## [10,] 6.134732 7.607543 1.472811 0.01941677
## [11,] 6.058396 7.572282 1.513886 0.01908242
## [12,] 6.022341 7.538742 1.516401 0.01893998
## [13,] 5.987075 7.506980 1.519906 0.01887916
## [14,] 5.938941 7.477479 1.538539 0.01881704
## [15,] 5.889481 7.449358 1.559877 0.01864375
## [16,] 5.860065 7.422465 1.562400 0.01821216
## [17,] 5.821592 7.397136 1.575544 0.01828022
## [18,] 5.766986 7.373224 1.606238 0.01771952
## [19,] 5.734121 7.349939 1.615817 0.01715475
## [20,] 5.683892 7.327298 1.643406 0.01709163

It can be difficult to visualize based on the number. We will be plotting the gap statistics in the code chunk below.

fviz_gap_stat(gap_stat)

As we can see the gap statistic kept on increase it means that the more cluster there is better the cluster structure is. However as the number of cluster increase, it will eventually reach the state where each municipality belongs to one cluster and the clustering will no longer be useful.

Hence, we pick 9 as the optimal number of cluster as after the sharp increase from 8 to 9 clusters the increment in gap statistics doesn’t increase significantly as the number of cluster increase.

8.6 Plot Cluster on Dendrogram

Now we will plot the cluster in the Dendrogram. Each of the color boxes represent the a cluster group.

plot(hclust_ward, cex = 0.6)
rect.hclust(hclust_ward, k = 9, border = 1:9)

We can see that the cluster size are not evenly spread out. The fourth and fifth cluster (cyan and red) in the middle consist of majority of the municipality. The rest of the cluster are relatively small.

8.7 Visually-driven hierarchical clustering analysis

With the dendrogram above there aren’t much insight we can gather. Hence, we will be building a highly interactive cluster heatmap. This will allow us to visualise how are the cluster determine based on the distribution of the location quotient of each industry

8.7.1 Transforming the data frame into a matrix

First we will need to convert the dataframe in to a matrix.

brazil_lq_mat <- data.matrix(brazil_lq_df)

8.7.2 Plotting Interactive Cluster Heatmap

As there are a number of feature that is used in the clustering it is difficult to see which variable is responsible for the particular cluster group. With the heatmap, we can see that for certain cluster, they have certain unique distribution of industry for each municipality.

heatmaply(normalize(brazil_lq_mat),
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 9,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Geographic Segmentation of Sau Paulo by Industry LQ",
          xlab = "Industry LQ",
          ylab = "Sau Paulo Municipal"
          )

One example would be the second last cluster (browish orange) where they have the highest concentration of Industry F, construction companies, while the first cluster have the highest concentration of Industry A (Agriculture, livestock, forestry, fishing and aquaculture). The can help us to determine the characteristics of each cluster. As each of the cluster have a unique concentration of a particular industry as shown in the heatmap, we will proceed with 9 clusters for the other analysis.

9 Mapping the clusters formed

We will use cuttree to create a 9 cluster model and save it as a list of factors.

groups <- as.factor(cutree(hclust_ward, k=9))

Next we will convert the group list into a matrix and merge it with our spatial data frame so we can plot it using tmap. Then we rename the matrix group as CLUSTER.

brazil_lq_cluster<- cbind(brazil_lq, as.matrix(groups)) %>%
  rename(`CLUSTER`=`as.matrix.groups.`)

Now we will quickly plot the municipalities with respect to their cluster group on the map.

qtm(brazil_lq_cluster, "CLUSTER")

As you can see from the map, the cluster are dispersed all over the Sau Paulo Macrometropolis. This map doesn’t bring much insights to us but we can see that for cluster 8 there are still very close to each other. If you could remember the choropleth map for Industry A, those were the area that are concentrated with Agriculture, livestock, forestry, fishing and aquaculture companies. This may be a sign that industry A is consistent with the clustering methods both numerically as well as spatially. With that being said, the rest of the industry are still very fragmented.

10 Spatially Constrained Clustering

With the analysis stated above, there is a need to identify the distribution of industry on the spatial dimension. Hence we will make use of spatially constrained cluster SKATER approach to see how are the municipality clustered based on both the industry distribution and their location on the map.

10.1 Computing Neighbour List

First we will compute the neighbour list from the polygon list. If you notice, one of the polygon “Ilhabela” is disconnected by water source from the rest of Sau Paulo Macrometropolis. Hence, if we were to compute the neighbour list using contiguity base Ilhabela will not have any neighbour. Hence, we include a snap of 0.02 so that every polygon will at least 1 neighbour.

brazil.nb <- poly2nb(brazil_sp,snap = 0.02)
summary(brazil.nb)

## Neighbour list object:
## Number of regions: 186 
## Number of nonzero links: 1140 
## Percentage nonzero weights: 3.295179 
## Average number of links: 6.129032 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 15 29 
##  2  9  7 29 39 36 22 17  7  7  4  4  1  1  1 
## 2 least connected regions:
## 9 39 with 1 link
## 1 most connected region:
## 171 with 29 links

With that, we will plot municipality boundaries and their neighbour which is represented by the links from the centroid of each municipality.

plot(brazil_sp, border=grey(.5))
plot(brazil.nb, coordinates(brazil_sp), col="blue", add=TRUE)

10.2 Computing minimum spanning tree

10.2.1 Calculating edge costs

Now we will compute the cost of each edge based on the distribution of the different industry.

lcosts <- nbcosts(brazil.nb, brazil_lq_df)

Next, we will incorporate the costs into a weights object. In other words, we convert the neighbour list to a list weights object by specifying the just computed lcosts as the weights.

brazil.w <- nb2listw(brazil.nb, lcosts, style="B")
summary(brazil.w)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 186 
## Number of nonzero links: 1140 
## Percentage nonzero weights: 3.295179 
## Average number of links: 6.129032 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 15 29 
##  2  9  7 29 39 36 22 17  7  7  4  4  1  1  1 
## 2 least connected regions:
## 9 39 with 1 link
## 1 most connected region:
## 171 with 29 links
## 
## Weights style: B 
## Weights constants summary:
##     n    nn       S0      S1       S2
## B 186 34596 30754.69 6396520 68711399

10.2.2 Computing minimum spanning tree

Minimum Spanning Tree is the graph nodes with the minimum number of link on the map and make sure that all municipality are interconnected.

brazil.mst <- mstree(brazil.w)

The code chunk below display the first few values of the minimum spanning tree matrix where 1 and 2 refers to the municiaplity index and 3 refers to their distance.

head(brazil.mst)

##      [,1] [,2]      [,3]
## [1,]  179  106 19.310313
## [2,]  106  178  5.793052
## [3,]  178  183  8.667057
## [4,]  183  177  6.947653
## [5,]  177  175  2.411848
## [6,]  175  154  2.547543

With that we will plot the minimum spanning tree to see how much link is being reduced when there is only one link connecting to each node.

plot(brazil_sp, border=gray(.5))
plot.mst(brazil.mst, coordinates(brazil_sp), 
     col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)

10.3 Computing spatially constrained clusters using SKATER method

Now we will make use of the minimum spanning tree together with the location quotient to perform SKATER clustering. we will be using euclidean distance and 9 clusters. Note that it is 8 as the parameter input is 1 less than the number of clusters we are using.

clust9 <- skater(brazil.mst[,1:2], brazil_lq_df, method = "euclidean", 8)

Next, we want to extract the cluster labels from the skater object created by performing SKATER clustering.

ccs9 <- clust9$groups

Next, we visualize the distribution of municipality across the cluster group. However, we can see that most of the cluster group is relatively sparse as compared to the hierarchical clustering and most od the municipalities belongs to cluster 1.

table(ccs9)

## ccs9
##   1   2   3   4   5   6   7   8   9 
## 169   1   1   7   1   1   1   4   1

we will plot the pruned tree that shows the 9 clusters across the municiaplity.

plot(clust9, coordinates(brazil_sp), cex.lab=.7, cex.circles=0.005)
plot(brazil_sp, border=gray(.5), add=TRUE)

10.4 Visualising Cluster on Choropleth Map

The code chunk below plot the derived cluster based on SKATER clustering onto a choropleth map.

groups_mat <- as.matrix(clust9$groups)
brazil_sf_spatialcluster <- cbind(brazil_lq_cluster, as.factor(groups_mat)) %>%
  rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(brazil_sf_spatialcluster, "SP_CLUSTER")

Lastly, we place the hierarchial clustering and SKATER clustering choropleth map side by side for easier comparison.

hclust.map <- qtm(brazil_lq_cluster,
                  "CLUSTER") + 
  tm_borders(alpha = 0.5) 

shclust.map <- qtm(brazil_sf_spatialcluster,
                   "SP_CLUSTER") + 
  tm_borders(alpha = 0.5) 

tmap_arrange(hclust.map, shclust.map,
             asp=NA, ncol=2)

## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

As you can see from the 2 choropleth map when we take the spatial as a constraints, it seems that 9 clusters seems to be not performing too well. There are quite a number of cluster with only 1 municipality in it. There are only 3 main spatial cluster in the skater cluster map which does not really provide us with much insights. Cluster 4 is the cluster with Agriculture, livestock, forestry, fishing and aquaculture companies which yield a consistent result between the hirearchical clustering and spatially constraint clustering. Similarly for cluster 8 where the municipalities have the most Transport, storage and mail companies. Cluster 1 where most of the municipalities belongs might represent that most of Sau Paulo municipalities have similar characteristics.

Tapirai is a cluster by itself in spatially constraint map but it is being group together with another municipality, São Pedro in the hireachical clustering. According to the choropleth map, Tapirai has the highest concentration in other service activities but São Pedro have many Trade; repair of motor vehicles and motorcycles companies. These 2 municipality are also quite far apart. With SKATER clustering, we are able to separate the 2 as different cluster.

11 Additional

11.1 Optimal Cluster with DBSCAN and Silhouette Score

As the cluster form above does not seem as meaningful we will be evaluating the optimal number of cluster. In this evaluation method we will be using dbscan as the clustering algorithm to determine the best number of cluster based on silhouette score. It is known that the higher the sihouette score the better the more “similar” municipalities within a cluster is.

fviz_nbclust(brazil_lq_df, dbscan, method = "silhouette", k.max = 30) + theme_minimal() + ggtitle("The Silhouette Plot")

With that being said, 17 clusters seems to be the most optimal number of cluster. we will try to apply skater cluster once again. I will not be repeating the explanation since it is the same.

clust17 <- skater(brazil.mst[,1:2], brazil_lq_df, method = "euclidean", 16)

ccs17 <- clust17$groups

tmap_mode("plot")

## tmap mode set to plotting

groups_mat <- as.matrix(clust17$groups)
brazil_sf_spatialcluster <- cbind(brazil_lq_cluster, as.factor(groups_mat)) %>%
  rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(brazil_sf_spatialcluster, "SP_CLUSTER")

With the new optimal number of cluster we can see that there is 5 main spatial cluster. 4,8,11,12,17. Cluster 4 is the cluster with Agriculture, livestock, forestry, fishing and aquaculture companies. Cluster 8 is the cluster with most Transport, storage and mail companies. As we refer to the choropleth map of each industry location quotient, the cluster does not fit into any of the trends. This means that they may take account where the cluster may represent that they detect pattern that might require a more in depth analysis. For example, cluster 11 is a large cluster with relatively little Arts, culture, sport and recreation but when it comes to other industries they tend to similar to the rest of the majority.

Cluster 5 is a single municipality, Tapirai, by itself. According to the choropleth map, it has the highest concentration in other service activities. This is the same as the skater plot that was done previously.

Take Home Ex 3

Fabian

11/5/2020