1 Introduction

In an increasingly globalised world, businesses often have to expand beyond local waters to remain competitive and relevant. In Singapore, Enterprise Singapore, a statutory board under the Ministry of Trade and Industry, assists small medium enterprises (SMEs) to internationalise by expanding overseas.

An often highly recommended country by Enterprise Singapore is Brazil. As an emerging market with plenty of growth upside, Brazil can serve as a greenfield for Singapore’s SMEs to grow and capitalise their goods and services. However as a developing economy with large land area, spatial development is still uneven as such companies will require good understanding on the competitiveness of local industries and geographical specialisations.

1.1 Study Area

We will be focusing on the São Paulo Macrometropolis, a Brazilian meagapolis. As one of the most populous urban agglomerations in the world, this area will provide us a good opportunity to cluster and delinate Industry Cluster through geographic segmentation.

We will also be analysing variable data based on the 2016 boundaries.

1.2 Aim of Analysis

Through this analysis we aim to:

  • Examine the distribution of spatial specialisation by industry type and compute location quotients by industry type
  • Delineate and map industry specialisation clusters by using hierarchical clustering method and spatially constrained clustering method

1.3 Data

We will be utilising the following data sources:

  • BRAZIL_CITIES.csv. This data file consists of 81 columns and 5573 rows. Each row representing one municipality. The data was extracted from Kaggle.
  • Municipality and Metropolis boundary file downloaded through geobr

2 Install and load packages

The following code chunk loads the relevant packages to be utilised in this study.

packages = c('geobr','rgdal', 'sf', 'spdep', 'tmap', 'tidyverse','plotly','tmaptools', 'ggpubr', 'cluster', 'factoextra', 'NbClust', 'heatmaply', 'corrplot', 'psych')

for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p,character.only = T)
}

3 Data Import and Preparation

3.1 Importing Aspatial data

3.1.1 Import Data

We will be first importing in the various variable data for each city in brazil. The data is in CSV file and is delimited by ;. The following code chunk imports in the data accordingly.

brazil <- read.csv("data/aspatial/BRAZIL_CITIES.csv",sep=";", fileEncoding = "UTF-8")

3.1.2 Preview Data

The data provides a comprehensive list of various variables that relate to cities in brazil. The comprehensive data descriptions for each variable can be found here

head(brazil)

Now that we have the attribute data, to make it the data more understandable we can rename some of the columns. We will also additionally convert any NA values to 0.

brazil_new <- brazil %>%
  rename("IND_AGRICULTURE"="COMP_A",
         "IND_EXTRACTIVE"="COMP_B",
         "IND_TRANSFORMATION"="COMP_C",
         "IND_ELECTRIC_GAS"="COMP_D",
         "IND_WASTE_SEWAGE"="COMP_E",
         "IND_CONSTRUCTION"="COMP_F",
         "IND_VEHICLE_TRADE_REPAIR"="COMP_G",
         "IND_TRANSPORT_STORAGE_MAIL"="COMP_H",
         "IND_ACCOMD_FOOD"="COMP_I",
         "IND_INFO_COMM"="COMP_J",
         "IND_FINANCE_SVCS"="COMP_K",
         "IND_REAL_ESTATE"="COMP_L",
         "IND_PRO_SCIENTIC_TECHNICAL"="COMP_M",
         "IND_ADMINISTRATIVE"="COMP_N",
         "IND_PUBLIC_ADMIN_DEFENSE_SS"="COMP_O",
         "IND_EDUCATION"="COMP_P",
         "IND_HEALTH_SS"="COMP_Q",
         "IND_ART_SPORT_RECRE"="COMP_R",
         "IND_OTHER_SVCS"="COMP_S",
         "IND_DOMESTIC_SVCS"="COMP_T",
         "IND_INTERNL_INSTITUTIONS"="COMP_U")%>%
  mutate_all(~replace(., is.na(.), 0))%>%
  filter(STATE == "SP")

3.2 Importing Geospatial data

We will be utilising the package geobr to access the spatial data set for the megapolis, São Paulo Macrometropolis. The code chunk below downloads the municipalities in the state of Sao Paulo where São Paulo Macrometropolis is located and all the metro areas in Brazil.

muni <- read_municipality(code_muni = "SP",year=2016)
metro <- read_metro_area(year=2016)

3.3 Extracting study area

Now we will need to extract out our study area to be utilised in our further analysis. Based on the wikipedia page for the São Paulo Macrometropolis, the megapolis has the following divisions:

Region Population
1 Metropolitan Region of São Paulo 21,860,000
2 Metropolitan Region of Campinas 3,300,000
3 Metropolitan Region of Vale do Paraíba e Litoral Norte 2,550,000
4 Metropolitan Region of Sorocaba 2,150,000
5 Metropolitan Region of Baixada Santista 1,880,000
6 Piracicaba Urban Agglomeration 1,500,000
7 Jundiaí Urban Agglomeration 825,000
8 Regional Unit of Bragança Paulista city 480,000

As such lets first filter out the relevant metropolis and agglomerations to our study area from the metro data set.

filter_metro <- metro%>%
  filter(name_metro %in% c("RM São Paulo","RM Campinas","RM do Vale do Paraíba e Litoral Norte","RM de Sorocaba","RM Baixada Santista","Aglomeração Urbana de Piracicaba-AU- Piracicaba","Aglomeração Urbana de Jundiaí"))%>%
  select(code_muni,name_muni,name_metro,geom)

Now lets filter out the remaining municipalities belonging to the Regional Unit of Braganca Paulista city. The following areas are in the Regional Unit:

  • Atibaia
  • Bom Jesus Dos Perdões
  • Bragança Paulista
  • Joanópolis
  • Nazaré Paulista
  • Pedra Bela
  • Pinhalzinho
  • Piracaia
  • Tuiuti
  • Vargem
filter_muni <- muni %>%
  filter(name_muni %in% c('Atibaia','Bom Jesus Dos Perdões','Bragança Paulista','Joanópolis','Nazaré Paulista',  'Pedra Bela','Pinhalzinho','Piracaia','Tuiuti','Vargem'))%>%
  mutate (name_metro ="Regional Unit of Bragança Paulista city")%>%
  select(code_muni,name_muni,name_metro,geom)

Now that we have all our required data we can combine both spatial datasets into one.

map<-rbind(filter_metro, filter_muni)

We can additionally join in the attributes data from brazil_new into the spatial layer as well. As part of data preparation, we will aslo rename some cities and check for and remove duplicates.

# change name of Santa Bárbara D’oeste to match geospatial dataset
brazil_new$CITY[brazil_new$CITY == "Santa Bárbara D'Oeste"] <- "Santa Bárbara D'oeste"
#join with spatial data
sp_map <- left_join(map, brazil_new, 
                              by = c("name_muni" = "CITY"))
sp_map <-sp_map[!duplicated(sp_map$name_muni),] 

We can do a quick visualisation of our study area, to confirm that everything is correct.

qtm(sp_map, fill="name_metro")
## Some legend labels were too wide. These labels have been resized to 0.63. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.

3.4 Deriving the location quotients for industries

To beter understand how concentrated a particular industry in a region is compared to the nation we can calculate what is known as location quotients. They enable us to quantify how “concentrated” an industry is in a region compared to the larger geographic area.

We can calculate the location quotients by utilising the following formula.

\[\frac{(\frac{number\ of\ industry\ \alpha \ in\ municipality\ \iota}{sum\ of\ all\ industry\ in\ municipality\ \iota})}{(\frac{number\ of\ industry\ \alpha\ in\ Greater\ Sao\ Paulo}{sum\ of\ all\ industry\ in\ Greater\ Sao\ Paulo})} \]

splq_map<- sp_map %>%
  mutate(`IND_AGRICULTURE_LQ` = (IND_AGRICULTURE/COMP_TOT)/(sum(brazil_new$IND_AGRICULTURE)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_EXTRACTIVE_LQ` = (IND_EXTRACTIVE/COMP_TOT)/(sum(brazil_new$IND_EXTRACTIVE)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_TRANSFORMATION_LQ` = (IND_TRANSFORMATION/COMP_TOT)/(sum(brazil_new$IND_TRANSFORMATION)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_ELECTRIC_GAS_LQ` = (IND_ELECTRIC_GAS/COMP_TOT)/(sum(brazil_new$IND_ELECTRIC_GAS)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_WASTE_SEWAGE_LQ` = (IND_WASTE_SEWAGE/COMP_TOT)/(sum(brazil_new$IND_WASTE_SEWAGE)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_CONSTRUCTION_LQ` = (IND_CONSTRUCTION/COMP_TOT)/(sum(brazil_new$IND_CONSTRUCTION)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_VEHICLE_TRADE_REPAIR_LQ` = (IND_VEHICLE_TRADE_REPAIR/COMP_TOT)/(sum(brazil_new$IND_VEHICLE_TRADE_REPAIR)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_TRANSPORT_STORAGE_MAIL_LQ` = (IND_TRANSPORT_STORAGE_MAIL/COMP_TOT)/(sum(brazil_new$IND_TRANSPORT_STORAGE_MAIL)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_ACCOMD_FOOD_LQ` = (IND_ACCOMD_FOOD/COMP_TOT)/(sum(brazil_new$IND_ACCOMD_FOOD)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_INFO_COMM_LQ` = (IND_INFO_COMM/COMP_TOT)/(sum(brazil_new$IND_INFO_COMM)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_FINANCE_SVCS_LQ` = (IND_FINANCE_SVCS/COMP_TOT)/(sum(brazil_new$IND_FINANCE_SVCS)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_REAL_ESTATE_LQ` = (IND_REAL_ESTATE/COMP_TOT)/(sum(brazil_new$IND_REAL_ESTATE)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_PRO_SCIENTIC_TECHNICAL_LQ` = (IND_PRO_SCIENTIC_TECHNICAL/COMP_TOT)/(sum(brazil_new$IND_PRO_SCIENTIC_TECHNICAL)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_ADMINISTRATIVE_LQ` = (IND_ADMINISTRATIVE/COMP_TOT)/(sum(brazil_new$IND_ADMINISTRATIVE)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_PUBLIC_ADMIN_DEFENSE_SS_LQ` = (IND_PUBLIC_ADMIN_DEFENSE_SS/COMP_TOT)/(sum(brazil_new$IND_PUBLIC_ADMIN_DEFENSE_SS)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_EDUCATION_LQ` = (IND_EDUCATION/COMP_TOT)/(sum(brazil_new$IND_EDUCATION)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_HEALTH_SS_LQ` = (IND_HEALTH_SS/COMP_TOT)/(sum(brazil_new$IND_HEALTH_SS)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_ART_SPORT_RECRE_LQ` = (IND_ART_SPORT_RECRE/COMP_TOT)/(sum(brazil_new$IND_ART_SPORT_RECRE)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_OTHER_SVCS_LQ` = (IND_OTHER_SVCS/COMP_TOT)/(sum(brazil_new$IND_OTHER_SVCS)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_DOMESTIC_SVCS_LQ` = (IND_DOMESTIC_SVCS/COMP_TOT)/(sum(brazil_new$IND_DOMESTIC_SVCS)/sum(brazil_new$COMP_TOT))) %>%
  mutate(`IND_INTERNL_INSTITUTIONS_LQ` = (IND_INTERNL_INSTITUTIONS/COMP_TOT)/(sum(brazil_new$IND_INTERNL_INSTITUTIONS)/sum(brazil_new$COMP_TOT))) %>%
  select(name_muni,name_metro,starts_with(c("IND")),contains(c("LQ")))

We have successfully computed the Location Quotient for each industry by municipality. Lets dive in deeper.

4 Exploratory Data Analysis

To better understand the distribution of these indistuy location quotient varibles, we can utilise a histogram to identify the overall distribution of the data values.

The following code chunk creates a function to generate the histogram.

head(splq_map)
names<-colnames(splq_map%>%
                 select(contains("LQ")))
b = 1
for (name in names){
  nameplot = paste("ggplot", b, sep="_")
  assign(nameplot,ggplot(data=splq_map, aes_string(x=name)) +
  geom_histogram(bins=10, color="black", fill="light blue"))
  b = 1+b
}

Let visualise these histograms together.

We can observe that most of the location quotients are left skewed, indicating that generally across the municipalities there are a low concentration of those industries. Industries of note are construction, transformation, transport, health & social services, vehichle trade and repair, these industries are more normally distributed indicating that there are a range of these industries across the municipalities and they are also more common across the region.

Additionally from the distribution of IND_DOMESTIC_SVCS_LQ we can see that possibly the greater sao paulo region, does not have IND_DOMESTIC_SVCS as an industry. It might be useful to remove this industry from future analysis.

4.1 Choropleth map of location quotient by industry type

To have a quick view of the location quotient for each industry by municipality we can create choropleth maps.

The following code chunk does this.

names<-colnames(splq_map%>%
                 select(contains("LQ"),-"IND_DOMESTIC_SVCS_LQ"))
b = 1
for (name in names){
  nameplot = paste("tmap", b, sep="_")
  assign(nameplot,
         tm_shape(splq_map) + 
  tm_fill(col = name,
          style = "jenks",
          title = name)+ 
  tm_borders(alpha = 0.5))
  b = 1+b
}

From the cloropleth map we can observe that in general some industries have more concentration of certain industries. This is indicated by the darker red color.

An interesting industriess to note will be the Agriculture, Electricity and Gas, Extractive Industries and Public Administration. These industries have deep conncentration of industry in certain regions. This makes sense and these industries like agricultire and extractive or elctricity and gas, may require natural features like fertile ground or abundance of natural resources. As such the regions whrere they are more concentrated could be areas that have these natural features demanded by these industries.

Another interesting industry is education which is almost evenly spread out across the regions. As education is an important industry and often highly demanded, it is expected to see them across all the municipalities.

However while the choroplet reveal useful information, about the various industry concentration. However as there are so many different maps to look at, it might be hard to understand and compare easily. As such identifying clusters might be usefult to better understand the spatial distributions.

5 Delineating Industry Specialisation Cluster

We can utilise clustering as a tool to delineate industry specialisation, we will be utilising hierarchical clustering method and spatially constrained clustering method to do so. The variables for clustering will be the various location quotient scores for each municipality.

5.1 Correlation Analysis

Before we perform cluster analysis, it is important for us to ensure that the cluster variables are not highly correlated. The following code creates the corrplot required.

cluster_init <- as.data.frame(splq_map)%>%
  select(name_muni,contains("LQ"),-"IND_DOMESTIC_SVCS_LQ")
cluster_vars.cor = cor(cluster_init[,2:21])

corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Based on the correlation plot, there are some variables that have a slight correlation. This is expected as industries do not often operate in cylo and they do interact with other industries to procure goods and services that they require in their function. As the correlation scores are not very high, we can continue to ustilise the industries in our clustering.

5.2 Hierarchy Cluster Analysis

The first cluster analysis that we can perform is the Hierarchy Cluster Analysis.

5.2.1 Extrating clustering variables

As a first step, lets extract out the clustering variables that we can utilise in out cluster analysis.

cluster_vars <- splq_map %>%
  st_set_geometry(NULL) %>%
  select(name_muni,contains("LQ"),-"IND_DOMESTIC_SVCS_LQ")
head(cluster_vars,10)

We will remove IND_DOMESTIC_SVCS_LQ as its an industry that is not available in São Paulo Macrometropolis.

We will next change the rows by the names of the municipalities by using the code chunk below.

row.names(cluster_vars) <- cluster_vars$"name_muni"
sp_ind <- select(cluster_vars, c(2:21))
head(sp_ind, 10)

5.2.2 Data Standardization

As much as possible we will prefer not to standardize the variables, however as the value range between the variables are pretty large as can be seen in the histogram in the earlier section. We can also have a view of the range in the summary table below.

summary(sp_ind)
##  IND_AGRICULTURE_LQ IND_EXTRACTIVE_LQ IND_TRANSFORMATION_LQ IND_ELECTRIC_GAS_LQ
##  Min.   : 0.0000    Min.   : 0.0000   Min.   :0.0000        Min.   : 0.0000    
##  1st Qu.: 0.1047    1st Qu.: 0.0000   1st Qu.:0.6943        1st Qu.: 0.0000    
##  Median : 0.4726    Median : 0.9814   Median :1.0751        Median : 0.0000    
##  Mean   : 1.9071    Mean   : 4.2335   Mean   :1.2054        Mean   : 0.8009    
##  3rd Qu.: 1.9648    3rd Qu.: 3.7706   3rd Qu.:1.5925        3rd Qu.: 0.0000    
##  Max.   :17.0036    Max.   :68.4546   Max.   :3.5695        Max.   :56.7820    
##  IND_WASTE_SEWAGE_LQ IND_CONSTRUCTION_LQ IND_VEHICLE_TRADE_REPAIR_LQ
##  Min.   : 0.0000     Min.   :0.0000      Min.   :0.2343             
##  1st Qu.: 0.6293     1st Qu.:0.7076      1st Qu.:0.9898             
##  Median : 1.1945     Median :0.9709      Median :1.1008             
##  Mean   : 1.5273     Mean   :0.9690      Mean   :1.0910             
##  3rd Qu.: 2.0076     3rd Qu.:1.1931      3rd Qu.:1.2283             
##  Max.   :13.9357     Max.   :2.1413      Max.   :2.0490             
##  IND_TRANSPORT_STORAGE_MAIL_LQ IND_ACCOMD_FOOD_LQ IND_INFO_COMM_LQ
##  Min.   :0.0000                Min.   :0.2877     Min.   :0.0000  
##  1st Qu.:0.6803                1st Qu.:0.9135     1st Qu.:0.2660  
##  Median :0.9424                Median :1.0574     Median :0.4090  
##  Mean   :1.1104                Mean   :1.2051     Mean   :0.5529  
##  3rd Qu.:1.3518                3rd Qu.:1.3014     3rd Qu.:0.5840  
##  Max.   :3.7699                Max.   :4.4768     Max.   :9.9770  
##  IND_FINANCE_SVCS_LQ IND_REAL_ESTATE_LQ IND_PRO_SCIENTIC_TECHNICAL_LQ
##  Min.   :0.0000      Min.   :0.0000     Min.   :0.0000               
##  1st Qu.:0.2517      1st Qu.:0.3168     1st Qu.:0.3995               
##  Median :0.4011      Median :0.6513     Median :0.5649               
##  Mean   :0.4502      Mean   :0.6601     Mean   :0.6158               
##  3rd Qu.:0.5896      3rd Qu.:0.8931     3rd Qu.:0.7663               
##  Max.   :2.6342      Max.   :2.0834     Max.   :2.8560               
##  IND_ADMINISTRATIVE_LQ IND_PUBLIC_ADMIN_DEFENSE_SS_LQ IND_EDUCATION_LQ
##  Min.   :0.0000        Min.   : 0.1450                Min.   :0.0000  
##  1st Qu.:0.3849        1st Qu.: 0.6861                1st Qu.:0.7463  
##  Median :0.6123        Median : 1.6120                Median :1.0712  
##  Mean   :0.6825        Mean   : 3.7157                Mean   :1.0589  
##  3rd Qu.:0.8617        3rd Qu.: 3.8706                3rd Qu.:1.4115  
##  Max.   :3.1221        Max.   :69.9574                Max.   :2.3454  
##  IND_HEALTH_SS_LQ IND_ART_SPORT_RECRE_LQ IND_OTHER_SVCS_LQ
##  Min.   :0.0000   Min.   :0.0000         Min.   :0.1286   
##  1st Qu.:0.3823   1st Qu.:0.7413         1st Qu.:0.7382   
##  Median :0.5849   Median :0.9772         Median :0.9395   
##  Mean   :0.6237   Mean   :0.9275         Mean   :0.9523   
##  3rd Qu.:0.7962   3rd Qu.:1.1875         3rd Qu.:1.1388   
##  Max.   :1.8649   Max.   :1.9842         Max.   :2.5004   
##  IND_INTERNL_INSTITUTIONS_LQ
##  Min.   :0.00000            
##  1st Qu.:0.00000            
##  Median :0.00000            
##  Mean   :0.06269            
##  3rd Qu.:0.00000            
##  Max.   :5.97916

As such we can utilise a min-max standardisation to standardise the variables.

sp_ind.std <- normalize(sp_ind)
summary(sp_ind.std)
##  IND_AGRICULTURE_LQ IND_EXTRACTIVE_LQ IND_TRANSFORMATION_LQ IND_ELECTRIC_GAS_LQ
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.0000        Min.   :0.0000     
##  1st Qu.:0.006157   1st Qu.:0.00000   1st Qu.:0.1945        1st Qu.:0.0000     
##  Median :0.027794   Median :0.01434   Median :0.3012        Median :0.0000     
##  Mean   :0.112160   Mean   :0.06184   Mean   :0.3377        Mean   :0.0141     
##  3rd Qu.:0.115554   3rd Qu.:0.05508   3rd Qu.:0.4461        3rd Qu.:0.0000     
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.0000        Max.   :1.0000     
##  IND_WASTE_SEWAGE_LQ IND_CONSTRUCTION_LQ IND_VEHICLE_TRADE_REPAIR_LQ
##  Min.   :0.00000     Min.   :0.0000      Min.   :0.0000             
##  1st Qu.:0.04515     1st Qu.:0.3305      1st Qu.:0.4163             
##  Median :0.08571     Median :0.4534      Median :0.4775             
##  Mean   :0.10959     Mean   :0.4525      Mean   :0.4721             
##  3rd Qu.:0.14406     3rd Qu.:0.5572      3rd Qu.:0.5477             
##  Max.   :1.00000     Max.   :1.0000      Max.   :1.0000             
##  IND_TRANSPORT_STORAGE_MAIL_LQ IND_ACCOMD_FOOD_LQ IND_INFO_COMM_LQ 
##  Min.   :0.0000                Min.   :0.0000     Min.   :0.00000  
##  1st Qu.:0.1805                1st Qu.:0.1494     1st Qu.:0.02666  
##  Median :0.2500                Median :0.1837     Median :0.04099  
##  Mean   :0.2945                Mean   :0.2190     Mean   :0.05542  
##  3rd Qu.:0.3586                3rd Qu.:0.2420     3rd Qu.:0.05853  
##  Max.   :1.0000                Max.   :1.0000     Max.   :1.00000  
##  IND_FINANCE_SVCS_LQ IND_REAL_ESTATE_LQ IND_PRO_SCIENTIC_TECHNICAL_LQ
##  Min.   :0.00000     Min.   :0.0000     Min.   :0.0000               
##  1st Qu.:0.09555     1st Qu.:0.1520     1st Qu.:0.1399               
##  Median :0.15225     Median :0.3126     Median :0.1978               
##  Mean   :0.17089     Mean   :0.3168     Mean   :0.2156               
##  3rd Qu.:0.22384     3rd Qu.:0.4287     3rd Qu.:0.2683               
##  Max.   :1.00000     Max.   :1.0000     Max.   :1.0000               
##  IND_ADMINISTRATIVE_LQ IND_PUBLIC_ADMIN_DEFENSE_SS_LQ IND_EDUCATION_LQ
##  Min.   :0.0000        Min.   :0.000000               Min.   :0.0000  
##  1st Qu.:0.1233        1st Qu.:0.007751               1st Qu.:0.3182  
##  Median :0.1961        Median :0.021014               Median :0.4567  
##  Mean   :0.2186        Mean   :0.051148               Mean   :0.4515  
##  3rd Qu.:0.2760        3rd Qu.:0.053365               3rd Qu.:0.6018  
##  Max.   :1.0000        Max.   :1.000000               Max.   :1.0000  
##  IND_HEALTH_SS_LQ IND_ART_SPORT_RECRE_LQ IND_OTHER_SVCS_LQ
##  Min.   :0.0000   Min.   :0.0000         Min.   :0.0000   
##  1st Qu.:0.2050   1st Qu.:0.3736         1st Qu.:0.2570   
##  Median :0.3136   Median :0.4925         Median :0.3419   
##  Mean   :0.3344   Mean   :0.4674         Mean   :0.3473   
##  3rd Qu.:0.4269   3rd Qu.:0.5985         3rd Qu.:0.4259   
##  Max.   :1.0000   Max.   :1.0000         Max.   :1.0000   
##  IND_INTERNL_INSTITUTIONS_LQ
##  Min.   :0.00000            
##  1st Qu.:0.00000            
##  Median :0.00000            
##  Mean   :0.01048            
##  3rd Qu.:0.00000            
##  Max.   :1.00000

Now we can see that all the values range between 0-1 and as such we can cluster them all at the same magnitutde without one variable impacting the clustering results more.

5.2.3 Computing Proximity Matrix

For clustering, we will need to have a distance measure that will be utilised in the clustering of points. To do this we will be utilising the following code chunk to create a proximity matrix, we will be using the euclidean method.

proxmat <- dist(sp_ind.std, method = 'euclidean')

5.2.4 Finding Optimum Clustering Algorithm

Before starting on our clustering it will be useful to identify which algorithm will provide the strongest clustering structure.

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(sp_ind.std, method = x)$ac
}

map_dbl(m, ac)
##   average    single  complete      ward 
## 0.7262692 0.6818087 0.7890747 0.9168273

With reference to the output above, we can see that Ward’s method provides the strongest clustering structure among the four methods assessed. Hence, in the subsequent analysis, only Ward’s method will be used.

5.2.5 Identifying Optimal Clusters

As part of cluster analysis we will also need to identify an optimum cluster. To do so we will be utilisng the gap statistic method to identify the optimum cluster. We will set the k.max=20 as that is the number of industries we have and any more cluster than that may be hard to interpret.

set.seed(12345)
gap_stat <- clusGap(sp_ind.std, FUN = hcut, nstart = 25, K.max = 20, B = 50)
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = sp_ind.std, FUNcluster = hcut, K.max = 20, B = 50,     nstart = 25)
## B=50 simulated reference sets, k = 1..20; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 20
##           logW   E.logW       gap      SE.sim
##  [1,] 3.701891 4.381207 0.6793160 0.009195348
##  [2,] 3.598698 4.329566 0.7308684 0.008585474
##  [3,] 3.547557 4.300043 0.7524856 0.007879301
##  [4,] 3.504016 4.274828 0.7708123 0.008095224
##  [5,] 3.463682 4.252821 0.7891383 0.008552328
##  [6,] 3.430153 4.232715 0.8025622 0.008898478
##  [7,] 3.396369 4.214180 0.8178113 0.008935278
##  [8,] 3.371944 4.196644 0.8247006 0.008973340
##  [9,] 3.343766 4.180138 0.8363721 0.009026804
## [10,] 3.307810 4.164305 0.8564952 0.009049633
## [11,] 3.280957 4.149013 0.8680560 0.009033655
## [12,] 3.261941 4.134320 0.8723792 0.009064817
## [13,] 3.242174 4.120022 0.8778472 0.009004439
## [14,] 3.223538 4.106127 0.8825887 0.009127173
## [15,] 3.200506 4.092515 0.8920096 0.009238617
## [16,] 3.179226 4.079172 0.8999456 0.009251334
## [17,] 3.162137 4.066062 0.9039247 0.009254161
## [18,] 3.140525 4.053194 0.9126682 0.009257948
## [19,] 3.123803 4.040421 0.9166174 0.009297646
## [20,] 3.105960 4.027778 0.9218181 0.009295561

We can visualise the plot with the following code chunk.

fviz_gap_stat(gap_stat)

18 clusters as recommended by the Gap statistics, is too high of a value to have meaningful clusters. As such we will utilise 10, as the optimum number of clusters to use, as there is a sharp increase from 9 clusters to 10 in terms of gap statistics.

Now that we have the various parameters prepared we can run the clutering and visualise the results.

hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.6)
rect.hclust(hclust_ward, k = 10, border = 2:5)

We can see in the dendogram that with 10 clusters there are clusters that are uneven with less municipality, it might be better to drop to 9 clusters.

5.2.6 Visualising the Cluster Heatmap

With this interactive heatmap we can visualise which variables seem to contribute more to which cluster. It will assit us in undersigning the unique characteristics of each cluster.

sp_ind.std_mat <- data.matrix(sp_ind.std)
heatmaply(normalize(sp_ind.std_mat),
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 9,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Geographic Segmentation of Greater Sao Paulo by Industry LQ",
          xlab = "Industry Location Quotients",
          ylab = "Municipalities of Greater Sao Paulo"
          )

As each cluster seems to have a unique mix of different industries we can continue with 9 clusters for our analysis.

5.2.7 Mapping the clusters

Based on the observation of the dendogram, we can map the results of the clustering of a 9-cluster model.

groups <- as.factor(cutree(hclust_ward, k=9))
splq_sf_cluster <- cbind(splq_map, as.matrix(groups)) %>%
  rename(`CLUSTER`=`as.matrix.groups.`)


qtm(splq_sf_cluster, "CLUSTER")

We can see that the clusters are disperesed all over sau paulo macrometropolis and we cant really derive any straight forward insights. It is interesting to not there is as strip pink along the coastline of the magapolis. The majority industries here are those of sports and recereation and administaration. This makes sense as people may go here for recreation acivites and water sports.

5.3 Spatially Constrained Clustering

We can as such utilise a method known as the SKATER approach to identify the distribution of industry spatially.

5.3.1 Converting into SpatialPolygonsDataFrame

The first step required is to convert of spatial data into a SpatialPolygonsDataFrame.

splq_sp <- as_Spatial(splq_map)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition
## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Sistema de Referencia Geocentrico para las
## AmericaS 2000 in CRS definition

5.3.2 Computing Neighbour List

We will then need to compute a neighbours list. Also to ensure that each polygon has at least one neighbour we will will include a snap=0.02. That way even regions seperated by a body of water will have neigbours.

splq.nb <- poly2nb(splq_sp,snap=0.02)
summary(splq.nb)
## Neighbour list object:
## Number of regions: 174 
## Number of nonzero links: 984 
## Percentage nonzero weights: 3.250099 
## Average number of links: 5.655172 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 12 13 27 
##  4  8 13 35 34 31 18 13  7  6  3  1  1 
## 4 least connected regions:
## 8 9 36 100 with 1 link
## 1 most connected region:
## 161 with 27 links

5.3.3 Calculating edge costs

We will need to now compute the cost of each edge, It is the distance between its nodes.

lcosts <- nbcosts(splq.nb, sp_ind.std)

Next, We will incorporate these costs into a weights object as shown in the code chunk below.

splq.w <- nb2listw(splq.nb, lcosts, style="B")
summary(splq.w)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 174 
## Number of nonzero links: 984 
## Percentage nonzero weights: 3.250099 
## Average number of links: 5.655172 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 12 13 27 
##  4  8 13 35 34 31 18 13  7  6  3  1  1 
## 4 least connected regions:
## 8 9 36 100 with 1 link
## 1 most connected region:
## 161 with 27 links
## 
## Weights style: B 
## Weights constants summary:
##     n    nn       S0       S1      S2
## B 174 30276 771.8576 1409.952 18347.3

5.3.4 Computing minimum spanning tree

The following code computes the minimum spanning trees.

splq.mst <- mstree(splq.w)

5.3.5 Computing spatially constrained clusters using SKATER method

Now we will make use of the minimum spanning tree together with the location quotient to perform SKATER clustering. As decided earlier we will continue to utilise the euclidean method and 9 clusters (parameter input will be 1 less, so 8).

clust9 <- skater(splq.mst[,1:2], sp_ind.std, method = "euclidean", 8)
# extract the cluster labels from the skater object 
ccs9 <- clust9$groups

5.3.6 Mapping the clusters

Now that we have computed the spatially constrained cluster we can utilise a choropleth map too visualise it.

groups_mat <- as.matrix(ccs9)
splq_sf_spatialcluster <- cbind(splq_sf_cluster, as.factor(groups_mat)) %>%
  rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(splq_sf_spatialcluster, "SP_CLUSTER")

From the map we can see that clusters 1, 3 & 5 are the largest, as they occupy the most land space. Lets take a look at the profiles of these clusters.

spatialcluster.summary <- splq_sf_spatialcluster %>%
   gather("Industry","LQ",contains("LQ"))
ggplot(spatialcluster.summary, aes(x = Industry, y = LQ, fill= Industry)) + 
               facet_wrap(~ SP_CLUSTER) +
               geom_bar(stat="identity", show_guide = FALSE) +
              theme(axis.text.x = element_text(angle = 90, hjust = 1))

From this plot we can see that, Cluster 3 has a large number of all industries, indicating that this regions is probable more developed than the other regions. With the bulk of economic activities centered here. Whereas cluster 1, a large portion is extractive industries, these could mean that these regions are where more resources extraction occur. Additionally cluster 1 has other industries as well and it could be the main economic driver is the extraction industries and the rest help to suppport it.

Finally cluster 5 also as large number of extractive industries and public administration, defense and social services industries, indicating that this too is an important resource mining region since it had both these industries (defense to protect the extraction industry maybe?). However extractive industry is not as pervalent as cluster 1, it could be that this cluster is still growing

As such new business going in can observe where in the industries they fit and go to the region where it will be relevant to them. For example since characteristics of cluster 5 seem similar to 1, peripheral industries may want to come in similar to cluster 1 and capitalise on the money that is coming from extractive industries in cluster 5

6 Conclusion

Through this analysis we where able to identify and map the location quotients of each industries municipalites and we then utilised them to delineate industry specialisation clusters. We also found that hierarchical clustering is not as spatially usefull as spatially constrained clustering. We where also able to identify characteristics of certain industries and provide suggestions on how entrepreuners can exploit these information for their business expansion.

7 References and Credits

The following guide enabled the analysis in the project:

Codes in this document have been adapted from the following useful guide:

 

A work by Rajiv Abraham Xavier