In an increasingly globalised world, businesses often have to expand beyond local waters to remain competitive and relevant. In Singapore, Enterprise Singapore, a statutory board under the Ministry of Trade and Industry, assists small medium enterprises (SMEs) to internationalise by expanding overseas.
An often highly recommended country by Enterprise Singapore is Brazil. As an emerging market with plenty of growth upside, Brazil can serve as a greenfield for Singapore’s SMEs to grow and capitalise their goods and services. However as a developing economy with large land area, spatial development is still uneven as such companies will require good understanding on the competitiveness of local industries and geographical specialisations.
We will be focusing on the São Paulo Macrometropolis, a Brazilian meagapolis. As one of the most populous urban agglomerations in the world, this area will provide us a good opportunity to cluster and delinate Industry Cluster through geographic segmentation.
We will also be analysing variable data based on the 2016 boundaries.
Through this analysis we aim to:
The following code chunk loads the relevant packages to be utilised in this study.
packages = c('geobr','rgdal', 'sf', 'spdep', 'tmap', 'tidyverse','plotly','tmaptools', 'ggpubr', 'cluster', 'factoextra', 'NbClust', 'heatmaply', 'corrplot', 'psych')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
We will be first importing in the various variable data for each city in brazil. The data is in CSV file and is delimited by ;. The following code chunk imports in the data accordingly.
brazil <- read.csv("data/aspatial/BRAZIL_CITIES.csv",sep=";", fileEncoding = "UTF-8")
The data provides a comprehensive list of various variables that relate to cities in brazil. The comprehensive data descriptions for each variable can be found here
head(brazil)
Now that we have the attribute data, to make it the data more understandable we can rename some of the columns. We will also additionally convert any NA values to 0.
brazil_new <- brazil %>%
rename("IND_AGRICULTURE"="COMP_A",
"IND_EXTRACTIVE"="COMP_B",
"IND_TRANSFORMATION"="COMP_C",
"IND_ELECTRIC_GAS"="COMP_D",
"IND_WASTE_SEWAGE"="COMP_E",
"IND_CONSTRUCTION"="COMP_F",
"IND_VEHICLE_TRADE_REPAIR"="COMP_G",
"IND_TRANSPORT_STORAGE_MAIL"="COMP_H",
"IND_ACCOMD_FOOD"="COMP_I",
"IND_INFO_COMM"="COMP_J",
"IND_FINANCE_SVCS"="COMP_K",
"IND_REAL_ESTATE"="COMP_L",
"IND_PRO_SCIENTIC_TECHNICAL"="COMP_M",
"IND_ADMINISTRATIVE"="COMP_N",
"IND_PUBLIC_ADMIN_DEFENSE_SS"="COMP_O",
"IND_EDUCATION"="COMP_P",
"IND_HEALTH_SS"="COMP_Q",
"IND_ART_SPORT_RECRE"="COMP_R",
"IND_OTHER_SVCS"="COMP_S",
"IND_DOMESTIC_SVCS"="COMP_T",
"IND_INTERNL_INSTITUTIONS"="COMP_U")%>%
mutate_all(~replace(., is.na(.), 0))%>%
filter(STATE == "SP")
We will be utilising the package geobr to access the spatial data set for the megapolis, São Paulo Macrometropolis. The code chunk below downloads the municipalities in the state of Sao Paulo where São Paulo Macrometropolis is located and all the metro areas in Brazil.
muni <- read_municipality(code_muni = "SP",year=2016)
metro <- read_metro_area(year=2016)
Now we will need to extract out our study area to be utilised in our further analysis. Based on the wikipedia page for the São Paulo Macrometropolis, the megapolis has the following divisions:
| Region | Population | |
|---|---|---|
| 1 | Metropolitan Region of São Paulo | 21,860,000 |
| 2 | Metropolitan Region of Campinas | 3,300,000 |
| 3 | Metropolitan Region of Vale do Paraíba e Litoral Norte | 2,550,000 |
| 4 | Metropolitan Region of Sorocaba | 2,150,000 |
| 5 | Metropolitan Region of Baixada Santista | 1,880,000 |
| 6 | Piracicaba Urban Agglomeration | 1,500,000 |
| 7 | Jundiaí Urban Agglomeration | 825,000 |
| 8 | Regional Unit of Bragança Paulista city | 480,000 |
As such lets first filter out the relevant metropolis and agglomerations to our study area from the metro data set.
filter_metro <- metro%>%
filter(name_metro %in% c("RM São Paulo","RM Campinas","RM do Vale do Paraíba e Litoral Norte","RM de Sorocaba","RM Baixada Santista","Aglomeração Urbana de Piracicaba-AU- Piracicaba","Aglomeração Urbana de Jundiaí"))%>%
select(code_muni,name_muni,name_metro,geom)
Now lets filter out the remaining municipalities belonging to the Regional Unit of Braganca Paulista city. The following areas are in the Regional Unit:
filter_muni <- muni %>%
filter(name_muni %in% c('Atibaia','Bom Jesus Dos Perdões','Bragança Paulista','Joanópolis','Nazaré Paulista', 'Pedra Bela','Pinhalzinho','Piracaia','Tuiuti','Vargem'))%>%
mutate (name_metro ="Regional Unit of Bragança Paulista city")%>%
select(code_muni,name_muni,name_metro,geom)
Now that we have all our required data we can combine both spatial datasets into one.
map<-rbind(filter_metro, filter_muni)
We can additionally join in the attributes data from brazil_new into the spatial layer as well. As part of data preparation, we will aslo rename some cities and check for and remove duplicates.
# change name of Santa Bárbara D’oeste to match geospatial dataset
brazil_new$CITY[brazil_new$CITY == "Santa Bárbara D'Oeste"] <- "Santa Bárbara D'oeste"
#join with spatial data
sp_map <- left_join(map, brazil_new,
by = c("name_muni" = "CITY"))
sp_map <-sp_map[!duplicated(sp_map$name_muni),]
We can do a quick visualisation of our study area, to confirm that everything is correct.
qtm(sp_map, fill="name_metro")
## Some legend labels were too wide. These labels have been resized to 0.63. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
To beter understand how concentrated a particular industry in a region is compared to the nation we can calculate what is known as location quotients. They enable us to quantify how “concentrated” an industry is in a region compared to the larger geographic area.
We can calculate the location quotients by utilising the following formula.
\[\frac{(\frac{number\ of\ industry\ \alpha \ in\ municipality\ \iota}{sum\ of\ all\ industry\ in\ municipality\ \iota})}{(\frac{number\ of\ industry\ \alpha\ in\ Greater\ Sao\ Paulo}{sum\ of\ all\ industry\ in\ Greater\ Sao\ Paulo})} \]
splq_map<- sp_map %>%
mutate(`IND_AGRICULTURE_LQ` = (IND_AGRICULTURE/COMP_TOT)/(sum(brazil_new$IND_AGRICULTURE)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_EXTRACTIVE_LQ` = (IND_EXTRACTIVE/COMP_TOT)/(sum(brazil_new$IND_EXTRACTIVE)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_TRANSFORMATION_LQ` = (IND_TRANSFORMATION/COMP_TOT)/(sum(brazil_new$IND_TRANSFORMATION)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_ELECTRIC_GAS_LQ` = (IND_ELECTRIC_GAS/COMP_TOT)/(sum(brazil_new$IND_ELECTRIC_GAS)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_WASTE_SEWAGE_LQ` = (IND_WASTE_SEWAGE/COMP_TOT)/(sum(brazil_new$IND_WASTE_SEWAGE)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_CONSTRUCTION_LQ` = (IND_CONSTRUCTION/COMP_TOT)/(sum(brazil_new$IND_CONSTRUCTION)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_VEHICLE_TRADE_REPAIR_LQ` = (IND_VEHICLE_TRADE_REPAIR/COMP_TOT)/(sum(brazil_new$IND_VEHICLE_TRADE_REPAIR)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_TRANSPORT_STORAGE_MAIL_LQ` = (IND_TRANSPORT_STORAGE_MAIL/COMP_TOT)/(sum(brazil_new$IND_TRANSPORT_STORAGE_MAIL)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_ACCOMD_FOOD_LQ` = (IND_ACCOMD_FOOD/COMP_TOT)/(sum(brazil_new$IND_ACCOMD_FOOD)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_INFO_COMM_LQ` = (IND_INFO_COMM/COMP_TOT)/(sum(brazil_new$IND_INFO_COMM)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_FINANCE_SVCS_LQ` = (IND_FINANCE_SVCS/COMP_TOT)/(sum(brazil_new$IND_FINANCE_SVCS)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_REAL_ESTATE_LQ` = (IND_REAL_ESTATE/COMP_TOT)/(sum(brazil_new$IND_REAL_ESTATE)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_PRO_SCIENTIC_TECHNICAL_LQ` = (IND_PRO_SCIENTIC_TECHNICAL/COMP_TOT)/(sum(brazil_new$IND_PRO_SCIENTIC_TECHNICAL)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_ADMINISTRATIVE_LQ` = (IND_ADMINISTRATIVE/COMP_TOT)/(sum(brazil_new$IND_ADMINISTRATIVE)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_PUBLIC_ADMIN_DEFENSE_SS_LQ` = (IND_PUBLIC_ADMIN_DEFENSE_SS/COMP_TOT)/(sum(brazil_new$IND_PUBLIC_ADMIN_DEFENSE_SS)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_EDUCATION_LQ` = (IND_EDUCATION/COMP_TOT)/(sum(brazil_new$IND_EDUCATION)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_HEALTH_SS_LQ` = (IND_HEALTH_SS/COMP_TOT)/(sum(brazil_new$IND_HEALTH_SS)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_ART_SPORT_RECRE_LQ` = (IND_ART_SPORT_RECRE/COMP_TOT)/(sum(brazil_new$IND_ART_SPORT_RECRE)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_OTHER_SVCS_LQ` = (IND_OTHER_SVCS/COMP_TOT)/(sum(brazil_new$IND_OTHER_SVCS)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_DOMESTIC_SVCS_LQ` = (IND_DOMESTIC_SVCS/COMP_TOT)/(sum(brazil_new$IND_DOMESTIC_SVCS)/sum(brazil_new$COMP_TOT))) %>%
mutate(`IND_INTERNL_INSTITUTIONS_LQ` = (IND_INTERNL_INSTITUTIONS/COMP_TOT)/(sum(brazil_new$IND_INTERNL_INSTITUTIONS)/sum(brazil_new$COMP_TOT))) %>%
select(name_muni,name_metro,starts_with(c("IND")),contains(c("LQ")))
We have successfully computed the Location Quotient for each industry by municipality. Lets dive in deeper.
To better understand the distribution of these indistuy location quotient varibles, we can utilise a histogram to identify the overall distribution of the data values.
The following code chunk creates a function to generate the histogram.
head(splq_map)
names<-colnames(splq_map%>%
select(contains("LQ")))
b = 1
for (name in names){
nameplot = paste("ggplot", b, sep="_")
assign(nameplot,ggplot(data=splq_map, aes_string(x=name)) +
geom_histogram(bins=10, color="black", fill="light blue"))
b = 1+b
}
Let visualise these histograms together.
We can observe that most of the location quotients are left skewed, indicating that generally across the municipalities there are a low concentration of those industries. Industries of note are construction, transformation, transport, health & social services, vehichle trade and repair, these industries are more normally distributed indicating that there are a range of these industries across the municipalities and they are also more common across the region.
Additionally from the distribution of IND_DOMESTIC_SVCS_LQ we can see that possibly the greater sao paulo region, does not have IND_DOMESTIC_SVCS as an industry. It might be useful to remove this industry from future analysis.
To have a quick view of the location quotient for each industry by municipality we can create choropleth maps.
The following code chunk does this.
names<-colnames(splq_map%>%
select(contains("LQ"),-"IND_DOMESTIC_SVCS_LQ"))
b = 1
for (name in names){
nameplot = paste("tmap", b, sep="_")
assign(nameplot,
tm_shape(splq_map) +
tm_fill(col = name,
style = "jenks",
title = name)+
tm_borders(alpha = 0.5))
b = 1+b
}
From the cloropleth map we can observe that in general some industries have more concentration of certain industries. This is indicated by the darker red color.
An interesting industriess to note will be the Agriculture, Electricity and Gas, Extractive Industries and Public Administration. These industries have deep conncentration of industry in certain regions. This makes sense and these industries like agricultire and extractive or elctricity and gas, may require natural features like fertile ground or abundance of natural resources. As such the regions whrere they are more concentrated could be areas that have these natural features demanded by these industries.
Another interesting industry is education which is almost evenly spread out across the regions. As education is an important industry and often highly demanded, it is expected to see them across all the municipalities.
However while the choroplet reveal useful information, about the various industry concentration. However as there are so many different maps to look at, it might be hard to understand and compare easily. As such identifying clusters might be usefult to better understand the spatial distributions.
We can utilise clustering as a tool to delineate industry specialisation, we will be utilising hierarchical clustering method and spatially constrained clustering method to do so. The variables for clustering will be the various location quotient scores for each municipality.
Before we perform cluster analysis, it is important for us to ensure that the cluster variables are not highly correlated. The following code creates the corrplot required.
cluster_init <- as.data.frame(splq_map)%>%
select(name_muni,contains("LQ"),-"IND_DOMESTIC_SVCS_LQ")
cluster_vars.cor = cor(cluster_init[,2:21])
corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black")
Based on the correlation plot, there are some variables that have a slight correlation. This is expected as industries do not often operate in cylo and they do interact with other industries to procure goods and services that they require in their function. As the correlation scores are not very high, we can continue to ustilise the industries in our clustering.
The first cluster analysis that we can perform is the Hierarchy Cluster Analysis.
As a first step, lets extract out the clustering variables that we can utilise in out cluster analysis.
cluster_vars <- splq_map %>%
st_set_geometry(NULL) %>%
select(name_muni,contains("LQ"),-"IND_DOMESTIC_SVCS_LQ")
head(cluster_vars,10)
We will remove IND_DOMESTIC_SVCS_LQ as its an industry that is not available in São Paulo Macrometropolis.
We will next change the rows by the names of the municipalities by using the code chunk below.
row.names(cluster_vars) <- cluster_vars$"name_muni"
sp_ind <- select(cluster_vars, c(2:21))
head(sp_ind, 10)
As much as possible we will prefer not to standardize the variables, however as the value range between the variables are pretty large as can be seen in the histogram in the earlier section. We can also have a view of the range in the summary table below.
summary(sp_ind)
## IND_AGRICULTURE_LQ IND_EXTRACTIVE_LQ IND_TRANSFORMATION_LQ IND_ELECTRIC_GAS_LQ
## Min. : 0.0000 Min. : 0.0000 Min. :0.0000 Min. : 0.0000
## 1st Qu.: 0.1047 1st Qu.: 0.0000 1st Qu.:0.6943 1st Qu.: 0.0000
## Median : 0.4726 Median : 0.9814 Median :1.0751 Median : 0.0000
## Mean : 1.9071 Mean : 4.2335 Mean :1.2054 Mean : 0.8009
## 3rd Qu.: 1.9648 3rd Qu.: 3.7706 3rd Qu.:1.5925 3rd Qu.: 0.0000
## Max. :17.0036 Max. :68.4546 Max. :3.5695 Max. :56.7820
## IND_WASTE_SEWAGE_LQ IND_CONSTRUCTION_LQ IND_VEHICLE_TRADE_REPAIR_LQ
## Min. : 0.0000 Min. :0.0000 Min. :0.2343
## 1st Qu.: 0.6293 1st Qu.:0.7076 1st Qu.:0.9898
## Median : 1.1945 Median :0.9709 Median :1.1008
## Mean : 1.5273 Mean :0.9690 Mean :1.0910
## 3rd Qu.: 2.0076 3rd Qu.:1.1931 3rd Qu.:1.2283
## Max. :13.9357 Max. :2.1413 Max. :2.0490
## IND_TRANSPORT_STORAGE_MAIL_LQ IND_ACCOMD_FOOD_LQ IND_INFO_COMM_LQ
## Min. :0.0000 Min. :0.2877 Min. :0.0000
## 1st Qu.:0.6803 1st Qu.:0.9135 1st Qu.:0.2660
## Median :0.9424 Median :1.0574 Median :0.4090
## Mean :1.1104 Mean :1.2051 Mean :0.5529
## 3rd Qu.:1.3518 3rd Qu.:1.3014 3rd Qu.:0.5840
## Max. :3.7699 Max. :4.4768 Max. :9.9770
## IND_FINANCE_SVCS_LQ IND_REAL_ESTATE_LQ IND_PRO_SCIENTIC_TECHNICAL_LQ
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2517 1st Qu.:0.3168 1st Qu.:0.3995
## Median :0.4011 Median :0.6513 Median :0.5649
## Mean :0.4502 Mean :0.6601 Mean :0.6158
## 3rd Qu.:0.5896 3rd Qu.:0.8931 3rd Qu.:0.7663
## Max. :2.6342 Max. :2.0834 Max. :2.8560
## IND_ADMINISTRATIVE_LQ IND_PUBLIC_ADMIN_DEFENSE_SS_LQ IND_EDUCATION_LQ
## Min. :0.0000 Min. : 0.1450 Min. :0.0000
## 1st Qu.:0.3849 1st Qu.: 0.6861 1st Qu.:0.7463
## Median :0.6123 Median : 1.6120 Median :1.0712
## Mean :0.6825 Mean : 3.7157 Mean :1.0589
## 3rd Qu.:0.8617 3rd Qu.: 3.8706 3rd Qu.:1.4115
## Max. :3.1221 Max. :69.9574 Max. :2.3454
## IND_HEALTH_SS_LQ IND_ART_SPORT_RECRE_LQ IND_OTHER_SVCS_LQ
## Min. :0.0000 Min. :0.0000 Min. :0.1286
## 1st Qu.:0.3823 1st Qu.:0.7413 1st Qu.:0.7382
## Median :0.5849 Median :0.9772 Median :0.9395
## Mean :0.6237 Mean :0.9275 Mean :0.9523
## 3rd Qu.:0.7962 3rd Qu.:1.1875 3rd Qu.:1.1388
## Max. :1.8649 Max. :1.9842 Max. :2.5004
## IND_INTERNL_INSTITUTIONS_LQ
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.06269
## 3rd Qu.:0.00000
## Max. :5.97916
As such we can utilise a min-max standardisation to standardise the variables.
sp_ind.std <- normalize(sp_ind)
summary(sp_ind.std)
## IND_AGRICULTURE_LQ IND_EXTRACTIVE_LQ IND_TRANSFORMATION_LQ IND_ELECTRIC_GAS_LQ
## Min. :0.000000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.006157 1st Qu.:0.00000 1st Qu.:0.1945 1st Qu.:0.0000
## Median :0.027794 Median :0.01434 Median :0.3012 Median :0.0000
## Mean :0.112160 Mean :0.06184 Mean :0.3377 Mean :0.0141
## 3rd Qu.:0.115554 3rd Qu.:0.05508 3rd Qu.:0.4461 3rd Qu.:0.0000
## Max. :1.000000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## IND_WASTE_SEWAGE_LQ IND_CONSTRUCTION_LQ IND_VEHICLE_TRADE_REPAIR_LQ
## Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.04515 1st Qu.:0.3305 1st Qu.:0.4163
## Median :0.08571 Median :0.4534 Median :0.4775
## Mean :0.10959 Mean :0.4525 Mean :0.4721
## 3rd Qu.:0.14406 3rd Qu.:0.5572 3rd Qu.:0.5477
## Max. :1.00000 Max. :1.0000 Max. :1.0000
## IND_TRANSPORT_STORAGE_MAIL_LQ IND_ACCOMD_FOOD_LQ IND_INFO_COMM_LQ
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1805 1st Qu.:0.1494 1st Qu.:0.02666
## Median :0.2500 Median :0.1837 Median :0.04099
## Mean :0.2945 Mean :0.2190 Mean :0.05542
## 3rd Qu.:0.3586 3rd Qu.:0.2420 3rd Qu.:0.05853
## Max. :1.0000 Max. :1.0000 Max. :1.00000
## IND_FINANCE_SVCS_LQ IND_REAL_ESTATE_LQ IND_PRO_SCIENTIC_TECHNICAL_LQ
## Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.09555 1st Qu.:0.1520 1st Qu.:0.1399
## Median :0.15225 Median :0.3126 Median :0.1978
## Mean :0.17089 Mean :0.3168 Mean :0.2156
## 3rd Qu.:0.22384 3rd Qu.:0.4287 3rd Qu.:0.2683
## Max. :1.00000 Max. :1.0000 Max. :1.0000
## IND_ADMINISTRATIVE_LQ IND_PUBLIC_ADMIN_DEFENSE_SS_LQ IND_EDUCATION_LQ
## Min. :0.0000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.1233 1st Qu.:0.007751 1st Qu.:0.3182
## Median :0.1961 Median :0.021014 Median :0.4567
## Mean :0.2186 Mean :0.051148 Mean :0.4515
## 3rd Qu.:0.2760 3rd Qu.:0.053365 3rd Qu.:0.6018
## Max. :1.0000 Max. :1.000000 Max. :1.0000
## IND_HEALTH_SS_LQ IND_ART_SPORT_RECRE_LQ IND_OTHER_SVCS_LQ
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2050 1st Qu.:0.3736 1st Qu.:0.2570
## Median :0.3136 Median :0.4925 Median :0.3419
## Mean :0.3344 Mean :0.4674 Mean :0.3473
## 3rd Qu.:0.4269 3rd Qu.:0.5985 3rd Qu.:0.4259
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## IND_INTERNL_INSTITUTIONS_LQ
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.01048
## 3rd Qu.:0.00000
## Max. :1.00000
Now we can see that all the values range between 0-1 and as such we can cluster them all at the same magnitutde without one variable impacting the clustering results more.
For clustering, we will need to have a distance measure that will be utilised in the clustering of points. To do this we will be utilising the following code chunk to create a proximity matrix, we will be using the euclidean method.
proxmat <- dist(sp_ind.std, method = 'euclidean')
Before starting on our clustering it will be useful to identify which algorithm will provide the strongest clustering structure.
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(sp_ind.std, method = x)$ac
}
map_dbl(m, ac)
## average single complete ward
## 0.7262692 0.6818087 0.7890747 0.9168273
With reference to the output above, we can see that Ward’s method provides the strongest clustering structure among the four methods assessed. Hence, in the subsequent analysis, only Ward’s method will be used.
As part of cluster analysis we will also need to identify an optimum cluster. To do so we will be utilisng the gap statistic method to identify the optimum cluster. We will set the k.max=20 as that is the number of industries we have and any more cluster than that may be hard to interpret.
set.seed(12345)
gap_stat <- clusGap(sp_ind.std, FUN = hcut, nstart = 25, K.max = 20, B = 50)
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = sp_ind.std, FUNcluster = hcut, K.max = 20, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..20; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 20
## logW E.logW gap SE.sim
## [1,] 3.701891 4.381207 0.6793160 0.009195348
## [2,] 3.598698 4.329566 0.7308684 0.008585474
## [3,] 3.547557 4.300043 0.7524856 0.007879301
## [4,] 3.504016 4.274828 0.7708123 0.008095224
## [5,] 3.463682 4.252821 0.7891383 0.008552328
## [6,] 3.430153 4.232715 0.8025622 0.008898478
## [7,] 3.396369 4.214180 0.8178113 0.008935278
## [8,] 3.371944 4.196644 0.8247006 0.008973340
## [9,] 3.343766 4.180138 0.8363721 0.009026804
## [10,] 3.307810 4.164305 0.8564952 0.009049633
## [11,] 3.280957 4.149013 0.8680560 0.009033655
## [12,] 3.261941 4.134320 0.8723792 0.009064817
## [13,] 3.242174 4.120022 0.8778472 0.009004439
## [14,] 3.223538 4.106127 0.8825887 0.009127173
## [15,] 3.200506 4.092515 0.8920096 0.009238617
## [16,] 3.179226 4.079172 0.8999456 0.009251334
## [17,] 3.162137 4.066062 0.9039247 0.009254161
## [18,] 3.140525 4.053194 0.9126682 0.009257948
## [19,] 3.123803 4.040421 0.9166174 0.009297646
## [20,] 3.105960 4.027778 0.9218181 0.009295561
We can visualise the plot with the following code chunk.
fviz_gap_stat(gap_stat)
18 clusters as recommended by the Gap statistics, is too high of a value to have meaningful clusters. As such we will utilise 10, as the optimum number of clusters to use, as there is a sharp increase from 9 clusters to 10 in terms of gap statistics.
Now that we have the various parameters prepared we can run the clutering and visualise the results.
hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.6)
rect.hclust(hclust_ward, k = 10, border = 2:5)
We can see in the dendogram that with 10 clusters there are clusters that are uneven with less municipality, it might be better to drop to 9 clusters.
With this interactive heatmap we can visualise which variables seem to contribute more to which cluster. It will assit us in undersigning the unique characteristics of each cluster.
sp_ind.std_mat <- data.matrix(sp_ind.std)
heatmaply(normalize(sp_ind.std_mat),
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 9,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="Geographic Segmentation of Greater Sao Paulo by Industry LQ",
xlab = "Industry Location Quotients",
ylab = "Municipalities of Greater Sao Paulo"
)
As each cluster seems to have a unique mix of different industries we can continue with 9 clusters for our analysis.
Based on the observation of the dendogram, we can map the results of the clustering of a 9-cluster model.
groups <- as.factor(cutree(hclust_ward, k=9))
splq_sf_cluster <- cbind(splq_map, as.matrix(groups)) %>%
rename(`CLUSTER`=`as.matrix.groups.`)
qtm(splq_sf_cluster, "CLUSTER")
We can see that the clusters are disperesed all over sau paulo macrometropolis and we cant really derive any straight forward insights. It is interesting to not there is as strip pink along the coastline of the magapolis. The majority industries here are those of sports and recereation and administaration. This makes sense as people may go here for recreation acivites and water sports.
We can as such utilise a method known as the SKATER approach to identify the distribution of industry spatially.
The first step required is to convert of spatial data into a SpatialPolygonsDataFrame.
splq_sp <- as_Spatial(splq_map)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition
## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Sistema de Referencia Geocentrico para las
## AmericaS 2000 in CRS definition
We will then need to compute a neighbours list. Also to ensure that each polygon has at least one neighbour we will will include a snap=0.02. That way even regions seperated by a body of water will have neigbours.
splq.nb <- poly2nb(splq_sp,snap=0.02)
summary(splq.nb)
## Neighbour list object:
## Number of regions: 174
## Number of nonzero links: 984
## Percentage nonzero weights: 3.250099
## Average number of links: 5.655172
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 12 13 27
## 4 8 13 35 34 31 18 13 7 6 3 1 1
## 4 least connected regions:
## 8 9 36 100 with 1 link
## 1 most connected region:
## 161 with 27 links
We will need to now compute the cost of each edge, It is the distance between its nodes.
lcosts <- nbcosts(splq.nb, sp_ind.std)
Next, We will incorporate these costs into a weights object as shown in the code chunk below.
splq.w <- nb2listw(splq.nb, lcosts, style="B")
summary(splq.w)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 174
## Number of nonzero links: 984
## Percentage nonzero weights: 3.250099
## Average number of links: 5.655172
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 12 13 27
## 4 8 13 35 34 31 18 13 7 6 3 1 1
## 4 least connected regions:
## 8 9 36 100 with 1 link
## 1 most connected region:
## 161 with 27 links
##
## Weights style: B
## Weights constants summary:
## n nn S0 S1 S2
## B 174 30276 771.8576 1409.952 18347.3
The following code computes the minimum spanning trees.
splq.mst <- mstree(splq.w)
Now we will make use of the minimum spanning tree together with the location quotient to perform SKATER clustering. As decided earlier we will continue to utilise the euclidean method and 9 clusters (parameter input will be 1 less, so 8).
clust9 <- skater(splq.mst[,1:2], sp_ind.std, method = "euclidean", 8)
# extract the cluster labels from the skater object
ccs9 <- clust9$groups
Now that we have computed the spatially constrained cluster we can utilise a choropleth map too visualise it.
groups_mat <- as.matrix(ccs9)
splq_sf_spatialcluster <- cbind(splq_sf_cluster, as.factor(groups_mat)) %>%
rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(splq_sf_spatialcluster, "SP_CLUSTER")
From the map we can see that clusters 1, 3 & 5 are the largest, as they occupy the most land space. Lets take a look at the profiles of these clusters.
spatialcluster.summary <- splq_sf_spatialcluster %>%
gather("Industry","LQ",contains("LQ"))
ggplot(spatialcluster.summary, aes(x = Industry, y = LQ, fill= Industry)) +
facet_wrap(~ SP_CLUSTER) +
geom_bar(stat="identity", show_guide = FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
From this plot we can see that, Cluster 3 has a large number of all industries, indicating that this regions is probable more developed than the other regions. With the bulk of economic activities centered here. Whereas cluster 1, a large portion is extractive industries, these could mean that these regions are where more resources extraction occur. Additionally cluster 1 has other industries as well and it could be the main economic driver is the extraction industries and the rest help to suppport it.
Finally cluster 5 also as large number of extractive industries and public administration, defense and social services industries, indicating that this too is an important resource mining region since it had both these industries (defense to protect the extraction industry maybe?). However extractive industry is not as pervalent as cluster 1, it could be that this cluster is still growing
As such new business going in can observe where in the industries they fit and go to the region where it will be relevant to them. For example since characteristics of cluster 5 seem similar to 1, peripheral industries may want to come in similar to cluster 1 and capitalise on the money that is coming from extractive industries in cluster 5
Through this analysis we where able to identify and map the location quotients of each industries municipalites and we then utilised them to delineate industry specialisation clusters. We also found that hierarchical clustering is not as spatially usefull as spatially constrained clustering. We where also able to identify characteristics of certain industries and provide suggestions on how entrepreuners can exploit these information for their business expansion.
The following guide enabled the analysis in the project:
Codes in this document have been adapted from the following useful guide:
A work by Rajiv Abraham Xavier