Brazil is the world’s fifth-largest country by area and the sixth most populous. Brazil is classified as an upper-middle income economy by the World Bank and a developing country, with the largest share of global wealth in Latin America. It is considered an advanced emerging economy. It has the ninth largest GDP in the world by nominal, and eighth by PPP measures. Behind all this impressive figures, the spatial development of Brazil is highly unequal as shown in the figure below. In 2016, the GDP per capita of the poorest municipality is R$3,190.6. On the other hand, the GDP per capita of the richest municipality is R$314,638. Half of the municipalities have GDP per capita less than R$16,000 and the top 25% municipalities earn R$26,155 and above.
In recent year, Singapore government has been encouraging local enterprises to venture into international market and one of the highly recommended country is Brazil:
However, in view of the spatial development of Brazil that is highly unequal, any company interested to invest in Brazil must have a good understanding of the geographical specialisation and competitiveness of industry in the country.
Analyze based on segment São Paulo Macrometropolis at the municipality level into homogeneous industry specialization clusters by using data provided by Brazilian Institute of Geography and Statistics (IBGE).
The code chunk below will check if the R packages in the packaging list have been installed. if not, install the library. After the installation, it will also load the R packages in R.
packages = c('rgdal', 'spdep', 'tmap', 'sf', 'ggpubr', 'cluster', 'factoextra', 'NbClust', 'heatmaply', 'corrplot', 'psych', 'tidyverse', 'geobr', 'readr', 'expss', 'DT')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
From Kaggle:
From R package geobr:
brazil = read_delim("../data/aspatial/BRAZIL_CITIES.csv", delim = ";")
data_dict = read_delim("../data/aspatial/Data_Dictionary.csv", delim = ";")
muni_sp <- read_municipality(code_muni= "SP",year=2016)
metro <- read_metro_area(year=2016)
Select necessary rows for reading.
data_dict_sub <- data_dict %>%
select(FIELD, DESCRIPTION) %>%
filter(str_detect(FIELD, "^COMP"))
Show the table.
datatable(data_dict_sub, options = list(
searching = FALSE,
pageLength = 5
))
Checking validity of geospatial data.
count_if(FALSE, st_is_valid(muni_sp))
## [1] 1
count_if(FALSE, st_is_valid(metro))
## [1] 0
Making the geospatial data valid. And check again its validtiy.
muni_sp <- st_make_valid(muni_sp)
count_if(FALSE, st_is_valid(muni_sp))
## [1] 0
Extract the São Paulo area by filtering the abbrev_state equals to “SP” and select required columns.
metro_sp <- metro %>%
filter(abbrev_state == "SP") %>%
dplyr::select(code_muni, name_muni, code_state, abbrev_state, name_metro, geom)
Plot the map to check all regions from São Paulo Macrometropolis. There are 8 regions from São Paulo Macrometropolis:
However, the region Regional Unit of Bragança Paulista city is missing from the map below.
tm_shape(metro_sp) +
tm_fill(c("name_metro"))
Filter the municipalities in the region Regional Unit of Bragança Paulista city to new object muni_sp_bp and add the column name Regional Unit of Bragança Paulista city. Municipalities in Regional Unit of Bragança Paulista city:
muni_sp_bp <- muni_sp %>%
filter(name_muni %in%
c('Atibaia',
'Bom Jesus Dos Perdões',
'Bragança Paulista',
'Joanópolis',
'Nazaré Paulista',
'Pedra Bela',
'Pinhalzinho',
'Piracaia',
'Tuiuti',
'Vargem'))%>%
mutate(name_metro ="Regional Unit of Bragança Paulista city")
tm_shape(muni_sp_bp) +
tm_fill(c("name_muni"))
Combine the study area.
study_area <- rbind(metro_sp, muni_sp_bp)
tm_shape(study_area) +
tm_fill(c("name_metro"))
Extract the city and industry related information from the brazil object.
brazil_sp <- brazil %>%
filter(STATE == "SP") %>%
select (CITY,contains(c("COMP")))
Join the data by the state and the city names. And plot the map to check.
study_area_sp <- study_area %>% left_join(brazil_sp, by = c("name_muni" = "CITY"))
tm_shape(study_area_sp) +
tm_fill(c("COMP_TOT"))
After plotting the map, we can see there is a value missing. The reason is the naming of the city from two datasets are different, which caused the value missing.
Change the city name same as the name from geospatial data.
brazil_sp$CITY[brazil_sp$CITY == "Santa Bárbara D'Oeste"] <- "Santa Bárbara D'oeste"
Join the data again.
study_area_sp <- study_area %>% left_join(brazil_sp, by = c("name_muni" = "CITY"))
tm_shape(study_area_sp) +
tm_fill(c("COMP_TOT"))
Formula
(number of industry a in municipality i / sum of all industry in municipality i)/(number of industry a in Greater Sao Paulo/sum of all industry in Greater Sao Paulo)
Repleced NA values by 0.
study_area_sp[is.na(study_area_sp)] <- 0
Apply the formula to do calculation and assign the result to a new column.
study_area_sp_lq2016 <- study_area_sp %>%
mutate(ind_A = (`COMP_A`/`COMP_TOT`) / (sum(study_area_sp$COMP_A)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_B = (`COMP_B`/`COMP_TOT`) / (sum(study_area_sp$COMP_B)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_C = (`COMP_C`/`COMP_TOT`) / (sum(study_area_sp$COMP_C)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_D = (`COMP_D`/`COMP_TOT`) / (sum(study_area_sp$COMP_D)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_E = (`COMP_E`/`COMP_TOT`) / (sum(study_area_sp$COMP_E)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_F = (`COMP_F`/`COMP_TOT`) / (sum(study_area_sp$COMP_F)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_G = (`COMP_G`/`COMP_TOT`) / (sum(study_area_sp$COMP_G)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_H = (`COMP_H`/`COMP_TOT`) / (sum(study_area_sp$COMP_H)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_I = (`COMP_I`/`COMP_TOT`) / (sum(study_area_sp$COMP_I)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_J = (`COMP_J`/`COMP_TOT`) / (sum(study_area_sp$COMP_J)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_K = (`COMP_K`/`COMP_TOT`) / (sum(study_area_sp$COMP_K)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_L = (`COMP_L`/`COMP_TOT`) / (sum(study_area_sp$COMP_L)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_M = (`COMP_M`/`COMP_TOT`) / (sum(study_area_sp$COMP_M)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_N = (`COMP_N`/`COMP_TOT`) / (sum(study_area_sp$COMP_N)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_O = (`COMP_O`/`COMP_TOT`) / (sum(study_area_sp$COMP_O)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_P = (`COMP_P`/`COMP_TOT`) / (sum(study_area_sp$COMP_P)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_Q = (`COMP_Q`/`COMP_TOT`) / (sum(study_area_sp$COMP_Q)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_R = (`COMP_R`/`COMP_TOT`) / (sum(study_area_sp$COMP_R)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_S = (`COMP_S`/`COMP_TOT`) / (sum(study_area_sp$COMP_S)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_T = (`COMP_T`/`COMP_TOT`) / (sum(study_area_sp$COMP_T)/sum(study_area_sp$COMP_TOT))) %>%
mutate(ind_U = (`COMP_U`/`COMP_TOT`) / (sum(study_area_sp$COMP_U)/sum(study_area_sp$COMP_TOT))) %>%
select(name_muni, ind_A,ind_B,ind_C,ind_D,ind_E,ind_F,ind_G,ind_H,ind_I,ind_J,ind_K,ind_L,ind_M,ind_N,ind_O,ind_P,ind_Q,ind_R,ind_S,ind_U,)
Create a list contains the columns’ name we want to use for plotting the map. And create choropleth maps and store them into chor_list.
chor_list <- list()
column_n <- c("ind_A","ind_B","ind_C","ind_D","ind_E","ind_F","ind_G","ind_H","ind_I","ind_J","ind_K","ind_L","ind_M","ind_N","ind_O","ind_P","ind_Q","ind_R","ind_S","ind_U")
for (name in column_n) {
new_choro <- qtm(study_area_sp_lq2016, fill = name, fill.palette = "Blues")
chor_list[[name]] <- new_choro
}
Plot the choropleth maps.
From the choropleth maps above, we can find the industries below are well distributed in the São Paulo Macrometropolis:
Replace NA values with 0.
study_area_sp_c <- as.data.frame(study_area_sp_lq2016)%>%
mutate_all(~replace(., is.na(.), 0)) %>%
select(name_muni,contains("ind"))
study_area_sp_cor <- cor(study_area_sp_c[2:21])
Plot the correlation plot.
corrplot.mixed(study_area_sp_cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black",
number.cex= 0.6,
tl.offset = 0.4)
The correlation plot above shows that there are highly correlated variables. The highest correlation is between industry K and industry M with value 0.61. Therefore, none of the variable need be removed from the cluster analysis.
row.names(study_area_sp_c) <- make.unique(study_area_sp_c$name_muni)
cls <- select(study_area_sp_c, c(2:21))
cls.std<- normalize(cls)
cls.z <- scale(cls)
proxmat <- dist(cls, method = 'euclidean')
hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.3)
Select optimal clustering algorithm.
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(cls, method = x)$ac
}
map_dbl(m, ac)
## average single complete ward
## 0.9563973 0.9510675 0.9583889 0.9729391
Use clusGap() function to calculates a goodness of clustering measure, the “gap” statistic. And we are using firstSEmax method to determine the optimal number of clusters.
set.seed(12345)
gap_stat <- clusGap(cls, FUN = hcut, K.max = 10, B = 50)
print(gap_stat, method = "firstSEmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = cls, FUNcluster = hcut, K.max = 10, B = 50)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstSEmax', SE.factor=1): 3
## logW E.logW gap SE.sim
## [1,] 6.833378 8.114682 1.281303 0.01794967
## [2,] 6.563896 7.937530 1.373634 0.02067457
## [3,] 6.424661 7.839737 1.415076 0.01878824
## [4,] 6.341358 7.753330 1.411972 0.01951777
## [5,] 6.292025 7.678384 1.386359 0.01878100
## [6,] 6.125757 7.616993 1.491236 0.01788969
## [7,] 6.072069 7.563659 1.491590 0.01855217
## [8,] 5.993981 7.515791 1.521810 0.01823686
## [9,] 5.901331 7.474077 1.572746 0.01793577
## [10,] 5.849331 7.434625 1.585294 0.01704356
Visualize the gap_stat object.
fviz_gap_stat(gap_stat)
The gap stats plot shows the statistics by number of clusters (k) with standard errors drawn with vertical segments and the optimal value of k marked with a vertical dashed blue line. According to the graph above k = 3 is the optimal number of clusters in the data.
Transform the dataframe into a matrix.
cls_mat <- data.matrix(cls)
Plot interactive cluster heatmap using heatmaply().
heatmaply(normalize(cls_mat),
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 3,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="Geographic Segmentation of São Paulo by Industry Indicators",
xlab = "Industry Indicators",
ylab = "Municipalities in São Paulo"
)
Use cutree() of R Base to derive a 18-cluster model.
clusters <- as.factor(cutree(hclust_ward, k=3))
clusters object into a matrix.study_area_sp_lq2016 object to produce an output simple feature object called sp_clusters.sp_cluster <- cbind(study_area_sp_lq2016, as.matrix(clusters)) %>%
rename(`CLUSTER`=`as.matrix.clusters.`)
Plot the map.
qtm(sp_cluster, "CLUSTER")
The choropleth map above shows the clusters are fragmented. The is one of the main limitation when non-spatial clustering algorithm such as hierarchical cluster analysis method is used. And the next section will be using spatially constrained clustering method to overcome this limitation.
Convert study_area_sp_lq2016 into SpatialPolygonsDataFrame
study_area_sp_lq2016_sp<- as_Spatial(study_area_sp_lq2016)
Set the snap to 0.02 since there is 1 region is not linked to any of the region. Thus, to make it at least 1 neighbour snap will be set to 0.02
brazil_sp_nb <- poly2nb(study_area_sp_lq2016_sp, snap =0.02)
summary(brazil_sp_nb)
## Neighbour list object:
## Number of regions: 174
## Number of nonzero links: 984
## Percentage nonzero weights: 3.250099
## Average number of links: 5.655172
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 12 13 27
## 4 8 13 35 34 31 18 13 7 6 3 1 1
## 4 least connected regions:
## 8 9 36 100 with 1 link
## 1 most connected region:
## 161 with 27 links
The average number of links is around 6. And 4 regions with 1 link. Region 161 has 27 links which is the highest number of links.
Plot the boundary of the map. And add the links above using the coordinates from the study_area_sp_lq2016_sp object.
plot(study_area_sp_lq2016_sp, border=grey(.5), main = "São Paulo Neighbours")
plot(brazil_sp_nb, coordinates(study_area_sp_lq2016_sp), col="blue", add=TRUE)
Use nbcosts() function from spdep package to calculate edge costs. Edges costs represent the distance between the nodes.
lcosts <- nbcosts(brazil_sp_nb, cls)
Use nb2listw() function from spdep package to incorporate these costs into a weights object by converting the neighbour list to a list weights object by specifying the just computed lcosts as the weights. And style parameter as B to ensure the cost values are not row-standardised.
sp.w <- nb2listw(brazil_sp_nb, lcosts, style="B")
summary(sp.w)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 174
## Number of nonzero links: 984
## Percentage nonzero weights: 3.250099
## Average number of links: 5.655172
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 12 13 27
## 4 8 13 35 34 31 18 13 7 6 3 1 1
## 4 least connected regions:
## 8 9 36 100 with 1 link
## 1 most connected region:
## 161 with 27 links
##
## Weights style: B
## Weights constants summary:
## n nn S0 S1 S2
## B 174 30276 14920.08 1064908 9237949
Use mstree() function from spdep package to calculate minimum spanning tree. And plot the
sp.mst <- mstree(sp.w)
Plot the boundary of the map. And add the calculation results above using the coordinates from the study_area_sp_lq2016_sp object.
plot(study_area_sp_lq2016_sp, border=gray(.5))
plot.mst(sp.mst, coordinates(study_area_sp_lq2016_sp),
col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)
Use skater() function from spdep package to compute spatially constrained clusters. Variables required by skater() function:
Note: Set to [cluster number - 1]. So, the value specified is not the number of clusters, but the number of cuts in the graph, one less than the number of clusters.]
clust7 <- skater(sp.mst[,1:2], cls, method = "euclidean", 2)
str(clust7)
## List of 8
## $ groups : num [1:174] 1 1 1 1 1 1 1 1 1 1 ...
## $ edges.groups:List of 3
## ..$ :List of 3
## .. ..$ node: num [1:166] 122 92 112 64 85 105 152 11 76 25 ...
## .. ..$ edge: num [1:165, 1:3] 112 64 85 11 76 25 152 122 57 92 ...
## .. ..$ ssw : num 2105
## ..$ :List of 3
## .. ..$ node: num [1:7] 97 118 120 88 103 107 90
## .. ..$ edge: num [1:6, 1:3] 118 97 97 120 118 120 88 118 120 103 ...
## .. ..$ ssw : num 154
## ..$ :List of 3
## .. ..$ node: num 89
## .. ..$ edge: num[0 , 1:3]
## .. ..$ ssw : num 0
## $ not.prune : NULL
## $ candidates : int [1:3] 1 2 3
## $ ssto : num 2664
## $ ssw : num [1:3] 2664 2414 2259
## $ crit : num [1:2] 1 Inf
## $ vec.crit : num [1:174] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "class")= chr "skater"
groups_mat <- as.matrix(clust7$groups)
sp_sf_spatialcluster <- cbind(sp_cluster, as.factor(groups_mat)) %>%
rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(sp_sf_spatialcluster, "SP_CLUSTER")
By using spatially constrained clustering, the clusters generated are not fragmented and grouping together, producing more homogeneity.
hclust.map <- qtm(sp_cluster,
"CLUSTER") +
tm_layout(title = "Hierarchical Clustering")
shclust.map <- qtm(sp_sf_spatialcluster,
"SP_CLUSTER") +
tm_layout(title = "Spatially Constrained Clustering")
tmap_arrange(hclust.map, shclust.map,
asp=0, ncol=2)
From the spatially constrained clustering method, we can see the clusters clearly from the map above. And the cities in the same clusters shows their industry specialisation are similar.