1 Background

Brazil is the world’s fifth-largest country by area and the sixth most populous. Brazil is classified as an upper-middle income economy by the World Bank and a developing country, with the largest share of global wealth in Latin America. It is considered an advanced emerging economy. It has the ninth largest GDP in the world by nominal, and eighth by PPP measures. Behind all this impressive figures, the spatial development of Brazil is highly unequal as shown in the figure below. In 2016, the GDP per capita of the poorest municipality is R$3,190.6. On the other hand, the GDP per capita of the richest municipality is R$314,638. Half of the municipalities have GDP per capita less than R$16,000 and the top 25% municipalities earn R$26,155 and above.

In recent year, Singapore government has been encouraging local enterprises to venture into international market and one of the highly recommended country is Brazil:

However, in view of the spatial development of Brazil that is highly unequal, any company interested to invest in Brazil must have a good understanding of the geographical specialisation and competitiveness of industry in the country.

2 Objectives

Analyze based on segment São Paulo Macrometropolis at the municipality level into homogeneous industry specialization clusters by using data provided by Brazilian Institute of Geography and Statistics (IBGE).

  • Using choropleth maps to show the distribution of spatial specialisation by industry type, 2016 at municipality level.
  • Delineating industry specialisation clusters by using hierarchical clustering method and display their distribution by using appropriate thematic mapping technique.
  • Delineating industry specialisation clusters by using spatially constrained clustering method and display their distribution by using appropriate thematic mapping technique.

3 Input

3.1 Loading libraries

The code chunk below will check if the R packages in the packaging list have been installed. if not, install the library. After the installation, it will also load the R packages in R.

packages = c('rgdal', 'spdep', 'tmap', 'sf', 'ggpubr', 'cluster', 'factoextra', 'NbClust', 'heatmaply', 'corrplot', 'psych', 'tidyverse', 'geobr', 'readr', 'expss', 'DT')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

3.2 Reading data

From Kaggle:

  • BRAZIL_CITIES.csv. This data file consists of 81 columns and 5573 rows. Each row representing one municipality.
  • Data_Dictionary.csv. This file provides meta data of each columns in BRAZIL_CITIES.csv.

From R package geobr:

  • 2016 municipality boundary
  • 2016 metropolitan boundary
brazil = read_delim("../data/aspatial/BRAZIL_CITIES.csv", delim = ";") 
data_dict = read_delim("../data/aspatial/Data_Dictionary.csv", delim = ";") 
muni_sp <- read_municipality(code_muni= "SP",year=2016)
metro <- read_metro_area(year=2016)

Select necessary rows for reading.

data_dict_sub <- data_dict %>%
  select(FIELD, DESCRIPTION) %>%
  filter(str_detect(FIELD, "^COMP"))

Show the table.

datatable(data_dict_sub, options = list(
  searching = FALSE,
  pageLength = 5
))

3.3 Data Cleaning

Checking validity of geospatial data.

count_if(FALSE, st_is_valid(muni_sp))
## [1] 1
count_if(FALSE, st_is_valid(metro))
## [1] 0

Making the geospatial data valid. And check again its validtiy.

muni_sp <- st_make_valid(muni_sp)
count_if(FALSE, st_is_valid(muni_sp))
## [1] 0

3.4 Data Wrangling

3.4.1 Geospatial Data

Extract the São Paulo area by filtering the abbrev_state equals to “SP” and select required columns.

metro_sp <- metro %>%
              filter(abbrev_state == "SP") %>%
              dplyr::select(code_muni, name_muni, code_state, abbrev_state, name_metro, geom)

Plot the map to check all regions from São Paulo Macrometropolis. There are 8 regions from São Paulo Macrometropolis:

  1. Metropolitan Region of São Paulo
  2. Metropolitan Region of Campinas
  3. Metropolitan Region of Vale do Paraíba e Litoral Norte
  4. Metropolitan Region of Sorocaba
  5. Metropolitan Region of Baixada Santista
  6. Piracicaba Urban Agglomeration
  7. Jundiaí Urban Agglomeration
  8. Regional Unit of Bragança Paulista city

However, the region Regional Unit of Bragança Paulista city is missing from the map below.

tm_shape(metro_sp) +
  tm_fill(c("name_metro"))

Filter the municipalities in the region Regional Unit of Bragança Paulista city to new object muni_sp_bp and add the column name Regional Unit of Bragança Paulista city. Municipalities in Regional Unit of Bragança Paulista city:

  • Atibaia
  • Bom Jesus Dos Perdões
  • Bragança Paulista
  • Joanópolis
  • Nazaré Paulista
  • Pedra Bela
  • Pinhalzinho
  • Piracaia
  • Tuiuti
  • Vargem
muni_sp_bp <- muni_sp %>%
  filter(name_muni %in% 
           c('Atibaia',
            'Bom Jesus Dos Perdões',
            'Bragança Paulista',
            'Joanópolis',
            'Nazaré Paulista',  
            'Pedra Bela',
            'Pinhalzinho',
            'Piracaia',
            'Tuiuti',
            'Vargem'))%>%
  mutate(name_metro ="Regional Unit of Bragança Paulista city")

tm_shape(muni_sp_bp) +
  tm_fill(c("name_muni"))

Combine the study area.

study_area <- rbind(metro_sp, muni_sp_bp)
tm_shape(study_area) +
  tm_fill(c("name_metro"))

3.4.2 Aspatial Data

Extract the city and industry related information from the brazil object.

brazil_sp <- brazil %>%
  filter(STATE == "SP") %>%
  select (CITY,contains(c("COMP")))

3.4.3 Join aspatial data with geospatial data

Join the data by the state and the city names. And plot the map to check.

study_area_sp <- study_area %>% left_join(brazil_sp, by = c("name_muni" = "CITY"))
tm_shape(study_area_sp) +
  tm_fill(c("COMP_TOT"))

After plotting the map, we can see there is a value missing. The reason is the naming of the city from two datasets are different, which caused the value missing.

Change the city name same as the name from geospatial data.

brazil_sp$CITY[brazil_sp$CITY == "Santa Bárbara D'Oeste"] <- "Santa Bárbara D'oeste"

Join the data again.

study_area_sp <- study_area %>% left_join(brazil_sp, by = c("name_muni" = "CITY"))
tm_shape(study_area_sp) +
  tm_fill(c("COMP_TOT"))

3.4.4 Calculating Location Quotients (LQ)

Formula

(number of industry a in municipality i / sum of all industry in municipality i)/(number of industry a in Greater Sao Paulo/sum of all industry in Greater Sao Paulo)

Repleced NA values by 0.

study_area_sp[is.na(study_area_sp)] <- 0

Apply the formula to do calculation and assign the result to a new column.

study_area_sp_lq2016 <- study_area_sp %>%
  mutate(ind_A = (`COMP_A`/`COMP_TOT`) / (sum(study_area_sp$COMP_A)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_B = (`COMP_B`/`COMP_TOT`) / (sum(study_area_sp$COMP_B)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_C = (`COMP_C`/`COMP_TOT`) / (sum(study_area_sp$COMP_C)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_D = (`COMP_D`/`COMP_TOT`) / (sum(study_area_sp$COMP_D)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_E = (`COMP_E`/`COMP_TOT`) / (sum(study_area_sp$COMP_E)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_F = (`COMP_F`/`COMP_TOT`) / (sum(study_area_sp$COMP_F)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_G = (`COMP_G`/`COMP_TOT`) / (sum(study_area_sp$COMP_G)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_H = (`COMP_H`/`COMP_TOT`) / (sum(study_area_sp$COMP_H)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_I = (`COMP_I`/`COMP_TOT`) / (sum(study_area_sp$COMP_I)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_J = (`COMP_J`/`COMP_TOT`) / (sum(study_area_sp$COMP_J)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_K = (`COMP_K`/`COMP_TOT`) / (sum(study_area_sp$COMP_K)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_L = (`COMP_L`/`COMP_TOT`) / (sum(study_area_sp$COMP_L)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_M = (`COMP_M`/`COMP_TOT`) / (sum(study_area_sp$COMP_M)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_N = (`COMP_N`/`COMP_TOT`) / (sum(study_area_sp$COMP_N)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_O = (`COMP_O`/`COMP_TOT`) / (sum(study_area_sp$COMP_O)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_P = (`COMP_P`/`COMP_TOT`) / (sum(study_area_sp$COMP_P)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_Q = (`COMP_Q`/`COMP_TOT`) / (sum(study_area_sp$COMP_Q)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_R = (`COMP_R`/`COMP_TOT`) / (sum(study_area_sp$COMP_R)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_S = (`COMP_S`/`COMP_TOT`) / (sum(study_area_sp$COMP_S)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_T = (`COMP_T`/`COMP_TOT`) / (sum(study_area_sp$COMP_T)/sum(study_area_sp$COMP_TOT))) %>%
  mutate(ind_U = (`COMP_U`/`COMP_TOT`) / (sum(study_area_sp$COMP_U)/sum(study_area_sp$COMP_TOT))) %>%
  select(name_muni, ind_A,ind_B,ind_C,ind_D,ind_E,ind_F,ind_G,ind_H,ind_I,ind_J,ind_K,ind_L,ind_M,ind_N,ind_O,ind_P,ind_Q,ind_R,ind_S,ind_U,)

4 Analysis

4.1 Choropleth maps to show the distribution of spatial specialisation by industry type, 2016 at municipality level

Create a list contains the columns’ name we want to use for plotting the map. And create choropleth maps and store them into chor_list.

chor_list <- list()
column_n <- c("ind_A","ind_B","ind_C","ind_D","ind_E","ind_F","ind_G","ind_H","ind_I","ind_J","ind_K","ind_L","ind_M","ind_N","ind_O","ind_P","ind_Q","ind_R","ind_S","ind_U")
for (name in column_n) {
  new_choro <- qtm(study_area_sp_lq2016, fill = name, fill.palette = "Blues")
  chor_list[[name]] <- new_choro
}

Plot the choropleth maps.

From the choropleth maps above, we can find the industries below are well distributed in the São Paulo Macrometropolis:

  • Industry F: Construction
  • Industry L: Accommodation and food
  • Industry Q: Human health and social services
  • Industry R: Arts, culture, sport and recreation

4.1.1 Correlation Analysis

Replace NA values with 0.

study_area_sp_c <- as.data.frame(study_area_sp_lq2016)%>% 
    mutate_all(~replace(., is.na(.), 0)) %>%
    select(name_muni,contains("ind"))
study_area_sp_cor <- cor(study_area_sp_c[2:21])

Plot the correlation plot.

corrplot.mixed(study_area_sp_cor,
   lower = "ellipse", 
   upper = "number",
   tl.pos = "lt",
   diag = "l",
   tl.col = "black",
   number.cex= 0.6,
   tl.offset = 0.4)

The correlation plot above shows that there are highly correlated variables. The highest correlation is between industry K and industry M with value 0.61. Therefore, none of the variable need be removed from the cluster analysis.

4.2 Hierarchical clustering

  • To make the name_muni as the row name for plottinh the hierarchy cluster analysis and also to extract the clustering variables from the study_area_sp_c
  • Z-score standardization
  • Standardise values by using normalize() function from package heatmaply for min-max standardization
  • Compute the proximity matrix
row.names(study_area_sp_c) <- make.unique(study_area_sp_c$name_muni)
cls <- select(study_area_sp_c, c(2:21))
cls.std<- normalize(cls)
cls.z <- scale(cls)
proxmat <- dist(cls, method = 'euclidean')

4.2.1 Compute hierarchical clustering

hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.3)

Select optimal clustering algorithm.

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
  agnes(cls, method = x)$ac
}
map_dbl(m, ac)
##   average    single  complete      ward 
## 0.9563973 0.9510675 0.9583889 0.9729391

4.2.2 Determine optimal number of clusters

Use clusGap() function to calculates a goodness of clustering measure, the “gap” statistic. And we are using firstSEmax method to determine the optimal number of clusters.

set.seed(12345)
gap_stat <- clusGap(cls, FUN = hcut, K.max = 10, B = 50)
print(gap_stat, method = "firstSEmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = cls, FUNcluster = hcut, K.max = 10, B = 50)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstSEmax', SE.factor=1): 3
##           logW   E.logW      gap     SE.sim
##  [1,] 6.833378 8.114682 1.281303 0.01794967
##  [2,] 6.563896 7.937530 1.373634 0.02067457
##  [3,] 6.424661 7.839737 1.415076 0.01878824
##  [4,] 6.341358 7.753330 1.411972 0.01951777
##  [5,] 6.292025 7.678384 1.386359 0.01878100
##  [6,] 6.125757 7.616993 1.491236 0.01788969
##  [7,] 6.072069 7.563659 1.491590 0.01855217
##  [8,] 5.993981 7.515791 1.521810 0.01823686
##  [9,] 5.901331 7.474077 1.572746 0.01793577
## [10,] 5.849331 7.434625 1.585294 0.01704356

Visualize the gap_stat object.

fviz_gap_stat(gap_stat)

The gap stats plot shows the statistics by number of clusters (k) with standard errors drawn with vertical segments and the optimal value of k marked with a vertical dashed blue line. According to the graph above k = 3 is the optimal number of clusters in the data.

4.2.3 Visually-driven hierarchical clustering analysis

Transform the dataframe into a matrix.

cls_mat <- data.matrix(cls)

Plot interactive cluster heatmap using heatmaply().

heatmaply(normalize(cls_mat),
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 3,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Geographic Segmentation of São Paulo by Industry Indicators",
          xlab = "Industry Indicators",
          ylab = "Municipalities in São Paulo"
          )

4.2.4 Mapping the clusters formed

Use cutree() of R Base to derive a 18-cluster model.

clusters <- as.factor(cutree(hclust_ward, k=3))
  • Convert clusters object into a matrix.
  • Use cbind() function to append cluster matrix onto study_area_sp_lq2016 object to produce an output simple feature object called sp_clusters.
  • Rename as.matrix.groups field to CLUSTER.
sp_cluster <- cbind(study_area_sp_lq2016, as.matrix(clusters)) %>%
  rename(`CLUSTER`=`as.matrix.clusters.`)

Plot the map.

qtm(sp_cluster, "CLUSTER")

The choropleth map above shows the clusters are fragmented. The is one of the main limitation when non-spatial clustering algorithm such as hierarchical cluster analysis method is used. And the next section will be using spatially constrained clustering method to overcome this limitation.

4.3 Spatially constrained clustering

4.3.1 Converting into SpatialPolygonsDataFrame

Convert study_area_sp_lq2016 into SpatialPolygonsDataFrame

study_area_sp_lq2016_sp<- as_Spatial(study_area_sp_lq2016)

4.3.2 Computing Neighbour List

Set the snap to 0.02 since there is 1 region is not linked to any of the region. Thus, to make it at least 1 neighbour snap will be set to 0.02

brazil_sp_nb <- poly2nb(study_area_sp_lq2016_sp, snap =0.02)
summary(brazil_sp_nb)
## Neighbour list object:
## Number of regions: 174 
## Number of nonzero links: 984 
## Percentage nonzero weights: 3.250099 
## Average number of links: 5.655172 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 12 13 27 
##  4  8 13 35 34 31 18 13  7  6  3  1  1 
## 4 least connected regions:
## 8 9 36 100 with 1 link
## 1 most connected region:
## 161 with 27 links

The average number of links is around 6. And 4 regions with 1 link. Region 161 has 27 links which is the highest number of links.

Plot the boundary of the map. And add the links above using the coordinates from the study_area_sp_lq2016_sp object.

plot(study_area_sp_lq2016_sp, border=grey(.5), main = "São Paulo Neighbours")
plot(brazil_sp_nb, coordinates(study_area_sp_lq2016_sp), col="blue", add=TRUE)

4.3.3 Minimum Spanning Tree

4.3.3.1 Calculate edge costs

Use nbcosts() function from spdep package to calculate edge costs. Edges costs represent the distance between the nodes.

lcosts <- nbcosts(brazil_sp_nb, cls)

Use nb2listw() function from spdep package to incorporate these costs into a weights object by converting the neighbour list to a list weights object by specifying the just computed lcosts as the weights. And style parameter as B to ensure the cost values are not row-standardised.

sp.w <- nb2listw(brazil_sp_nb, lcosts, style="B")
summary(sp.w)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 174 
## Number of nonzero links: 984 
## Percentage nonzero weights: 3.250099 
## Average number of links: 5.655172 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 12 13 27 
##  4  8 13 35 34 31 18 13  7  6  3  1  1 
## 4 least connected regions:
## 8 9 36 100 with 1 link
## 1 most connected region:
## 161 with 27 links
## 
## Weights style: B 
## Weights constants summary:
##     n    nn       S0      S1      S2
## B 174 30276 14920.08 1064908 9237949

4.3.4 Compute minimum spanning tree

Use mstree() function from spdep package to calculate minimum spanning tree. And plot the

sp.mst <- mstree(sp.w)

Plot the boundary of the map. And add the calculation results above using the coordinates from the study_area_sp_lq2016_sp object.

plot(study_area_sp_lq2016_sp, border=gray(.5))
plot.mst(sp.mst, coordinates(study_area_sp_lq2016_sp), 
     col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)

4.3.5 Compute spatially constrained clusters using SKATER method

Use skater() function from spdep package to compute spatially constrained clusters. Variables required by skater() function:

  • The first two columns of the MST matrix
  • The data matrix (to update the costs as units are being grouped)
  • The number of cuts = 2

Note: Set to [cluster number - 1]. So, the value specified is not the number of clusters, but the number of cuts in the graph, one less than the number of clusters.]

clust7 <- skater(sp.mst[,1:2], cls, method = "euclidean", 2)
str(clust7)
## List of 8
##  $ groups      : num [1:174] 1 1 1 1 1 1 1 1 1 1 ...
##  $ edges.groups:List of 3
##   ..$ :List of 3
##   .. ..$ node: num [1:166] 122 92 112 64 85 105 152 11 76 25 ...
##   .. ..$ edge: num [1:165, 1:3] 112 64 85 11 76 25 152 122 57 92 ...
##   .. ..$ ssw : num 2105
##   ..$ :List of 3
##   .. ..$ node: num [1:7] 97 118 120 88 103 107 90
##   .. ..$ edge: num [1:6, 1:3] 118 97 97 120 118 120 88 118 120 103 ...
##   .. ..$ ssw : num 154
##   ..$ :List of 3
##   .. ..$ node: num 89
##   .. ..$ edge: num[0 , 1:3] 
##   .. ..$ ssw : num 0
##  $ not.prune   : NULL
##  $ candidates  : int [1:3] 1 2 3
##  $ ssto        : num 2664
##  $ ssw         : num [1:3] 2664 2414 2259
##  $ crit        : num [1:2] 1 Inf
##  $ vec.crit    : num [1:174] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "class")= chr "skater"

4.3.6 Visualize clusters in chrolopleth map

groups_mat <- as.matrix(clust7$groups)
sp_sf_spatialcluster <- cbind(sp_cluster, as.factor(groups_mat)) %>%
  rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(sp_sf_spatialcluster, "SP_CLUSTER")

By using spatially constrained clustering, the clusters generated are not fragmented and grouping together, producing more homogeneity.

4.4 Hierarchical Clustering VS Spatially Constrained Clustering

hclust.map <- qtm(sp_cluster,
                  "CLUSTER") + 
  tm_layout(title = "Hierarchical Clustering")

shclust.map <- qtm(sp_sf_spatialcluster,
                   "SP_CLUSTER") + 
  tm_layout(title = "Spatially Constrained Clustering")

tmap_arrange(hclust.map, shclust.map,
             asp=0, ncol=2)

From the spatially constrained clustering method, we can see the clusters clearly from the map above. And the cities in the same clusters shows their industry specialisation are similar.