1 Introduction


This analysis will segment São Paulo Macrometropolis at the municipality level into homogeneous industry specialisation clusters. Such geographic segmentation will be important for businesses investing in Brazil, to gain a better understanding of the geographical specialisation and competitiveness of industry in the country.

The study area in this analysis will be the São Paulo Macrometropolis, consisting of the following divisions:

  • Metropolitan Region of São Paulo
  • Metropolitan Region of Campinas
  • Metropolitan Region of Vale do Paraíba e Litoral Norte
  • Metropolitan Region of Sorocaba
  • Metropolitan Region of Baixada Santista
  • Piracicaba Urban Agglomeration
  • Jundiaí Urban Agglomeration
  • Regional Unit of Bragança Paulista city


Source: Wikipedia


2 Data


The following datasets will be utilised in this analysis.


Data Description Format
Brazil Cities Data on municipalities in Brazil, including information on population, economic indexes and industry. From kaggle CSV
Brazil Municipalities 2016 municipality boundary in Brazil, retrieved using geobr package Spatial file
Brazil Metropolitan Areas 2018 (latest) metropolitan boundary in Brazil, retrieved using geobr package Spatial file


3 Install packages


The following R packages will be used in the analysis:

  • tidyverse: data import and manipulation
  • sf: spatial data manipulation
  • tmap: spatial mapping
  • geobr: to download official spatial data sets of Brazil
packages = c('tidyverse', 'tmap', 'sf', 'geobr', 'spdep', 'cluster', 'factoextra', 'NbClust', 'heatmaply', 'corrplot', 'psych', 'plotly')
for (p in packages) {
  if (!require(p, character.only = T)) {
    install.packages(p)
  }
  library(p, character.only = T)
}


4 Data import


4.1 Brazil cities


Import

cities <- read_delim('data/aspatial/BRAZIL_CITIES.csv', delim = ';')
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   CITY = col_character(),
##   STATE = col_character(),
##   AREA = col_number(),
##   REGIAO_TUR = col_character(),
##   CATEGORIA_TUR = col_character(),
##   RURAL_URBAN = col_character(),
##   GVA_MAIN = col_character()
## )
## i Use `spec()` for the full column specifications.


View content

glimpse(cities)
## Rows: 5,573
## Columns: 81
## $ CITY                     <chr> "Abadia De Goiás", "Abadia Dos Dourados", ...
## $ STATE                    <chr> "GO", "MG", "GO", "MG", "PA", "CE", "BA", ...
## $ CAPITAL                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ IBGE_RES_POP             <dbl> 6876, 6704, 15757, 22690, 141100, 10496, 8...
## $ IBGE_RES_POP_BRAS        <dbl> 6876, 6704, 15609, 22690, 141040, 10496, 8...
## $ IBGE_RES_POP_ESTR        <dbl> 0, 0, 148, 0, 60, 0, 0, 0, 0, 0, 0, 16, 17...
## $ IBGE_DU                  <dbl> 2137, 2328, 4655, 7694, 31061, 2791, 2572,...
## $ IBGE_DU_URBAN            <dbl> 1546, 1481, 3233, 6667, 19057, 1251, 1193,...
## $ IBGE_DU_RURAL            <dbl> 591, 847, 1422, 1027, 12004, 1540, 1379, 1...
## $ IBGE_POP                 <dbl> 5300, 4154, 10656, 18464, 82956, 4538, 372...
## $ IBGE_1                   <dbl> 69, 38, 139, 176, 1354, 98, 37, 167, 69, 1...
## $ `IBGE_1-4`               <dbl> 318, 207, 650, 856, 5567, 323, 156, 733, 3...
## $ `IBGE_5-9`               <dbl> 438, 260, 894, 1233, 7618, 421, 263, 978, ...
## $ `IBGE_10-14`             <dbl> 517, 351, 1087, 1539, 8905, 483, 277, 927,...
## $ `IBGE_15-59`             <dbl> 3542, 2709, 6896, 11979, 53516, 2631, 2319...
## $ `IBGE_60+`               <dbl> 416, 589, 990, 2681, 5996, 582, 673, 803, ...
## $ IBGE_PLANTED_AREA        <dbl> 319, 4479, 10307, 1862, 25200, 2598, 895, ...
## $ `IBGE_CROP_PRODUCTION_$` <dbl> 1843, 18017, 33085, 7502, 700872, 5234, 39...
## $ `IDHM Ranking 2010`      <dbl> 1689, 2207, 2202, 1994, 3530, 3522, 4086, ...
## $ IDHM                     <dbl> 0.708, 0.690, 0.690, 0.698, 0.628, 0.628, ...
## $ IDHM_Renda               <dbl> 0.687, 0.693, 0.671, 0.720, 0.579, 0.540, ...
## $ IDHM_Longevidade         <dbl> 0.830, 0.839, 0.841, 0.848, 0.798, 0.748, ...
## $ IDHM_Educacao            <dbl> 0.622, 0.563, 0.579, 0.556, 0.537, 0.612, ...
## $ LONG                     <dbl> -49.44055, -47.39683, -48.71881, -45.44619...
## $ LAT                      <dbl> -16.758812, -18.487565, -16.182672, -19.15...
## $ ALT                      <dbl> 893.60, 753.12, 1017.55, 644.74, 10.12, 40...
## $ PAY_TV                   <dbl> 360, 77, 227, 1230, 3389, 29, 952, 51, 55,...
## $ FIXED_PHONES             <dbl> 842, 296, 720, 1716, 1218, 34, 335, 222, 3...
## $ AREA                     <dbl> 147.26, 881.06, 1045.13, 1817.07, 1610.65,...
## $ REGIAO_TUR               <chr> NA, "Caminhos Do Cerrado", "Região Turísti...
## $ CATEGORIA_TUR            <chr> NA, "D", "C", "D", "D", NA, "D", NA, NA, "...
## $ ESTIMATED_POP            <dbl> 8583, 6972, 19614, 23223, 156292, 11663, 8...
## $ RURAL_URBAN              <chr> "Urbano", "Rural Adjacente", "Rural Adjace...
## $ GVA_AGROPEC              <dbl> 6.20, 50524.57, 42.84, 113824.60, 140463.7...
## $ GVA_INDUSTRY             <dbl> 27991.25, 25917.70, 16728.30, 31002.62, 58...
## $ GVA_SERVICES             <dbl> 74750.32, 62689.23, 138198.58, 172.33, 468...
## $ GVA_PUBLIC               <dbl> 36915.04, 28083.79, 63396.20, 86081.41, 48...
## $ ` GVA_TOTAL `            <dbl> 145857.60, 167215.28, 261161.91, 403241.27...
## $ TAXES                    <dbl> 20554.20, 12873.50, 26822.58, 26994.09, 95...
## $ GDP                      <dbl> 166.41, 180.09, 287984.49, 430235.36, 1249...
## $ POP_GDP                  <dbl> 8053, 7037, 18427, 23574, 151934, 11483, 9...
## $ GDP_CAPITA               <dbl> 20664.57, 25591.70, 15628.40, 18250.42, 82...
## $ GVA_MAIN                 <chr> "Demais serviços", "Demais serviços", "Dem...
## $ MUN_EXPENDIT             <dbl> 28227691, 17909274, 37513019, NA, NA, NA, ...
## $ COMP_TOT                 <dbl> 284, 476, 288, 621, 931, 86, 191, 87, 285,...
## $ COMP_A                   <dbl> 5, 6, 5, 18, 4, 1, 6, 2, 5, 2, 0, 8, 3, 1,...
## $ COMP_B                   <dbl> 1, 6, 9, 1, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, ...
## $ COMP_C                   <dbl> 56, 30, 26, 40, 43, 4, 8, 3, 20, 4, 9, 40,...
## $ COMP_D                   <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
## $ COMP_E                   <dbl> 2, 2, 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 2, ...
## $ COMP_F                   <dbl> 29, 34, 7, 20, 27, 6, 4, 0, 10, 2, 0, 25, ...
## $ COMP_G                   <dbl> 110, 190, 117, 303, 500, 48, 97, 71, 133, ...
## $ COMP_H                   <dbl> 26, 70, 12, 62, 16, 2, 5, 0, 18, 8, 1, 67,...
## $ COMP_I                   <dbl> 4, 28, 57, 30, 31, 10, 5, 1, 14, 3, 0, 25,...
## $ COMP_J                   <dbl> 5, 11, 2, 9, 6, 2, 3, 1, 8, 1, 1, 9, 5, 14...
## $ COMP_K                   <dbl> 0, 0, 1, 6, 1, 0, 1, 0, 0, 1, 0, 4, 3, 3, ...
## $ COMP_L                   <dbl> 2, 4, 0, 4, 1, 0, 0, 0, 4, 0, 0, 7, 4, 4, ...
## $ COMP_M                   <dbl> 10, 15, 7, 28, 22, 2, 5, 0, 11, 4, 2, 26, ...
## $ COMP_N                   <dbl> 12, 29, 15, 27, 16, 3, 5, 1, 26, 0, 1, 16,...
## $ COMP_O                   <dbl> 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 6, 2, 4, 2, ...
## $ COMP_P                   <dbl> 6, 9, 11, 15, 155, 0, 8, 0, 8, 1, 6, 14, 1...
## $ COMP_Q                   <dbl> 6, 14, 5, 19, 33, 2, 1, 2, 9, 3, 0, 13, 22...
## $ COMP_R                   <dbl> 1, 6, 1, 9, 15, 0, 2, 0, 4, 0, 0, 4, 6, 6,...
## $ COMP_S                   <dbl> 5, 19, 8, 27, 56, 4, 38, 4, 12, 3, 4, 23, ...
## $ COMP_T                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ COMP_U                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ HOTELS                   <dbl> NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, ...
## $ BEDS                     <dbl> NA, NA, 34, NA, NA, NA, 24, NA, NA, NA, NA...
## $ Pr_Agencies              <dbl> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0,...
## $ Pu_Agencies              <dbl> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1,...
## $ Pr_Bank                  <dbl> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0,...
## $ Pu_Bank                  <dbl> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1,...
## $ Pr_Assets                <dbl> NA, NA, 33724584, 44974716, 76181384, NA, ...
## $ Pu_Assets                <dbl> NA, NA, 67091904, 371922572, 800078483, NA...
## $ Cars                     <dbl> 2158, 2227, 2838, 6928, 5277, 553, 896, 61...
## $ Motorcycles              <dbl> 1246, 1142, 1426, 2953, 25661, 1674, 696, ...
## $ Wheeled_tractor          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, ...
## $ UBER                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ MAC                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `WAL-MART`               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ POST_OFFICES             <dbl> 1, 1, 3, 4, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, ...


4.2 Brazil municipalities


Import

muni <- read_municipality(year=2016, showProgress = FALSE)


View content

glimpse(muni)
## Rows: 5,572
## Columns: 5
## $ code_muni    <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100064, ...
## $ name_muni    <chr> "Alta Floresta D'oeste", "Ariquemes", "Cabixi", "Cacoa...
## $ code_state   <chr> "11", "11", "11", "11", "11", "11", "11", "11", "11", ...
## $ abbrev_state <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", ...
## $ geom         <MULTIPOLYGON [°]> MULTIPOLYGON (((-62.19465 -..., MULTIPOLY...

Spatial plot

qtm(muni)


4.3 Brazil metropolitan areas


Import

metro <- read_metro_area(year = 2016, showProgress = FALSE)


View content

glimpse(metro)
## Rows: 1,331
## Columns: 10
## $ code_muni        <dbl> 2700300, 2701506, 2702009, 2702355, 2702603, 27029...
## $ name_muni        <chr> "Arapiraca", "Campo Grande", "Coité Do Nóia", "Cra...
## $ code_state       <chr> "27", "27", "27", "27", "27", "27", "27", "27", "2...
## $ abbrev_state     <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A...
## $ name_metro       <chr> "RM Agreste", "RM Agreste", "RM Agreste", "RM Agre...
## $ type             <chr> "RM", "RM", "RM", "RM", "RM", "RM", "RM", "RM", "R...
## $ subdivision      <chr> "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO T...
## $ legislation      <chr> "Lei Complementar 27", "Lei Complementar 27", "Lei...
## $ legislation_date <chr> "30.11.2009", "30.11.2009", "30.11.2009", "30.11.2...
## $ geom             <MULTIPOLYGON [°]> MULTIPOLYGON (((-36.66005 -..., MULTI...


Spatial plot

qtm(metro)


5 Data extraction


The municipalities for the targeted study area, São Paulo Macrometropolis, will be extracted from the data sets in this section.


5.1 Brazil metropolitan areas

The following divisions of São Paulo Macrometropolis will be extracted from the Brazil metropolitan areas data set.

  • Metropolitan Region of São Paulo
  • Metropolitan Region of Campinas
  • Metropolitan Region of Vale do Paraíba e Litoral Norte
  • Metropolitan Region of Sorocaba
  • Metropolitan Region of Baixada Santista
  • Piracicaba Urban Agglomeration
  • Jundiaí Urban Agglomeration


Extract data

# Names for target regions of name_metro field
metro_name <- c('RM São Paulo', 'RM Campinas', 'RM do Vale do Paraíba e Litoral Norte', 'RM de Sorocaba', 'RM Baixada Santista', 'Aglomeração Urbana de Piracicaba-AU- Piracicaba', 'Aglomeração Urbana de Jundiaí')

metro_study <- metro %>%
  filter(name_metro %in% metro_name) %>%
  # Add a column region, labelling the corresponding regions
  mutate(region = case_when(
    name_metro == 'RM São Paulo' ~ 'Metropolitan Region of São Paulo',
    name_metro == 'RM Campinas' ~ 'Metropolitan Region of Campinas',
    name_metro == 'RM do Vale do Paraíba e Litoral Norte' ~ 'Metropolitan Region of Vale do Paraíba e Litoral Norte',
    name_metro == 'RM de Sorocaba' ~ 'Metropolitan Region of Sorocaba',
    name_metro == 'RM Baixada Santista' ~ 'Metropolitan Region of Baixada Santista',
    name_metro == 'Aglomeração Urbana de Piracicaba-AU- Piracicaba' ~ 'Piracicaba Urban Agglomeration',
    name_metro == 'Aglomeração Urbana de Jundiaí' ~ 'Jundiaí Urban Agglomeration'
  ))


View content

glimpse(metro_study)
## Rows: 164
## Columns: 11
## $ code_muni        <dbl> 3508405, 3509601, 3524006, 3525201, 3525904, 35273...
## $ name_muni        <chr> "Cabreúva", "Campo Limpo Paulista", "Itupeva", "Ja...
## $ code_state       <chr> "35", "35", "35", "35", "35", "35", "35", "35", "3...
## $ abbrev_state     <chr> "SP", "SP", "SP", "SP", "SP", "SP", "SP", "SP", "S...
## $ name_metro       <chr> "Aglomeração Urbana de Jundiaí", "Aglomeração Urba...
## $ type             <chr> "AGLO", "AGLO", "AGLO", "AGLO", "AGLO", "AGLO", "A...
## $ subdivision      <chr> "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO T...
## $ legislation      <chr> "Lei Complementar 1.146", "Lei Complementar 1.146"...
## $ legislation_date <chr> "24.08.2011", "24.08.2011", "24.08.2011", "24.08.2...
## $ geom             <MULTIPOLYGON [°]> MULTIPOLYGON (((-47.09692 -..., MULTI...
## $ region           <chr> "Jundiaí Urban Agglomeration", "Jundiaí Urban Aggl...


Spatial plot

qtm(metro_study)


5.2 Brazil municipalities


The Regional Unit of Bragança Paulista city division of São Paulo Macrometropolis will be extracted from the Brazil municipalities data set. The Regional Unit of Bragança Paulista city division consists of the following municipalities with the corresponding municipality codes, which will be extracted.

No. Municipality Code
1 Atibaia 3504107
2 Bragança Paulista 3507605
3 Joanópolis 3525508
4 Nazaré Paulista 3532405
5 Pedra Bela 3536802
6 Pinhalzinho 3538204
7 Piracaia 3538600
8 Tuiuti 3554953
9 Vargem 3556354
10 Bom Jesus Dos Perdões 3507100


Extract data

# Codes for target municipalities of code_muni field
muni_codes <- c('3504107', '3507605', '3525508', '3532405', '3536802', '3538204', '3538600', '3554953', '3556354', '3507100')

muni_study <- muni %>%
  filter(code_muni %in% muni_codes) %>%
  # Add a column region, labelling the corresponding regions
  mutate(region = 'Regional Unit of Bragança Paulista city')


View content

glimpse(muni_study)
## Rows: 10
## Columns: 6
## $ code_muni    <dbl> 3504107, 3507100, 3507605, 3525508, 3532405, 3536802, ...
## $ name_muni    <chr> "Atibaia", "Bom Jesus Dos Perdões", "Bragança Paulista...
## $ code_state   <chr> "35", "35", "35", "35", "35", "35", "35", "35", "35", ...
## $ abbrev_state <chr> "SP", "SP", "SP", "SP", "SP", "SP", "SP", "SP", "SP", ...
## $ geom         <MULTIPOLYGON [°]> MULTIPOLYGON (((-46.66146 -..., MULTIPOLY...
## $ region       <chr> "Regional Unit of Bragança Paulista city", "Regional U...


Spatial plot

qtm(muni_study)


5.3 Combine study area


The target areas that were extracted above will be combined to form the São Paulo Macrometropolis study area.

# Add missing columns from metro_study to muni_study, such that data frames can be joined
miss_col <- c('name_metro', 'type', 'subdivision', 'legislation', 'legislation_date')
muni_study <- cbind(muni_study,
                    setNames(lapply(miss_col, function(x) x=NA),
                             miss_col))

# Combine extracted data sets
macrometro <- rbind(metro_study, muni_study)

glimpse(macrometro)
## Rows: 174
## Columns: 11
## $ code_muni        <dbl> 3508405, 3509601, 3524006, 3525201, 3525904, 35273...
## $ name_muni        <chr> "Cabreúva", "Campo Limpo Paulista", "Itupeva", "Ja...
## $ code_state       <chr> "35", "35", "35", "35", "35", "35", "35", "35", "3...
## $ abbrev_state     <chr> "SP", "SP", "SP", "SP", "SP", "SP", "SP", "SP", "S...
## $ name_metro       <chr> "Aglomeração Urbana de Jundiaí", "Aglomeração Urba...
## $ type             <chr> "AGLO", "AGLO", "AGLO", "AGLO", "AGLO", "AGLO", "A...
## $ subdivision      <chr> "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO T...
## $ legislation      <chr> "Lei Complementar 1.146", "Lei Complementar 1.146"...
## $ legislation_date <chr> "24.08.2011", "24.08.2011", "24.08.2011", "24.08.2...
## $ region           <chr> "Jundiaí Urban Agglomeration", "Jundiaí Urban Aggl...
## $ geom             <MULTIPOLYGON [°]> MULTIPOLYGON (((-47.09692 -..., MULTI...


  • A total of 174 municipalities is extracted for São Paulo Macrometropolis, which is correct.


6 Data wrangling


Data will be pre-processed and prepared for the analysis.


6.1 Handle invalid geometry


Ensure that spatial data to be used for analysis has no invalid geometries.

length(which(st_is_valid(macrometro) == FALSE))
## [1] 0


  • There are no invalid geometries in the spatial data for São Paulo Macrometropolis study area.


6.2 Check CRS


The coordinate reference system (CRS) for the spatial data set will be checked and defined accordingly if necessary.

st_crs(macrometro)
## Coordinate Reference System:
##   User input: SIRGAS 2000 
##   wkt:
## GEOGCRS["SIRGAS 2000",
##     DATUM["Sistema de Referencia Geocentrico para las AmericaS 2000",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["Latin America - SIRGAS 2000 by country"],
##         BBOX[-59.87,-122.19,32.72,-25.28]],
##     ID["EPSG",4674]]


  • The spatial data set is already projected in the correct coordinate system, EPSG:4674, which is used for both onshore and offshore Brazil.


6.3 Join attribute data


6.3.1 Join data

The Brazil cities attribute data will be joined with the spatial data set.

brazil <- inner_join(cities, macrometro %>% as.data.frame(), by = c('CITY' = 'name_muni')) %>%
  # Remove municipalities not in São Paulo
  filter(STATE == 'SP')


6.3.2 Data sanity check

nrow(brazil)
## [1] 173


  • It can be noted that upon the joining of the data sets and appropriately filtering to only the São Paulo state, the joined data set contains only 173 municipalities (rows). In fact, there are supposed to be 174 municipalities in the São Paulo Macrometropolis, as per the extracted spatial data set.
  • The code chunk below finds out the missing municipality due to the data join.
# Check difference
setdiff(macrometro$name_muni, brazil$CITY)
## [1] "Santa Bárbara D'oeste"


  • It can be observed that the municipality, Santa Bárbara D’oeste, is present in the spatial data but missing from the joined data set, which could be due to the difference in text capitalisation between the cities attribute data and the spatial data.
  • The code chunk below conducts a case-insensitive search to check if this municipality exists in the Brazil cities attribute data set and returns the corresponding index if found.
# Case-insensitive search
missing_index <- match("santa bárbara d'oeste", tolower(cities$CITY))
missing_index
## [1] 4352
cities[missing_index,]$CITY
## [1] "Santa Bárbara D'Oeste"


  • Upon data checking, the Santa Bárbara D’oeste municipality does exist in the cities data set, but was not included during the data join due to differences in text capitalisation.
  • Hence, this missing row will be manually added to the joined data frame.


Retrieve missing row

missing_row <- subset(macrometro, name_muni == "Santa Bárbara D'oeste") %>%
  select(-name_muni)

missing_row <- cbind(cities[missing_index,], missing_row)


Add missing row to joined data frame

brazil <- rbind(brazil, missing_row)
brazil <- st_as_sf(brazil, sf_column_name = 'geom')

nrow(brazil)
## [1] 174


  • After adding the missing row, the joined data set now has 174 municipalities for São Paulo Macrometropolis, which is correct.


6.3.3 Select relevant fields


As there are 91 fields in the newly joined data set, only fields relevant to the analysis will be selected and extracted. Fields will also be renamed for easier usage of data set in analysis.

# Extract relevant fields only
brazil <- brazil %>%
  select(CITY, IBGE_RES_POP, contains("COMP"), region, geom)

# Rename field names
brazil <- setNames(brazil, c('Municipality',
                             'Population',
                             'Total Municipality Companies',
                             'Agriculture',
                             'Extractive Industries',
                             'Industries of Transformation',
                             'Electricity and Gas',
                             'Water',
                             'Construction',
                             'Trade and Repair of Motor Vehicles',
                             'Transport',
                             'Accommodation and Food',
                             'Information and Communication',
                             'Financial',
                             'Real Estate Activities',
                             'Professional',
                             'Admin Activities and Complementary Services',
                             'Public Administration',
                             'Education',
                             'Human Health and Social Services',
                             'Arts',
                             'Other Service Activities',
                             'Domestic Services',
                             'International and Extraterritorial Institutions',
                             'region',
                             'geom'))


6.3.4 Handle missing values


Ensure that there are no missing values, since it will impact calculations as well as clustering in the later analysis.

length(which(is.na(brazil)))
## [1] 0


  • There are no missing values in the data set.


6.3.5 Final data set


The following is the final data set that will be utilised in this analysis, which contains municipality level information about the number of companies for different industries.

glimpse(brazil)
## Rows: 174
## Columns: 26
## $ Municipality                                      <chr> "Águas De São Ped...
## $ Population                                        <dbl> 2707, 4884, 16839...
## $ `Total Municipality Companies`                    <dbl> 202, 137, 271, 10...
## $ Agriculture                                       <dbl> 2, 61, 2, 22, 36,...
## $ `Extractive Industries`                           <dbl> 1, 0, 0, 1, 3, 3,...
## $ `Industries of Transformation`                    <dbl> 4, 14, 18, 1463, ...
## $ `Electricity and Gas`                             <dbl> 0, 0, 0, 0, 0, 0,...
## $ Water                                             <dbl> 1, 0, 0, 36, 0, 2...
## $ Construction                                      <dbl> 6, 4, 16, 534, 6,...
## $ `Trade and Repair of Motor Vehicles`              <dbl> 99, 31, 118, 3996...
## $ Transport                                         <dbl> 3, 3, 16, 337, 26...
## $ `Accommodation and Food`                          <dbl> 40, 8, 31, 692, 2...
## $ `Information and Communication`                   <dbl> 5, 2, 2, 270, 3, ...
## $ Financial                                         <dbl> 2, 0, 2, 195, 0, ...
## $ `Real Estate Activities`                          <dbl> 1, 3, 1, 352, 4, ...
## $ Professional                                      <dbl> 9, 0, 7, 666, 6, ...
## $ `Admin Activities and Complementary Services`     <dbl> 15, 2, 13, 1079, ...
## $ `Public Administration`                           <dbl> 2, 2, 2, 5, 2, 2,...
## $ Education                                         <dbl> 3, 2, 17, 306, 0,...
## $ `Human Health and Social Services`                <dbl> 3, 0, 3, 306, 2, ...
## $ Arts                                              <dbl> 2, 0, 5, 146, 0, ...
## $ `Other Service Activities`                        <dbl> 4, 5, 18, 380, 3,...
## $ `Domestic Services`                               <dbl> 0, 0, 0, 0, 0, 0,...
## $ `International and Extraterritorial Institutions` <dbl> 0, 0, 0, 0, 0, 0,...
## $ region                                            <chr> "Piracicaba Urban...
## $ geom                                              <MULTIPOLYGON [°]> MULT...


7 Study area


The study area for this analysis, São Paulo Macrometropolis, is plotted below.

tm_shape(brazil) +
  tm_fill('region',
          palette = c('mistyrose3', 'slategray3', 'seashell2', 'azure3', 'bisque3', 'thistle3', 'cornsilk2', 'lightcyan3'),
          title = 'Region') +
  tm_borders(lwd = 0.1, alpha = 1) +
  tm_layout(main.title = 'São Paulo Macrometropolis',
            main.title.position = 'center',
            main.title.size = 1,
            legend.outside = TRUE,
            legend.outside.position = 'bottom',
            frame = FALSE)


  • There are a total of 174 municipalities in São Paulo Macrometropolis.


8 Analysing Industry Spatial Specialisation


In this section, the distribution of spatial specialisation of São Paulo Macrometropolis (abbrev. SPM) will be studied by industry type, at the municipality level. To do this, the industry location quotient (LQ) for each industry type will be computed and visualised spatially through choropleth mapping.


Industry LQ quantifies how ‘concentrated’ an industry is in a region compared to a larger geographic area, such as the state or nation (reference). It can reveal industries that make a particular region ‘unique’ in comparison to the average of the entire geographical area.


The industry LQ for each industry will be computed for each municipality in São Paulo Macrometropolis. The value of the industry LQ can then be interpreted as the relative concentration of the particular industry in the particular municipality, compared to the rest of São Paulo Macrometropolis.

The industry LQ of industry \(z\) for a given municipality is given by,


\[Industry\ LQ = \frac{No.\ of\ industry\ z\ companies\ in\ municipality\ /\ Total\ no.\ of\ companies\ in\ municipality}{No.\ of\ industry\ z\ companies\ in\ SPM\ /\ Total\ no.\ of\ companies\ in\ SPM}\]


8.1 Calculate industry LQ


The industry LQ of each industry for every municipality in São Paulo Macrometropolis will be computed.



Compute totals for São Paulo Macrometropolis

  • Number of companies for each industry in São Paulo Macrometropolis
  • Total number of companies across all industries in São Paulo Macrometropolis
# List of fields for industries
industry_list <- c('Agriculture', 'Extractive Industries', 'Industries of Transformation', 'Electricity and Gas', 'Water', 'Construction', 'Trade and Repair of Motor Vehicles', 'Transport', 'Accommodation and Food', 'Information and Communication', 'Financial', 'Real Estate Activities', 'Professional', 'Admin Activities and Complementary Services', 'Public Administration', 'Education', 'Human Health and Social Services', 'Arts', 'Other Service Activities', 'Domestic Services', 'International and Extraterritorial Institutions')

# Compute total number of companies in SPM
total_companies_SPM <- sum(brazil$`Total Municipality Companies`)

# List to store number of companies for each industry
num_industry_list <- vector(mode = "list", length = length(industry_list))

# Compute total number of companies for each industry in SPM, and store into num_industry_list
for (i in 1:length(industry_list)) {
  num_companies <- sum(brazil[[industry_list[[i]]]])
  
  num_industry_list[[i]] <- num_companies
}

print(paste0('Total number of companies across all industries in São Paulo Macrometropolis is ', total_companies_SPM))
## [1] "Total number of companies across all industries in São Paulo Macrometropolis is 1125844"


Compute industry LQ

  • Tidy data to a longer format, such that for each municipality, there is a row for every industry. This will enable easier calculation of the industry LQ.
  • Add a column containing the value for the total number of companies in São Paulo Macrometropolis for the particular industry.
  • Compute industry LQ values.
brazil_industry <- brazil %>%
  as.data.frame() %>%
  # Tidy data to a longer format for easier graphing later on
  pivot_longer(cols = 4:24,
               names_to = 'Industry',
               values_to = 'Municipality Industry Companies') %>%
  # Add column for total number of companies for industry in SPM
  mutate(`SPM Industry Companies` = case_when(
    `Industry` == industry_list[[1]] ~ num_industry_list[[1]],
    `Industry` == industry_list[[2]] ~ num_industry_list[[2]],
    `Industry` == industry_list[[3]] ~ num_industry_list[[3]],
    `Industry` == industry_list[[4]] ~ num_industry_list[[4]],
    `Industry` == industry_list[[5]] ~ num_industry_list[[5]],
    `Industry` == industry_list[[6]] ~ num_industry_list[[6]],
    `Industry` == industry_list[[7]] ~ num_industry_list[[7]],
    `Industry` == industry_list[[8]] ~ num_industry_list[[8]],
    `Industry` == industry_list[[9]] ~ num_industry_list[[9]],
    `Industry` == industry_list[[10]] ~ num_industry_list[[10]],
    `Industry` == industry_list[[11]] ~ num_industry_list[[11]],
    `Industry` == industry_list[[12]] ~ num_industry_list[[12]],
    `Industry` == industry_list[[13]] ~ num_industry_list[[13]],
    `Industry` == industry_list[[14]] ~ num_industry_list[[14]],
    `Industry` == industry_list[[15]] ~ num_industry_list[[15]],
    `Industry` == industry_list[[16]] ~ num_industry_list[[16]],
    `Industry` == industry_list[[17]] ~ num_industry_list[[17]],
    `Industry` == industry_list[[18]] ~ num_industry_list[[18]],
    `Industry` == industry_list[[19]] ~ num_industry_list[[19]],
    `Industry` == industry_list[[20]] ~ num_industry_list[[20]],
    `Industry` == industry_list[[21]] ~ num_industry_list[[21]]
  )) %>%
  # Compute LQ values
  mutate(`LQ` = ifelse(
    # Handle division by 0 error
    (`SPM Industry Companies` / total_companies_SPM) != 0,
    (`Municipality Industry Companies` / `Total Municipality Companies`) /
      (`SPM Industry Companies` / total_companies_SPM),
    0)) %>%
  st_as_sf(sf_column_name = 'geom')

glimpse(brazil_industry)
## Rows: 3,654
## Columns: 9
## $ Municipality                      <chr> "Águas De São Pedro", "Águas De S...
## $ Population                        <dbl> 2707, 2707, 2707, 2707, 2707, 270...
## $ `Total Municipality Companies`    <dbl> 202, 202, 202, 202, 202, 202, 202...
## $ region                            <chr> "Piracicaba Urban Agglomeration",...
## $ geom                              <MULTIPOLYGON [°]> MULTIPOLYGON (((-47....
## $ Industry                          <chr> "Agriculture", "Extractive Indust...
## $ `Municipality Industry Companies` <dbl> 2, 1, 4, 0, 1, 6, 99, 3, 40, 5, 2...
## $ `SPM Industry Companies`          <dbl> 14492, 581, 78136, 399, 2203, 557...
## $ LQ                                <dbl> 0.7691809, 9.5929176, 0.2853223, ...


8.2 Exploratory data analysis




Identify industries that are not present in São Paulo Macrometropolis

# Calculate total number of companies in SPM for each industry
spm_companies <- brazil_industry  %>%
  as.data.frame() %>%
  group_by(`Industry`) %>%
  summarise(`Total Companies in Industry` = sum(`Municipality Industry Companies`))

# Subset industries that have no companies in SPM
subset(spm_companies, `Total Companies in Industry` == 0)
## # A tibble: 1 x 2
##   Industry          `Total Companies in Industry`
##   <chr>                                     <dbl>
## 1 Domestic Services                             0


  • It has been identified that there are São Paulo Macrometropolis do not have companies related to the Domestic Services industry.
  • Hence, subsequent analysis will not include the Domestic Services industry. It will also not be appropriate to calculate and map LQ for Domestic Services, since this industry does not exist in São Paulo Macrometropolis.


  • Remove data for Domestic Services from the data set.
brazil_industry <- brazil_industry %>%
  filter(`Industry` != 'Domestic Services')




Top 10 industries with highest number of companies in São Paulo Macrometropolis

ggplot(top_n(spm_companies, 10, `Total Companies in Industry`),
       aes(x = `Total Companies in Industry`,
           y = reorder(`Industry`, `Total Companies in Industry`),
           label = `Total Companies in Industry`)) +
  geom_col(fill='darksalmon') +
  labs(title='Top 10 industries in São Paulo Macrometropolis',
       x='Number of Companies',
       y='Industry') +
  geom_label(colour = 'gray23', size = 3) +
  scale_x_continuous(labels = scales::comma) +
  coord_cartesian(xlim = c(0, 380000)) +
  theme_minimal() +
  theme(plot.title.position = 'plot',
        plot.margin = margin(b = 10, t = 10),
        axis.title.x = element_text(margin = margin(t = 10)))


  • The booming industry in São Paulo Macrometropolis is Trade and Repair of Motor Vehicles and Motorcycles, with a total of 365,543 companies.
  • With a total of 1,125,844 companies across all industries in São Paulo Macrometropolis, Trade and Repair of Motor Vehicles industry makes up 32.5% of all in São Paulo Macrometropolis.
  • This is followed by the Administrative Activities and Complementary Services industry, which has 142,717 companies.
  • All other industries in São Paulo Macrometropolis have less than 100,000 companies.




Top 10 municipalities with most number of companies

ggplot(top_n(brazil, 10, `Total Municipality Companies`),
       aes(x = `Total Municipality Companies`,
           y = reorder(`Municipality`, `Total Municipality Companies`),
           label = `Total Municipality Companies`)) +
  geom_col(fill='darksalmon') +
  labs(title='Top 10 municipalities with most number of companies in São Paulo Macrometropolis',
       x='Number of Companies',
       y='Municipalities') +
  geom_label(colour = 'gray23', size = 3, nudge_x = 35000) +
  scale_x_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(plot.title.position = 'plot',
        plot.margin = margin(b = 10, t = 10),
        axis.title.x = element_text(margin = margin(t = 10)),
        axis.title.y = element_text(margin = margin(r = 15)))


  • The São Paulo municipality has the most number of companies, with as much as 530,446 companies in the region.
  • Comparatively, all other municipalities in São Paulo Macrometropolis have less than 50,000 companies. The municipality with the second highest number of companies, Campinas, only has 45,789 companies.




Summary statistics

summary(brazil_industry$LQ)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.3267   0.7644   1.6864   1.2156 135.1432


  • It can be observed that industry LQ ranges from 0 to 135.1. This means that there are certain municipalities which have industries that are up to 135.1 times more concentrated (more companies) in the São Paulo Macrometropolis region than average.
  • The mean industry LQ in the São Paulo Macrometropolis region is approximately 1.7. This means that on average, an industry in a municipality can be about 1.7 times more concentrated in the São Paulo Macrometropolis region than average.




Distribution of industry LQ by industry type

eda_lq_dist <- ggplot(brazil_industry, aes(x = `LQ`)) +
  geom_histogram(fill = "darksalmon") +
  facet_wrap(~`Industry`, ncol = 3, nrow = 7, scales = 'free') +
  scale_x_continuous(guide = guide_axis(check.overlap = TRUE)) +
  scale_y_continuous(guide = guide_axis(check.overlap = TRUE)) +
  theme(panel.spacing.x = unit(4, "mm"),
        strip.text.x = element_text(size = 8))

eda_lq_dist


Some observations can be made for different industry types:


  • Accommodation and Food, Arts, Construction, Education, Industries of Transformation, Other Service Activities, Transport
    • The distribution of LQ seem to follow a normal distribution across municipalities, with a mean that is centred around an LQ of 1.
    • This implies that for each of these industries, most municipalities have an industry concentration that matches or is close to the average concentration in São Paulo Macrometropolis for that particular industry.


  • Administrative Activities and Complementary Services, Financial, Human Health and Social Services, Information and Communication, Professional, Real Estate Activities
    • LQ seems to follow a normal distribution.
    • Although majority of municipalities have an LQ of less than 1.
    • This means that for these industries, most municipalities have an industry concentration that is lower than the average industry concentration in São Paulo Macrometropolis.
    • This also implies that there are a few municipalities with a higher than average industry concentration (indicated on the charts with LQ>1), suggesting specialisation in the industry.


  • On the other hand, for the Trade and Repair of Motor Vehicles industry, which is the most prevalent industry in the São Paulo Macrometropolis as observed previously:
    • The distribution of LQ seems to follow a normal distribution with a mean centred around a value larger than 1.
    • This means that for this industry, most municipalities have a higher proportion of companies compared to the average proportion of companies in São Paulo Macrometropolis for Trade and Repair of Motor Vehicles. This is consistent with the prevalence of the industry in São Paulo Macrometropolis.


  • Agriculture, Extractive Industries, Electricity and Gas, International and Other Extraterritorial Institutions and Public Administration
    • Distribution of LQ is clearly right-skewed.
    • This implies that there are few select municipalities with much higher concentration of these industries as compared to the rest of São Paulo Macrometropolis, suggesting geographical specialisation in the industry.
    • This is especially true in Public Administration and Extractive Industries, where there are certain municipalities with very large LQ values that can go up to (and even beyond) 100. This suggests high specialisation of the industries in the particular municipalities, as there can be industry concentration of up to 100 times the regional average of São Paulo Macrometropolis.




Box plot of LQ by industry type for outlier detection

ggplot(brazil_industry, aes(x = `LQ`)) +
  geom_boxplot() +
  facet_wrap(~`Industry`, ncol = 3, nrow = 7, scales = 'free') +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())


It can be observed that there are outliers for LQ, particularly in the following industries:

  • Transport
  • Extractive Industries
  • Electricity and Gas
  • Agriculture
  • Accommodation and Food
  • Information and Communication
  • Public Administration
  • International and Extraterritorial Institutions (there are clearly three municipalities that are outliers with industry specialisation)

This indicates further signs of specialisation for these industries at the municipality level.



Exploratory data analysis reveals signs of geographical specialisation for certain industries. To better understand the spatial distribution of specialisation in industries and identify municipalities with industry specialisation, the next section utilises choropleth mapping techniques for industry LQ.



8.3 Choropleth mapping


Choropleth maps will be created to spatially visualise the industry LQ for each industry type in São Paulo Macrometropolis. This will reveal spatial patterns of industry specialisation at the municipality level.

  • Notes:
    • Data classes will not be fixed across the maps for different industry types, as the range of LQ values vary greatly for different industries.
    • ‘Pretty’ classification is utilised, to create regular intervals with rounded numbers, for easier interpretation.
tm_shape(brazil_industry) +
  tm_polygons('LQ', style = 'pretty') +
  tm_facets(by = 'Industry', free.scales.fill = TRUE, scale.factor = 3) +
  tm_layout(legend.position = c('right', 'bottom'))


Based on the series of choropleth maps, the following insights can be drawn for the different industries:

  • In general, geographic concentration of industries can be observed, with some municipalities having a relatively higher concentration of the industry type as compared to the rest of the São Paulo Macrometropolis region. This is indicated by the areas having a darker red colour. Different industry types are more concentrated in municipalities at different parts of the region. Certain industry types have more specialisation than others.

  • The Trade and Repair of Motor Vehicles industry is rather prevalent across municipalities in São Paulo Macrometropolis. This is indicated by majority of municipalities having an industry LQ of 1 to 1.5, implying that for Trade and Repair of Motor Vehicles, industry concentration for majority of municipalities is close to the average industry concentration across the region. This is consistent with previous findings where the Trade and Repair of Motor Vehicles industry is a booming industry in São Paulo Macrometropolis. A few municipalities at the south of the metropolitan region of Sorocaba, and east of metropolitan region of São Paulo and Vale do Paraíba e Litoral Norte (indicated in darker red colour), have relatively higher concentration of industry (up to 2.5 times) than the rest of the region.

  • For Agriculture, Electricity and Gas, Extractive Industries and Public Administration, there is very obvious concentration (that can be up to 135 times more concentrated than the regional average) and specialisation of industries in the region, indicated by the large range of LQ values, with only a few municipalities having very large industry LQ values relative to the rest of the region.

    • The agricultural industry is concentrated at municipalities located west of the metropolitan region of Vale do Paraíba e Litoral Norte, as well as some municipalities north of the Regional Unit of Bragança Paulista city (indicated by darker red areas).
    • For electricity and gas industry, one municipality in the metropolitan region of Vale do Paraíba e Litoral Norte, Arapeí (dark red area), has the highest concentration of industry compared to the rest of the region. This indicates that Arapeí specialises in the electricity and gas industry within the entire São Paulo Macrometropolis.
    • Similarly for extractive industries, only a few municipalities across the region have relatively higher concentration of the industry than the average concentration in the region. This is indicated by the darker red areas. One municipality in the metropolitan region of Vale do Paraíba e Litoral Norte, Lavrinhas, have very high concentration of the industry compared to the rest of the region (indicated by dark red area), indicating specialisation of extractive industries in Lavrinhas.
    • The public administration industry is most concentrated in Areias (dark red area), which is located in the metropolitan region of Vale do Paraíba e Litoral Norte. Hence there is specialisation of public administration at Areias. Its neighbouring municipalities also have relatively higher industry concentration compared to the rest of the region.
  • Administrative Activities and Complementary Services seem to be most concentrated in municipalities at the Metropolitan Region of Baixada Santista. Some municipalities in the metropolitan regions of Capinas, São Paulo and Vale do Paraíba e Litoral Norte also have relatively high concentration of this industry as compared to the average concentration in the region.

  • The Information and Communication industry has relatively higher concencention at a few municipalities north of the metropolitan region of São Paulo, in particular, Pirapora do Bom Jesus.

  • Municipalities in the northwest region of São Paulo Macrometropolis has a higher concentration of Industries of Transformation compared to the rest of São Paulo Macrometropolis.

  • There are a few municipalities with higher concentration of Transport industry (dark red areas) compared to the rest of the region, which are scattered across São Paulo Macrometropolis.

  • Specialisation can also be observed for International and Extraterritorial Institutions, where only two municipalities in the metropolitan region of São Paulo (darker red areas) have a higher concentration of the industry compared to the average concentration in the region.

  • Real Estate Activities and Professional industry seem to have higher concentration of the industry at municipalities located in the centre of São Paulo Macrometropolis, and concentration diffuses radially as municipalities move further away from the centre.



Choropleth mapping of LQ values reveal that there is indeed geographical specialisation of industries at São Paulo Macrometropolis. In particular, the industries: Agriculture, Electricity and Gas, Extractive Industries and Public Administration see the greatest specialisation at certain municipalities.



  • The series of choropleth maps reveal useful information about the geographical concentration of each industry type, and allows for comparisons to be made spatially.
  • However, with 20 different industry types, there are a lot of maps and it is difficult to identify geographical clusters with industry specialisation in São Paulo Macrometropolis. Therefore, the next section aims to identify industry specialisation clusters through clustering methods.


9 Delineating Industry Specialisation Clusters


In this section, industry specialisation clusters in São Paulo Macrometropolis will be identified and segmented geographically at the municipality level, based on industry LQ values. The clusters will be delineated through agglomerative hierarchical clustering (AGNES).


9.1 Data preparation


To perform the cluster analysis, data will first be prepared, and clustering variables will be extracted.

  • Each municipality (observation) will be a row, and the LQ values for each industry (variable) will be columns.
  • The segmentation task delineates industry specialisation clusters, hence the industry LQ values for each industry type will be utilised as input variables for clustering.
  • Missing values have already been checked in preliminary data wrangling, and it has been verified that there are no missing values.


brazil_lq <- as.data.frame(brazil_industry) %>%
  select(`Municipality`, `Industry`, `LQ`) %>%
  pivot_wider(names_from = 'Industry',
              values_from = 'LQ') %>%
  # Name rows by municipality
  column_to_rownames(var = 'Municipality')

head(brazil_lq, 5)
##                    Agriculture Extractive Industries
## Águas De São Pedro   0.7691809              9.592918
## Alambari            34.5906848              0.000000
## Alumínio             0.5733378              0.000000
## Americana            0.1584573              0.179656
## Analândia           17.2638391             35.884618
##                    Industries of Transformation Electricity and Gas    Water
## Águas De São Pedro                    0.2853223                   0 2.529952
## Alambari                              1.4724295                   0 0.000000
## Alumínio                              0.9570404                   0 0.000000
## Americana                             1.9543887                   0 1.705712
## Analândia                             0.7115444                   0 0.000000
##                    Construction Trade and Repair of Motor Vehicles Transport
## Águas De São Pedro    0.5999446                          1.5094668 0.3390748
## Alambari              0.5897265                          0.6969167 0.4999497
## Alumínio              1.1925098                          1.3410731 1.3479578
## Americana             0.9999818                          1.1410504 0.7133375
## Analândia             0.7480790                          0.7414628 3.6642402
##                    Accommodation and Food Information and Communication
## Águas De São Pedro              3.0584894                     0.4772964
## Alambari                        0.9019195                     0.2815003
## Alumínio                        1.7668137                     0.1423083
## Americana                       0.9909324                     0.4826950
## Analândia                       1.9068360                     0.3570884
##                    Financial Real Estate Activities Professional
## Águas De São Pedro 0.3292271              0.2219451    0.6100723
## Alambari           0.0000000              0.9817425    0.0000000
## Alumínio           0.2454017              0.1654351    0.3536869
## Americana          0.6011614              1.4631174    0.8454813
## Analândia          0.0000000              1.1069854    0.5071383
##                    Admin Activities and Complementary Services
## Águas De São Pedro                                   0.5857906
## Alambari                                             0.1151627
## Alumínio                                             0.3784222
## Americana                                            0.7891572
## Analândia                                            0.1947814
##                    Public Administration Education
## Águas De São Pedro            15.7221020 0.4654397
## Alambari                      23.1814934 0.4575125
## Alumínio                      11.7190576 1.9659531
## Americana                      0.7361081 0.8891080
## Analândia                     19.6041025 0.0000000
##                    Human Health and Social Services      Arts
## Águas De São Pedro                        0.4025340 0.8029801
## Alambari                                  0.0000000 0.0000000
## Alumínio                                  0.3000438 1.4963283
## Americana                                 0.7689421 1.0977883
## Analândia                                 0.3346167 0.0000000
##                    Other Service Activities
## Águas De São Pedro                0.4276605
## Alambari                          0.7882064
## Alumínio                          1.4344774
## Americana                         0.7608756
## Analândia                         0.3999417
##                    International and Extraterritorial Institutions
## Águas De São Pedro                                               0
## Alambari                                                         0
## Alumínio                                                         0
## Americana                                                        0
## Analândia                                                        0


9.2 Exploratory data analysis


9.2.1 Univariate EDA


  • Detect if variables (industry LQ) have very different ranges across the the different industry types, as hierarchical clustering is very sensitive to large data range differences for variables.
  • This is done by plotting the distribution of the cluster variables (plotted previously).



  • The small multiple histogram plots reveal data range differences between the industry LQ values of different industry types.
    • Some LQ values are single-digit, while others can go up to the hundreds.
  • Hence, data standardisation of variables will have to be applied for the cluster analysis.
    • As there is skew to the distribution of some variables, it cannot be assumed that all variables come from some normal distribution. Thus, the z-score standardisation technique will not be suitable in this use case.
    • Therefore, min-max standardisation will be applied to the variables.


9.2.2 Correlation analysis


  • Bivariate EDA will be conducted, to ensure that cluster input variables (industry LQ) are not highly correlated.
  • This is to avoid the curse of multi-collinearity, which will impact the clustering results.
corrplot.mixed(cor(brazil_lq),
               lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black",
               tl.cex = 0.5,
               number.cex = 0.7)


  • From the correlation plot, it can be observed that there are no input variables that are highly correlated. All variables have a correlation coefficient of less than 0.85.
  • Thus, all input variables of industry LQ values will be retained and utilised for the cluster analysis.


9.3 Hierarchical clustering


Agglomerative hierarchical clustering (AGNES) will be performed, to produce a set of nested clusters organised as a hierarchical tree. The resulting tree will be plotted as a dendrogram, and a selected number of clusters will be visualised geospatially.


9.3.1 Data standardisation


  • As noted previously, the range of values for the cluster input variables are different.
  • Hence, variable standardisation will be performed to avoid cluster analysis results from being biased to input variables with larger values.
  • Input variables will be standardised using the min-max variable standardisation technique.
  • Using this method, the values of the standandardised input variables will range between 0 to 1.
    • A standardised industry LQ value that is closer to 1 indicates a higher concentration of the particular industry in the municipality, as compared to the rest of the São Paulo Macrometropolis region.
brazil_lq_std <- normalize(brazil_lq)

summary(brazil_lq_std)
##   Agriculture       Extractive Industries Industries of Transformation
##  Min.   :0.000000   Min.   :0.00000       Min.   :0.0000              
##  1st Qu.:0.006157   1st Qu.:0.00000       1st Qu.:0.1945              
##  Median :0.027794   Median :0.01434       Median :0.3012              
##  Mean   :0.112160   Mean   :0.06184       Mean   :0.3377              
##  3rd Qu.:0.115554   3rd Qu.:0.05508       3rd Qu.:0.4461              
##  Max.   :1.000000   Max.   :1.00000       Max.   :1.0000              
##  Electricity and Gas     Water          Construction   
##  Min.   :0.0000      Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000      1st Qu.:0.04515   1st Qu.:0.3305  
##  Median :0.0000      Median :0.08571   Median :0.4534  
##  Mean   :0.0141      Mean   :0.10959   Mean   :0.4525  
##  3rd Qu.:0.0000      3rd Qu.:0.14406   3rd Qu.:0.5572  
##  Max.   :1.0000      Max.   :1.00000   Max.   :1.0000  
##  Trade and Repair of Motor Vehicles   Transport      Accommodation and Food
##  Min.   :0.0000                     Min.   :0.0000   Min.   :0.0000        
##  1st Qu.:0.4163                     1st Qu.:0.1805   1st Qu.:0.1494        
##  Median :0.4775                     Median :0.2500   Median :0.1837        
##  Mean   :0.4721                     Mean   :0.2945   Mean   :0.2190        
##  3rd Qu.:0.5477                     3rd Qu.:0.3586   3rd Qu.:0.2420        
##  Max.   :1.0000                     Max.   :1.0000   Max.   :1.0000        
##  Information and Communication   Financial       Real Estate Activities
##  Min.   :0.00000               Min.   :0.00000   Min.   :0.0000        
##  1st Qu.:0.02666               1st Qu.:0.09555   1st Qu.:0.1520        
##  Median :0.04099               Median :0.15225   Median :0.3126        
##  Mean   :0.05542               Mean   :0.17089   Mean   :0.3168        
##  3rd Qu.:0.05853               3rd Qu.:0.22384   3rd Qu.:0.4287        
##  Max.   :1.00000               Max.   :1.00000   Max.   :1.0000        
##   Professional    Admin Activities and Complementary Services
##  Min.   :0.0000   Min.   :0.0000                             
##  1st Qu.:0.1399   1st Qu.:0.1233                             
##  Median :0.1978   Median :0.1961                             
##  Mean   :0.2156   Mean   :0.2186                             
##  3rd Qu.:0.2683   3rd Qu.:0.2760                             
##  Max.   :1.0000   Max.   :1.0000                             
##  Public Administration   Education      Human Health and Social Services
##  Min.   :0.000000      Min.   :0.0000   Min.   :0.0000                  
##  1st Qu.:0.007751      1st Qu.:0.3182   1st Qu.:0.2050                  
##  Median :0.021014      Median :0.4567   Median :0.3136                  
##  Mean   :0.051148      Mean   :0.4515   Mean   :0.3344                  
##  3rd Qu.:0.053365      3rd Qu.:0.6018   3rd Qu.:0.4269                  
##  Max.   :1.000000      Max.   :1.0000   Max.   :1.0000                  
##       Arts        Other Service Activities
##  Min.   :0.0000   Min.   :0.0000          
##  1st Qu.:0.3736   1st Qu.:0.2570          
##  Median :0.4925   Median :0.3419          
##  Mean   :0.4674   Mean   :0.3473          
##  3rd Qu.:0.5985   3rd Qu.:0.4259          
##  Max.   :1.0000   Max.   :1.0000          
##  International and Extraterritorial Institutions
##  Min.   :0.00000                                
##  1st Qu.:0.00000                                
##  Median :0.00000                                
##  Mean   :0.01048                                
##  3rd Qu.:0.00000                                
##  Max.   :1.00000


9.3.2 Compute proximity matrix


  • The proximity matrix is a measure of similarity/dissimilarity, which will be calculated and utilised for hierarchical clustering.
  • A larger value indicates higher dissimilarity between variables.
  • Euclidean distance (crow-flying distance) will be used as the distance measure to calculate the proximity matrix.
prox_matrix <- dist(brazil_lq_std, method = 'euclidean')


9.3.3 Select optimal clustering algorithm


  • There are various agglomerative hierarchical clustering algorithms that can be utilised.
  • To determine the optimal method, agglomerative coefficients will be computed for each hierarchical clustering algorithm, to measure the strength of the clustering structure.
  • Values closer to 1 suggest stronger clustering structure.
  • The computation of agglomerative coefficients will enable the identification of the algorithm resulting in a stronger clustering structure.
# Compute agglomerative coefficients of all hierarchical clustering algorithms

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(brazil_lq_std, method = x)$ac
}

map_dbl(m, ac)
##   average    single  complete      ward 
## 0.7262692 0.6818087 0.7890747 0.9168273


  • The results above reveal that Ward’s minimum variance method provides the strongest clustering structure (agglomerative coefficient = 0.92) among the four methods (average linkage, single linkage, complete linkage, Ward’s minimum variance) assessed.
  • Hence, Ward’s minimum variance method will be the selected algorithm to perform hierarchical clustering.
    • Ward’s minimum variance method has an advantage over other algorithms as it accounts for distortion of data when there is highly skewed data, through its calculation which minimises the total within-cluster variance.


9.3.4 Perform hierarchical clustering

hclust_ward <- hclust(prox_matrix, method = 'ward.D')


Plot the dendrogram (tree) that is constructed by the hierarchical clustering process

plot(as.dendrogram(hclust_ward),
     ylab = "Height",
     nodePar = list(lab.cex = 0.6, pch = c(NA, 19), cex = 0.4, col = 1),
     # Remove labels
     leaflab = "none")


Interactive dendrogram

  • Selecting a node selects all the labels (i.e. leafs) under that node.
  • Zoom in to view the names of municipalities belonging to a cluster.
dendrogram <- as.dendrogram(hclust_ward)
plot_dendro(dendrogram) %>%
  hide_legend() %>% 
  highlight(persistent = FALSE, dynamic = TRUE)


9.3.5 Select optimal number of clusters


  • To determine the optimal number of clusters to be utilised for the analysis, the gap statistic method will be used for local optimisation.
  • The gap statistic compares the total intra-cluster variation for different values of k, with its expectation under a null reference distribution of the data.
  • A larger gap statistic means that the clustering structure is far away from the random uniform distribution of points.
  • Therefore, the estimate of the optimal number of clusters will be the value that yields the largest gap statistic.
set.seed(12345)
gap_stat <- clusGap(brazil_lq_std, FUN = hcut, nstart = 25, K.max = 20, B = 50)
fviz_gap_stat(gap_stat)


  • The gap statistic graph suggests that 18 clusters should be selected.
  • However, selecting such a large number of clusters may segment São Paulo Macrometropolis into too many groups, which runs the risk of sacrificing the legibility and interpretability of the different clusters.
  • Hence, upon closer examination, 7 clusters will be selected for the analysis, which gives a relatively large gap statistic.


Plot dendrogram with clusters highlighted for k=7

plot(as.dendrogram(hclust_ward),
     ylab = "Height",
     nodePar = list(lab.cex = 0.6, pch = c(NA, 19), cex = 0.4, col = 1),
     # Remove labels
     leaflab = "none")
rect.hclust(hclust_ward, k = 7, border = 2:5)


9.3.6 Cluster heatmap


  • To better interpret the clustering results and enable a better understanding of the properties of each cluster, a cluster heatmap will be constructed.
heatmaply(data.matrix(brazil_lq_std),
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 7,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 6,
          main="Geographic Segmentation of São Paulo Macrometropolis by Industry LQ",
          xlab = "Industry LQ",
          ylab = "Municipalities"
          ) %>%
  layout(width = 800,
         height = 600)









Based on the cluster heatmap, the industry specialisation clusters can be interpreted as such:


Cluster Colour Description
1 Light Purple Municipalities in this group specialise in International and Extraterritorial Institutions, Financial, Real Estate Activities, Professional industries or Human Health and Social Services
2 Dark Purple Municipalities specialise in Accomodation and Food or Administrative Activities and Complementary Services. These municipalities also have relatively high concentration of the Arts industry
3 Turquoise Municipalities in this group specialise in the Construction or Education industry
4 Green Municipalities in this group specialise in Real Estate Activities or Industries of Transformation
5 Olive Green Municipalities in this group specialise in Agriculture, while having relatively lower concentration for all other industries
6 Brown Municipalities in this group specialise in Water, Transport, Education, Industries of Transformation or Other Service Activities. There are also several municipalities which do not have the presence of Human Health and Social Services and the Arts industries.
7 Pink Municipalities in this group have industry specialisation in either Extractive Industries, Electricity and Gas, Information and Communication, Trade and Repair of Motor Vehicles, Accomodation and Food, or Public Administration or Arts


9.4 Visualise clusters spatially


  • The clusters obtained from hierarchical clustering will be visualised spatially.
# Derive cluster model with 7 clusters
groups <- as.factor(cutree(hclust_ward, k=7))

# Append cluster list object to sf data frame for visualisation
brazil_hcluster <- cbind(brazil, as.matrix(groups)) %>%
  rename(Cluster = `as.matrix.groups.`) %>%
  select(`Municipality`, `region`, `Cluster`)


tmap_mode('view')

tm_shape(brazil_hcluster) +
  tm_fill(col = 'Cluster',
          alpha = 0.9,
          id = 'Municipality',
          # customise pop-up tooltip information
          popup.vars = c('Cluster :' = 'Cluster',
                         'Region :' = 'region')) +
  tm_borders(lwd = 0.5,
             alpha= 0.8) +
  tm_view(set.bounds = TRUE,
          set.zoom.limits = c(7.5, 11))


Based on the spatial mapping of the hierarchical clustering results, several insights can be gleaned.

  • Generally, municipalities with similar industry specialisations in São Paulo Macrometropolis tend to be located geographically close to each other in small groups of a few municipalities. These small groups can be separated geographically.

  • Municipalities at the coast of São Paulo Macrometropolis belong the same cluster (cluster 7: green areas on the map), indicating that they specialise in similar industries. The geographical location of these municipalities could have enabled them to specialise in similar niche industries.

  • Cluster 4 (red areas on the map) is mainly located in the north-west region of São Paulo Macrometropolis. These municipalities specialise in similar industries.

  • Cluster 2 (yellow areas on map) is mainly located at the Metropolitan Region of Vale do Paraíba e Litoral Norte, with one municipality at the north of the Regional Unit of Bragança Paulista city, and the rest in the Metropolitan Region of Sorocaba.

  • Cluster 5 (blue areas on map) is mainly located at the Piracicaba Urban Agglomeration.



Industry specialisation clusters in São Paulo Macrometropolis have been geographically segmented through hierarchical clustering.



  • A limitation of hierarchical clustering for geographically referenced attributes is that spatially fragmented regions are obtained. The next section explores spatial constrained clustering method to obtain spatially homogeneous regions.


10 Delineating Industry Specialisation Social Areas


In this section, industry specialisation social areas in São Paulo Macrometropolis will be identified and segmented geographically at the municipality level, based on industry LQ values. The clusters will be delineated through spatially constrained clustering, using the SKATER algorithm (Spatial Kluster Analysis by Tree Edge Removal).

  • Unlike hierarchical clustering, the SKATER algorithm forms spatially constrained clusters, that are based on not just the attributes but also geographical location.
  • Therefore, spatially constrained clustering will enable the identification of homogeneous areas where municipalities with the same industry specialisations are able to interact and exchange knowledge and resources, through shared geographical boundaries.
  • To obtain spatially constrained clusters, the SKATER algorithm first constructs a minimum spanning tree from the region’s adjacency graph. It then prunes the tree to achieve maximum internal homogeneity.


10.1 Compute neighbour list


  • An adjacency neighbour list will first be computed using Queen contiguity, for the construction of a minimum spanning tree.
brazil_sp <- as_Spatial(brazil)
nb_list <- poly2nb(brazil_sp, queen=TRUE, snap=0.02)

summary(nb_list)
## Neighbour list object:
## Number of regions: 174 
## Number of nonzero links: 984 
## Percentage nonzero weights: 3.250099 
## Average number of links: 5.655172 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 12 13 27 
##  4  8 13 35 34 31 18 13  7  6  3  1  1 
## 4 least connected regions:
## 1 5 62 108 with 1 link
## 1 most connected region:
## 150 with 27 links


  • It can be observed that there is one most connected municipality in São Paulo Macrometropolis, that is connected to 27 different municipalities.
  • On average, each municipality is connected to about 6 other municipalities.
  • The 4 least connected municipalities are connected to only 1 municipality.


Plot neighbour list

plot(brazil_sp, border=grey(.4))
plot(nb_list, coordinates(brazil_sp), col="red", add=TRUE)


10.2 Compute minimum spanning tree


10.2.1 Calculate edge costs


  • Before the minimum spanning tree can be computed, the cost of each edge in the adjacency graph (i.e. distance between the neighbouring nodes based on dissimilarity between the values of the clustering variable(s)) is computed.
  • For each observation, the pairwise dissimilarity between values of all the clustering variables for itself and its neighbouring observations (from the neighbour list) is calculated.
  • The neighbour list is then converted into a weights object, by specifying the computed costs as weights.
costs <- nbcosts(nb_list, brazil_lq_std)
wt_list <- nb2listw(nb_list, costs, style='B')

summary(wt_list)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 174 
## Number of nonzero links: 984 
## Percentage nonzero weights: 3.250099 
## Average number of links: 5.655172 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 12 13 27 
##  4  8 13 35 34 31 18 13  7  6  3  1  1 
## 4 least connected regions:
## 1 5 62 108 with 1 link
## 1 most connected region:
## 150 with 27 links
## 
## Weights style: B 
## Weights constants summary:
##     n    nn       S0       S1      S2
## B 174 30276 771.8576 1409.952 18347.3


10.2.2 Compute minimum spanning tree

brazil_mst <- mstree(wt_list)


Plot minimum spanning tree

plot(brazil_sp, border=gray(.4))
plot.mst(brazil_mst, coordinates(brazil_sp), col="red", cex.lab=0.7, cex.circles=0.005, add=TRUE)


  • With the minimum spanning tree, the initial neighbour list is simplified to just one edge connecting each of the nodes, while passing through all the nodes.


10.3 Compute spatially constrained clusters


  • Upon construction of the minimum spanning tree, spatially constrained clusters will be computed using 7 clusters.
spatial_clusters <- skater(brazil_mst[,1:2], brazil_lq_std, method = "euclidean", 6)


View number of observations (municipalities) in each cluster

table(spatial_clusters$groups)
## 
##  1  2  3  4  5  6  7 
## 79  7 43  7 32  5  1


  • The largest spatially constrained cluster has 79 municipalities, while the smallest cluster has only one municipality.


Plot the pruned tree showing 7 clusters

plot(brazil_sp, border=gray(.4))
plot(spatial_clusters, coordinates(brazil_sp), cex.lab=.7,
     groups.colors=c("red", "green", "blue", "brown", "pink", "orange", "purple"), cex.circles=0.005, add=TRUE)


10.4 Visualise clusters spatially

  • The clusters computed using the SKATER algorithm will be visualised spatially.
groups_mat <- as.matrix(spatial_clusters$groups)

brazil_spatialcluster <- cbind(brazil_hcluster, as.factor(groups_mat)) %>%
  rename(`Spatial Cluster`=`as.factor.groups_mat.`)

tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(brazil_spatialcluster) +
  tm_fill(col = 'Spatial Cluster',
          alpha = 0.9,
          id = 'Municipality',
          # customise pop-up tooltip information
          popup.vars = c('Cluster :' = 'Spatial Cluster',
                         'Region :' = 'region'),
          title = 'Cluster') +
  tm_borders(lwd = 0.5,
             alpha= 0.8) +
  tm_view(set.bounds = TRUE,
          set.zoom.limits = c(7.5, 11))


Several insights can be obtained from the spatially constrained clusters:

  • Industry specialisation social areas in São Paulo Macrometropolis are geographically segmented in a ‘columnar’ way. The municipalities in the same cluster have similar industry specialisations, and are able to interact through their shared contiguous boundaries.

  • Municipalities at the coastal area form one cluster (cluster 4: red areas).

  • The municipality, Tuiuti, is a cluster by itself. This suggests that it is specialised in an industry that is not similar to those of its neighbouring municipalities.

  • There is an industry specialisation social area within the Metropolitan Region of São Paulo (cluster 6: red areas).

  • The Piracicaba Urban Agglomeration and the Metropolitan Region of Sorocaba have municipalities belonging in the same industry specialisation social area (cluster 3: purple areas).

  • There is also an industry specialisation social area (cluster 2: yellow areas) within the Metropolitan Region of Vale do Paraíba e Litoral Norte.

  • Cluster 5 (blue areas) is an industry specialisation social area that connects municipalities in the Metropolitan Region of Vale do Paraíba e Litoral Norte, Regional Unit of Bragança Paulista city and Metropolitan Region of São Paulo.

  • Cluster 1 (green areas) is the largest industry specialisation social area that connects municipalities in Metropolitan Region of Campinas, Regional Unit of Bragança Paulista city, Metropolitan Region of São Paulo, Metropolitan Region of Sorocaba, and Metropolitan Region of Baixada Santista.