This analysis will segment São Paulo Macrometropolis at the municipality level into homogeneous industry specialisation clusters. Such geographic segmentation will be important for businesses investing in Brazil, to gain a better understanding of the geographical specialisation and competitiveness of industry in the country.
The study area in this analysis will be the São Paulo Macrometropolis, consisting of the following divisions:
Source: Wikipedia
The following datasets will be utilised in this analysis.
| Data | Description | Format |
|---|---|---|
| Brazil Cities | Data on municipalities in Brazil, including information on population, economic indexes and industry. From kaggle | CSV |
| Brazil Municipalities | 2016 municipality boundary in Brazil, retrieved using geobr package | Spatial file |
| Brazil Metropolitan Areas | 2018 (latest) metropolitan boundary in Brazil, retrieved using geobr package | Spatial file |
The following R packages will be used in the analysis:
tidyverse: data import and manipulationsf: spatial data manipulationtmap: spatial mappinggeobr: to download official spatial data sets of Brazilpackages = c('tidyverse', 'tmap', 'sf', 'geobr', 'spdep', 'cluster', 'factoextra', 'NbClust', 'heatmaply', 'corrplot', 'psych', 'plotly')
for (p in packages) {
if (!require(p, character.only = T)) {
install.packages(p)
}
library(p, character.only = T)
}
cities <- read_delim('data/aspatial/BRAZIL_CITIES.csv', delim = ';')
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## CITY = col_character(),
## STATE = col_character(),
## AREA = col_number(),
## REGIAO_TUR = col_character(),
## CATEGORIA_TUR = col_character(),
## RURAL_URBAN = col_character(),
## GVA_MAIN = col_character()
## )
## i Use `spec()` for the full column specifications.
glimpse(cities)
## Rows: 5,573
## Columns: 81
## $ CITY <chr> "Abadia De Goiás", "Abadia Dos Dourados", ...
## $ STATE <chr> "GO", "MG", "GO", "MG", "PA", "CE", "BA", ...
## $ CAPITAL <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ IBGE_RES_POP <dbl> 6876, 6704, 15757, 22690, 141100, 10496, 8...
## $ IBGE_RES_POP_BRAS <dbl> 6876, 6704, 15609, 22690, 141040, 10496, 8...
## $ IBGE_RES_POP_ESTR <dbl> 0, 0, 148, 0, 60, 0, 0, 0, 0, 0, 0, 16, 17...
## $ IBGE_DU <dbl> 2137, 2328, 4655, 7694, 31061, 2791, 2572,...
## $ IBGE_DU_URBAN <dbl> 1546, 1481, 3233, 6667, 19057, 1251, 1193,...
## $ IBGE_DU_RURAL <dbl> 591, 847, 1422, 1027, 12004, 1540, 1379, 1...
## $ IBGE_POP <dbl> 5300, 4154, 10656, 18464, 82956, 4538, 372...
## $ IBGE_1 <dbl> 69, 38, 139, 176, 1354, 98, 37, 167, 69, 1...
## $ `IBGE_1-4` <dbl> 318, 207, 650, 856, 5567, 323, 156, 733, 3...
## $ `IBGE_5-9` <dbl> 438, 260, 894, 1233, 7618, 421, 263, 978, ...
## $ `IBGE_10-14` <dbl> 517, 351, 1087, 1539, 8905, 483, 277, 927,...
## $ `IBGE_15-59` <dbl> 3542, 2709, 6896, 11979, 53516, 2631, 2319...
## $ `IBGE_60+` <dbl> 416, 589, 990, 2681, 5996, 582, 673, 803, ...
## $ IBGE_PLANTED_AREA <dbl> 319, 4479, 10307, 1862, 25200, 2598, 895, ...
## $ `IBGE_CROP_PRODUCTION_$` <dbl> 1843, 18017, 33085, 7502, 700872, 5234, 39...
## $ `IDHM Ranking 2010` <dbl> 1689, 2207, 2202, 1994, 3530, 3522, 4086, ...
## $ IDHM <dbl> 0.708, 0.690, 0.690, 0.698, 0.628, 0.628, ...
## $ IDHM_Renda <dbl> 0.687, 0.693, 0.671, 0.720, 0.579, 0.540, ...
## $ IDHM_Longevidade <dbl> 0.830, 0.839, 0.841, 0.848, 0.798, 0.748, ...
## $ IDHM_Educacao <dbl> 0.622, 0.563, 0.579, 0.556, 0.537, 0.612, ...
## $ LONG <dbl> -49.44055, -47.39683, -48.71881, -45.44619...
## $ LAT <dbl> -16.758812, -18.487565, -16.182672, -19.15...
## $ ALT <dbl> 893.60, 753.12, 1017.55, 644.74, 10.12, 40...
## $ PAY_TV <dbl> 360, 77, 227, 1230, 3389, 29, 952, 51, 55,...
## $ FIXED_PHONES <dbl> 842, 296, 720, 1716, 1218, 34, 335, 222, 3...
## $ AREA <dbl> 147.26, 881.06, 1045.13, 1817.07, 1610.65,...
## $ REGIAO_TUR <chr> NA, "Caminhos Do Cerrado", "Região Turísti...
## $ CATEGORIA_TUR <chr> NA, "D", "C", "D", "D", NA, "D", NA, NA, "...
## $ ESTIMATED_POP <dbl> 8583, 6972, 19614, 23223, 156292, 11663, 8...
## $ RURAL_URBAN <chr> "Urbano", "Rural Adjacente", "Rural Adjace...
## $ GVA_AGROPEC <dbl> 6.20, 50524.57, 42.84, 113824.60, 140463.7...
## $ GVA_INDUSTRY <dbl> 27991.25, 25917.70, 16728.30, 31002.62, 58...
## $ GVA_SERVICES <dbl> 74750.32, 62689.23, 138198.58, 172.33, 468...
## $ GVA_PUBLIC <dbl> 36915.04, 28083.79, 63396.20, 86081.41, 48...
## $ ` GVA_TOTAL ` <dbl> 145857.60, 167215.28, 261161.91, 403241.27...
## $ TAXES <dbl> 20554.20, 12873.50, 26822.58, 26994.09, 95...
## $ GDP <dbl> 166.41, 180.09, 287984.49, 430235.36, 1249...
## $ POP_GDP <dbl> 8053, 7037, 18427, 23574, 151934, 11483, 9...
## $ GDP_CAPITA <dbl> 20664.57, 25591.70, 15628.40, 18250.42, 82...
## $ GVA_MAIN <chr> "Demais serviços", "Demais serviços", "Dem...
## $ MUN_EXPENDIT <dbl> 28227691, 17909274, 37513019, NA, NA, NA, ...
## $ COMP_TOT <dbl> 284, 476, 288, 621, 931, 86, 191, 87, 285,...
## $ COMP_A <dbl> 5, 6, 5, 18, 4, 1, 6, 2, 5, 2, 0, 8, 3, 1,...
## $ COMP_B <dbl> 1, 6, 9, 1, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, ...
## $ COMP_C <dbl> 56, 30, 26, 40, 43, 4, 8, 3, 20, 4, 9, 40,...
## $ COMP_D <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
## $ COMP_E <dbl> 2, 2, 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 2, ...
## $ COMP_F <dbl> 29, 34, 7, 20, 27, 6, 4, 0, 10, 2, 0, 25, ...
## $ COMP_G <dbl> 110, 190, 117, 303, 500, 48, 97, 71, 133, ...
## $ COMP_H <dbl> 26, 70, 12, 62, 16, 2, 5, 0, 18, 8, 1, 67,...
## $ COMP_I <dbl> 4, 28, 57, 30, 31, 10, 5, 1, 14, 3, 0, 25,...
## $ COMP_J <dbl> 5, 11, 2, 9, 6, 2, 3, 1, 8, 1, 1, 9, 5, 14...
## $ COMP_K <dbl> 0, 0, 1, 6, 1, 0, 1, 0, 0, 1, 0, 4, 3, 3, ...
## $ COMP_L <dbl> 2, 4, 0, 4, 1, 0, 0, 0, 4, 0, 0, 7, 4, 4, ...
## $ COMP_M <dbl> 10, 15, 7, 28, 22, 2, 5, 0, 11, 4, 2, 26, ...
## $ COMP_N <dbl> 12, 29, 15, 27, 16, 3, 5, 1, 26, 0, 1, 16,...
## $ COMP_O <dbl> 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 6, 2, 4, 2, ...
## $ COMP_P <dbl> 6, 9, 11, 15, 155, 0, 8, 0, 8, 1, 6, 14, 1...
## $ COMP_Q <dbl> 6, 14, 5, 19, 33, 2, 1, 2, 9, 3, 0, 13, 22...
## $ COMP_R <dbl> 1, 6, 1, 9, 15, 0, 2, 0, 4, 0, 0, 4, 6, 6,...
## $ COMP_S <dbl> 5, 19, 8, 27, 56, 4, 38, 4, 12, 3, 4, 23, ...
## $ COMP_T <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ COMP_U <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ HOTELS <dbl> NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, ...
## $ BEDS <dbl> NA, NA, 34, NA, NA, NA, 24, NA, NA, NA, NA...
## $ Pr_Agencies <dbl> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0,...
## $ Pu_Agencies <dbl> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1,...
## $ Pr_Bank <dbl> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0,...
## $ Pu_Bank <dbl> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1,...
## $ Pr_Assets <dbl> NA, NA, 33724584, 44974716, 76181384, NA, ...
## $ Pu_Assets <dbl> NA, NA, 67091904, 371922572, 800078483, NA...
## $ Cars <dbl> 2158, 2227, 2838, 6928, 5277, 553, 896, 61...
## $ Motorcycles <dbl> 1246, 1142, 1426, 2953, 25661, 1674, 696, ...
## $ Wheeled_tractor <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, ...
## $ UBER <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ MAC <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `WAL-MART` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ POST_OFFICES <dbl> 1, 1, 3, 4, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, ...
muni <- read_municipality(year=2016, showProgress = FALSE)
glimpse(muni)
## Rows: 5,572
## Columns: 5
## $ code_muni <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100064, ...
## $ name_muni <chr> "Alta Floresta D'oeste", "Ariquemes", "Cabixi", "Cacoa...
## $ code_state <chr> "11", "11", "11", "11", "11", "11", "11", "11", "11", ...
## $ abbrev_state <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", ...
## $ geom <MULTIPOLYGON [°]> MULTIPOLYGON (((-62.19465 -..., MULTIPOLY...
qtm(muni)
metro <- read_metro_area(year = 2016, showProgress = FALSE)
glimpse(metro)
## Rows: 1,331
## Columns: 10
## $ code_muni <dbl> 2700300, 2701506, 2702009, 2702355, 2702603, 27029...
## $ name_muni <chr> "Arapiraca", "Campo Grande", "Coité Do Nóia", "Cra...
## $ code_state <chr> "27", "27", "27", "27", "27", "27", "27", "27", "2...
## $ abbrev_state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A...
## $ name_metro <chr> "RM Agreste", "RM Agreste", "RM Agreste", "RM Agre...
## $ type <chr> "RM", "RM", "RM", "RM", "RM", "RM", "RM", "RM", "R...
## $ subdivision <chr> "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO T...
## $ legislation <chr> "Lei Complementar 27", "Lei Complementar 27", "Lei...
## $ legislation_date <chr> "30.11.2009", "30.11.2009", "30.11.2009", "30.11.2...
## $ geom <MULTIPOLYGON [°]> MULTIPOLYGON (((-36.66005 -..., MULTI...
qtm(metro)
The municipalities for the targeted study area, São Paulo Macrometropolis, will be extracted from the data sets in this section.
The following divisions of São Paulo Macrometropolis will be extracted from the Brazil metropolitan areas data set.
# Names for target regions of name_metro field
metro_name <- c('RM São Paulo', 'RM Campinas', 'RM do Vale do Paraíba e Litoral Norte', 'RM de Sorocaba', 'RM Baixada Santista', 'Aglomeração Urbana de Piracicaba-AU- Piracicaba', 'Aglomeração Urbana de Jundiaí')
metro_study <- metro %>%
filter(name_metro %in% metro_name) %>%
# Add a column region, labelling the corresponding regions
mutate(region = case_when(
name_metro == 'RM São Paulo' ~ 'Metropolitan Region of São Paulo',
name_metro == 'RM Campinas' ~ 'Metropolitan Region of Campinas',
name_metro == 'RM do Vale do Paraíba e Litoral Norte' ~ 'Metropolitan Region of Vale do Paraíba e Litoral Norte',
name_metro == 'RM de Sorocaba' ~ 'Metropolitan Region of Sorocaba',
name_metro == 'RM Baixada Santista' ~ 'Metropolitan Region of Baixada Santista',
name_metro == 'Aglomeração Urbana de Piracicaba-AU- Piracicaba' ~ 'Piracicaba Urban Agglomeration',
name_metro == 'Aglomeração Urbana de Jundiaí' ~ 'Jundiaí Urban Agglomeration'
))
glimpse(metro_study)
## Rows: 164
## Columns: 11
## $ code_muni <dbl> 3508405, 3509601, 3524006, 3525201, 3525904, 35273...
## $ name_muni <chr> "Cabreúva", "Campo Limpo Paulista", "Itupeva", "Ja...
## $ code_state <chr> "35", "35", "35", "35", "35", "35", "35", "35", "3...
## $ abbrev_state <chr> "SP", "SP", "SP", "SP", "SP", "SP", "SP", "SP", "S...
## $ name_metro <chr> "Aglomeração Urbana de Jundiaí", "Aglomeração Urba...
## $ type <chr> "AGLO", "AGLO", "AGLO", "AGLO", "AGLO", "AGLO", "A...
## $ subdivision <chr> "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO T...
## $ legislation <chr> "Lei Complementar 1.146", "Lei Complementar 1.146"...
## $ legislation_date <chr> "24.08.2011", "24.08.2011", "24.08.2011", "24.08.2...
## $ geom <MULTIPOLYGON [°]> MULTIPOLYGON (((-47.09692 -..., MULTI...
## $ region <chr> "Jundiaí Urban Agglomeration", "Jundiaí Urban Aggl...
qtm(metro_study)
The Regional Unit of Bragança Paulista city division of São Paulo Macrometropolis will be extracted from the Brazil municipalities data set. The Regional Unit of Bragança Paulista city division consists of the following municipalities with the corresponding municipality codes, which will be extracted.
| No. | Municipality | Code |
|---|---|---|
| 1 | Atibaia | 3504107 |
| 2 | Bragança Paulista | 3507605 |
| 3 | Joanópolis | 3525508 |
| 4 | Nazaré Paulista | 3532405 |
| 5 | Pedra Bela | 3536802 |
| 6 | Pinhalzinho | 3538204 |
| 7 | Piracaia | 3538600 |
| 8 | Tuiuti | 3554953 |
| 9 | Vargem | 3556354 |
| 10 | Bom Jesus Dos Perdões | 3507100 |
# Codes for target municipalities of code_muni field
muni_codes <- c('3504107', '3507605', '3525508', '3532405', '3536802', '3538204', '3538600', '3554953', '3556354', '3507100')
muni_study <- muni %>%
filter(code_muni %in% muni_codes) %>%
# Add a column region, labelling the corresponding regions
mutate(region = 'Regional Unit of Bragança Paulista city')
glimpse(muni_study)
## Rows: 10
## Columns: 6
## $ code_muni <dbl> 3504107, 3507100, 3507605, 3525508, 3532405, 3536802, ...
## $ name_muni <chr> "Atibaia", "Bom Jesus Dos Perdões", "Bragança Paulista...
## $ code_state <chr> "35", "35", "35", "35", "35", "35", "35", "35", "35", ...
## $ abbrev_state <chr> "SP", "SP", "SP", "SP", "SP", "SP", "SP", "SP", "SP", ...
## $ geom <MULTIPOLYGON [°]> MULTIPOLYGON (((-46.66146 -..., MULTIPOLY...
## $ region <chr> "Regional Unit of Bragança Paulista city", "Regional U...
qtm(muni_study)
The target areas that were extracted above will be combined to form the São Paulo Macrometropolis study area.
# Add missing columns from metro_study to muni_study, such that data frames can be joined
miss_col <- c('name_metro', 'type', 'subdivision', 'legislation', 'legislation_date')
muni_study <- cbind(muni_study,
setNames(lapply(miss_col, function(x) x=NA),
miss_col))
# Combine extracted data sets
macrometro <- rbind(metro_study, muni_study)
glimpse(macrometro)
## Rows: 174
## Columns: 11
## $ code_muni <dbl> 3508405, 3509601, 3524006, 3525201, 3525904, 35273...
## $ name_muni <chr> "Cabreúva", "Campo Limpo Paulista", "Itupeva", "Ja...
## $ code_state <chr> "35", "35", "35", "35", "35", "35", "35", "35", "3...
## $ abbrev_state <chr> "SP", "SP", "SP", "SP", "SP", "SP", "SP", "SP", "S...
## $ name_metro <chr> "Aglomeração Urbana de Jundiaí", "Aglomeração Urba...
## $ type <chr> "AGLO", "AGLO", "AGLO", "AGLO", "AGLO", "AGLO", "A...
## $ subdivision <chr> "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO TEM", "NÃO T...
## $ legislation <chr> "Lei Complementar 1.146", "Lei Complementar 1.146"...
## $ legislation_date <chr> "24.08.2011", "24.08.2011", "24.08.2011", "24.08.2...
## $ region <chr> "Jundiaí Urban Agglomeration", "Jundiaí Urban Aggl...
## $ geom <MULTIPOLYGON [°]> MULTIPOLYGON (((-47.09692 -..., MULTI...
Data will be pre-processed and prepared for the analysis.
Ensure that spatial data to be used for analysis has no invalid geometries.
length(which(st_is_valid(macrometro) == FALSE))
## [1] 0
The coordinate reference system (CRS) for the spatial data set will be checked and defined accordingly if necessary.
st_crs(macrometro)
## Coordinate Reference System:
## User input: SIRGAS 2000
## wkt:
## GEOGCRS["SIRGAS 2000",
## DATUM["Sistema de Referencia Geocentrico para las AmericaS 2000",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["unknown"],
## AREA["Latin America - SIRGAS 2000 by country"],
## BBOX[-59.87,-122.19,32.72,-25.28]],
## ID["EPSG",4674]]
The Brazil cities attribute data will be joined with the spatial data set.
brazil <- inner_join(cities, macrometro %>% as.data.frame(), by = c('CITY' = 'name_muni')) %>%
# Remove municipalities not in São Paulo
filter(STATE == 'SP')
nrow(brazil)
## [1] 173
# Check difference
setdiff(macrometro$name_muni, brazil$CITY)
## [1] "Santa Bárbara D'oeste"
# Case-insensitive search
missing_index <- match("santa bárbara d'oeste", tolower(cities$CITY))
missing_index
## [1] 4352
cities[missing_index,]$CITY
## [1] "Santa Bárbara D'Oeste"
Retrieve missing row
missing_row <- subset(macrometro, name_muni == "Santa Bárbara D'oeste") %>%
select(-name_muni)
missing_row <- cbind(cities[missing_index,], missing_row)
Add missing row to joined data frame
brazil <- rbind(brazil, missing_row)
brazil <- st_as_sf(brazil, sf_column_name = 'geom')
nrow(brazil)
## [1] 174
As there are 91 fields in the newly joined data set, only fields relevant to the analysis will be selected and extracted. Fields will also be renamed for easier usage of data set in analysis.
# Extract relevant fields only
brazil <- brazil %>%
select(CITY, IBGE_RES_POP, contains("COMP"), region, geom)
# Rename field names
brazil <- setNames(brazil, c('Municipality',
'Population',
'Total Municipality Companies',
'Agriculture',
'Extractive Industries',
'Industries of Transformation',
'Electricity and Gas',
'Water',
'Construction',
'Trade and Repair of Motor Vehicles',
'Transport',
'Accommodation and Food',
'Information and Communication',
'Financial',
'Real Estate Activities',
'Professional',
'Admin Activities and Complementary Services',
'Public Administration',
'Education',
'Human Health and Social Services',
'Arts',
'Other Service Activities',
'Domestic Services',
'International and Extraterritorial Institutions',
'region',
'geom'))
Ensure that there are no missing values, since it will impact calculations as well as clustering in the later analysis.
length(which(is.na(brazil)))
## [1] 0
The following is the final data set that will be utilised in this analysis, which contains municipality level information about the number of companies for different industries.
glimpse(brazil)
## Rows: 174
## Columns: 26
## $ Municipality <chr> "Águas De São Ped...
## $ Population <dbl> 2707, 4884, 16839...
## $ `Total Municipality Companies` <dbl> 202, 137, 271, 10...
## $ Agriculture <dbl> 2, 61, 2, 22, 36,...
## $ `Extractive Industries` <dbl> 1, 0, 0, 1, 3, 3,...
## $ `Industries of Transformation` <dbl> 4, 14, 18, 1463, ...
## $ `Electricity and Gas` <dbl> 0, 0, 0, 0, 0, 0,...
## $ Water <dbl> 1, 0, 0, 36, 0, 2...
## $ Construction <dbl> 6, 4, 16, 534, 6,...
## $ `Trade and Repair of Motor Vehicles` <dbl> 99, 31, 118, 3996...
## $ Transport <dbl> 3, 3, 16, 337, 26...
## $ `Accommodation and Food` <dbl> 40, 8, 31, 692, 2...
## $ `Information and Communication` <dbl> 5, 2, 2, 270, 3, ...
## $ Financial <dbl> 2, 0, 2, 195, 0, ...
## $ `Real Estate Activities` <dbl> 1, 3, 1, 352, 4, ...
## $ Professional <dbl> 9, 0, 7, 666, 6, ...
## $ `Admin Activities and Complementary Services` <dbl> 15, 2, 13, 1079, ...
## $ `Public Administration` <dbl> 2, 2, 2, 5, 2, 2,...
## $ Education <dbl> 3, 2, 17, 306, 0,...
## $ `Human Health and Social Services` <dbl> 3, 0, 3, 306, 2, ...
## $ Arts <dbl> 2, 0, 5, 146, 0, ...
## $ `Other Service Activities` <dbl> 4, 5, 18, 380, 3,...
## $ `Domestic Services` <dbl> 0, 0, 0, 0, 0, 0,...
## $ `International and Extraterritorial Institutions` <dbl> 0, 0, 0, 0, 0, 0,...
## $ region <chr> "Piracicaba Urban...
## $ geom <MULTIPOLYGON [°]> MULT...
The study area for this analysis, São Paulo Macrometropolis, is plotted below.
tm_shape(brazil) +
tm_fill('region',
palette = c('mistyrose3', 'slategray3', 'seashell2', 'azure3', 'bisque3', 'thistle3', 'cornsilk2', 'lightcyan3'),
title = 'Region') +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(main.title = 'São Paulo Macrometropolis',
main.title.position = 'center',
main.title.size = 1,
legend.outside = TRUE,
legend.outside.position = 'bottom',
frame = FALSE)
In this section, the distribution of spatial specialisation of São Paulo Macrometropolis (abbrev. SPM) will be studied by industry type, at the municipality level. To do this, the industry location quotient (LQ) for each industry type will be computed and visualised spatially through choropleth mapping.
Industry LQ quantifies how ‘concentrated’ an industry is in a region compared to a larger geographic area, such as the state or nation (reference). It can reveal industries that make a particular region ‘unique’ in comparison to the average of the entire geographical area.
The industry LQ for each industry will be computed for each municipality in São Paulo Macrometropolis. The value of the industry LQ can then be interpreted as the relative concentration of the particular industry in the particular municipality, compared to the rest of São Paulo Macrometropolis.
The industry LQ of industry \(z\) for a given municipality is given by,
\[Industry\ LQ = \frac{No.\ of\ industry\ z\ companies\ in\ municipality\ /\ Total\ no.\ of\ companies\ in\ municipality}{No.\ of\ industry\ z\ companies\ in\ SPM\ /\ Total\ no.\ of\ companies\ in\ SPM}\]
The industry LQ of each industry for every municipality in São Paulo Macrometropolis will be computed.
Compute totals for São Paulo Macrometropolis
# List of fields for industries
industry_list <- c('Agriculture', 'Extractive Industries', 'Industries of Transformation', 'Electricity and Gas', 'Water', 'Construction', 'Trade and Repair of Motor Vehicles', 'Transport', 'Accommodation and Food', 'Information and Communication', 'Financial', 'Real Estate Activities', 'Professional', 'Admin Activities and Complementary Services', 'Public Administration', 'Education', 'Human Health and Social Services', 'Arts', 'Other Service Activities', 'Domestic Services', 'International and Extraterritorial Institutions')
# Compute total number of companies in SPM
total_companies_SPM <- sum(brazil$`Total Municipality Companies`)
# List to store number of companies for each industry
num_industry_list <- vector(mode = "list", length = length(industry_list))
# Compute total number of companies for each industry in SPM, and store into num_industry_list
for (i in 1:length(industry_list)) {
num_companies <- sum(brazil[[industry_list[[i]]]])
num_industry_list[[i]] <- num_companies
}
print(paste0('Total number of companies across all industries in São Paulo Macrometropolis is ', total_companies_SPM))
## [1] "Total number of companies across all industries in São Paulo Macrometropolis is 1125844"
Compute industry LQ
brazil_industry <- brazil %>%
as.data.frame() %>%
# Tidy data to a longer format for easier graphing later on
pivot_longer(cols = 4:24,
names_to = 'Industry',
values_to = 'Municipality Industry Companies') %>%
# Add column for total number of companies for industry in SPM
mutate(`SPM Industry Companies` = case_when(
`Industry` == industry_list[[1]] ~ num_industry_list[[1]],
`Industry` == industry_list[[2]] ~ num_industry_list[[2]],
`Industry` == industry_list[[3]] ~ num_industry_list[[3]],
`Industry` == industry_list[[4]] ~ num_industry_list[[4]],
`Industry` == industry_list[[5]] ~ num_industry_list[[5]],
`Industry` == industry_list[[6]] ~ num_industry_list[[6]],
`Industry` == industry_list[[7]] ~ num_industry_list[[7]],
`Industry` == industry_list[[8]] ~ num_industry_list[[8]],
`Industry` == industry_list[[9]] ~ num_industry_list[[9]],
`Industry` == industry_list[[10]] ~ num_industry_list[[10]],
`Industry` == industry_list[[11]] ~ num_industry_list[[11]],
`Industry` == industry_list[[12]] ~ num_industry_list[[12]],
`Industry` == industry_list[[13]] ~ num_industry_list[[13]],
`Industry` == industry_list[[14]] ~ num_industry_list[[14]],
`Industry` == industry_list[[15]] ~ num_industry_list[[15]],
`Industry` == industry_list[[16]] ~ num_industry_list[[16]],
`Industry` == industry_list[[17]] ~ num_industry_list[[17]],
`Industry` == industry_list[[18]] ~ num_industry_list[[18]],
`Industry` == industry_list[[19]] ~ num_industry_list[[19]],
`Industry` == industry_list[[20]] ~ num_industry_list[[20]],
`Industry` == industry_list[[21]] ~ num_industry_list[[21]]
)) %>%
# Compute LQ values
mutate(`LQ` = ifelse(
# Handle division by 0 error
(`SPM Industry Companies` / total_companies_SPM) != 0,
(`Municipality Industry Companies` / `Total Municipality Companies`) /
(`SPM Industry Companies` / total_companies_SPM),
0)) %>%
st_as_sf(sf_column_name = 'geom')
glimpse(brazil_industry)
## Rows: 3,654
## Columns: 9
## $ Municipality <chr> "Águas De São Pedro", "Águas De S...
## $ Population <dbl> 2707, 2707, 2707, 2707, 2707, 270...
## $ `Total Municipality Companies` <dbl> 202, 202, 202, 202, 202, 202, 202...
## $ region <chr> "Piracicaba Urban Agglomeration",...
## $ geom <MULTIPOLYGON [°]> MULTIPOLYGON (((-47....
## $ Industry <chr> "Agriculture", "Extractive Indust...
## $ `Municipality Industry Companies` <dbl> 2, 1, 4, 0, 1, 6, 99, 3, 40, 5, 2...
## $ `SPM Industry Companies` <dbl> 14492, 581, 78136, 399, 2203, 557...
## $ LQ <dbl> 0.7691809, 9.5929176, 0.2853223, ...
Identify industries that are not present in São Paulo Macrometropolis
# Calculate total number of companies in SPM for each industry
spm_companies <- brazil_industry %>%
as.data.frame() %>%
group_by(`Industry`) %>%
summarise(`Total Companies in Industry` = sum(`Municipality Industry Companies`))
# Subset industries that have no companies in SPM
subset(spm_companies, `Total Companies in Industry` == 0)
## # A tibble: 1 x 2
## Industry `Total Companies in Industry`
## <chr> <dbl>
## 1 Domestic Services 0
brazil_industry <- brazil_industry %>%
filter(`Industry` != 'Domestic Services')
Top 10 industries with highest number of companies in São Paulo Macrometropolis
ggplot(top_n(spm_companies, 10, `Total Companies in Industry`),
aes(x = `Total Companies in Industry`,
y = reorder(`Industry`, `Total Companies in Industry`),
label = `Total Companies in Industry`)) +
geom_col(fill='darksalmon') +
labs(title='Top 10 industries in São Paulo Macrometropolis',
x='Number of Companies',
y='Industry') +
geom_label(colour = 'gray23', size = 3) +
scale_x_continuous(labels = scales::comma) +
coord_cartesian(xlim = c(0, 380000)) +
theme_minimal() +
theme(plot.title.position = 'plot',
plot.margin = margin(b = 10, t = 10),
axis.title.x = element_text(margin = margin(t = 10)))
Top 10 municipalities with most number of companies
ggplot(top_n(brazil, 10, `Total Municipality Companies`),
aes(x = `Total Municipality Companies`,
y = reorder(`Municipality`, `Total Municipality Companies`),
label = `Total Municipality Companies`)) +
geom_col(fill='darksalmon') +
labs(title='Top 10 municipalities with most number of companies in São Paulo Macrometropolis',
x='Number of Companies',
y='Municipalities') +
geom_label(colour = 'gray23', size = 3, nudge_x = 35000) +
scale_x_continuous(labels = scales::comma) +
theme_minimal() +
theme(plot.title.position = 'plot',
plot.margin = margin(b = 10, t = 10),
axis.title.x = element_text(margin = margin(t = 10)),
axis.title.y = element_text(margin = margin(r = 15)))
Summary statistics
summary(brazil_industry$LQ)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3267 0.7644 1.6864 1.2156 135.1432
Distribution of industry LQ by industry type
eda_lq_dist <- ggplot(brazil_industry, aes(x = `LQ`)) +
geom_histogram(fill = "darksalmon") +
facet_wrap(~`Industry`, ncol = 3, nrow = 7, scales = 'free') +
scale_x_continuous(guide = guide_axis(check.overlap = TRUE)) +
scale_y_continuous(guide = guide_axis(check.overlap = TRUE)) +
theme(panel.spacing.x = unit(4, "mm"),
strip.text.x = element_text(size = 8))
eda_lq_dist
Some observations can be made for different industry types:
Box plot of LQ by industry type for outlier detection
ggplot(brazil_industry, aes(x = `LQ`)) +
geom_boxplot() +
facet_wrap(~`Industry`, ncol = 3, nrow = 7, scales = 'free') +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
It can be observed that there are outliers for LQ, particularly in the following industries:
This indicates further signs of specialisation for these industries at the municipality level.
Exploratory data analysis reveals signs of geographical specialisation for certain industries. To better understand the spatial distribution of specialisation in industries and identify municipalities with industry specialisation, the next section utilises choropleth mapping techniques for industry LQ.
Choropleth maps will be created to spatially visualise the industry LQ for each industry type in São Paulo Macrometropolis. This will reveal spatial patterns of industry specialisation at the municipality level.
tm_shape(brazil_industry) +
tm_polygons('LQ', style = 'pretty') +
tm_facets(by = 'Industry', free.scales.fill = TRUE, scale.factor = 3) +
tm_layout(legend.position = c('right', 'bottom'))
Based on the series of choropleth maps, the following insights can be drawn for the different industries:
In general, geographic concentration of industries can be observed, with some municipalities having a relatively higher concentration of the industry type as compared to the rest of the São Paulo Macrometropolis region. This is indicated by the areas having a darker red colour. Different industry types are more concentrated in municipalities at different parts of the region. Certain industry types have more specialisation than others.
The Trade and Repair of Motor Vehicles industry is rather prevalent across municipalities in São Paulo Macrometropolis. This is indicated by majority of municipalities having an industry LQ of 1 to 1.5, implying that for Trade and Repair of Motor Vehicles, industry concentration for majority of municipalities is close to the average industry concentration across the region. This is consistent with previous findings where the Trade and Repair of Motor Vehicles industry is a booming industry in São Paulo Macrometropolis. A few municipalities at the south of the metropolitan region of Sorocaba, and east of metropolitan region of São Paulo and Vale do Paraíba e Litoral Norte (indicated in darker red colour), have relatively higher concentration of industry (up to 2.5 times) than the rest of the region.
For Agriculture, Electricity and Gas, Extractive Industries and Public Administration, there is very obvious concentration (that can be up to 135 times more concentrated than the regional average) and specialisation of industries in the region, indicated by the large range of LQ values, with only a few municipalities having very large industry LQ values relative to the rest of the region.
Administrative Activities and Complementary Services seem to be most concentrated in municipalities at the Metropolitan Region of Baixada Santista. Some municipalities in the metropolitan regions of Capinas, São Paulo and Vale do Paraíba e Litoral Norte also have relatively high concentration of this industry as compared to the average concentration in the region.
The Information and Communication industry has relatively higher concencention at a few municipalities north of the metropolitan region of São Paulo, in particular, Pirapora do Bom Jesus.
Municipalities in the northwest region of São Paulo Macrometropolis has a higher concentration of Industries of Transformation compared to the rest of São Paulo Macrometropolis.
There are a few municipalities with higher concentration of Transport industry (dark red areas) compared to the rest of the region, which are scattered across São Paulo Macrometropolis.
Specialisation can also be observed for International and Extraterritorial Institutions, where only two municipalities in the metropolitan region of São Paulo (darker red areas) have a higher concentration of the industry compared to the average concentration in the region.
Real Estate Activities and Professional industry seem to have higher concentration of the industry at municipalities located in the centre of São Paulo Macrometropolis, and concentration diffuses radially as municipalities move further away from the centre.
Choropleth mapping of LQ values reveal that there is indeed geographical specialisation of industries at São Paulo Macrometropolis. In particular, the industries: Agriculture, Electricity and Gas, Extractive Industries and Public Administration see the greatest specialisation at certain municipalities.
In this section, industry specialisation clusters in São Paulo Macrometropolis will be identified and segmented geographically at the municipality level, based on industry LQ values. The clusters will be delineated through agglomerative hierarchical clustering (AGNES).
To perform the cluster analysis, data will first be prepared, and clustering variables will be extracted.
brazil_lq <- as.data.frame(brazil_industry) %>%
select(`Municipality`, `Industry`, `LQ`) %>%
pivot_wider(names_from = 'Industry',
values_from = 'LQ') %>%
# Name rows by municipality
column_to_rownames(var = 'Municipality')
head(brazil_lq, 5)
## Agriculture Extractive Industries
## Águas De São Pedro 0.7691809 9.592918
## Alambari 34.5906848 0.000000
## Alumínio 0.5733378 0.000000
## Americana 0.1584573 0.179656
## Analândia 17.2638391 35.884618
## Industries of Transformation Electricity and Gas Water
## Águas De São Pedro 0.2853223 0 2.529952
## Alambari 1.4724295 0 0.000000
## Alumínio 0.9570404 0 0.000000
## Americana 1.9543887 0 1.705712
## Analândia 0.7115444 0 0.000000
## Construction Trade and Repair of Motor Vehicles Transport
## Águas De São Pedro 0.5999446 1.5094668 0.3390748
## Alambari 0.5897265 0.6969167 0.4999497
## Alumínio 1.1925098 1.3410731 1.3479578
## Americana 0.9999818 1.1410504 0.7133375
## Analândia 0.7480790 0.7414628 3.6642402
## Accommodation and Food Information and Communication
## Águas De São Pedro 3.0584894 0.4772964
## Alambari 0.9019195 0.2815003
## Alumínio 1.7668137 0.1423083
## Americana 0.9909324 0.4826950
## Analândia 1.9068360 0.3570884
## Financial Real Estate Activities Professional
## Águas De São Pedro 0.3292271 0.2219451 0.6100723
## Alambari 0.0000000 0.9817425 0.0000000
## Alumínio 0.2454017 0.1654351 0.3536869
## Americana 0.6011614 1.4631174 0.8454813
## Analândia 0.0000000 1.1069854 0.5071383
## Admin Activities and Complementary Services
## Águas De São Pedro 0.5857906
## Alambari 0.1151627
## Alumínio 0.3784222
## Americana 0.7891572
## Analândia 0.1947814
## Public Administration Education
## Águas De São Pedro 15.7221020 0.4654397
## Alambari 23.1814934 0.4575125
## Alumínio 11.7190576 1.9659531
## Americana 0.7361081 0.8891080
## Analândia 19.6041025 0.0000000
## Human Health and Social Services Arts
## Águas De São Pedro 0.4025340 0.8029801
## Alambari 0.0000000 0.0000000
## Alumínio 0.3000438 1.4963283
## Americana 0.7689421 1.0977883
## Analândia 0.3346167 0.0000000
## Other Service Activities
## Águas De São Pedro 0.4276605
## Alambari 0.7882064
## Alumínio 1.4344774
## Americana 0.7608756
## Analândia 0.3999417
## International and Extraterritorial Institutions
## Águas De São Pedro 0
## Alambari 0
## Alumínio 0
## Americana 0
## Analândia 0
corrplot.mixed(cor(brazil_lq),
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black",
tl.cex = 0.5,
number.cex = 0.7)
Agglomerative hierarchical clustering (AGNES) will be performed, to produce a set of nested clusters organised as a hierarchical tree. The resulting tree will be plotted as a dendrogram, and a selected number of clusters will be visualised geospatially.
brazil_lq_std <- normalize(brazil_lq)
summary(brazil_lq_std)
## Agriculture Extractive Industries Industries of Transformation
## Min. :0.000000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.006157 1st Qu.:0.00000 1st Qu.:0.1945
## Median :0.027794 Median :0.01434 Median :0.3012
## Mean :0.112160 Mean :0.06184 Mean :0.3377
## 3rd Qu.:0.115554 3rd Qu.:0.05508 3rd Qu.:0.4461
## Max. :1.000000 Max. :1.00000 Max. :1.0000
## Electricity and Gas Water Construction
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.04515 1st Qu.:0.3305
## Median :0.0000 Median :0.08571 Median :0.4534
## Mean :0.0141 Mean :0.10959 Mean :0.4525
## 3rd Qu.:0.0000 3rd Qu.:0.14406 3rd Qu.:0.5572
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## Trade and Repair of Motor Vehicles Transport Accommodation and Food
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.4163 1st Qu.:0.1805 1st Qu.:0.1494
## Median :0.4775 Median :0.2500 Median :0.1837
## Mean :0.4721 Mean :0.2945 Mean :0.2190
## 3rd Qu.:0.5477 3rd Qu.:0.3586 3rd Qu.:0.2420
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## Information and Communication Financial Real Estate Activities
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.02666 1st Qu.:0.09555 1st Qu.:0.1520
## Median :0.04099 Median :0.15225 Median :0.3126
## Mean :0.05542 Mean :0.17089 Mean :0.3168
## 3rd Qu.:0.05853 3rd Qu.:0.22384 3rd Qu.:0.4287
## Max. :1.00000 Max. :1.00000 Max. :1.0000
## Professional Admin Activities and Complementary Services
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1399 1st Qu.:0.1233
## Median :0.1978 Median :0.1961
## Mean :0.2156 Mean :0.2186
## 3rd Qu.:0.2683 3rd Qu.:0.2760
## Max. :1.0000 Max. :1.0000
## Public Administration Education Human Health and Social Services
## Min. :0.000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.007751 1st Qu.:0.3182 1st Qu.:0.2050
## Median :0.021014 Median :0.4567 Median :0.3136
## Mean :0.051148 Mean :0.4515 Mean :0.3344
## 3rd Qu.:0.053365 3rd Qu.:0.6018 3rd Qu.:0.4269
## Max. :1.000000 Max. :1.0000 Max. :1.0000
## Arts Other Service Activities
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3736 1st Qu.:0.2570
## Median :0.4925 Median :0.3419
## Mean :0.4674 Mean :0.3473
## 3rd Qu.:0.5985 3rd Qu.:0.4259
## Max. :1.0000 Max. :1.0000
## International and Extraterritorial Institutions
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.01048
## 3rd Qu.:0.00000
## Max. :1.00000
prox_matrix <- dist(brazil_lq_std, method = 'euclidean')
# Compute agglomerative coefficients of all hierarchical clustering algorithms
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(brazil_lq_std, method = x)$ac
}
map_dbl(m, ac)
## average single complete ward
## 0.7262692 0.6818087 0.7890747 0.9168273
hclust_ward <- hclust(prox_matrix, method = 'ward.D')
Plot the dendrogram (tree) that is constructed by the hierarchical clustering process
plot(as.dendrogram(hclust_ward),
ylab = "Height",
nodePar = list(lab.cex = 0.6, pch = c(NA, 19), cex = 0.4, col = 1),
# Remove labels
leaflab = "none")
Interactive dendrogram
dendrogram <- as.dendrogram(hclust_ward)
plot_dendro(dendrogram) %>%
hide_legend() %>%
highlight(persistent = FALSE, dynamic = TRUE)
set.seed(12345)
gap_stat <- clusGap(brazil_lq_std, FUN = hcut, nstart = 25, K.max = 20, B = 50)
fviz_gap_stat(gap_stat)
Plot dendrogram with clusters highlighted for k=7
plot(as.dendrogram(hclust_ward),
ylab = "Height",
nodePar = list(lab.cex = 0.6, pch = c(NA, 19), cex = 0.4, col = 1),
# Remove labels
leaflab = "none")
rect.hclust(hclust_ward, k = 7, border = 2:5)
heatmaply(data.matrix(brazil_lq_std),
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 7,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 6,
main="Geographic Segmentation of São Paulo Macrometropolis by Industry LQ",
xlab = "Industry LQ",
ylab = "Municipalities"
) %>%
layout(width = 800,
height = 600)
Based on the cluster heatmap, the industry specialisation clusters can be interpreted as such:
| Cluster | Colour | Description |
|---|---|---|
| 1 | Light Purple | Municipalities in this group specialise in International and Extraterritorial Institutions, Financial, Real Estate Activities, Professional industries or Human Health and Social Services |
| 2 | Dark Purple | Municipalities specialise in Accomodation and Food or Administrative Activities and Complementary Services. These municipalities also have relatively high concentration of the Arts industry |
| 3 | Turquoise | Municipalities in this group specialise in the Construction or Education industry |
| 4 | Green | Municipalities in this group specialise in Real Estate Activities or Industries of Transformation |
| 5 | Olive Green | Municipalities in this group specialise in Agriculture, while having relatively lower concentration for all other industries |
| 6 | Brown | Municipalities in this group specialise in Water, Transport, Education, Industries of Transformation or Other Service Activities. There are also several municipalities which do not have the presence of Human Health and Social Services and the Arts industries. |
| 7 | Pink | Municipalities in this group have industry specialisation in either Extractive Industries, Electricity and Gas, Information and Communication, Trade and Repair of Motor Vehicles, Accomodation and Food, or Public Administration or Arts |
# Derive cluster model with 7 clusters
groups <- as.factor(cutree(hclust_ward, k=7))
# Append cluster list object to sf data frame for visualisation
brazil_hcluster <- cbind(brazil, as.matrix(groups)) %>%
rename(Cluster = `as.matrix.groups.`) %>%
select(`Municipality`, `region`, `Cluster`)
tmap_mode('view')
tm_shape(brazil_hcluster) +
tm_fill(col = 'Cluster',
alpha = 0.9,
id = 'Municipality',
# customise pop-up tooltip information
popup.vars = c('Cluster :' = 'Cluster',
'Region :' = 'region')) +
tm_borders(lwd = 0.5,
alpha= 0.8) +
tm_view(set.bounds = TRUE,
set.zoom.limits = c(7.5, 11))
Based on the spatial mapping of the hierarchical clustering results, several insights can be gleaned.
Generally, municipalities with similar industry specialisations in São Paulo Macrometropolis tend to be located geographically close to each other in small groups of a few municipalities. These small groups can be separated geographically.
Municipalities at the coast of São Paulo Macrometropolis belong the same cluster (cluster 7: green areas on the map), indicating that they specialise in similar industries. The geographical location of these municipalities could have enabled them to specialise in similar niche industries.
Cluster 4 (red areas on the map) is mainly located in the north-west region of São Paulo Macrometropolis. These municipalities specialise in similar industries.
Cluster 2 (yellow areas on map) is mainly located at the Metropolitan Region of Vale do Paraíba e Litoral Norte, with one municipality at the north of the Regional Unit of Bragança Paulista city, and the rest in the Metropolitan Region of Sorocaba.
Cluster 5 (blue areas on map) is mainly located at the Piracicaba Urban Agglomeration.
Industry specialisation clusters in São Paulo Macrometropolis have been geographically segmented through hierarchical clustering.