1 Overview

1.1 Background Context

Brazil is the world's fifth-largest country by area and the sixth most populous. Brazil is classified as an upper-middle income economy by the World Bank and a developing country, with the largest share of global wealth in Latin America. It is considered an advanced emerging economy. It has the ninth largest GDP in the world by nominal, and eighth by PPP measures. Behind all this impressive figures,the spatial development of Brazil is highly unequal asshown in the figure on the right. In 2016, the GDP per capita of the poorest municipality is R$3,190.6. On the other hand, the GDP per capita of the richest municipality is R$314,638. Half of the municipalities have GDP per capita less than R$16,000 and the top 25% municipalities earn R$26,155 and above.

In recent year, our government has been encouraging local enterprises to venture into international market and one of the highly recommended country is Brazil (refer to https://www.enterprisesg.gov.sg/overseas-markets/north-latin-america/brazil/market-profile, https://www.enterprisesg.gov.sg/blog/latin-america-tech-landscape, and https://www.enterprisesg.gov.sg/blog/where-to-begin-in-latin-america). However, in view of the spatial development of Brazil that is highly unequal, any company interested to invest in Brazil must have a good understanding of the geographical specialisation and competitiveness of industry in the country.

1.2 Objectives

In this take-home exercise, it is required to segment São Paulo Macrometropolis at the municipality level into homogeneous industry specialization clusters by using data provided by Brazilian Institute of Geography and Statistics (IBGE) (https://www.ibge.gov.br/en/homeeng.html).

The specific tasks of the analysis are as follows:

  • Preparing a series of choropleth maps showing the distribution of spatial specialisation by industry type, 2016 at municipality level.

  • Delineating industry specialisation clusters by using hierarchical clustering method and display their distribution by using appropriate thematic mapping technique.

  • Delineating industry specialisation clusters by using spatially constrained clustering method and display their distribution by using appropriate thematic mapping technique

2 Import packages & Libraries

packages = c('rgdal', 'spdep', 'tmap', 'sf', 'ggpubr', 'cluster', 'factoextra', 'NbClust', 'heatmaply', 'corrplot', 'psych', 'tidyverse','geobr')
for (p in packages) {
  if(!require(p, character.only = T)){
  install.packages(p)
  }
  library(p,character.only = T)
}

3 Data Import

3.1 Aspatial Data


The csv files will be import using the read.csv function, using ; as a separator.

cities_data <- read.csv("data/aspatial/BRAZIL_CITIES.csv", sep = ";", encoding = "UTF-8") # add in encoding = UTF-8 to account for the portguese names that are in a different script as english
dict <- read.csv("data/aspatial/Data_Dictionary.csv", sep = ";", encoding = "UTF-8")

Take a brief look at the cities dataframe

glimpse(cities_data)
## Rows: 5,573
## Columns: 81
## $ CITY                   <chr> "Abadia De Goiás", "Abadia Dos Dourados", "A...
## $ STATE                  <chr> "GO", "MG", "GO", "MG", "PA", "CE", "BA", "B...
## $ CAPITAL                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ IBGE_RES_POP           <int> 6876, 6704, 15757, 22690, 141100, 10496, 831...
## $ IBGE_RES_POP_BRAS      <int> 6876, 6704, 15609, 22690, 141040, 10496, 831...
## $ IBGE_RES_POP_ESTR      <int> 0, 0, 148, 0, 60, 0, 0, 0, 0, 0, 0, 16, 17, ...
## $ IBGE_DU                <int> 2137, 2328, 4655, 7694, 31061, 2791, 2572, 4...
## $ IBGE_DU_URBAN          <int> 1546, 1481, 3233, 6667, 19057, 1251, 1193, 2...
## $ IBGE_DU_RURAL          <int> 591, 847, 1422, 1027, 12004, 1540, 1379, 195...
## $ IBGE_POP               <int> 5300, 4154, 10656, 18464, 82956, 4538, 3725,...
## $ IBGE_1                 <int> 69, 38, 139, 176, 1354, 98, 37, 167, 69, 12,...
## $ IBGE_1.4               <int> 318, 207, 650, 856, 5567, 323, 156, 733, 302...
## $ IBGE_5.9               <int> 438, 260, 894, 1233, 7618, 421, 263, 978, 37...
## $ IBGE_10.14             <int> 517, 351, 1087, 1539, 8905, 483, 277, 927, 4...
## $ IBGE_15.59             <int> 3542, 2709, 6896, 11979, 53516, 2631, 2319, ...
## $ IBGE_60.               <int> 416, 589, 990, 2681, 5996, 582, 673, 803, 81...
## $ IBGE_PLANTED_AREA      <int> 319, 4479, 10307, 1862, 25200, 2598, 895, 20...
## $ IBGE_CROP_PRODUCTION_. <int> 1843, 18017, 33085, 7502, 700872, 5234, 3999...
## $ IDHM.Ranking.2010      <int> 1689, 2207, 2202, 1994, 3530, 3522, 4086, 47...
## $ IDHM                   <dbl> 0.708, 0.690, 0.690, 0.698, 0.628, 0.628, 0....
## $ IDHM_Renda             <dbl> 0.687, 0.693, 0.671, 0.720, 0.579, 0.540, 0....
## $ IDHM_Longevidade       <dbl> 0.830, 0.839, 0.841, 0.848, 0.798, 0.748, 0....
## $ IDHM_Educacao          <dbl> 0.622, 0.563, 0.579, 0.556, 0.537, 0.612, 0....
## $ LONG                   <dbl> -49.44055, -47.39683, -48.71881, -45.44619, ...
## $ LAT                    <dbl> -16.758812, -18.487565, -16.182672, -19.1558...
## $ ALT                    <dbl> 893.60, 753.12, 1017.55, 644.74, 10.12, 403....
## $ PAY_TV                 <int> 360, 77, 227, 1230, 3389, 29, 952, 51, 55, 1...
## $ FIXED_PHONES           <int> 842, 296, 720, 1716, 1218, 34, 335, 222, 392...
## $ AREA                   <chr> "147.26", "881.06", "1,045.13", "1,817.07", ...
## $ REGIAO_TUR             <chr> "", "Caminhos Do Cerrado", "Região Turística...
## $ CATEGORIA_TUR          <chr> "", "D", "C", "D", "D", "", "D", "", "", "D"...
## $ ESTIMATED_POP          <int> 8583, 6972, 19614, 23223, 156292, 11663, 876...
## $ RURAL_URBAN            <chr> "Urbano", "Rural Adjacente", "Rural Adjacent...
## $ GVA_AGROPEC            <dbl> 6.20, 50524.57, 42.84, 113824.60, 140463.72,...
## $ GVA_INDUSTRY           <dbl> 27991.25, 25917.70, 16728.30, 31002.62, 5861...
## $ GVA_SERVICES           <dbl> 74750.32, 62689.23, 138198.58, 172.33, 46812...
## $ GVA_PUBLIC             <dbl> 36915.04, 28083.79, 63396.20, 86081.41, 4868...
## $ GVA_TOTAL              <dbl> 145857.60, 167215.28, 261161.91, 403241.27, ...
## $ TAXES                  <dbl> 20554.20, 12873.50, 26822.58, 26994.09, 9518...
## $ GDP                    <dbl> 166.41, 180.09, 287984.49, 430235.36, 124925...
## $ POP_GDP                <int> 8053, 7037, 18427, 23574, 151934, 11483, 921...
## $ GDP_CAPITA             <dbl> 20664.57, 25591.70, 15628.40, 18250.42, 8222...
## $ GVA_MAIN               <chr> "Demais serviços", "Demais serviços", "Demai...
## $ MUN_EXPENDIT           <dbl> 28227691, 17909274, 37513019, NA, NA, NA, NA...
## $ COMP_TOT               <int> 284, 476, 288, 621, 931, 86, 191, 87, 285, 6...
## $ COMP_A                 <int> 5, 6, 5, 18, 4, 1, 6, 2, 5, 2, 0, 8, 3, 1, 4...
## $ COMP_B                 <int> 1, 6, 9, 1, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0,...
## $ COMP_C                 <int> 56, 30, 26, 40, 43, 4, 8, 3, 20, 4, 9, 40, 3...
## $ COMP_D                 <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ COMP_E                 <int> 2, 2, 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0,...
## $ COMP_F                 <int> 29, 34, 7, 20, 27, 6, 4, 0, 10, 2, 0, 25, 12...
## $ COMP_G                 <int> 110, 190, 117, 303, 500, 48, 97, 71, 133, 35...
## $ COMP_H                 <int> 26, 70, 12, 62, 16, 2, 5, 0, 18, 8, 1, 67, 3...
## $ COMP_I                 <int> 4, 28, 57, 30, 31, 10, 5, 1, 14, 3, 0, 25, 1...
## $ COMP_J                 <int> 5, 11, 2, 9, 6, 2, 3, 1, 8, 1, 1, 9, 5, 14, ...
## $ COMP_K                 <int> 0, 0, 1, 6, 1, 0, 1, 0, 0, 1, 0, 4, 3, 3, 0,...
## $ COMP_L                 <int> 2, 4, 0, 4, 1, 0, 0, 0, 4, 0, 0, 7, 4, 4, 0,...
## $ COMP_M                 <int> 10, 15, 7, 28, 22, 2, 5, 0, 11, 4, 2, 26, 17...
## $ COMP_N                 <int> 12, 29, 15, 27, 16, 3, 5, 1, 26, 0, 1, 16, 1...
## $ COMP_O                 <int> 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 6, 2, 4, 2, 4,...
## $ COMP_P                 <int> 6, 9, 11, 15, 155, 0, 8, 0, 8, 1, 6, 14, 11,...
## $ COMP_Q                 <int> 6, 14, 5, 19, 33, 2, 1, 2, 9, 3, 0, 13, 22, ...
## $ COMP_R                 <int> 1, 6, 1, 9, 15, 0, 2, 0, 4, 0, 0, 4, 6, 6, 0...
## $ COMP_S                 <int> 5, 19, 8, 27, 56, 4, 38, 4, 12, 3, 4, 23, 38...
## $ COMP_T                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COMP_U                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ HOTELS                 <int> NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA...
## $ BEDS                   <int> NA, NA, 34, NA, NA, NA, 24, NA, NA, NA, NA, ...
## $ Pr_Agencies            <int> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0, 2...
## $ Pu_Agencies            <int> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1, 3...
## $ Pr_Bank                <int> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0, 2...
## $ Pu_Bank                <int> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1, 3...
## $ Pr_Assets              <dbl> NA, NA, 33724584, 44974716, 76181384, NA, NA...
## $ Pu_Assets              <dbl> NA, NA, 67091904, 371922572, 800078483, NA, ...
## $ Cars                   <int> 2158, 2227, 2838, 6928, 5277, 553, 896, 613,...
## $ Motorcycles            <int> 1246, 1142, 1426, 2953, 25661, 1674, 696, 15...
## $ Wheeled_tractor        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0,...
## $ UBER                   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ MAC                    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ WAL.MART               <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ POST_OFFICES           <int> 1, 1, 3, 4, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,...

Extract out Geospatial Data for Sao Paulo Macrometropolis

unique(cities_data$STATE)
##  [1] "GO" "MG" "PA" "CE" "BA" "PR" "SC" "PE" "TO" "MA" "RN" "PI" "RS" "MT" "AC"
## [16] "SP" "ES" "AL" "PB" "MS" "RO" "RR" "AM" "AP" "SE" "RJ" "DF"

We can see that there are 27 state codes for each of Brazil's 27 states. We also notice that there is one for "SP" - which stands for Sao Paulo, the metropolitian region we are interested in.

Next, let us extract out the GIS data for the region that we are interested in.

sp_data <- cities_data %>% filter(STATE == "SP")
head(sp_data) # To get a look at the dataframe 

3.2 Geospatial Data


The required geospatial data will be downloaded using the geobr package. I will be using the read_municipality function to download the municipality level data and using read_metro_area to retrieve the metropolitian regions.

3.2.1 Sao Paulo Municipality Boundary Data

municipalities <-  read_municipality(year = 2016, simplified = TRUE, showProgress = FALSE) # add simplified = TRUE for faster download and plotting
## Using year 2016
## Loading data for the whole country. This might take a few minutes.

Extract out Geospatial Data for Sao Paulo Municipalities

unique(municipalities$abbrev_state)
##  [1] "RO" "AC" "AM" "RR" "PA" "AP" "TO" "MA" "PI" "CE" "RN" "PB" "PE" "AL" "SE"
## [16] "BA" "MG" "ES" "RJ" "SP" "PR" "SC" "RS" "MS" "MT" "GO" "DF"

Next, let us extract out the GIS data for the region that we are interested in.

saopaulo_muni <- municipalities %>% filter(abbrev_state == "SP")

Do a plot to examine whether the subsetted region is correct

plot(saopaulo_muni$geom)


Cross referencing to the given wikipedia source; this geographical boundary that was subsetted out is correct.

3.2.2 Sao Paulo Metropolitian Boundary Data

metro <- read_metro_area(year = 2016, simplified = TRUE, showProgress = FALSE)
## Using year 2016
saupaulo_metro <- metro %>% filter(abbrev_state == "SP")
plot(saupaulo_metro$geom)


The above metropolitian boundaries matches closely with that of the wikipedia page, with the exception of the Braganca Paulista city. Thus, we need to extract it from the municipality level data that we downloaded in the previous section.

3.2.3 Extract Municipalities from Braganca Paulista

From Wikipedia, we can retrieve the names of the municipalities enclosed in Braganca Paulista.

bp_names <- c("Atibaia",
              "Bom Jesus dos Perdões",
              "Bragança Paulista",
              "Itatiba",
              "Jarinu",
              "Joanópolis",
              "Morungaba",
              "Nazaré Paulista",
              "Piracaia",
              "Tuiuti",
              "Vargem"
              )

Next, we use it to subset out the relevant rows in the Sao Paulo municipality GIS data that we have.

bp_gis <- saopaulo_muni %>% filter(tolower(name_muni) %in% tolower(bp_names))
bp_gis


The extracted data has 11 features - which matches the length of the bp_names vector.

3.2.4 Merge the Sao Paulo Metropolitian Boundary with the Municipalities from Braganca Paulista

This will give us the boundaries for our required study area.

study_area <- bind_rows(saupaulo_metro, bp_gis)

Check for duplicates

duplicated(study_area$name_muni)
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [169]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE

The binding has introduced duplicates into our dataframe. Let us examine why.

dup_rows <- study_area[duplicated(study_area$name_muni),]
dup_names <- dup_rows$name_muni
study_area %>% filter(name_muni %in% dup_names)

This is likely due to the reshuffling of municipalities around, as we can see in the above data that 3 municipalities may have been shifted to the Braganca Paulista region. This is confirmed based on this source.

Drop duplicated columns

study_area <- study_area[!duplicated(study_area$name_muni),]

Check for duplicates & Do a plot to check if the boundaries are correct

duplicated(study_area$name_muni)
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE
plot(study_area$geom)

3.3 Join GeoSpatial & Aspatial Data together

We need to join the geospatial data and aspatial data together. However, the lack of the unique identifier municipality code in the aspatial data (which is present in the saopaulo GIS data) presents some challenges. Let us try the following approaches and see which one allows us to perform the join as intended.

3.3.1 Try inner_join

From previous inspection, this can be done by joining on municipality name and state code.

sp_inner <- study_area %>% inner_join(sp_data, by = c("abbrev_state" = "STATE", "name_muni" = "CITY"))

Evaluate whether the join works as expected

plot(sp_inner$geom)

print(nrow(study_area)-nrow(sp_inner))
## [1] 1


We can see above that 1 municipality is lost as a result of the inner join.

3.3.2 Examine which municipality was lost

gis_names <- study_area$name_muni
aspat_names <- sp_data$CITY
setdiff(gis_names, aspat_names)
## [1] "Santa Bárbara D'oeste"

Looking into the sp_data dataframe, we can see that this is because it is stored there as Santa Bárbara D'Oeste.

Let us rename the municipality and perform the join again.

sp_data <- within(sp_data, CITY[CITY =="Santa Bárbara D'Oeste"] <- "Santa Bárbara D'oeste")

Perform Inner Join

sp_fin <- study_area %>% inner_join(sp_data, by = c("abbrev_state" = "STATE", "name_muni" = "CITY"))
print(nrow(study_area)-nrow(sp_fin))
## [1] 0


The data is now prepared and ready for analysis.

4 Data Transformation

4.1 Select out relevant columns

selected <- c("name_muni",
              "IBGE_RES_POP", 
              "AREA",
              "GVA_AGROPEC",
              "GVA_INDUSTRY",
              "GVA_SERVICES",
              "GVA_PUBLIC",
              "COMP_A",
              "COMP_B",
              "COMP_C",
              "COMP_D",
              "COMP_E",
              "COMP_F",
              "COMP_G",
              "COMP_H",
              "COMP_I",
              "COMP_J",
              "COMP_K",
              "COMP_L",
              "COMP_M",
              "COMP_N",
              "COMP_O",
              "COMP_P",
              "COMP_Q",
              "COMP_R",
              "COMP_S",
              "COMP_T",
              "COMP_U",
              "geom"
              )
  • Leaving out Domestic Units since it is correlated to population; also leaving out planned population
  • Leaving out agricultural production and planted area as it is correlated to other agricultural metrics
  • Leaving out HDI related indices as it in itself is calculated by weighting the other variables that we have in the dataset
  • Leaving out Pay-TV, Phones, Lat-Long, Altitude, Tourism, Urban/Rural, Taxes, Expenditure as it is not as related to this study
  • Leaving out GDP and GDP/Capita as we are using GVA
  • Leaving out main GVA as we are keeping the broken down GVA by industry in the dataset.
  • Leaving out hotels, private banks, public banks, assets, uber, macdonalds, post offices as they would be accounted for in the variabes from COMP_A to COMP_U (which we will aggregate and rename later for easier reference)

Select needed columns

sp_fin_filtered <- sp_fin[,selected]

Rename columns for easier reference

new_names <- c("NAME_MUNI",
              "POPULATION", 
              "AREA",
              "GVA_AGROPEC",
              "GVA_INDUSTRY",
              "GVA_SERVICES",
              "GVA_PUBLIC",
              "COMP_AGRO",
              "COMP_EXTRACTIVE",
              "COMP_TRANSFORMATION",
              "COMP_GAS_ELECT",
              "COMP_WATER_SEWAGE",
              "COMP_CONSTRUCTION",
              "COMP_VEHICLE_REPAIR",
              "COMP_TRANSPORT_MAIL",
              "COMP_ACCOM_FOOD", 
              "COMP_INFOCOMM",
              "COMP_FINANCE",
              "COMP_REALESTATE",
              "COMP_SCIENCE_TECHNICAL",
              "COMP_ADMIN",
              "COMP_DEFENSE_SS",
              "COMP_EDUCATION",
              "COMP_HEALTH_SOCIAL_SERVICE",
              "COMP_ART_SPORT_RECRE",
              "COMP_OTHER_SERVICES",
              "COMP_DOMESTIC_SERVICES",
              "COMP_INTERNATIONAL_INSTI",
              "geom"
              )

colnames(sp_fin_filtered) <- new_names

4.2 Variable Aggregation

We will group the Variables from COMP_AGROPEC to COMP_INTERNATIONAL_INSTI into 4 categories - AGROPEC, INDUSTRY, SERVICES & PUBLIC; for consistency with the GVA Categories that we have. I used the definitions from this page to guide me through assigning the sectors to the main 4 categories.

sp_transformed_v1 <- sp_fin_filtered %>% 
  mutate(`COMP_AGROPEC` = `COMP_AGRO` + `COMP_EXTRACTIVE`) %>%
  mutate(`COMP_INDUSTRY` = `COMP_GAS_ELECT` + `COMP_CONSTRUCTION` + `COMP_TRANSFORMATION`) %>%
  mutate(`COMP_SERVICES` =  + `COMP_WATER_SEWAGE` + `COMP_VEHICLE_REPAIR` + `COMP_TRANSPORT_MAIL` + `COMP_INFOCOMM` + `COMP_FINANCE` + `COMP_REALESTATE` + `COMP_SCIENCE_TECHNICAL` + `COMP_ADMIN`+`COMP_ACCOM_FOOD` +`COMP_OTHER_SERVICES` + `COMP_DOMESTIC_SERVICES` + `COMP_INTERNATIONAL_INSTI`) %>%
  mutate(`COMP_PUBLIC` = `COMP_DEFENSE_SS` + `COMP_EDUCATION` + `COMP_HEALTH_SOCIAL_SERVICE` + `COMP_ART_SPORT_RECRE`) %>%
  select(NAME_MUNI, POPULATION, AREA, GVA_AGROPEC, GVA_INDUSTRY, GVA_SERVICES, GVA_PUBLIC, COMP_AGROPEC, COMP_INDUSTRY, COMP_SERVICES, COMP_PUBLIC, geom)

4.3 Variable Transformation

Next, we will transform the GVA variables by dividing with population (to account for productivity and industrial competitiveness), and transform the COMP variables by dividing by municipality area (to account for the fact that larger municipalities have more land & vice versa).

sp_transformed <- sp_transformed_v1 %>%
  mutate(`GVA_AGROPEC_PC` = GVA_AGROPEC*1000/POPULATION) %>%
  mutate(`GVA_INDUSTRY_PC` = GVA_INDUSTRY*1000/POPULATION) %>%
  mutate(`GVA_SERVICES_PC` = GVA_SERVICES*1000/POPULATION) %>%
  mutate(`GVA_PUBLIC_PC` = GVA_PUBLIC*1000/POPULATION) %>%
  mutate(AREA = gsub(",","",AREA)) %>% #important step since some values in the dataframe (with 4 digits); are separated by commas
  mutate(AREA = as.numeric(AREA)) %>%
  mutate(`COMP_AGROPEC_BA` = COMP_AGROPEC/AREA) %>% 
  mutate(`COMP_INDUSTRY_BA` = COMP_INDUSTRY/AREA) %>% 
  mutate(`COMP_SERVICES_BA` = COMP_SERVICES/AREA) %>% 
  mutate(`COMP_PUBLIC_BA` = COMP_PUBLIC/AREA) %>%
  select(NAME_MUNI, GVA_AGROPEC_PC, GVA_INDUSTRY_PC, GVA_SERVICES_PC, GVA_PUBLIC_PC, COMP_AGROPEC_BA, COMP_INDUSTRY_BA, COMP_SERVICES_BA, COMP_PUBLIC_BA, geom)

The data is now transformed. We will scale the variables later for the clustering section - but before that, let us get an understanding of the spatial and aspatial distribution of our 8 variables.

5 Exploratory Data Analysis - Thematic Mapping of Industrial Variables

5.1 Statistical Graphics

In this portion, we will be visualising our transformed variables using histograms. This allows us to get a rough understanding of the variable distribution.

gva_agp <- ggplot(data=sp_transformed, 
             aes(x= `GVA_AGROPEC_PC`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

comp_agp <- ggplot(data=sp_transformed, 
             aes(x= `COMP_AGROPEC_BA`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gva_industry <- ggplot(data=sp_transformed, 
             aes(x= `GVA_INDUSTRY_PC`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

comp_industry <- ggplot(data=sp_transformed, 
             aes(x= `COMP_INDUSTRY_BA`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gva_service <- ggplot(data=sp_transformed, 
             aes(x= `GVA_SERVICES_PC`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

comp_service <- ggplot(data=sp_transformed, 
             aes(x= `COMP_SERVICES_BA`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gva_public <- ggplot(data=sp_transformed, 
             aes(x= `GVA_PUBLIC_PC`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

comp_public <- ggplot(data=sp_transformed, 
             aes(x= `COMP_PUBLIC_BA`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

Visualise the Variable Distributions together

ggarrange(gva_agp, comp_agp, gva_industry, comp_industry, gva_service, comp_service, gva_public,comp_public,
          ncol = 2,
          nrow = 4)

  • From the above, we can see that GVA Per Capita is right skewed for AGROPEC, INDUSTRY & SERVICES, while it has no significant skew for the PUBLIC Industries
  • For the Companies per km2, we can see that it is generally right skewed across all the 4 major industrial groups.
  • In terms of magnitude, we can see that by GVA Per capita, the services industry generates the most revenue, followed by the industrial sector, agropecurary sector, and the public sector
  • In terms of magnitude for companies per km2, we can see that the services industry generally leads this, followed by the public industry, industrial, and lastly agropecurary.

5.2 Chloropleth Mapping

In this portion, we will be visualising our transformed variables using both chloropleths and box maps. Using boxmaps in addition to chloropleths are important as chloropleths can be easily distorted by the scale and breaks used. We are using chloropleths still as they allow us to see the difference in magnitude by the variable across the different Sao Paulo municipalities.

5.2.1 Create Box Map / Tmap Functions

boxbreaks <- function(v,mult=1.5) { 
  qv <- unname(quantile(v,na.rm = TRUE))
  iqr <- qv[4] - qv[2]  
  upfence <- qv[4] + mult * iqr  
  lofence <- qv[2] - mult * iqr
  
  # initialize break points vector  
  bb <- vector(mode="numeric",length=7)
  # logic for lower and upper fences
  if (lofence < qv[1]) {  
    # no lower outliers    
    bb[1] <- lofence    
    bb[2] <- floor(qv[1])  
  } else {    
    bb[2] <- lofence    
    bb[1] <- qv[1]
  }
  
  if(upfence > qv[5]) {
  # no upper outliers   
    bb[7] <- upfence   
    bb[6] <- ceiling(qv[5]) 
  } else {   
    bb[6] <- upfence  
    bb[7] <- qv[5]
  } 
  bb[3:5] <- qv[2:4]
  return(bb)
}

get.var <- function(vname,df) { 
  v <- df[vname] %>% st_set_geometry(NULL)
  v <- unname(v[,1])
  return(v)
}

boxmap <- function(vnam,df,titlesize,legx, legy, legtitle=NA,mtitle="Box Map",mult=1.5){  
  var  <- get.var(vnam,df)
  bb <- boxbreaks(var)
  tm_shape(df) +    
    tm_fill(vnam,
            title=legtitle,
            breaks=bb,
            palette="-RdBu", 
            labels = c("lower outlier", "< 25%", "25% - 50%", "50% - 75%","> 75%", "upper outlier")
            ) +
    tm_fill(vnam, 
            df %>% 
              filter(is.na(vnam)), 
            labels = c("NA")
            ) +
  tm_borders(alpha =0.2) +
  tm_layout(main.title = paste("Box Map of ",toString(vnam)," Numbers by Municipality"),
            main.title.size = titlesize,
            main.title.position =  "center",
            frame = FALSE,
            legend.height = legx,
            legend.width = legy
            )
}

bm_qtm <- function(vnam,df,titlesize,legx, legy) {
  tm <- qtm(df, vnam, main.title = paste("Chloropleth of ",toString(vnam)," Numbers by Municipality"), frame = FALSE, main.title.size = titlesize, main.title.position = "center")
  bm <- boxmap(vnam, df, titlesize, legx,legy)
  tmap_arrange(tm,bm, nrow = 2)
}

5.2.2 Agropecurary Sector

bm_qtm("GVA_AGROPEC_PC", sp_transformed, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

  • In terms of productivity (proxied by GVA Per Capita), we can see that only a few municipalities that are shaded darker in the chloropleth map. This narrative changes when we use the box map, where we can see that agricultural productivity is the highest in the southwestern region; coupled with some in the northen region.
bm_qtm("COMP_AGROPEC_BA", sp_transformed, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

  • When we look at the distribution of companies per km2, we can see that density of agropercurary companies is higher in the northen, eastern and western regions.
  • While the areas with higher density of agropercurary companies tend to have higher productivity, this is not the case for some of the municipalities. Examples would include the cluster of red municipalities in the box map for GVA_AGROPEC_PC in the south-western region - which are mostly lesser than 50th percentile if we look at boxmap for COMP_AGROPEC_BA. There are also pockets or red hues in the north-western region for GVA_AGROPEC_PC, which are matched by bluer hues in the boxmap for COMP_AGROPEC_BA.
  • The aforementioned regions may be useful for companies looking to venture in this industry - as they can exploit the high productivity (higher GVA Per Capita) and lower competition in the area (lower density).

5.2.3 Industrial Sector

bm_qtm("GVA_INDUSTRY_PC", sp_transformed, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

  • We can see large swaths in the central of the study area with a higher GVA_INDUSTRY_PC using the boxmap.
  • An anomaly that we could see is the lone municipality that is at the coast of the Sao Paulo region.
bm_qtm("COMP_INDUSTRY_BA", sp_transformed, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

  • We can see that there seems to be a clustering of industrial companies in the southern region of the Sao Paulo state. This area, while only above average in terms of GVA_INDUSTRY_PC, has the highest density of industrial companies.
  • The converse relationship holds in the north-western region, which is generally above average in terms of industry company density but fares much higher in terms of GVA per capita. This is a potential investment area that can be looked further into.

5.2.4 Services Sector

bm_qtm("GVA_SERVICES_PC", sp_transformed, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

  • The distribution here seem to resemble that of the GVA_INDUSTRY_PC boxmap. However, those municipalities which are upper outliers for industrial gva per capita seem to be only slightly above average in terms of services gva per capita & vice versa.
bm_qtm("COMP_SERVICES_BA", sp_transformed, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

  • This distribution of company density for the services industry seems to resemble the boxmap for COMP_INDUSTRY_BA.
  • Thus, the Southern municipalities seem to be a area with high density of service and industrial related companies - which corresponds to the fact that this region is the capital region for the State.

5.2.5 Public Sector

bm_qtm("GVA_PUBLIC_PC", sp_transformed, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

  • For Public industries GVA Per capita, we can see a much even distribution of the values across the municipalities. This is consistent with thte conclusions that we got from plotting the histograms in the previous section for the GVA_PUBLIC_PC variable.
bm_qtm("COMP_PUBLIC_BA", sp_transformed, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

  • In terms of company density, we can once again see that the southern region may also present clusters of public companies as well. This could be because the region in red corresponds to the capital area - which is usually where all the public/governmental companies are located.

5.3 Correlation Analysis

Next, we will examine how the variables are correlated to one another using a correlation plot. This is also done before clustering as we want to ensure that the cluster variables are not highly correlated.

sp.cor = cor(as.data.frame(sp_transformed[2:9])[,1:8])
corrplot.mixed(sp.cor,
         lower = "ellipse", 
         upper = "number",
         tl.pos = "lt",
         diag = "l",
         tl.col = "black",
         tl.cex = 0.75
         )

  • We can see that COMP_PUBLIC_BA is highly correlated with COMP_INDUSTRY_BA and COMP_SERVICES_BA. We will remove COMP_PUBLIC_BA when we do the clustering analysis.
  • We also see that COMP_INDUSTRY_BA and COMP_SERVICES_BA are highly correlated. This suggests only one should be kept for the cluster analysis.

6 Hierarchial Clustering Analysis

6.1 Extract Clustering Variables

cluster_df <- sp_transformed %>%
  st_set_geometry(NULL) %>%
  select("NAME_MUNI", "GVA_AGROPEC_PC","GVA_INDUSTRY_PC", "GVA_SERVICES_PC", "GVA_PUBLIC_PC", "COMP_AGROPEC_BA" ,"COMP_SERVICES_BA")

head(cluster_df,10)

We removed COMP_INDUSTRY_BA and COMP_PUBLIC_BA due to their high correlation with COMP_SERVICES_BA.

Next, we rename the rows by municipality name

row.names(cluster_df) <- cluster_df$"NAME_MUNI"
head(cluster_df,10)

Afterwards, we drop the NAME_MUNI column

cluster_df <- select(cluster_df, c(2:7))
head(cluster_df,10)

6.2 Standardise Data

This step is important as the variables that we have selected for are different in terms of units and thus range. We will use Min-Max standardisation as we cannot assume that the variables come from a normal distribution.

cluster_df.std <- normalize(cluster_df)
summary(cluster_df.std)
##  GVA_AGROPEC_PC      GVA_INDUSTRY_PC    GVA_SERVICES_PC   GVA_PUBLIC_PC   
##  Min.   :0.0000000   Min.   :0.000000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0002927   1st Qu.:0.006709   1st Qu.:0.04940   1st Qu.:0.3210  
##  Median :0.0102740   Median :0.026104   Median :0.08954   Median :0.3561  
##  Mean   :0.0401360   Mean   :0.052581   Mean   :0.12748   Mean   :0.3334  
##  3rd Qu.:0.0419550   3rd Qu.:0.060139   3rd Qu.:0.15174   3rd Qu.:0.4039  
##  Max.   :1.0000000   Max.   :1.000000   Max.   :1.00000   Max.   :1.0000  
##  COMP_AGROPEC_BA   COMP_SERVICES_BA  
##  Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.02892   1st Qu.:0.002910  
##  Median :0.05792   Median :0.008356  
##  Mean   :0.10246   Mean   :0.041472  
##  3rd Qu.:0.11148   3rd Qu.:0.028325  
##  Max.   :1.00000   Max.   :1.000000

6.3 Compute Proximity Matrix

This chunk computes the euclidean distance between any 2 municipalities using the input variables that we have.

proxmat <- dist(cluster_df.std, method = "euclidean")

6.4 Select Optimal Method for Hierarchial Clustering

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(cluster_df.std, method = x)$ac
}

map_dbl(m, ac)
##   average    single  complete      ward 
## 0.9294905 0.9095148 0.9413205 0.9610303

From the above, we can see the Ward's method provides the strongest clustering structure. Let us use this method for hierarchial clustering of the municipalities.

6.5 Hierarchial Clustering

hclust_ward <- hclust(proxmat, method = 'ward.D')

Let us visualise the tree

plot(hclust_ward, cex = 0.3)

6.6 Select Optimal Number of Clusters

We will use the gap statistic method which compares total intra-cluster variation for different number of clusters used; with comparison to the expected values under null reference data distriubtion. We want to select the cluster number that maximises the gap statistic.

set.seed(12345)
gap_stat <- clusGap(cluster_df.std, FUN = hcut, nstart = 20, K.max = 10, B =50)
# Print the result
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = cluster_df.std, FUNcluster = hcut, K.max = 10, B = 50,     nstart = 20)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 1
##           logW   E.logW      gap     SE.sim
##  [1,] 2.765606 3.813811 1.048205 0.01634477
##  [2,] 2.645646 3.666258 1.020612 0.01835091
##  [3,] 2.451175 3.597884 1.146709 0.01560244
##  [4,] 2.351573 3.539409 1.187836 0.01549761
##  [5,] 2.299680 3.489840 1.190160 0.01489049
##  [6,] 2.248246 3.447668 1.199422 0.01426494
##  [7,] 2.204067 3.409547 1.205480 0.01378066
##  [8,] 2.165894 3.374018 1.208124 0.01388462
##  [9,] 2.076070 3.341144 1.265074 0.01431569
## [10,] 2.011010 3.310141 1.299131 0.01425596

Visualise it using a plot

fviz_gap_stat(gap_stat)

From the above plot, we can see that the Gap Statistic seems to be increasing as k increases. While theoratically it means we can pick a large number for k, this is not a good practice as it will result in lesser than meaningful clusters. Using this source, we will instead pick the first cluster number i where the gradient from cluster i-1 to i is steeper than the gradient from i to i+1. This means we pick k=3 for this context.

6.7 Interpret Dendograms

plot(hclust_ward, cex = 0.4)
rect.hclust(hclust_ward, k = 3, border = 2:5)

6.8 Visually-driven Hierarchical Clustering Analysis

6.8.1 Transform df to matrix

cluster_df_mat <- data.matrix(cluster_df.std)

6.8.2 Plot interactive Cluster Heat Map

heatmaply(cluster_df_mat,
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 3,
          margins = c(NA,200,60,NA),
          fontsize_row = 2,
          fontsize_col = 5,
          main="Geographic Segmentation of Sao Paulo by Industrial indicators",
          xlab = "Industrial Indicators",
          ylab = "Municipalities of Sao Paulo State"
          )

6.9 Map Clusters

groups <- as.factor(cutree(hclust_ward, k=3))

Map back to original simple feature data frame

sp_transformed_clust <- cbind(sp_transformed, as.matrix(groups)) %>%
  rename(`CLUSTER`=`as.matrix.groups.`)

Visualise it on a chloropleth

qtm(sp_transformed_clust, "CLUSTER")


The above shows us pretty fragmented clusters - difficult for further analysis.

6.10 Further Analysing Clusters

For this section, we will be looking at the outputs from the hierarchial clustering to understand some differences that are present between the 3 generated clusters. This will be done by plotting them on bar charts.

Create the bar charts

agrp_pc <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`GVA_AGROPEC_PC`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")
indust_pc <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`GVA_INDUSTRY_PC`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")
services_pc <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`GVA_SERVICES_PC`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")
public_pc <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`GVA_PUBLIC_PC`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")
comp_agro_ba <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`COMP_AGROPEC_BA`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")
comp_serv_ba <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`COMP_SERVICES_BA`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")

Arrange them together

ggarrange(agrp_pc, indust_pc, services_pc, public_pc, comp_agro_ba, comp_serv_ba, nrow = 2, ncol =3)

  • From the above, we can see that those municipalites in Cluster 1 are characterised as those with the highest GVA Per Capita for Industrial, Services and Public Sector. They also have the highest density of AGROPEC and Services companies per km2.
  • For municipalities in Cluster 2, they are characterised as having the highest GVA Per Capita for Agropecuary sector. They also have the second highest GVA Per Capita for Public industries, and relatively low mean metrics for the other 4 variables.
  • For municipalities in Cluster 3, the striking point about them is that they have very very low GVA Per Capita for Public sector. The other 5 variables tend to be low for them.

7 Spatially Constrained Clustering

To remedy the fragmented clusters of the hierachial clustering method, we will try the spatially constrained clustering method using the SKATER approach.

7.1 Convert into SpatialPolygonsDataFrame

sp_transformed_sp <- as_Spatial(sp_transformed)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition
## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Sistema de Referencia Geocentrico para las
## AmericaS 2000 in CRS definition

Change CRS to EPSG 4618. Source

sp_transformed_sp <- spTransform(sp_transformed_sp, CRS("+init=epsg:4618"))
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum South_American_Datum_1969 in CRS definition

7.2 Compute Neighbour List

sp.nb <- poly2nb(sp_transformed_sp)
summary(sp.nb)
## Neighbour list object:
## Number of regions: 172 
## Number of nonzero links: 872 
## Percentage nonzero weights: 2.947539 
## Average number of links: 5.069767 
## 1 region with no links:
## 100
## Link number distribution:
## 
##  0  1  2  3  4  5  6  7  8  9 10 22 
##  1  4 14 21 35 31 29 17  9  5  5  1 
## 4 least connected regions:
## 8 9 29 36 with 1 link
## 1 most connected region:
## 161 with 22 links

7.2.1 Visualise the neighbours list

plot(sp_transformed_sp, border=grey(.5))
plot(sp.nb, coordinates(sp_transformed_sp), col="blue", add=TRUE)


We can see the municipality that has the most number of neighbours is the large one in the South Region. In addition, one municipality (in the south east) is not connected to any other municipalities, due to the fact that it is probably an island off the coast of the main land.

7.2.2 Artificially add in neighbour for the island municipality

We first find out which municipality is the island

cnt <- card(sp.nb)
island_ind <- match(0, cnt)
print(island_ind)
## [1] 100

The island municipality's index is 100 in sp.nb.

Now let us add in a neighbour for this municipality using the knearneigh function of the spdep package.

nearest <- knearneigh(coordinates(sp_transformed_sp))$nn
sp.nb[[island_ind]] <- nearest[island_ind]

# We need to add this island to the neighbours of its connected municipality as well
sp.nb[[nearest[island_ind]]] <- c(as.integer(100), as.integer(sp.nb[[nearest[island_ind]]]))

7.2.3 Visualise Final Neighbours List

plot(sp_transformed_sp, border=grey(.5))
plot(sp.nb, coordinates(sp_transformed_sp), col="blue", add=TRUE)

The island municipality now has a neighbour, and we can proceed to compute the edge costs.

7.3 Compute Minimum Spanning Tree

7.3.1 Find edge values (distance)

lcosts <- nbcosts(sp.nb, cluster_df.std)
sp.w <- nb2listw(sp.nb, lcosts, style="B")
summary(sp.w)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 172 
## Number of nonzero links: 874 
## Percentage nonzero weights: 2.9543 
## Average number of links: 5.081395 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 22 
##  5 14 21 35 30 30 17  9  5  5  1 
## 5 least connected regions:
## 8 9 29 36 100 with 1 link
## 1 most connected region:
## 161 with 22 links
## 
## Weights style: B 
## Weights constants summary:
##     n    nn       S0       S1       S2
## B 172 29584 293.1677 304.1517 3442.263

7.3.2 Compute MST

sp.mst <- mstree(sp.w)

Check for dimensions of MST

dim(sp.mst)
## [1] 171   3

This has 171 edges, which is equal to 172-1. Hence the mst is correct.

Plot MST

plot(sp_transformed_sp, border=gray(.5))
plot.mst(sp.mst, coordinates(sp_transformed_sp), 
     col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)

7.4 Compute Spatially Constrained Clusters using SKATER Approach

clust3 <- skater(sp.mst[,1:2], cluster_df.std, method = "euclidean", 2)

Check cluster assignment

ccs3 <- clust3$groups
ccs3
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 2 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1
table(ccs3)
## ccs3
##   1   2   3 
## 165   6   1

Visualise Clusters

plot(sp_transformed_sp, border=gray(.5))
plot(clust3, coordinates(sp_transformed_sp), cex.lab=.4,
     groups.colors=c("red","green","blue"), cex.circles=0.005, add=TRUE)
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

7.5 Visualise Clusters using Chloropleth Map

groups_mat <- as.matrix(clust3$groups)
sp_transformed_sp_spatialclust <- cbind(sp_transformed_clust, as.factor(groups_mat)) %>%
  rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(sp_transformed_sp_spatialclust, "SP_CLUSTER")

  • We can see that the spatially constrained method gives us more homogenous clusters. The municipalities in cluster 2 also correspond to the areas with the highest density of industrial and services companies as seen in our previous EDA.

Let us compare the Spatially Constrained vs Hierarchial Clustering Methods.

hclust.map <- qtm(sp_transformed_clust,
                  "CLUSTER", legend.outside = T) + 
  tm_borders(alpha = 0.5) 

spclust.map <- qtm(sp_transformed_sp_spatialclust,
                   "SP_CLUSTER", legend.outside = T) + 
  tm_borders(alpha = 0.5) 

tmap_arrange(hclust.map, spclust.map,
             asp=NA, nrow=2)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

8 Limitations

  • One limitation that is significant is the fact that the breakdown of the number of companies to the 4 broad categories (Agropercuary, industrial, service, public) is done based on my understanding of the definitions. However, there could be some companies encompassed within a category (e.g. Art_Culture_Sport), that may belong to different categories.
  • The clusters formed by the spatially constrained method seems to be rather skewed - one cluster is extremely large in number while the others are relatively small. This limits the analysis potential.