TakeHomeEx03-PengSzuHua
Background
The government has been encouraging local enterprises to venture into international market and one of the highly recommended country is Brazil.However, in view of the spatial development of Brazil that is highly unequal, any company interested to invest in Brazil must have a good understanding of the geographical specialization and competitiveness of industry in the country.
The aim of the analysis is to segment São Paulo Macrometropolis at the municipality level into homogeneous industry specialization clusters by using data provided by Brazilian Institute of Geography and Statistics.
Packages Installation
Install the following packages for this study.
packages = c('rgdal', 'maptools', 'raster','spatstat', 'tmap','sf', 'dplyr', 'tidyverse','ggplot2','spdep','tidyselect','gganimate','cowplot','RColorBrewer','plotly','sp','FRK','animation','ggpubr','geobr','cluster', 'factoextra', 'NbClust', 'heatmaply', 'corrplot', 'psych')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}## Loading required package: rgdal
## Loading required package: sp
## rgdal: version: 1.5-17, (SVN revision 1070)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.0.4, released 2020/01/28
## Path to GDAL shared files: C:/Users/sihua/OneDrive/Documents/R/win-library/4.0/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ runtime: Rel. 6.3.1, February 10th, 2020, [PJ_VERSION: 631]
## Path to PROJ shared files: C:/Users/sihua/OneDrive/Documents/R/win-library/4.0/rgdal/proj
## Linking to sp version:1.4-4
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
## Loading required package: maptools
## Checking rgeos availability: TRUE
## Loading required package: raster
## Loading required package: spatstat
## Loading required package: spatstat.data
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:raster':
##
## getData
## Loading required package: rpart
##
## spatstat 1.64-1 (nickname: 'Help you I can, yes!')
## For an introduction to spatstat, type 'beginner'
##
## Note: spatstat version 1.64-1 is out of date by more than 6 months; we recommend upgrading to the latest version.
##
## Attaching package: 'spatstat'
## The following objects are masked from 'package:raster':
##
## area, rotate, shift
## Loading required package: tmap
## Loading required package: sf
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:nlme':
##
## collapse
## The following objects are masked from 'package:raster':
##
## intersect, select, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: tidyverse
## Registered S3 method overwritten by 'cli':
## method from
## print.boxx spatstat
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v stringr 1.4.0
## v tidyr 1.1.2 v forcats 0.5.0
## v readr 1.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::collapse() masks nlme::collapse()
## x tidyr::extract() masks raster::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks raster::select()
## Loading required package: spdep
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## Loading required package: tidyselect
## Loading required package: gganimate
##
## Attaching package: 'gganimate'
## The following object is masked from 'package:raster':
##
## animate
## Loading required package: cowplot
## Loading required package: RColorBrewer
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:raster':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: FRK
##
## Attaching package: 'FRK'
## The following object is masked from 'package:raster':
##
## distance
## Loading required package: animation
## Loading required package: ggpubr
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:cowplot':
##
## get_legend
## The following objects are masked from 'package:spatstat':
##
## border, rotate
## The following object is masked from 'package:raster':
##
## rotate
## Loading required package: geobr
## Warning: package 'geobr' was built under R version 4.0.3
## Loading required package: cluster
##
## Attaching package: 'cluster'
## The following object is masked from 'package:spatstat':
##
## volume
## Loading required package: factoextra
## Warning: package 'factoextra' was built under R version 4.0.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Loading required package: NbClust
## Warning: package 'NbClust' was built under R version 4.0.3
## Loading required package: heatmaply
## Loading required package: viridis
## Loading required package: viridisLite
##
## ======================
## Welcome to heatmaply version 1.1.1
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================
## Loading required package: corrplot
## corrplot 0.84 loaded
## Loading required package: psych
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following objects are masked from 'package:spatstat':
##
## reflect, rescale
Data Preparation
This section is to prepare the data in order to perform analysis.
Data Sources
This study will be using the following datasets:
- BRAZIL_CITIES.csv
- Data_Dictionary.csv
Data Import
we will be using read_csv() function to read in the BRAZIL_CITIES.csv and Data_Dictionary.csv datasets.
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## CITY = col_character(),
## STATE = col_character(),
## AREA = col_number(),
## REGIAO_TUR = col_character(),
## CATEGORIA_TUR = col_character(),
## RURAL_URBAN = col_character(),
## GVA_MAIN = col_character()
## )
## i Use `spec()` for the full column specifications.
It also required for us to extract 2016 municipality boundary file and the metropolitan boundary file by using geobr () function
## Using year 2016
## Loading data for the whole country. This might take a few minutes.
##
|
| | 0%
|
|=== | 4%
|
|===== | 7%
|
|======== | 11%
|
|========== | 15%
|
|============= | 19%
|
|================ | 22%
|
|================== | 26%
|
|===================== | 30%
|
|======================= | 33%
|
|========================== | 37%
|
|============================= | 41%
|
|=============================== | 44%
|
|================================== | 48%
|
|==================================== | 52%
|
|======================================= | 56%
|
|========================================= | 59%
|
|============================================ | 63%
|
|=============================================== | 67%
|
|================================================= | 70%
|
|==================================================== | 74%
|
|====================================================== | 78%
|
|========================================================= | 81%
|
|============================================================ | 85%
|
|============================================================== | 89%
|
|================================================================= | 93%
|
|=================================================================== | 96%
|
|======================================================================| 100%
## Using year 2016
##
|
| | 0%
|
|======================================================================| 100%
##
Downloading: 1.6 kB
Downloading: 1.6 kB
Downloading: 10 kB
Downloading: 10 kB
Downloading: 10 kB
Downloading: 10 kB
Downloading: 26 kB
Downloading: 26 kB
Downloading: 34 kB
Downloading: 34 kB
Downloading: 51 kB
Downloading: 51 kB
Downloading: 67 kB
Downloading: 67 kB
Downloading: 83 kB
Downloading: 83 kB
Downloading: 91 kB
Downloading: 91 kB
Downloading: 99 kB
Downloading: 99 kB
Downloading: 120 kB
Downloading: 120 kB
Downloading: 130 kB
Downloading: 130 kB
Downloading: 150 kB
Downloading: 150 kB
Downloading: 160 kB
Downloading: 160 kB
Downloading: 160 kB
Downloading: 160 kB
Downloading: 180 kB
Downloading: 180 kB
Downloading: 200 kB
Downloading: 200 kB
Downloading: 210 kB
Downloading: 210 kB
Downloading: 230 kB
Downloading: 230 kB
Downloading: 230 kB
Downloading: 230 kB
Downloading: 240 kB
Downloading: 240 kB
Downloading: 260 kB
Downloading: 260 kB
Downloading: 280 kB
Downloading: 280 kB
Downloading: 290 kB
Downloading: 290 kB
Downloading: 300 kB
Downloading: 300 kB
Downloading: 310 kB
Downloading: 310 kB
Downloading: 330 kB
Downloading: 330 kB
Downloading: 340 kB
Downloading: 340 kB
Downloading: 360 kB
Downloading: 360 kB
Downloading: 370 kB
Downloading: 370 kB
Downloading: 380 kB
Downloading: 380 kB
Downloading: 390 kB
Downloading: 390 kB
Downloading: 410 kB
Downloading: 410 kB
Downloading: 420 kB
Downloading: 420 kB
Downloading: 440 kB
Downloading: 440 kB
Downloading: 460 kB
Downloading: 460 kB
Downloading: 470 kB
Downloading: 470 kB
Downloading: 470 kB
Downloading: 470 kB
Downloading: 480 kB
Downloading: 480 kB
Downloading: 500 kB
Downloading: 500 kB
Downloading: 510 kB
Downloading: 510 kB
Downloading: 530 kB
Downloading: 530 kB
Downloading: 540 kB
Downloading: 540 kB
Downloading: 560 kB
Downloading: 560 kB
Downloading: 570 kB
Downloading: 570 kB
Downloading: 570 kB
Downloading: 570 kB
Downloading: 580 kB
Downloading: 580 kB
Downloading: 600 kB
Downloading: 600 kB
Downloading: 620 kB
Downloading: 620 kB
Downloading: 630 kB
Downloading: 630 kB
Downloading: 650 kB
Downloading: 650 kB
Downloading: 660 kB
Downloading: 660 kB
Downloading: 660 kB
Downloading: 660 kB
Downloading: 660 kB
Downloading: 660 kB
Downloading: 680 kB
Downloading: 680 kB
Downloading: 700 kB
Downloading: 700 kB
Downloading: 710 kB
Downloading: 710 kB
Downloading: 730 kB
Downloading: 730 kB
Downloading: 750 kB
Downloading: 750 kB
Downloading: 760 kB
Downloading: 760 kB
Downloading: 780 kB
Downloading: 780 kB
Downloading: 790 kB
Downloading: 790 kB
Downloading: 810 kB
Downloading: 810 kB
Downloading: 830 kB
Downloading: 830 kB
Downloading: 840 kB
Downloading: 840 kB
Downloading: 850 kB
Downloading: 850 kB
Downloading: 850 kB
Downloading: 850 kB
Downloading: 850 kB
Downloading: 850 kB
Downloading: 870 kB
Downloading: 870 kB
Downloading: 880 kB
Downloading: 880 kB
Downloading: 900 kB
Downloading: 900 kB
Downloading: 920 kB
Downloading: 920 kB
Downloading: 930 kB
Downloading: 930 kB
Downloading: 940 kB
Downloading: 940 kB
Downloading: 940 kB
Downloading: 940 kB
Downloading: 940 kB
Downloading: 940 kB
Downloading: 940 kB
Downloading: 940 kB
Downloading: 960 kB
Downloading: 960 kB
Downloading: 970 kB
Downloading: 970 kB
Downloading: 990 kB
Downloading: 990 kB
Downloading: 1 MB
Downloading: 1 MB
Downloading: 1 MB
Downloading: 1 MB
Downloading: 1 MB
Downloading: 1 MB
Downloading: 1 MB
Downloading: 1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.1 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.2 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.3 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.4 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.5 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.6 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.7 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.8 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 1.9 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.1 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.2 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.3 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.4 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.5 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.6 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.7 MB
Downloading: 2.8 MB
Downloading: 2.8 MB
Downloading: 2.8 MB
Downloading: 2.8 MB
Data Wrangling
As in the metropolitan boundary file, there are missing area on the Regional Unit of Bragança Paulista city.To retrieve these area, it can be found in the municipality boundary file. The code below is to select the abbrev_state as SP (sao paulo) and the areas that is in the Regional Unit of Bragança Paulista city.
Regional Unit of Bragança Paulista city contains:
- Atibaia
- Bom Jesus Dos Perdões
- Bragança Paulista
- Joanópolis
- Nazaré Paulista
- Pedra Bela
- Pinhalzinho
- Piracaia
- Tuiuti
- Vargem
And also, for later rbind purpose name_metro column is created to indicate the above 10 areas are belongs to Regional Unit of Bragança Paulista city.
muni_state <- muni %>%
filter(abbrev_state =="SP" & name_muni %in% c('Atibaia',
'Bom Jesus Dos Perdões',
'Bragança Paulista',
'Joanópolis',
'Nazaré Paulista',
'Pedra Bela',
'Pinhalzinho',
'Piracaia',
'Tuiuti',
'Vargem'))%>%
mutate (name_metro ="Regional Unit of Bragança Paulista city")To extract the Sao Paulo area by filter the abbrev_state as “SP” and select reqired columns.
metro_state <- metro %>%
filter(abbrev_state =="SP")%>%
select(code_muni,name_muni,code_state,abbrev_state,name_metro,geom)To combind the metropolitan and municipality to get the São Paulo Macrometropolis area.
For the analysis purpose, it is required to extract the city and industry related information from the brazil cities file.
To combind the list of location in São Paulo Macrometropolis with the industry related data by using left join function [ join by the city name which is name_muni from the mapz and city column in the brazil_cities dataframe].
To compute the Location Quotients (LQ) for each industry type by using the below formula
- (Number of industry a in municipality [i] / Sum of all industry in municipality [i])/(Number of industry a in Greater Sao Paulo/Sum of all industry in Greater Sao Paulo)
To compute for 20 indsutry, funs function will be used and to compute LQ also need to ensure there is no NA values. Hence, in the code chunk below, NA values is repleced by 0 instead.
mapz_final <- mapz_join %>%
mutate_all(~replace(., is.na(.), 0))%>%
mutate_at(vars(contains("COMP")),funs(INDUSTRYLQ = (. / COMP_TOT) / (sum(.)/ sum(COMP_TOT))))%>%
select (name_muni,name_metro,contains(c("COMP")),contains(c("INDUSTRYLQ")))## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
The code below is the transform the data into SP format in order to plot the map later.
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition
## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Sistema de Referencia Geocentrico para las
## AmericaS 2000 in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Sistema de Referencia Geocentrico para las
## AmericaS 2000 in CRS definition
Exploratory Data Analysis (EDA)
EDA statistical graphics
The code below is to plot the distribution of the variables by using Histogram as it is easy for visualization and to identify the overall distribution of the data values.
ggplot(data= mapz_final %>%
st_set_geometry(NULL)%>%
as.data.frame() %>%
select(contains("LQ"),-"COMP_TOT_INDUSTRYLQ") %>%
normalize()%>%
gather("Industry","Num_Company",contains("LQ")),
aes_string(x = "Num_Company"))+
facet_wrap("Industry") +
geom_histogram(bins=10,
color="black",
fill="light blue")## Warning in min(x[ss_no_NA]): no non-missing arguments to min; returning Inf
## Warning in max(x[ss_no_NA]): no non-missing arguments to max; returning -Inf
## Warning in min(x[ss_no_NA]): no non-missing arguments to min; returning Inf
## Warning: Removed 206 rows containing non-finite values (stat_bin).
From the histograms above we can see that most of the variable does not form a normal distribution and most of them are right skew.
EDA using choropleth Map
The EDA will be using the choropleth map to show Location Quotients by Industry Type.
The code below is to create a function to generate the choropleth map on the location quotients by industry type. The input variable for this function is the indutry type by location quotients.
GenerateDistributionIndustry <- function(industry)
{
industrydistributionmap <- tm_shape(mapz_final_sp)+
tm_borders(alpha=0.5)+
tm_layout(main.title=industry,main.title.position = "center",legend.text.size = 0.6)+
tm_fill(industry)
tmap_arrange(industrydistributionmap)
}The code below is to store the LQ related columns into a list so that I can leverage on the For Loop to loop through all the 20 industry types and call the GenerateDistributionIndustry() functions to create the choropleth map for the individual industry.
Note: it start at column 2 to skip “CITY” column and i-1 to advoid the “geom” column in the mapz_final.
industrylist <- colnames(mapz_final %>%
select (contains(c("INDUSTRYLQ")),-"COMP_TOT_INDUSTRYLQ",-"geom"))
LQlist<-list()
for(i in 2:length(industrylist))
{
LQlist<- c(LQlist,(assign(paste("industry", i-1, sep = ""), GenerateDistributionIndustry(industrylist[[i-1]]))))
}For visualizing purpose, the code below is to put all the 20 choropleth maps into animated form.
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output
From the above EDA, it shows that
The below 3 indsutry are well distributed across the São Paulo Macrometropolis area:
- Industry G - Trade; repair of motor vehicles and motorcycles
- Industry F - Constructionevenly
- Indsutry Q - Human health and social services
And also from the map, it also shows that Industry T- Domestic services does not exist in the São Paulo Macrometropolis. Hence, the following analysis we can exclude Domestic services industry.
EDA Correlation Analysis
Before perform cluster analysis, it is important to ensure that the cluster variables are not highly correlated.
NOTE :The following code is to exclude Domestic services industry before the correlation analysis.
mapz_cluster <- as.data.frame(mapz_final)%>%
mutate_all(~replace(., is.na(.), 0)) %>%
select(name_muni,contains("INDUSTRYLQ"), -"COMP_T_INDUSTRYLQ",-"COMP_TOT_INDUSTRYLQ")The code chunk below will be using corrplot.mixed() function of corrplot package to visualise and analyse the correlation of the input variables.
cluster_vars.cor = cor(mapz_cluster[,2:21])
corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black") From the above correlation plot above shows that there are no variables are highly correlated. Thus, this suggest that none of the variable need to be removed from the cluster analysis.
Hierarchy Cluster Analysis
Extrating clustering variables
The below code chunk is the make the name_muni as the row name in order to plot the hierarchy cluster analysis and also to extract the clustering variables from the maps_cluster.
Data Standardisation
To avoid the cluster analysis result is baised to clustering variables with large values, it is useful to standardise the input variables before performing cluster analysis.
In standardisation there are 2 methods, 1) min-max and 2) Z-score. It is important to note that Z-score standardisation method should only be used if the analysis assume all variables come from some normal distribution.
In the EDA statistical graphics we can see that most of the variables are not normal distribution. Therefore Min-Max standardisation method will be used instead of z-score method.
Min-Max standardisation
To standardise, normalize() of heatmaply package is used to stadardisation the clustering variables by using Min-Max method.
The summary() is then used to display the summary statistics of the standardised clustering variables.
## COMP_A_INDUSTRYLQ COMP_B_INDUSTRYLQ COMP_C_INDUSTRYLQ COMP_D_INDUSTRYLQ
## Min. :0.000000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.005384 1st Qu.:0.00000 1st Qu.:0.1787 1st Qu.:0.00000
## Median :0.027261 Median :0.01412 Median :0.2839 Median :0.00000
## Mean :0.107719 Mean :0.06464 Mean :0.3165 Mean :0.01851
## 3rd Qu.:0.114746 3rd Qu.:0.05951 3rd Qu.:0.4210 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## COMP_E_INDUSTRYLQ COMP_F_INDUSTRYLQ COMP_G_INDUSTRYLQ COMP_H_INDUSTRYLQ
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.01955 1st Qu.:0.3177 1st Qu.:0.4831 1st Qu.:0.1747
## Median :0.08226 Median :0.4509 Median :0.5383 Median :0.2505
## Mean :0.10537 Mean :0.4402 Mean :0.5356 Mean :0.2963
## 3rd Qu.:0.14351 3rd Qu.:0.5528 3rd Qu.:0.6025 3rd Qu.:0.3615
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## COMP_I_INDUSTRYLQ COMP_J_INDUSTRYLQ COMP_K_INDUSTRYLQ COMP_L_INDUSTRYLQ
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.1911 1st Qu.:0.02256 1st Qu.:0.07436 1st Qu.:0.1240
## Median :0.2322 Median :0.03811 Median :0.14918 Median :0.2936
## Mean :0.2606 Mean :0.05217 Mean :0.16285 Mean :0.2971
## 3rd Qu.:0.2885 3rd Qu.:0.05514 3rd Qu.:0.21736 3rd Qu.:0.4280
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## COMP_M_INDUSTRYLQ COMP_N_INDUSTRYLQ COMP_O_INDUSTRYLQ COMP_P_INDUSTRYLQ
## Min. :0.0000 Min. :0.0000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.1335 1st Qu.:0.1086 1st Qu.:0.005648 1st Qu.:0.1594
## Median :0.1964 Median :0.1914 Median :0.014244 Median :0.2251
## Mean :0.2096 Mean :0.2108 Mean :0.050124 Mean :0.2316
## 3rd Qu.:0.2689 3rd Qu.:0.2695 3rd Qu.:0.038530 3rd Qu.:0.3029
## Max. :1.0000 Max. :1.0000 Max. :1.000000 Max. :1.0000
## COMP_Q_INDUSTRYLQ COMP_R_INDUSTRYLQ COMP_S_INDUSTRYLQ COMP_U_INDUSTRYLQ
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000000
## 1st Qu.:0.1935 1st Qu.:0.1590 1st Qu.:0.1269 1st Qu.:0.000000
## Median :0.3113 Median :0.2176 Median :0.1604 Median :0.000000
## Mean :0.3256 Mean :0.2078 Mean :0.1711 Mean :0.009808
## 3rd Qu.:0.4222 3rd Qu.:0.2694 3rd Qu.:0.1989 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000000
Note : After Min-max standardised the clustering variables the values range of the variable are between 0 to 1.
Visualising the standardised clustering variables
Computing proximity matrix
To compute the proximity matrix by using dist() of R. dist() supports six distance proximity calculations, they are: euclidean, maximum, manhattan, canberra, binary and minkowski. The default is euclidean proximity matrix.
The code chunk below is used to compute the proximity matrix using euclidean method.
Computing hierarchical clustering
To compute the hierarchical clustering , the hclust() of R stats will be used. hclust() employed agglomeration method to compute the cluster. 8 clustering algorithms are supported, they are: ward.D, ward.D2, single, complete, average(UPGMA), mcquitty(WPGMA), median(WPGMC) and centroid(UPGMC).
To performs hierarchical cluster for this analysis ward.D method will be used. The hierarchical clustering output is stored in an object of class hclust which describes the tree produced by the clustering process.
Selecting the optimal clustering algorithm
To deal with the challenges on selecting the optimal method of clustering, agnes() function of cluster package will be used. The agnes() function will provide the agglomerative coefficient, which measures the amount of clustering structure found.
The code chunk below will be used to compute the agglomerative coefficients of all hierarchical clustering algorithms.
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(LQ_comp, method = x)$ac
}
map_dbl(m, ac)## average single complete ward
## 0.9587330 0.9305396 0.9725822 0.9873336
The values closer to 1 suggest strong clustering structure. Hence, in this analysis the highest value is ward method. Thus, the following analysis only ward method will be used.
Determining Optimal Clusters
To deal with the challenges on selecting the optimal number of clustering,The gap statistic will be used. As it compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.
To compute the gap statistic, clusGap() of cluster package will be used and visualise the plot by using fviz_gap_stat() of factoextra package.
set.seed(12345)
gap_stat <- clusGap(LQ_comp, FUN = hcut, nstart = 25, K.max = 20, B = 50)
# Print the result
print(gap_stat, method = "firstmax")## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = LQ_comp, FUNcluster = hcut, K.max = 20, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..20; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 1
## logW E.logW gap SE.sim
## [1,] 7.218456 8.462920 1.244463 0.02499099
## [2,] 6.946855 8.125598 1.178743 0.03485009
## [3,] 6.693381 7.981988 1.288607 0.01957957
## [4,] 6.624222 7.905851 1.281629 0.01901593
## [5,] 6.488012 7.841814 1.353802 0.02020927
## [6,] 6.418956 7.785693 1.366736 0.02071964
## [7,] 6.376828 7.734372 1.357544 0.01960608
## [8,] 6.316574 7.688499 1.371925 0.01912219
## [9,] 6.173703 7.646229 1.472525 0.01996544
## [10,] 6.134732 7.607543 1.472811 0.01941677
## [11,] 6.058396 7.572282 1.513886 0.01908242
## [12,] 6.022341 7.538742 1.516401 0.01893998
## [13,] 5.987075 7.506980 1.519906 0.01887916
## [14,] 5.938941 7.477479 1.538539 0.01881704
## [15,] 5.889481 7.449358 1.559877 0.01864375
## [16,] 5.860065 7.422465 1.562400 0.01821216
## [17,] 5.821592 7.397136 1.575544 0.01828022
## [18,] 5.766986 7.373224 1.606238 0.01771952
## [19,] 5.734121 7.349939 1.615817 0.01715475
## [20,] 5.683892 7.327298 1.643406 0.01709163
From the above clusGap chart, it shows that after k = 10 the gap stats increase consistanctly. However, the higher the number of k is not good for the clustering analysis as it will be too heterogenous. A good cluster should be homogeneous hence, 9 or 10 will be a good number of k for this analysis.
Visually-driven hierarchical clustering analysis
Transforming the data frame into a matrix
The code chunk below will be used to transform LQ_comp data frame into a data matri to make the heatmap
Interactive cluster heatmap
To build an interactive cluster heatmap the heatmaply() function is used and cluster will be set to 9.
heatmaply(normalize(LQ_comp_mat),
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 9,
margins = c(NA,150,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="Geographic Segmentation of industry specialisation by location quotients",
xlab = "Location Quotients",
ylab = "CITY"
)Mapping the clusters formed
The code below will be using cutree() of R Base to derive a 9-cluster model.
To form the join in three steps:
The groups list object will be converted into a matrix
cbind() is used to append groups matrix onto mapz_final to produce an output simple feature object called mapz_cluster
Rename of dplyr package is used to rename as.matrix.groups field as CLUSTER.
mapz_final_cluster <- cbind(mapz_final, as.matrix(groups)) %>%
rename(`CLUSTER`=`as.matrix.groups.`)To show the choropleth map on the cluster formed, qtm() of tmap package is used.
The choropleth map above reveals the clusters are very fragmented. The is one of the major limitation when non-spatial clustering algorithm such as hierarchical cluster analysis method is used. Hence, in the next section will be using spatially constrained clustering method to overcome this limitation.
Spatially Constrained Clustering - SKATER approach
Converting into SpatialPolygonsDataFrame
The code chunk below uses as_Spatial() of sf package to convert mapz_final into a SpatialPolygonDataFrame called mapz_final_sp.
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition
## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Sistema de Referencia Geocentrico para las
## AmericaS 2000 in CRS definition
Computing Neighbour List
To create neighbour list, poly2nd() of spdep package will be used to compute the neighbours list from polygon list.
[Note:snap is set to 0.02 as there is 1 region is not linked to any of the region. Thus, to make it at least 1 neighbour snap will be set to 0.02.]
## Neighbour list object:
## Number of regions: 186
## Number of nonzero links: 1140
## Percentage nonzero weights: 3.295179
## Average number of links: 6.129032
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 15 29
## 2 9 7 29 39 36 22 17 7 7 4 4 1 1 1
## 2 least connected regions:
## 9 39 with 1 link
## 1 most connected region:
## 171 with 29 links
From the above result, it shows that the number of average link is 6 and the most connected region is 171.
The first plot command below gives the boundaries. This is followed by the plot of the neighbor list object, with coordinates applied to the original SpatialPolygonDataFrame to extract the centroids of the polygons. These are used as the nodes for the graph representation. We also set the color to blue and specify add=TRUE to plot the network on top of the boundaries.
plot(mapz_final_sp, border=grey(.5),main="São Paulo Macrometropolis")
plot(mapz.nb, coordinates(mapz_final_sp), col="blue", add=TRUE)Minimum Spanning Tree
Calculating edge costs
To calculste the edge costs, nbcosts() of spdep package is used. It is the distance between it nodes. This function compute this distance using a data.frame with observations vector in each node. it also shows the notion of a generalised weight for a spatial weights matrix
To incorporate these costs into a weights object by converting the neighbour list to a list weights object by specifying the just computed lcosts as the weights. To do this , nb2listw() of spdep package is used and style parameter as B to ensure the cost values are not row-standardised.
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 186
## Number of nonzero links: 1140
## Percentage nonzero weights: 3.295179
## Average number of links: 6.129032
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 15 29
## 2 9 7 29 39 36 22 17 7 7 4 4 1 1 1
## 2 least connected regions:
## 9 39 with 1 link
## 1 most connected region:
## 171 with 29 links
##
## Weights style: B
## Weights constants summary:
## n nn S0 S1 S2
## B 186 34596 30754.69 6396520 68711399
Computing minimum spanning tree
To compute The minimum spanning tree the mstree() of spdep package is used.
To plot method for the MST by showing the observation numbers of the nodes in addition to the edge as the following code.
plot(mapz_final_sp, border=gray(.5))
plot.mst(mapz.mst, coordinates(mapz_final_sp),
col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)Computing spatially constrained clusters using SKATER method
To compute spatially constrained cluster, the skater() of spdep package will be used.
The skater() takes three mandatory arguments:
The first two columns of the MST matrix (i.e. not the cost):mapz.mst[,1:2]
The data matrix (to update the costs as units are being grouped) :LQ_comp
The number of cuts = 8. [Note: It is set to one less than the number of clusters. So, the value specified is not the number of clusters, but the number of cuts in the graph, one less than the number of clusters.]
The result of the skater() is an object of class skater.
## List of 8
## $ groups : num [1:186] 1 1 1 1 1 1 1 1 8 1 ...
## $ edges.groups:List of 9
## ..$ :List of 3
## .. ..$ node: num [1:169] 31 119 69 60 11 91 81 45 58 27 ...
## .. ..$ edge: num [1:168, 1:3] 119 69 11 91 81 60 27 45 130 23 ...
## .. ..$ ssw : num 2141
## ..$ :List of 3
## .. ..$ node: num 167
## .. ..$ edge: num[0 , 1:3]
## .. ..$ ssw : num 0
## ..$ :List of 3
## .. ..$ node: num 164
## .. ..$ edge: num[0 , 1:3]
## .. ..$ ssw : num 0
## ..$ :List of 3
## .. ..$ node: num [1:7] 104 125 127 95 110 114 97
## .. ..$ edge: num [1:6, 1:3] 125 104 104 127 125 127 95 125 127 110 ...
## .. ..$ ssw : num 150
## ..$ :List of 3
## .. ..$ node: num 89
## .. ..$ edge: num[0 , 1:3]
## .. ..$ ssw : num 0
## ..$ :List of 3
## .. ..$ node: num 96
## .. ..$ edge: num[0 , 1:3]
## .. ..$ ssw : num 0
## ..$ :List of 3
## .. ..$ node: num 44
## .. ..$ edge: num[0 , 1:3]
## .. ..$ ssw : num 0
## ..$ :List of 3
## .. ..$ node: num [1:4] 26 15 9 17
## .. ..$ edge: num [1:3, 1:3] 26 26 15 15 17 ...
## .. ..$ ssw : num 39.4
## ..$ :List of 3
## .. ..$ node: num 33
## .. ..$ edge: num[0 , 1:3]
## .. ..$ ssw : num 0
## $ not.prune : NULL
## $ candidates : int [1:9] 1 2 3 4 5 6 7 8 9
## $ ssto : num 3907
## $ ssw : num [1:9] 3907 3591 3299 3067 2867 ...
## $ crit : num [1:2] 1 Inf
## $ vec.crit : num [1:186] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "class")= chr "skater"
## [1] 1 1 1 1 1 1 1 1 8 1 1 1 1 1 8 1 8 1 1 1 1 1 1 1 1 8 1 1 1 1 1 1 9 1 1 1 1
## [38] 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 1 1 1 1 4 6 4 1 1 1 1 1 1 4 1 1 1 1 1 4 1
## [112] 1 1 4 1 1 1 1 1 1 1 1 1 1 4 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1
## ccs9
## 1 2 3 4 5 6 7 8 9
## 169 1 1 7 1 1 1 4 1
To plot the pruned tree that shows the 9 clusters on top of the region.
plot(clust9, coordinates(mapz_final_sp), cex.lab=.7,
groups.colors=c("blue","green3","red", "brown", "pink","yellow","gray","cyan"), cex.circles=0.005)
plot(mapz_final_sp, border=gray(.5), add=TRUE)Visualising the clusters in choropleth map
The code chunk below is used to plot the newly derived clusters by using SKATER method.
groups_mat <- as.matrix(clust9$groups)
mapz_final_cluster_spatialcluster <- cbind(mapz_final_cluster, as.factor(groups_mat)) %>%
rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(mapz_final_cluster_spatialcluster, "SP_CLUSTER") We can now see that it is grouping contiguous objects that are similar into new aggregate areal unit.
Comparison : Hierarchical Clustering Vs Spatially Constrained Hierarchical Clustering Maps
For comparison purpose, it will be better to place both the hierarchical clustering and spatially constrained hierarchical clustering maps next to each other.
hclust.map <- qtm(mapz_final_cluster,
"CLUSTER") +
tm_borders(alpha = 0.5)
shclust.map <- qtm(mapz_final_cluster_spatialcluster,
"SP_CLUSTER") +
tm_borders(alpha = 0.5)
tmap_arrange(hclust.map, shclust.map,
asp=NA, ncol=2)## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
From the comparison map, it shows that the derived clusters by using SKATER method are not fragmented compared to the Hierarchical Clustering and it also shows that cluster 1 is the biggest cluster in the São Paulo Macrometropolis.This also means that the distribution on the type of the industry are similar across the São Paulo Macrometropolis since cluster 1 cover most part of the São Paulo Macrometropolis.