Background

The government has been encouraging local enterprises to venture into international market and one of the highly recommended country is Brazil.However, in view of the spatial development of Brazil that is highly unequal, any company interested to invest in Brazil must have a good understanding of the geographical specialization and competitiveness of industry in the country.

The aim of the analysis is to segment São Paulo Macrometropolis at the municipality level into homogeneous industry specialization clusters by using data provided by Brazilian Institute of Geography and Statistics.

Packages Installation

Install the following packages for this study.

packages = c('rgdal', 'maptools', 'raster','spatstat', 'tmap','sf', 'dplyr', 'tidyverse','ggplot2','spdep','tidyselect','gganimate','cowplot','RColorBrewer','plotly','sp','FRK','animation','ggpubr','geobr','cluster', 'factoextra', 'NbClust', 'heatmaply', 'corrplot', 'psych')

for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)

}

## Loading required package: rgdal

## Loading required package: sp

## rgdal: version: 1.5-17, (SVN revision 1070)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.0.4, released 2020/01/28
## Path to GDAL shared files: C:/Users/sihua/OneDrive/Documents/R/win-library/4.0/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 6.3.1, February 10th, 2020, [PJ_VERSION: 631]
## Path to PROJ shared files: C:/Users/sihua/OneDrive/Documents/R/win-library/4.0/rgdal/proj
## Linking to sp version:1.4-4
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.

## Loading required package: maptools

## Checking rgeos availability: TRUE

## Loading required package: raster

## Loading required package: spatstat

## Loading required package: spatstat.data

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:raster':
## 
##     getData

## Loading required package: rpart

## 
## spatstat 1.64-1       (nickname: 'Help you I can, yes!') 
## For an introduction to spatstat, type 'beginner'

## 
## Note: spatstat version 1.64-1 is out of date by more than 6 months; we recommend upgrading to the latest version.

## 
## Attaching package: 'spatstat'

## The following objects are masked from 'package:raster':
## 
##     area, rotate, shift

## Loading required package: tmap

## Loading required package: sf

## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:nlme':
## 
##     collapse

## The following objects are masked from 'package:raster':
## 
##     intersect, select, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: tidyverse

## Registered S3 method overwritten by 'cli':
##   method     from    
##   print.boxx spatstat

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.0
## v readr   1.4.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::collapse() masks nlme::collapse()
## x tidyr::extract()  masks raster::extract()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
## x dplyr::select()   masks raster::select()

## Loading required package: spdep

## Loading required package: spData

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

## Loading required package: tidyselect

## Loading required package: gganimate

## 
## Attaching package: 'gganimate'

## The following object is masked from 'package:raster':
## 
##     animate

## Loading required package: cowplot

## Loading required package: RColorBrewer

## Loading required package: plotly

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:raster':
## 
##     select

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

## Loading required package: FRK

## 
## Attaching package: 'FRK'

## The following object is masked from 'package:raster':
## 
##     distance

## Loading required package: animation

## Loading required package: ggpubr

## 
## Attaching package: 'ggpubr'

## The following object is masked from 'package:cowplot':
## 
##     get_legend

## The following objects are masked from 'package:spatstat':
## 
##     border, rotate

## The following object is masked from 'package:raster':
## 
##     rotate

## Loading required package: geobr

## Warning: package 'geobr' was built under R version 4.0.3

## Loading required package: cluster

## 
## Attaching package: 'cluster'

## The following object is masked from 'package:spatstat':
## 
##     volume

## Loading required package: factoextra

## Warning: package 'factoextra' was built under R version 4.0.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

## Loading required package: NbClust

## Warning: package 'NbClust' was built under R version 4.0.3

## Loading required package: heatmaply

## Loading required package: viridis

## Loading required package: viridisLite

## 
## ======================
## Welcome to heatmaply version 1.1.1
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================

## Loading required package: corrplot

## corrplot 0.84 loaded

## Loading required package: psych

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## The following objects are masked from 'package:spatstat':
## 
##     reflect, rescale

Data Preparation

This section is to prepare the data in order to perform analysis.

Data Sources

This study will be using the following datasets:

BRAZIL_CITIES.csv
Data_Dictionary.csv

Data Import

we will be using read_csv() function to read in the BRAZIL_CITIES.csv and Data_Dictionary.csv datasets.

brazil_data <- read_delim("data/aspatial/BRAZIL_CITIES.csv", delim = ";")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   CITY = col_character(),
##   STATE = col_character(),
##   AREA = col_number(),
##   REGIAO_TUR = col_character(),
##   CATEGORIA_TUR = col_character(),
##   RURAL_URBAN = col_character(),
##   GVA_MAIN = col_character()
## )
## i Use `spec()` for the full column specifications.

Data_Dictionary <- read.csv2("data/aspatial/Data_Dictionary.csv")

It also required for us to extract 2016 municipality boundary file and the metropolitan boundary file by using geobr () function

muni<- read_municipality(year=2016)

## Using year 2016

## Loading data for the whole country. This might take a few minutes.

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%

metro <- read_metro_area(year=2016)

## Using year 2016

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
Downloading: 1.6 kB     
Downloading: 1.6 kB     
Downloading: 10 kB     
Downloading: 10 kB     
Downloading: 10 kB     
Downloading: 10 kB     
Downloading: 26 kB     
Downloading: 26 kB     
Downloading: 34 kB     
Downloading: 34 kB     
Downloading: 51 kB     
Downloading: 51 kB     
Downloading: 67 kB     
Downloading: 67 kB     
Downloading: 83 kB     
Downloading: 83 kB     
Downloading: 91 kB     
Downloading: 91 kB     
Downloading: 99 kB     
Downloading: 99 kB     
Downloading: 120 kB     
Downloading: 120 kB     
Downloading: 130 kB     
Downloading: 130 kB     
Downloading: 150 kB     
Downloading: 150 kB     
Downloading: 160 kB     
Downloading: 160 kB     
Downloading: 160 kB     
Downloading: 160 kB     
Downloading: 180 kB     
Downloading: 180 kB     
Downloading: 200 kB     
Downloading: 200 kB     
Downloading: 210 kB     
Downloading: 210 kB     
Downloading: 230 kB     
Downloading: 230 kB     
Downloading: 230 kB     
Downloading: 230 kB     
Downloading: 240 kB     
Downloading: 240 kB     
Downloading: 260 kB     
Downloading: 260 kB     
Downloading: 280 kB     
Downloading: 280 kB     
Downloading: 290 kB     
Downloading: 290 kB     
Downloading: 300 kB     
Downloading: 300 kB     
Downloading: 310 kB     
Downloading: 310 kB     
Downloading: 330 kB     
Downloading: 330 kB     
Downloading: 340 kB     
Downloading: 340 kB     
Downloading: 360 kB     
Downloading: 360 kB     
Downloading: 370 kB     
Downloading: 370 kB     
Downloading: 380 kB     
Downloading: 380 kB     
Downloading: 390 kB     
Downloading: 390 kB     
Downloading: 410 kB     
Downloading: 410 kB     
Downloading: 420 kB     
Downloading: 420 kB     
Downloading: 440 kB     
Downloading: 440 kB     
Downloading: 460 kB     
Downloading: 460 kB     
Downloading: 470 kB     
Downloading: 470 kB     
Downloading: 470 kB     
Downloading: 470 kB     
Downloading: 480 kB     
Downloading: 480 kB     
Downloading: 500 kB     
Downloading: 500 kB     
Downloading: 510 kB     
Downloading: 510 kB     
Downloading: 530 kB     
Downloading: 530 kB     
Downloading: 540 kB     
Downloading: 540 kB     
Downloading: 560 kB     
Downloading: 560 kB     
Downloading: 570 kB     
Downloading: 570 kB     
Downloading: 570 kB     
Downloading: 570 kB     
Downloading: 580 kB     
Downloading: 580 kB     
Downloading: 600 kB     
Downloading: 600 kB     
Downloading: 620 kB     
Downloading: 620 kB     
Downloading: 630 kB     
Downloading: 630 kB     
Downloading: 650 kB     
Downloading: 650 kB     
Downloading: 660 kB     
Downloading: 660 kB     
Downloading: 660 kB     
Downloading: 660 kB     
Downloading: 660 kB     
Downloading: 660 kB     
Downloading: 680 kB     
Downloading: 680 kB     
Downloading: 700 kB     
Downloading: 700 kB     
Downloading: 710 kB     
Downloading: 710 kB     
Downloading: 730 kB     
Downloading: 730 kB     
Downloading: 750 kB     
Downloading: 750 kB     
Downloading: 760 kB     
Downloading: 760 kB     
Downloading: 780 kB     
Downloading: 780 kB     
Downloading: 790 kB     
Downloading: 790 kB     
Downloading: 810 kB     
Downloading: 810 kB     
Downloading: 830 kB     
Downloading: 830 kB     
Downloading: 840 kB     
Downloading: 840 kB     
Downloading: 850 kB     
Downloading: 850 kB     
Downloading: 850 kB     
Downloading: 850 kB     
Downloading: 850 kB     
Downloading: 850 kB     
Downloading: 870 kB     
Downloading: 870 kB     
Downloading: 880 kB     
Downloading: 880 kB     
Downloading: 900 kB     
Downloading: 900 kB     
Downloading: 920 kB     
Downloading: 920 kB     
Downloading: 930 kB     
Downloading: 930 kB     
Downloading: 940 kB     
Downloading: 940 kB     
Downloading: 940 kB     
Downloading: 940 kB     
Downloading: 940 kB     
Downloading: 940 kB     
Downloading: 940 kB     
Downloading: 940 kB     
Downloading: 960 kB     
Downloading: 960 kB     
Downloading: 970 kB     
Downloading: 970 kB     
Downloading: 990 kB     
Downloading: 990 kB     
Downloading: 1 MB     
Downloading: 1 MB     
Downloading: 1 MB     
Downloading: 1 MB     
Downloading: 1 MB     
Downloading: 1 MB     
Downloading: 1 MB     
Downloading: 1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.1 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.2 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.3 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.4 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.5 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.6 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.7 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.8 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 1.9 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.1 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.2 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.3 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.4 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.5 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.6 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.7 MB     
Downloading: 2.8 MB     
Downloading: 2.8 MB     
Downloading: 2.8 MB     
Downloading: 2.8 MB

Data Wrangling

As in the metropolitan boundary file, there are missing area on the Regional Unit of Bragança Paulista city.To retrieve these area, it can be found in the municipality boundary file. The code below is to select the abbrev_state as SP (sao paulo) and the areas that is in the Regional Unit of Bragança Paulista city.

Regional Unit of Bragança Paulista city contains:

Atibaia
Bom Jesus Dos Perdões
Bragança Paulista
Joanópolis
Nazaré Paulista
Pedra Bela
Pinhalzinho
Piracaia
Tuiuti
Vargem

And also, for later rbind purpose name_metro column is created to indicate the above 10 areas are belongs to Regional Unit of Bragança Paulista city.

muni_state <- muni %>%
  filter(abbrev_state =="SP" & name_muni %in% c('Atibaia',
'Bom Jesus Dos Perdões',
'Bragança Paulista',
'Joanópolis',
'Nazaré Paulista',  
'Pedra Bela',
'Pinhalzinho',
'Piracaia',
'Tuiuti',
'Vargem'))%>%
mutate (name_metro ="Regional Unit of Bragança Paulista city")

To extract the Sao Paulo area by filter the abbrev_state as “SP” and select reqired columns.

metro_state <- metro %>%
  filter(abbrev_state =="SP")%>%
  select(code_muni,name_muni,code_state,abbrev_state,name_metro,geom)

To combind the metropolitan and municipality to get the São Paulo Macrometropolis area.

mapz<-rbind(metro_state, muni_state)

For the analysis purpose, it is required to extract the city and industry related information from the brazil cities file.

BRAZIL_CITIES_geo <- brazil_data %>%
  select (CITY,contains(c("COMP")))

To combind the list of location in São Paulo Macrometropolis with the industry related data by using left join function [ join by the city name which is name_muni from the mapz and city column in the brazil_cities dataframe].

mapz_join<-left_join(mapz,BRAZIL_CITIES_geo ,by = c("name_muni" = "CITY"))

To compute the Location Quotients (LQ) for each industry type by using the below formula

(Number of industry a in municipality [i] / Sum of all industry in municipality [i])/(Number of industry a in Greater Sao Paulo/Sum of all industry in Greater Sao Paulo)

To compute for 20 indsutry, funs function will be used and to compute LQ also need to ensure there is no NA values. Hence, in the code chunk below, NA values is repleced by 0 instead.

mapz_final <- mapz_join %>%
  mutate_all(~replace(., is.na(.), 0))%>%
  mutate_at(vars(contains("COMP")),funs(INDUSTRYLQ = (. / COMP_TOT) / (sum(.)/ sum(COMP_TOT))))%>%
  select (name_muni,name_metro,contains(c("COMP")),contains(c("INDUSTRYLQ")))

## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

The code below is the transform the data into SP format in order to plot the map later.

muniState_sp <- as(muni_state,Class="Spatial")

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition

## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Sistema de Referencia Geocentrico para las
## AmericaS 2000 in CRS definition

mapz_final_sp<- as(mapz_final,Class="Spatial")

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Sistema de Referencia Geocentrico para las
## AmericaS 2000 in CRS definition

Exploratory Data Analysis (EDA)

EDA　statistical graphics

The code below is to plot the distribution of the variables by using Histogram as it is easy for visualization and to identify the overall distribution of the data values.

ggplot(data= mapz_final %>% 
        st_set_geometry(NULL)%>%
        as.data.frame() %>%
        select(contains("LQ"),-"COMP_TOT_INDUSTRYLQ") %>%
        normalize()%>%
        gather("Industry","Num_Company",contains("LQ")), 
             aes_string(x = "Num_Company"))+
              facet_wrap("Industry") +
              geom_histogram(bins=10, 
                 color="black", 
                 fill="light blue")

## Warning in min(x[ss_no_NA]): no non-missing arguments to min; returning Inf

## Warning in max(x[ss_no_NA]): no non-missing arguments to max; returning -Inf

## Warning in min(x[ss_no_NA]): no non-missing arguments to min; returning Inf

## Warning: Removed 206 rows containing non-finite values (stat_bin).

From the histograms above we can see that most of the variable does not form a normal distribution and most of them are right skew.

EDA using choropleth Map

The EDA will be using the choropleth map to show Location Quotients by Industry Type.

The code below is to create a function to generate the choropleth map on the location quotients by industry type. The input variable for this function is the indutry type by location quotients.

GenerateDistributionIndustry <- function(industry) 
{
  
industrydistributionmap <- tm_shape(mapz_final_sp)+
  tm_borders(alpha=0.5)+
  tm_layout(main.title=industry,main.title.position = "center",legend.text.size = 0.6)+
  tm_fill(industry)

tmap_arrange(industrydistributionmap)

}

The code below is to store the LQ related columns into a list so that I can leverage on the For Loop to loop through all the 20 industry types and call the GenerateDistributionIndustry() functions to create the choropleth map for the individual industry.

Note: it start at column 2 to skip “CITY” column and i-1 to advoid the “geom” column in the mapz_final.

industrylist <- colnames(mapz_final %>%
           select (contains(c("INDUSTRYLQ")),-"COMP_TOT_INDUSTRYLQ",-"geom"))
LQlist<-list()
for(i in 2:length(industrylist)) 
  {
   LQlist<- c(LQlist,(assign(paste("industry", i-1, sep = ""), GenerateDistributionIndustry(industrylist[[i-1]]))))
  
}

For visualizing purpose, the code below is to put all the 20 choropleth maps into animated form.

walk(LQlist, print)

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

## Warning in sp::proj4string(obj): CRS object has comment, which is lost in output

From the above EDA, it shows that

The below 3 indsutry are well distributed across the São Paulo Macrometropolis area:

Industry G - Trade; repair of motor vehicles and motorcycles
Industry F - Constructionevenly
Indsutry Q - Human health and social services

And also from the map, it also shows that Industry T- Domestic services does not exist in the São Paulo Macrometropolis. Hence, the following analysis we can exclude Domestic services industry.

EDA Correlation Analysis

Before perform cluster analysis, it is important to ensure that the cluster variables are not highly correlated.

NOTE :The following code is to exclude Domestic services industry before the correlation analysis.

mapz_cluster <- as.data.frame(mapz_final)%>% 
          mutate_all(~replace(., is.na(.), 0)) %>%
          select(name_muni,contains("INDUSTRYLQ"), -"COMP_T_INDUSTRYLQ",-"COMP_TOT_INDUSTRYLQ")

The code chunk below will be using corrplot.mixed() function of corrplot package to visualise and analyse the correlation of the input variables.

cluster_vars.cor = cor(mapz_cluster[,2:21])
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

From the above correlation plot above shows that there are no variables are highly correlated. Thus, this suggest that none of the variable need to be removed from the cluster analysis.

Hierarchy Cluster Analysis

Extrating clustering variables

The below code chunk is the make the name_muni as the row name in order to plot the hierarchy cluster analysis and also to extract the clustering variables from the maps_cluster.

row.names(mapz_cluster) <- make.unique(mapz_cluster$name_muni)
LQ_comp <- select(mapz_cluster, c(2:21))

Data Standardisation

To avoid the cluster analysis result is baised to clustering variables with large values, it is useful to standardise the input variables before performing cluster analysis.

In standardisation there are 2 methods, 1) min-max and 2) Z-score. It is important to note that Z-score standardisation method should only be used if the analysis assume all variables come from some normal distribution.

In the EDA　statistical graphics we can see that most of the variables are not normal distribution. Therefore Min-Max standardisation method will be used instead of z-score method.

Min-Max standardisation

To standardise, normalize() of heatmaply package is used to stadardisation the clustering variables by using Min-Max method.

The summary() is then used to display the summary statistics of the standardised clustering variables.

LQ_comp.std<- normalize(LQ_comp)
summary(LQ_comp.std)

##  COMP_A_INDUSTRYLQ  COMP_B_INDUSTRYLQ COMP_C_INDUSTRYLQ COMP_D_INDUSTRYLQ
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.0000    Min.   :0.00000  
##  1st Qu.:0.005384   1st Qu.:0.00000   1st Qu.:0.1787    1st Qu.:0.00000  
##  Median :0.027261   Median :0.01412   Median :0.2839    Median :0.00000  
##  Mean   :0.107719   Mean   :0.06464   Mean   :0.3165    Mean   :0.01851  
##  3rd Qu.:0.114746   3rd Qu.:0.05951   3rd Qu.:0.4210    3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.0000    Max.   :1.00000  
##  COMP_E_INDUSTRYLQ COMP_F_INDUSTRYLQ COMP_G_INDUSTRYLQ COMP_H_INDUSTRYLQ
##  Min.   :0.00000   Min.   :0.0000    Min.   :0.0000    Min.   :0.0000   
##  1st Qu.:0.01955   1st Qu.:0.3177    1st Qu.:0.4831    1st Qu.:0.1747   
##  Median :0.08226   Median :0.4509    Median :0.5383    Median :0.2505   
##  Mean   :0.10537   Mean   :0.4402    Mean   :0.5356    Mean   :0.2963   
##  3rd Qu.:0.14351   3rd Qu.:0.5528    3rd Qu.:0.6025    3rd Qu.:0.3615   
##  Max.   :1.00000   Max.   :1.0000    Max.   :1.0000    Max.   :1.0000   
##  COMP_I_INDUSTRYLQ COMP_J_INDUSTRYLQ COMP_K_INDUSTRYLQ COMP_L_INDUSTRYLQ
##  Min.   :0.0000    Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   
##  1st Qu.:0.1911    1st Qu.:0.02256   1st Qu.:0.07436   1st Qu.:0.1240   
##  Median :0.2322    Median :0.03811   Median :0.14918   Median :0.2936   
##  Mean   :0.2606    Mean   :0.05217   Mean   :0.16285   Mean   :0.2971   
##  3rd Qu.:0.2885    3rd Qu.:0.05514   3rd Qu.:0.21736   3rd Qu.:0.4280   
##  Max.   :1.0000    Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   
##  COMP_M_INDUSTRYLQ COMP_N_INDUSTRYLQ COMP_O_INDUSTRYLQ  COMP_P_INDUSTRYLQ
##  Min.   :0.0000    Min.   :0.0000    Min.   :0.000000   Min.   :0.0000   
##  1st Qu.:0.1335    1st Qu.:0.1086    1st Qu.:0.005648   1st Qu.:0.1594   
##  Median :0.1964    Median :0.1914    Median :0.014244   Median :0.2251   
##  Mean   :0.2096    Mean   :0.2108    Mean   :0.050124   Mean   :0.2316   
##  3rd Qu.:0.2689    3rd Qu.:0.2695    3rd Qu.:0.038530   3rd Qu.:0.3029   
##  Max.   :1.0000    Max.   :1.0000    Max.   :1.000000   Max.   :1.0000   
##  COMP_Q_INDUSTRYLQ COMP_R_INDUSTRYLQ COMP_S_INDUSTRYLQ COMP_U_INDUSTRYLQ 
##  Min.   :0.0000    Min.   :0.0000    Min.   :0.0000    Min.   :0.000000  
##  1st Qu.:0.1935    1st Qu.:0.1590    1st Qu.:0.1269    1st Qu.:0.000000  
##  Median :0.3113    Median :0.2176    Median :0.1604    Median :0.000000  
##  Mean   :0.3256    Mean   :0.2078    Mean   :0.1711    Mean   :0.009808  
##  3rd Qu.:0.4222    3rd Qu.:0.2694    3rd Qu.:0.1989    3rd Qu.:0.000000  
##  Max.   :1.0000    Max.   :1.0000    Max.   :1.0000    Max.   :1.000000

Note : After Min-max standardised the clustering variables the values range of the variable are between 0 to 1.

Visualising the standardised clustering variables

Computing proximity matrix

To compute the proximity matrix by using dist() of R. dist() supports six distance proximity calculations, they are: euclidean, maximum, manhattan, canberra, binary and minkowski. The default is euclidean proximity matrix.

The code chunk below is used to compute the proximity matrix using euclidean method.

proxmat <- dist(LQ_comp, method = 'euclidean')

Computing hierarchical clustering

To compute the hierarchical clustering , the hclust() of R stats will be used. hclust() employed agglomeration method to compute the cluster. 8 clustering algorithms are supported, they are: ward.D, ward.D2, single, complete, average(UPGMA), mcquitty(WPGMA), median(WPGMC) and centroid(UPGMC).

To performs hierarchical cluster for this analysis ward.D method will be used. The hierarchical clustering output is stored in an object of class hclust which describes the tree produced by the clustering process.

hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.6)

Selecting the optimal clustering algorithm

To deal with the challenges on selecting the optimal method of clustering, agnes() function of cluster package will be used. The agnes() function will provide the agglomerative coefficient, which measures the amount of clustering structure found.

The code chunk below will be used to compute the agglomerative coefficients of all hierarchical clustering algorithms.

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(LQ_comp, method = x)$ac
}

map_dbl(m, ac)

##   average    single  complete      ward 
## 0.9587330 0.9305396 0.9725822 0.9873336

The values closer to 1 suggest strong clustering structure. Hence, in this analysis the highest value is ward method. Thus, the following analysis only ward method will be used.

Determining Optimal Clusters

To deal with the challenges on selecting the optimal number of clustering,The gap statistic will be used. As it compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic.

To compute the gap statistic, clusGap() of cluster package will be used and visualise the plot by using fviz_gap_stat() of factoextra package.

set.seed(12345)
gap_stat <- clusGap(LQ_comp, FUN = hcut, nstart = 25, K.max = 20, B = 50)
# Print the result
print(gap_stat, method = "firstmax")

## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = LQ_comp, FUNcluster = hcut, K.max = 20, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..20; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 1
##           logW   E.logW      gap     SE.sim
##  [1,] 7.218456 8.462920 1.244463 0.02499099
##  [2,] 6.946855 8.125598 1.178743 0.03485009
##  [3,] 6.693381 7.981988 1.288607 0.01957957
##  [4,] 6.624222 7.905851 1.281629 0.01901593
##  [5,] 6.488012 7.841814 1.353802 0.02020927
##  [6,] 6.418956 7.785693 1.366736 0.02071964
##  [7,] 6.376828 7.734372 1.357544 0.01960608
##  [8,] 6.316574 7.688499 1.371925 0.01912219
##  [9,] 6.173703 7.646229 1.472525 0.01996544
## [10,] 6.134732 7.607543 1.472811 0.01941677
## [11,] 6.058396 7.572282 1.513886 0.01908242
## [12,] 6.022341 7.538742 1.516401 0.01893998
## [13,] 5.987075 7.506980 1.519906 0.01887916
## [14,] 5.938941 7.477479 1.538539 0.01881704
## [15,] 5.889481 7.449358 1.559877 0.01864375
## [16,] 5.860065 7.422465 1.562400 0.01821216
## [17,] 5.821592 7.397136 1.575544 0.01828022
## [18,] 5.766986 7.373224 1.606238 0.01771952
## [19,] 5.734121 7.349939 1.615817 0.01715475
## [20,] 5.683892 7.327298 1.643406 0.01709163

fviz_gap_stat(gap_stat)

From the above clusGap chart, it shows that after k = 10 the gap stats increase consistanctly. However, the higher the number of k is not good for the clustering analysis as it will be too heterogenous. A good cluster should be homogeneous hence, 9 or 10 will be a good number of k for this analysis.

Visually-driven hierarchical clustering analysis

Transforming the data frame into a matrix

The code chunk below will be used to transform LQ_comp data frame into a data matri to make the heatmap

LQ_comp_mat <- data.matrix(LQ_comp)

Interactive cluster heatmap

To build an interactive cluster heatmap the heatmaply() function is used and cluster will be set to 9.

heatmaply(normalize(LQ_comp_mat),
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 9,
          margins = c(NA,150,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Geographic Segmentation of industry specialisation by location quotients",
          xlab = "Location Quotients",
          ylab = "CITY"
          )

From the cluster Dendrogram with heatmap,we can easily see distribution by LQ industry type and it also shows that industry G -Trade; repair of motor vehicles and motorcycles are more common across the São Paulo Macrometropolis.

Mapping the clusters formed

The code below will be using cutree() of R Base to derive a 9-cluster model.

groups <- as.factor(cutree(hclust_ward, k=9))

To form the join in three steps:

The groups list object will be converted into a matrix
cbind() is used to append groups matrix onto mapz_final to produce an output simple feature object called mapz_cluster
Rename of dplyr package is used to rename as.matrix.groups field as CLUSTER.

mapz_final_cluster <- cbind(mapz_final, as.matrix(groups)) %>%
  rename(`CLUSTER`=`as.matrix.groups.`)

To show the choropleth map on the cluster formed, qtm() of tmap package is used.

qtm(mapz_final_cluster, "CLUSTER")

The choropleth map above reveals the clusters are very fragmented. The is one of the major limitation when non-spatial clustering algorithm such as hierarchical cluster analysis method is used. Hence, in the next section will be using spatially constrained clustering method to overcome this limitation.

Spatially Constrained Clustering - SKATER approach

Converting into SpatialPolygonsDataFrame

The code chunk below uses as_Spatial() of sf package to convert mapz_final into a SpatialPolygonDataFrame called mapz_final_sp.

mapz_final_sp <- as_Spatial(mapz_final)

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition

## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Sistema de Referencia Geocentrico para las
## AmericaS 2000 in CRS definition

Computing Neighbour List

To create neighbour list, poly2nd() of spdep package will be used to compute the neighbours list from polygon list.

[Note:snap is set to 0.02 as there is 1 region is not linked to any of the region. Thus, to make it at least 1 neighbour snap will be set to 0.02.]

mapz.nb <- poly2nb(mapz_final_sp, snap =0.02)

summary(mapz.nb)

## Neighbour list object:
## Number of regions: 186 
## Number of nonzero links: 1140 
## Percentage nonzero weights: 3.295179 
## Average number of links: 6.129032 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 15 29 
##  2  9  7 29 39 36 22 17  7  7  4  4  1  1  1 
## 2 least connected regions:
## 9 39 with 1 link
## 1 most connected region:
## 171 with 29 links

From the above result, it shows that the number of average link is 6 and the most connected region is 171.

The first plot command below gives the boundaries. This is followed by the plot of the neighbor list object, with coordinates applied to the original SpatialPolygonDataFrame to extract the centroids of the polygons. These are used as the nodes for the graph representation. We also set the color to blue and specify add=TRUE to plot the network on top of the boundaries.

plot(mapz_final_sp, border=grey(.5),main="São Paulo Macrometropolis")
plot(mapz.nb, coordinates(mapz_final_sp), col="blue", add=TRUE)

Minimum Spanning Tree

Calculating edge costs

To calculste the edge costs, nbcosts() of spdep package is used. It is the distance between it nodes. This function compute this distance using a data.frame with observations vector in each node. it also shows the notion of a generalised weight for a spatial weights matrix

lcosts <- nbcosts(mapz.nb, LQ_comp)

To incorporate these costs into a weights object by converting the neighbour list to a list weights object by specifying the just computed lcosts as the weights. To do this , nb2listw() of spdep package is used and style parameter as B to ensure the cost values are not row-standardised.

mapz.w <- nb2listw(mapz.nb, lcosts, style="B")
summary(mapz.w)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 186 
## Number of nonzero links: 1140 
## Percentage nonzero weights: 3.295179 
## Average number of links: 6.129032 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 15 29 
##  2  9  7 29 39 36 22 17  7  7  4  4  1  1  1 
## 2 least connected regions:
## 9 39 with 1 link
## 1 most connected region:
## 171 with 29 links
## 
## Weights style: B 
## Weights constants summary:
##     n    nn       S0      S1       S2
## B 186 34596 30754.69 6396520 68711399

Computing minimum spanning tree

To compute The minimum spanning tree the mstree() of spdep package is used.

mapz.mst <- mstree(mapz.w)

To plot method for the MST by showing the observation numbers of the nodes in addition to the edge as the following code.

plot(mapz_final_sp, border=gray(.5))
plot.mst(mapz.mst, coordinates(mapz_final_sp), 
     col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)

Computing spatially constrained clusters using SKATER method

To compute spatially constrained cluster, the skater() of spdep package will be used.

The skater() takes three mandatory arguments:

The first two columns of the MST matrix (i.e. not the cost):mapz.mst[,1:2]
The data matrix (to update the costs as units are being grouped) :LQ_comp
The number of cuts = 8. [Note: It is set to one less than the number of clusters. So, the value specified is not the number of clusters, but the number of cuts in the graph, one less than the number of clusters.]

clust9 <- skater(mapz.mst[,1:2], LQ_comp, method = "euclidean", 8)

The result of the skater() is an object of class skater.

str(clust9)

## List of 8
##  $ groups      : num [1:186] 1 1 1 1 1 1 1 1 8 1 ...
##  $ edges.groups:List of 9
##   ..$ :List of 3
##   .. ..$ node: num [1:169] 31 119 69 60 11 91 81 45 58 27 ...
##   .. ..$ edge: num [1:168, 1:3] 119 69 11 91 81 60 27 45 130 23 ...
##   .. ..$ ssw : num 2141
##   ..$ :List of 3
##   .. ..$ node: num 167
##   .. ..$ edge: num[0 , 1:3] 
##   .. ..$ ssw : num 0
##   ..$ :List of 3
##   .. ..$ node: num 164
##   .. ..$ edge: num[0 , 1:3] 
##   .. ..$ ssw : num 0
##   ..$ :List of 3
##   .. ..$ node: num [1:7] 104 125 127 95 110 114 97
##   .. ..$ edge: num [1:6, 1:3] 125 104 104 127 125 127 95 125 127 110 ...
##   .. ..$ ssw : num 150
##   ..$ :List of 3
##   .. ..$ node: num 89
##   .. ..$ edge: num[0 , 1:3] 
##   .. ..$ ssw : num 0
##   ..$ :List of 3
##   .. ..$ node: num 96
##   .. ..$ edge: num[0 , 1:3] 
##   .. ..$ ssw : num 0
##   ..$ :List of 3
##   .. ..$ node: num 44
##   .. ..$ edge: num[0 , 1:3] 
##   .. ..$ ssw : num 0
##   ..$ :List of 3
##   .. ..$ node: num [1:4] 26 15 9 17
##   .. ..$ edge: num [1:3, 1:3] 26 26 15 15 17 ...
##   .. ..$ ssw : num 39.4
##   ..$ :List of 3
##   .. ..$ node: num 33
##   .. ..$ edge: num[0 , 1:3] 
##   .. ..$ ssw : num 0
##  $ not.prune   : NULL
##  $ candidates  : int [1:9] 1 2 3 4 5 6 7 8 9
##  $ ssto        : num 3907
##  $ ssw         : num [1:9] 3907 3591 3299 3067 2867 ...
##  $ crit        : num [1:2] 1 Inf
##  $ vec.crit    : num [1:186] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "class")= chr "skater"

ccs9 <- clust9$groups
ccs9

##   [1] 1 1 1 1 1 1 1 1 8 1 1 1 1 1 8 1 8 1 1 1 1 1 1 1 1 8 1 1 1 1 1 1 9 1 1 1 1
##  [38] 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 1 1 1 1 4 6 4 1 1 1 1 1 1 4 1 1 1 1 1 4 1
## [112] 1 1 4 1 1 1 1 1 1 1 1 1 1 4 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1

table(ccs9)

## ccs9
##   1   2   3   4   5   6   7   8   9 
## 169   1   1   7   1   1   1   4   1

To plot the pruned tree that shows the 9 clusters on top of the region.

plot(clust9, coordinates(mapz_final_sp), cex.lab=.7,
     groups.colors=c("blue","green3","red", "brown", "pink","yellow","gray","cyan"), cex.circles=0.005)

plot(mapz_final_sp, border=gray(.5), add=TRUE)

Visualising the clusters in choropleth map

The code chunk below is used to plot the newly derived clusters by using SKATER method.

groups_mat <- as.matrix(clust9$groups)
mapz_final_cluster_spatialcluster <- cbind(mapz_final_cluster, as.factor(groups_mat)) %>%
  rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(mapz_final_cluster_spatialcluster, "SP_CLUSTER")

We can now see that it is grouping contiguous objects that are similar into new aggregate areal unit.

Comparison : Hierarchical Clustering Vs Spatially Constrained Hierarchical Clustering Maps

For comparison purpose, it will be better to place both the hierarchical clustering and spatially constrained hierarchical clustering maps next to each other.

hclust.map <- qtm(mapz_final_cluster,
                  "CLUSTER") + 
  tm_borders(alpha = 0.5) 

shclust.map <- qtm(mapz_final_cluster_spatialcluster,
                   "SP_CLUSTER") + 
  tm_borders(alpha = 0.5) 

tmap_arrange(hclust.map, shclust.map,
             asp=NA, ncol=2)

## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

From the comparison map, it shows that the derived clusters by using SKATER method are not fragmented compared to the Hierarchical Clustering and it also shows that cluster 1 is the biggest cluster in the São Paulo Macrometropolis.This also means that the distribution on the type of the industry are similar across the São Paulo Macrometropolis since cluster 1 cover most part of the São Paulo Macrometropolis.

TakeHomeEx03-PengSzuHua

TakeHomeEx03-PengSzuHua

Background

Packages Installation

Data Preparation

Data Sources

Data Import

Data Wrangling

Exploratory Data Analysis (EDA)

EDA　statistical graphics

EDA using choropleth Map

EDA Correlation Analysis

Hierarchy Cluster Analysis

Extrating clustering variables

Data Standardisation

Min-Max standardisation

Visualising the standardised clustering variables

Computing proximity matrix

Computing hierarchical clustering

Selecting the optimal clustering algorithm

Determining Optimal Clusters

Visually-driven hierarchical clustering analysis

Transforming the data frame into a matrix

Interactive cluster heatmap

Mapping the clusters formed

Spatially Constrained Clustering - SKATER approach

Converting into SpatialPolygonsDataFrame

Computing Neighbour List

Minimum Spanning Tree

Calculating edge costs

Computing minimum spanning tree

Computing spatially constrained clusters using SKATER method

Visualising the clusters in choropleth map

Comparison : Hierarchical Clustering Vs Spatially Constrained Hierarchical Clustering Maps

TakeHomeEx03-PengSzuHua

TakeHomeEx03-PengSzuHua

Background

Packages Installation

Data Preparation

Data Sources

Data Import

Data Wrangling

Exploratory Data Analysis (EDA)

EDA statistical graphics

EDA using choropleth Map

EDA Correlation Analysis

Hierarchy Cluster Analysis

Extrating clustering variables

Data Standardisation

Min-Max standardisation

Visualising the standardised clustering variables

Computing proximity matrix

Computing hierarchical clustering

Selecting the optimal clustering algorithm

Determining Optimal Clusters

Visually-driven hierarchical clustering analysis

Transforming the data frame into a matrix

Interactive cluster heatmap

Mapping the clusters formed

Spatially Constrained Clustering - SKATER approach

Converting into SpatialPolygonsDataFrame

Computing Neighbour List

Minimum Spanning Tree

Calculating edge costs

Computing minimum spanning tree

Computing spatially constrained clusters using SKATER method

Visualising the clusters in choropleth map

Comparison : Hierarchical Clustering Vs Spatially Constrained Hierarchical Clustering Maps

EDA　statistical graphics