1 Setting the Scene

Brazil is the world’s fifth-largest country by area and the sixth most populous. Brazil is classified as an upper-middle income economy by the World Bank and a developing country, with the largest share of global wealth in Latin America. It is considered an advanced emerging economy. It has the ninth largest GDP in the world by nominal, and eighth by PPP measures. Behind all these impressive figures, the spatial development of Brazil is highly unequal as shown in the figure on the right. In 2016, the GDP per capita of the poorest municipality is R$3,190.6. On the other hand, the GDP per capita of the richest municipality is R$314,638. Half of the municipalities have GDP per capita less than R$16,000 and the top 25% municipalities earn R$26,155 and above.

In recent year, our government has been encouraging local enterprises to venture into international market and one of the highly recommended country is Brazil (refer to https://www.enterprisesg.gov.sg/overseas-markets/north-latin-america/brazil/market-profile, https://www.enterprisesg.gov.sg/blog/latin-america-tech-landscape, and https://www.enterprisesg.gov.sg/blog/where-to-begin-in-latin-america). However, in view of the spatial development of Brazil that is highly unequal, any company interested to invest in Brazil must have a good understanding of the geographical specialisation and competitiveness of industry in the country.

1.1 The Task

to segment São Paulo Macrometropolis at the municipality level into homogeneous industry specialization clusters by using data provided by Brazilian Institute of Geography and Statistics (IBGE) (https://www.ibge.gov.br/en/homeeng. html).

The specific tasks of the analysis are as follows:

  • Preparing a series of choropleth maps showing the distribution of spatial specialisation by industry type, 2016 at municipality level.
  • Delineating industry specialisation clusters by using hierarchical clustering method and display their distribution by using appropriate thematic mapping technique.
  • Delineating industry specialisation clusters by using spatially constrained clustering method and display their distribution by using appropriate thematic mapping technique.

1.2 The Data

For the purpose of this study, two data sets are provided. They are:

  • BRAZIL_CITIES.csv. This data file consists of 81 columns and 5573 rows. Each row representing one municipality.

  • Data_Dictionary.csv. This file provides meta data of each columns in BRAZIL_CITIES.csv.

Both data files were downloaded from Kaggle (https://www.kaggle.com/crisparada/braziliancities).

Beside these two data files, you are also required to obtain a copy of 2016 municipality boundary file and a copy of the metropolitan boundary file by using an awesome R package called geobr (https://cran.r-project.org/web/packages/geobr/). Before using the package, you are highly recommended to read the vignettes (https://cran.rproject. org/web/packages/geobr/vignettes/intro_to_geobr.html) and the relevant section of the user guide.

1.3 Installing and loading R packages

Before we get started, it is important for us to install the necessary R packages into R and launch these R packages into R environment.

The R packages needed for this exercise are as follows:

  • Spatial data handling
    • geobr, sf, and spdep
  • Geospatial analysis package
    • ClustGeo
  • Attribute data handling
    • tidyverse, especially readr, ggplot2 and dplyr
  • Choropleth mapping
    • tmap
  • comparison handling
    • arsenal The code chunks below installs and launches these R packages into R environment.
packages = c('rgdal', 'spdep', 'ClustGeo',  'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse', 'geobr','arsenal','factoextra','NbClust')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Note: - With tidyverse, we do not have to install readr, ggplot2 and dplyr packages separately. In fact, tidyverse also installed other very useful R packages such as tidyr.

  • cluster package help us to compare the performance of different clustering algorithm.

1.4 New tricks learnt this take home exercise

  • You can insert an R code chunk either using the RStudio toolbar (the Insert button) or the keyboard shortcut Ctrl + Alt + I (Cmd + Option + I on macOS). Source here

2 Data Import and Preparation

2.1 Importing aspatial data into R environment

The csv files will be import using read.csv function of Base R functions.

Initially read_csv function of readr package was used to read the csv file into the environment. However, as the read_csv function was unable to separate the data effectively using the “;” as the separator, read.csv function was used and it has yielded the desired outcome. Besides that, any none English observations or observation with no value in the dataframe will not be read into R.

The code chunks used are shown below:

#encoding = UTF-8 is set to account for the unique portguese names and labels 
Brazil_cities <- read.csv("data/aspatial/BRAZIL_CITIES.csv", sep = ";", encoding = "UTF-8")
colnames(Brazil_cities)
##  [1] "CITY"                   "STATE"                  "CAPITAL"               
##  [4] "IBGE_RES_POP"           "IBGE_RES_POP_BRAS"      "IBGE_RES_POP_ESTR"     
##  [7] "IBGE_DU"                "IBGE_DU_URBAN"          "IBGE_DU_RURAL"         
## [10] "IBGE_POP"               "IBGE_1"                 "IBGE_1.4"              
## [13] "IBGE_5.9"               "IBGE_10.14"             "IBGE_15.59"            
## [16] "IBGE_60."               "IBGE_PLANTED_AREA"      "IBGE_CROP_PRODUCTION_."
## [19] "IDHM.Ranking.2010"      "IDHM"                   "IDHM_Renda"            
## [22] "IDHM_Longevidade"       "IDHM_Educacao"          "LONG"                  
## [25] "LAT"                    "ALT"                    "PAY_TV"                
## [28] "FIXED_PHONES"           "AREA"                   "REGIAO_TUR"            
## [31] "CATEGORIA_TUR"          "ESTIMATED_POP"          "RURAL_URBAN"           
## [34] "GVA_AGROPEC"            "GVA_INDUSTRY"           "GVA_SERVICES"          
## [37] "GVA_PUBLIC"             "GVA_TOTAL"              "TAXES"                 
## [40] "GDP"                    "POP_GDP"                "GDP_CAPITA"            
## [43] "GVA_MAIN"               "MUN_EXPENDIT"           "COMP_TOT"              
## [46] "COMP_A"                 "COMP_B"                 "COMP_C"                
## [49] "COMP_D"                 "COMP_E"                 "COMP_F"                
## [52] "COMP_G"                 "COMP_H"                 "COMP_I"                
## [55] "COMP_J"                 "COMP_K"                 "COMP_L"                
## [58] "COMP_M"                 "COMP_N"                 "COMP_O"                
## [61] "COMP_P"                 "COMP_Q"                 "COMP_R"                
## [64] "COMP_S"                 "COMP_T"                 "COMP_U"                
## [67] "HOTELS"                 "BEDS"                   "Pr_Agencies"           
## [70] "Pu_Agencies"            "Pr_Bank"                "Pu_Bank"               
## [73] "Pr_Assets"              "Pu_Assets"              "Cars"                  
## [76] "Motorcycles"            "Wheeled_tractor"        "UBER"                  
## [79] "MAC"                    "WAL.MART"               "POST_OFFICES"
#encoding = UTF-8 is set to account for the unique portguese names and labels 
Brazil_metadata <- read.csv("data/aspatial/Data_Dictionary.csv", sep = ";", encoding = "UTF-8")
colnames(Brazil_metadata)
## [1] "X.U.FEFF.FIELD" "DESCRIPTION"    "REFERENCE"      "UNIT"          
## [5] "SOURCE"         "X"

The code chunk below is to ascertain the data type of Brazil_cities and Brazil_metadata.

class(Brazil_cities)
## [1] "data.frame"
class(Brazil_metadata)
## [1] "data.frame"

The code chunk below reveal the summary statistics of Brazil_cities data.frame.

summary(Brazil_cities)
##      CITY              STATE              CAPITAL          IBGE_RES_POP     
##  Length:5573        Length:5573        Min.   :0.000000   Min.   :     805  
##  Class :character   Class :character   1st Qu.:0.000000   1st Qu.:    5235  
##  Mode  :character   Mode  :character   Median :0.000000   Median :   10934  
##                                        Mean   :0.004845   Mean   :   34278  
##                                        3rd Qu.:0.000000   3rd Qu.:   23424  
##                                        Max.   :1.000000   Max.   :11253503  
##                                                           NA's   :8         
##  IBGE_RES_POP_BRAS  IBGE_RES_POP_ESTR     IBGE_DU        IBGE_DU_URBAN    
##  Min.   :     805   Min.   :     0.0   Min.   :    239   Min.   :     60  
##  1st Qu.:    5230   1st Qu.:     0.0   1st Qu.:   1572   1st Qu.:    874  
##  Median :   10926   Median :     0.0   Median :   3174   Median :   1846  
##  Mean   :   34200   Mean   :    77.5   Mean   :  10303   Mean   :   8859  
##  3rd Qu.:   23390   3rd Qu.:    10.0   3rd Qu.:   6726   3rd Qu.:   4624  
##  Max.   :11133776   Max.   :119727.0   Max.   :3576148   Max.   :3548433  
##  NA's   :8          NA's   :8          NA's   :10        NA's   :10       
##  IBGE_DU_RURAL      IBGE_POP            IBGE_1            IBGE_1.4     
##  Min.   :    3   Min.   :     174   Min.   :     0.0   Min.   :     5  
##  1st Qu.:  487   1st Qu.:    2801   1st Qu.:    38.0   1st Qu.:   158  
##  Median :  931   Median :    6170   Median :    92.0   Median :   376  
##  Mean   : 1463   Mean   :   27595   Mean   :   383.3   Mean   :  1544  
##  3rd Qu.: 1832   3rd Qu.:   15302   3rd Qu.:   232.0   3rd Qu.:   951  
##  Max.   :33809   Max.   :10463636   Max.   :129464.0   Max.   :514794  
##  NA's   :81      NA's   :8          NA's   :8          NA's   :8       
##     IBGE_5.9        IBGE_10.14       IBGE_15.59         IBGE_60.      
##  Min.   :     7   Min.   :    12   Min.   :     94   Min.   :     29  
##  1st Qu.:   220   1st Qu.:   259   1st Qu.:   1734   1st Qu.:    341  
##  Median :   516   Median :   588   Median :   3841   Median :    722  
##  Mean   :  2069   Mean   :  2381   Mean   :  18212   Mean   :   3004  
##  3rd Qu.:  1300   3rd Qu.:  1478   3rd Qu.:   9628   3rd Qu.:   1724  
##  Max.   :684443   Max.   :783702   Max.   :7058221   Max.   :1293012  
##  NA's   :8        NA's   :8        NA's   :8         NA's   :8        
##  IBGE_PLANTED_AREA   IBGE_CROP_PRODUCTION_. IDHM.Ranking.2010      IDHM       
##  Min.   :      0.0   Min.   :      0        Min.   :   1      Min.   :0.4180  
##  1st Qu.:    910.2   1st Qu.:   2326        1st Qu.:1392      1st Qu.:0.5990  
##  Median :   3471.5   Median :  13846        Median :2783      Median :0.6650  
##  Mean   :  14179.9   Mean   :  57384        Mean   :2783      Mean   :0.6592  
##  3rd Qu.:  11194.2   3rd Qu.:  55619        3rd Qu.:4174      3rd Qu.:0.7180  
##  Max.   :1205669.0   Max.   :3274885        Max.   :5565      Max.   :0.8620  
##  NA's   :3           NA's   :3              NA's   :8         NA's   :8       
##    IDHM_Renda     IDHM_Longevidade IDHM_Educacao         LONG       
##  Min.   :0.4000   Min.   :0.6720   Min.   :0.2070   Min.   :-72.92  
##  1st Qu.:0.5720   1st Qu.:0.7690   1st Qu.:0.4900   1st Qu.:-50.87  
##  Median :0.6540   Median :0.8080   Median :0.5600   Median :-46.52  
##  Mean   :0.6429   Mean   :0.8016   Mean   :0.5591   Mean   :-46.23  
##  3rd Qu.:0.7070   3rd Qu.:0.8360   3rd Qu.:0.6310   3rd Qu.:-41.40  
##  Max.   :0.8910   Max.   :0.8940   Max.   :0.8250   Max.   :-32.44  
##  NA's   :8        NA's   :8        NA's   :8        NA's   :9       
##       LAT               ALT               PAY_TV         FIXED_PHONES    
##  Min.   :-33.688   Min.   :     0.0   Min.   :      1   Min.   :      3  
##  1st Qu.:-22.838   1st Qu.:   169.8   1st Qu.:     88   1st Qu.:    119  
##  Median :-18.089   Median :   406.5   Median :    247   Median :    327  
##  Mean   :-16.444   Mean   :   893.8   Mean   :   3094   Mean   :   6567  
##  3rd Qu.: -8.489   3rd Qu.:   628.9   3rd Qu.:    815   3rd Qu.:   1151  
##  Max.   :  4.585   Max.   :874579.0   Max.   :2047668   Max.   :5543127  
##  NA's   :9         NA's   :9          NA's   :3         NA's   :3        
##      AREA            REGIAO_TUR        CATEGORIA_TUR      ESTIMATED_POP     
##  Length:5573        Length:5573        Length:5573        Min.   :     786  
##  Class :character   Class :character   Class :character   1st Qu.:    5454  
##  Mode  :character   Mode  :character   Mode  :character   Median :   11590  
##                                                           Mean   :   37432  
##                                                           3rd Qu.:   25296  
##                                                           Max.   :12176866  
##                                                           NA's   :3         
##  RURAL_URBAN         GVA_AGROPEC       GVA_INDUSTRY       GVA_SERVICES      
##  Length:5573        Min.   :      0   Min.   :       1   Min.   :        2  
##  Class :character   1st Qu.:   4189   1st Qu.:    1726   1st Qu.:    10112  
##  Mode  :character   Median :  20426   Median :    7424   Median :    31211  
##                     Mean   :  47271   Mean   :  175928   Mean   :   489451  
##                     3rd Qu.:  51227   3rd Qu.:   41022   3rd Qu.:   115406  
##                     Max.   :1402282   Max.   :63306755   Max.   :464656988  
##                     NA's   :3         NA's   :3          NA's   :3          
##    GVA_PUBLIC         GVA_TOTAL             TAXES                GDP           
##  Min.   :       7   Min.   :       17   Min.   :   -14159   Min.   :       15  
##  1st Qu.:   17267   1st Qu.:    42253   1st Qu.:     1305   1st Qu.:    43709  
##  Median :   35866   Median :   119492   Median :     5100   Median :   125153  
##  Mean   :  123768   Mean   :   832987   Mean   :   118864   Mean   :   954584  
##  3rd Qu.:   89245   3rd Qu.:   313963   3rd Qu.:    22197   3rd Qu.:   329539  
##  Max.   :41902893   Max.   :569910503   Max.   :117125387   Max.   :687035890  
##  NA's   :3          NA's   :3           NA's   :3           NA's   :3          
##     POP_GDP           GDP_CAPITA       GVA_MAIN          MUN_EXPENDIT      
##  Min.   :     815   Min.   :  3191   Length:5573        Min.   :1.421e+06  
##  1st Qu.:    5483   1st Qu.:  9058   Class :character   1st Qu.:1.573e+07  
##  Median :   11578   Median : 15870   Mode  :character   Median :2.746e+07  
##  Mean   :   36998   Mean   : 21126                      Mean   :1.043e+08  
##  3rd Qu.:   25085   3rd Qu.: 26155                      3rd Qu.:5.666e+07  
##  Max.   :12038175   Max.   :314638                      Max.   :4.577e+10  
##  NA's   :3          NA's   :3                           NA's   :1492       
##     COMP_TOT            COMP_A            COMP_B            COMP_C        
##  Min.   :     6.0   Min.   :   0.00   Min.   :  0.000   Min.   :    0.00  
##  1st Qu.:    68.0   1st Qu.:   1.00   1st Qu.:  0.000   1st Qu.:    3.00  
##  Median :   162.0   Median :   2.00   Median :  0.000   Median :   11.00  
##  Mean   :   906.8   Mean   :  18.25   Mean   :  1.852   Mean   :   73.44  
##  3rd Qu.:   448.0   3rd Qu.:   8.00   3rd Qu.:  2.000   3rd Qu.:   39.00  
##  Max.   :530446.0   Max.   :1948.00   Max.   :274.000   Max.   :31566.00  
##  NA's   :3          NA's   :3         NA's   :3         NA's   :3         
##      COMP_D             COMP_E            COMP_F             COMP_G        
##  Min.   :  0.0000   Min.   :  0.000   Min.   :    0.00   Min.   :     1.0  
##  1st Qu.:  0.0000   1st Qu.:  0.000   1st Qu.:    1.00   1st Qu.:    32.0  
##  Median :  0.0000   Median :  0.000   Median :    4.00   Median :    74.5  
##  Mean   :  0.4262   Mean   :  2.029   Mean   :   43.26   Mean   :   348.0  
##  3rd Qu.:  0.0000   3rd Qu.:  1.000   3rd Qu.:   15.00   3rd Qu.:   199.0  
##  Max.   :332.0000   Max.   :657.000   Max.   :25222.00   Max.   :150633.0  
##  NA's   :3          NA's   :3         NA's   :3          NA's   :3         
##      COMP_H          COMP_I             COMP_J             COMP_K        
##  Min.   :    0   Min.   :    0.00   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    1   1st Qu.:    2.00   1st Qu.:    0.00   1st Qu.:    0.00  
##  Median :    7   Median :    7.00   Median :    1.00   Median :    0.00  
##  Mean   :   41   Mean   :   55.88   Mean   :   24.74   Mean   :   15.55  
##  3rd Qu.:   25   3rd Qu.:   24.00   3rd Qu.:    5.00   3rd Qu.:    2.00  
##  Max.   :19515   Max.   :29290.00   Max.   :38720.00   Max.   :23738.00  
##  NA's   :3       NA's   :3          NA's   :3          NA's   :3         
##      COMP_L             COMP_M             COMP_N            COMP_O       
##  Min.   :    0.00   Min.   :    0.00   Min.   :    0.0   Min.   :  0.000  
##  1st Qu.:    0.00   1st Qu.:    1.00   1st Qu.:    1.0   1st Qu.:  2.000  
##  Median :    0.00   Median :    4.00   Median :    4.0   Median :  2.000  
##  Mean   :   15.14   Mean   :   51.29   Mean   :   83.7   Mean   :  3.269  
##  3rd Qu.:    3.00   3rd Qu.:   13.00   3rd Qu.:   14.0   3rd Qu.:  3.000  
##  Max.   :14003.00   Max.   :49181.00   Max.   :76757.0   Max.   :204.000  
##  NA's   :3          NA's   :3          NA's   :3         NA's   :3        
##      COMP_P             COMP_Q             COMP_R            COMP_S        
##  Min.   :    0.00   Min.   :    0.00   Min.   :   0.00   Min.   :    0.00  
##  1st Qu.:    2.00   1st Qu.:    1.00   1st Qu.:   0.00   1st Qu.:    5.00  
##  Median :    6.00   Median :    3.00   Median :   2.00   Median :   12.00  
##  Mean   :   30.96   Mean   :   34.15   Mean   :  12.18   Mean   :   51.61  
##  3rd Qu.:   17.00   3rd Qu.:   12.00   3rd Qu.:   6.00   3rd Qu.:   31.00  
##  Max.   :16030.00   Max.   :22248.00   Max.   :6687.00   Max.   :24832.00  
##  NA's   :3          NA's   :3          NA's   :3         NA's   :3         
##      COMP_T      COMP_U              HOTELS            BEDS        
##  Min.   :0   Min.   :  0.00000   Min.   : 1.000   Min.   :    2.0  
##  1st Qu.:0   1st Qu.:  0.00000   1st Qu.: 1.000   1st Qu.:   40.0  
##  Median :0   Median :  0.00000   Median : 1.000   Median :   82.0  
##  Mean   :0   Mean   :  0.05027   Mean   : 3.131   Mean   :  257.5  
##  3rd Qu.:0   3rd Qu.:  0.00000   3rd Qu.: 3.000   3rd Qu.:  200.0  
##  Max.   :0   Max.   :123.00000   Max.   :97.000   Max.   :13247.0  
##  NA's   :3   NA's   :3           NA's   :4686     NA's   :4686     
##   Pr_Agencies        Pu_Agencies         Pr_Bank          Pu_Bank    
##  Min.   :   0.000   Min.   :  0.000   Min.   : 0.000   Min.   :0.00  
##  1st Qu.:   0.000   1st Qu.:  1.000   1st Qu.: 0.000   1st Qu.:1.00  
##  Median :   1.000   Median :  2.000   Median : 1.000   Median :2.00  
##  Mean   :   3.383   Mean   :  2.829   Mean   : 1.312   Mean   :1.58  
##  3rd Qu.:   2.000   3rd Qu.:  2.000   3rd Qu.: 2.000   3rd Qu.:2.00  
##  Max.   :1693.000   Max.   :626.000   Max.   :83.000   Max.   :8.00  
##  NA's   :2231       NA's   :2231      NA's   :2231     NA's   :2231  
##    Pr_Assets           Pu_Assets              Cars          Motorcycles     
##  Min.   :0.000e+00   Min.   :0.000e+00   Min.   :      2   Min.   :      4  
##  1st Qu.:0.000e+00   1st Qu.:4.047e+07   1st Qu.:    602   1st Qu.:    591  
##  Median :3.231e+07   Median :1.339e+08   Median :   1438   Median :   1285  
##  Mean   :9.180e+09   Mean   :6.005e+09   Mean   :   9859   Mean   :   4879  
##  3rd Qu.:1.148e+08   3rd Qu.:4.970e+08   3rd Qu.:   4086   3rd Qu.:   3294  
##  Max.   :1.947e+13   Max.   :8.016e+12   Max.   :5740995   Max.   :1134570  
##  NA's   :2231        NA's   :2231        NA's   :11        NA's   :11       
##  Wheeled_tractor         UBER           MAC             WAL.MART     
##  Min.   :   0.000   Min.   :1      Min.   :  1.000   Min.   : 1.000  
##  1st Qu.:   0.000   1st Qu.:1      1st Qu.:  1.000   1st Qu.: 1.000  
##  Median :   0.000   Median :1      Median :  2.000   Median : 1.000  
##  Mean   :   5.754   Mean   :1      Mean   :  4.277   Mean   : 2.059  
##  3rd Qu.:   1.000   3rd Qu.:1      3rd Qu.:  3.000   3rd Qu.: 1.750  
##  Max.   :3236.000   Max.   :1      Max.   :130.000   Max.   :26.000  
##  NA's   :11         NA's   :5448   NA's   :5407      NA's   :5471    
##   POST_OFFICES    
##  Min.   :  1.000  
##  1st Qu.:  1.000  
##  Median :  1.000  
##  Mean   :  2.081  
##  3rd Qu.:  2.000  
##  Max.   :225.000  
##  NA's   :120

2.1.1 Extracting aspatial data corresponding to Sao Paulo Macrometropolis

unique(Brazil_cities$STATE) 
##  [1] "GO" "MG" "PA" "CE" "BA" "PR" "SC" "PE" "TO" "MA" "RN" "PI" "RS" "MT" "AC"
## [16] "SP" "ES" "AL" "PB" "MS" "RO" "RR" "AM" "AP" "SE" "RJ" "DF"

After checking through Brazil_cities dataframe, we confirm that “SP” state is Sao Paulo state that we would like to look at. Next, we will extract all observations that belong to this category with the code chunk below.

SP_state <- Brazil_cities %>% 
  filter(STATE == "SP")
#head(SP_state) 
# To get a look at the dataframe 

2.2 Importing geospatial data into R environment

In this section, we will import Brazil 2016 municipality boundary file GIS data and the metropolitan boundary GIS data into R environment.

The geobr package provides quick and easy access to official spatial data sets of Brazil. The syntax of all geobr functions operate on a simple logic that allows users to easily download a wide variety of data sets with updated geometries and harmonized attributes and geographic projections across geographies and years.

The documentation of different functions in the geobr package can be found here.

2.2.1 Download shape files of official metropolitan areas in Brazil as an sf object using read_metro_area function

Description The function returns the shapes of municipalities grouped by their respective metro areas. Metropolitan areas are created by each state in Brazil. The data set includes the municipalities that belong to all metropolitan areas in the country according to state legislation in each year. Orignal data were generated by Institute of Geography. Data at scale 1:250,000, using Geodetic reference system “SIRGAS2000” and CRS(4674).

Usage read_metro_area(year = 2018, simplified = TRUE, showProgress = TRUE)

Arguments year A year number in YYYY format (defaults to 2018) 20 read_micro_region

simplified Logic FALSE or TRUE, indicating whether the function returns the data set with ’original’ resolution or a data set with ’simplified’ borders (Defaults to TRUE). For spatial analysis and statistics users should set simplified = FALSE. Borders have been simplified by removing vertices of borders using st_simplifysf preserving topology with a dTolerance of 100.

showProgress Logical. Defaults to (TRUE) display progress bar

The code chunks used are shown below:

metropolitan <- read_metro_area(year = 2016, simplified = TRUE, showProgress = FALSE)
## Using year 2016

Extracting online the São Paulo Macrometropolis by specifying the state to be “SP”.

SP_metropolitan <- metropolitan %>%
  filter(abbrev_state == "SP")

The code chunk below is used to ascertain the total number of metropolis in the São Paulo Macrometropolis.

unique(SP_metropolitan$name_metro) 
## [1] "Aglomeração Urbana de Jundiaí"                  
## [2] "Aglomeração Urbana de Piracicaba-AU- Piracicaba"
## [3] "RM Baixada Santista"                            
## [4] "RM Campinas"                                    
## [5] "RM de Sorocaba"                                 
## [6] "RM do Vale do Paraíba e Litoral Norte"          
## [7] "RM São Paulo"

Through the output of the code chunk above, we can clearly see that Regional Unit of Bragança Paulista city is missing as we compare the list of name_metro against the list of Region available in Wikipedia. The list below are the 8 main regions in the São Paulo Macrometropolis:

  1. “RM São Paulo” = São Paulo
  2. “RM Campinas” = Campinas
  3. “RM do Vale do Paraíba e Litoral Norte” = São José dos Campos
  4. “RM de Sorocaba” = Sorocaba
  5. “RM Baixada Santista” = Santos
  6. “Aglomeração Urbana de Piracicaba-AU- Piracicaba” = Piracicaba
  7. “Aglomeração Urbana de Jundiaí” = Jundiaí
  8. Bragança Paulista

2.2.2 Plotting the Metropolitan boundary of Brazil in 2016

# plot
  ggplot() +
    geom_sf(data=SP_metropolitan, fill="#2D3E50", color="#FEBF57", size=.15, show.legend = FALSE) +
    labs(subtitle="Metropolitan of Sao Paolo Brazil, 2016", size=8) +
    theme_minimal()

Note: As stated previously, Regional Unit of Bragança Paulista city is the missing region in the Metropolitan data. thus we will be using another function from the geobr package, read_municipality to extract the required municipalities.

2.2.3 Download shape files of Brazilian municipalities as sf objects using read_municipality function

Arguments code_muni The 7-digit code of a municipality. If the two-digit code or a two-letter uppercase abbreviation of a state is passed, (e.g. 33 or “RJ”) the function will load all municipalities of that state. If code_muni=“all”, all municipalities of the country will be loaded.

year Year of the data (defaults to 2010)

simplified Logic FALSE or TRUE, indicating whether the function returns the data set with ’original’ resolution or a data set with ’simplified’ borders (Defaults to TRUE). For spatial analysis and statistics users should set simplified = FALSE. Borders have been simplified by removing vertices of borders using st_simplifysf preserving topology with a dTolerance of 100.

showProgress Logical. Defaults to (TRUE) display progress bar

The code chunks used are shown below:

# Read all municipalities of the country at a given year
mun <- read_municipality(year=2016, simplified = TRUE, showProgress = FALSE)
## Using year 2016
## Loading data for the whole country. This might take a few minutes.

Finding out the different states that are included in the municipality dataset.

unique(mun$abbrev_state) 
##  [1] "RO" "AC" "AM" "RR" "PA" "AP" "TO" "MA" "PI" "CE" "RN" "PB" "PE" "AL" "SE"
## [16] "BA" "MG" "ES" "RJ" "SP" "PR" "SC" "RS" "MS" "MT" "GO" "DF"

2.2.4 Plotting the Municipalities of Brazil in 2016

# plot
ggplot() +
  geom_sf(data=mun, fill="#2D3E50", color="#FEBF57", size=.15, show.legend = TRUE) +
  labs(subtitle="Municipalities of Brazil, 2016", size=8) +
  theme_minimal() 

Now we will try to extract the Regional Unit of Bragança Paulista city from the mun dataset using the code chunk below:

Bra_paulista <- mun %>%
  filter(name_muni == "Bragança Paulista")
#Bragança Paulista

We will also extract the São Paulo state area to visualize whether our gazette region of interest is indeed in the correct location.

Sao_Paulo <- mun %>%
  filter(abbrev_state == "SP")

2.2.5 Plotting the combined layer of Sao_Paulo, SP_metropolitan and Regional Unit of Bragança Paulista city

ggplot() +
    geom_sf(data=Sao_Paulo, fill="#4AE107", color="#0941E7", size=.15, show.legend = TRUE, ) +
    geom_sf(data=SP_metropolitan, fill="#E90729", color="#E6E907", size=.15, show.legend = TRUE, ) +
    labs(subtitle="Metropolis of Brazil, 2016", size=8) +
    theme_minimal() +
    geom_sf(data=Bra_paulista, fill="#FF00CC", color="#030003", size=.15, show.legend = FALSE) 

Note: This is a mistake whereby only the Bragança Paulista municipality is gazetted instead of the whole Regional Unit of Bragança Paulista city which includes:

  1. Atibaia
  2. Bom Jesus dos Perdões
  3. Bragança Paulista
  4. Itatiba
  5. Jarinu
  6. Joanópolis
  7. Morungaba
  8. Nazaré Paulista
  9. Piracaia
  10. Tuiuti
  11. Vargem

These information can be found from here

2.2.6 Extracting the correct municipalities belonging to the Regional Unit of Bragança Paulista city

First we will create a vector of the names of municipalities in the Regional Unit of Bragança Paulista city using the code chunk below:

BP_munis <- c("Atibaia",
              "Bom Jesus dos Perdões",
              "Bragança Paulista",
              "Itatiba",
              "Jarinu",
              "Joanópolis",
              "Morungaba",
              "Nazaré Paulista",
              "Piracaia",
              "Tuiuti",
              "Vargem"
              )

Then, we will extract the relevant municipalities using the name as the index

BP_munis_gis <- Sao_Paulo %>% 
  filter(tolower(name_muni) %in% tolower(BP_munis))
BP_munis_gis
## Simple feature collection with 11 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -46.92735 ymin: -23.31611 xmax: -46.04105 ymax: -22.76762
## geographic CRS: SIRGAS 2000
## First 10 features:
##    code_muni             name_muni code_state abbrev_state
## 1    3504107               Atibaia         35           SP
## 2    3507100 Bom Jesus Dos Perdões         35           SP
## 3    3507605     Bragança Paulista         35           SP
## 4    3523404               Itatiba         35           SP
## 5    3525201                Jarinu         35           SP
## 6    3525508            Joanópolis         35           SP
## 7    3532009             Morungaba         35           SP
## 8    3532405       Nazaré Paulista         35           SP
## 9    3538600              Piracaia         35           SP
## 10   3554953                Tuiuti         35           SP
##                              geom
## 1  MULTIPOLYGON (((-46.66146 -...
## 2  MULTIPOLYGON (((-46.47303 -...
## 3  MULTIPOLYGON (((-46.50858 -...
## 4  MULTIPOLYGON (((-46.76852 -...
## 5  MULTIPOLYGON (((-46.71471 -...
## 6  MULTIPOLYGON (((-46.14383 -...
## 7  MULTIPOLYGON (((-46.80227 -...
## 8  MULTIPOLYGON (((-46.26139 -...
## 9  MULTIPOLYGON (((-46.34589 -...
## 10 MULTIPOLYGON (((-46.6655 -2...

Note: The confirmation that the extraction is done well when the BP_munis_gis returns 11 observations which is aligned with the total number of municipalities in the Regional Unit of Bragança Paulista city. This comparison can also be done by comparing with the length of BP_munis vector.

2.2.7 Concatenating the Sao Paulo Metropolitan Boundary with the Municipalities from the Regional Unit of Bragança Paulista city

SP_area <- bind_rows(SP_metropolitan, BP_munis_gis)

Checking for duplication of data

duplicated(SP_area$name_muni)
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [169]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE

Check the number of rows in the dataframe

nrow(SP_area)
## [1] 175

Since there are duplication, let’s us examine the data that are duplicated

duplicated_rows <- SP_area[duplicated(SP_area$name_muni),]
duplicated_names <- duplicated_rows$name_muni
SP_area %>% 
  filter(name_muni %in% duplicated_names)
## Simple feature collection with 6 features and 9 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -46.92735 ymin: -23.21495 xmax: -46.65393 ymax: -22.83401
## geographic CRS: SIRGAS 2000
##   code_muni name_muni code_state abbrev_state                    name_metro
## 1   3525201    Jarinu         35           SP Aglomeração Urbana de Jundiaí
## 2   3523404   Itatiba         35           SP                   RM Campinas
## 3   3532009 Morungaba         35           SP                   RM Campinas
## 4   3523404   Itatiba         35           SP                          <NA>
## 5   3525201    Jarinu         35           SP                          <NA>
## 6   3532009 Morungaba         35           SP                          <NA>
##   type subdivision            legislation legislation_date
## 1 AGLO     NÃO TEM Lei Complementar 1.146       24.08.2011
## 2   RM     NÃO TEM   Lei Complementar 870       19.06.2000
## 3   RM     NÃO TEM Lei Complementar 1.234       14.03.2014
## 4 <NA>        <NA>                   <NA>             <NA>
## 5 <NA>        <NA>                   <NA>             <NA>
## 6 <NA>        <NA>                   <NA>             <NA>
##                             geom
## 1 MULTIPOLYGON (((-46.71471 -...
## 2 MULTIPOLYGON (((-46.76852 -...
## 3 MULTIPOLYGON (((-46.80227 -...
## 4 MULTIPOLYGON (((-46.76852 -...
## 5 MULTIPOLYGON (((-46.71471 -...
## 6 MULTIPOLYGON (((-46.80227 -...

Through rigorous examination, it seems that 3 municipalities has been recognised to be in the Regional Unit of Bragança Paulista city instead of their previously identified metropolis. This information can be checked using this link and by examining the output above individually.

We will then proceed to drop the duplicated rows

SP_area <- SP_area[!duplicated(SP_area$name_muni),]

And perform checks for duplication again:

duplicated(SP_area$name_muni)
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE

Check the number of row in the dataframe after the removal of duplication. Since the number is 172 which is 3 rows lesser than the un-processed SP_area dataframe, the duplication removal has been performed effectively.

nrow(SP_area)
## [1] 172

2.2.8 Plotting the complete combined layer of Sao_Paulo, SP_metropolitan and Regional Unit of Bragança Paulista city

g1<- ggplot() +
    geom_sf(data=Sao_Paulo, fill="#4AE107", color="#0941E7", size=.15, show.legend = TRUE, ) +
    geom_sf(data=SP_metropolitan, fill="#E90729", color="#E6E907", size=.15, show.legend = TRUE, ) +
    labs(subtitle="Metropolis of Brazil, 2016", size=8) +
    theme_minimal() +
    geom_sf(data=Bra_paulista, fill="#FF00CC", color="#030003", size=.15, show.legend = FALSE) 


g2 <- ggplot() +
    geom_sf(data=Sao_Paulo, fill="#4AE107", color="#0941E7", size=.15, show.legend = TRUE, ) +
    geom_sf(data=SP_area, fill="#E90729", color="#E6E907", size=.15, show.legend = TRUE, ) +
    labs(subtitle="Metropolis of Brazil (Complete), 2016", size=8) +
    theme_minimal() 


ggarrange(g1, g2, widths = c(2,2))

2.3 Performing join of Apastial and Geospatial data

Joining of geospatial and aspatial data needs to be done to ultimately assist in the analysis of distribution of spatial specialisation by industry type. However, due to the lack of common data field in both geospatial and aspatial dataframe, the challenge lies in the join process.

2.3.1 Inner_join

SP_inner_join <- SP_area %>% inner_join(SP_state, by = c("abbrev_state" = "STATE", "name_muni" = "CITY"))

Counting the number of observations in the newly formed inner joined dataframe

nrow(SP_inner_join)
## [1] 171

Recall that the geospatial dataframe SP_area has 172 observations, we will need to identify which municipality has been removed.

nrow(SP_area)-nrow(SP_inner_join)
## [1] 1

2.3.2 Identifying the removed municipality

geo <- SP_area$name_muni
aspatial <- SP_state$CITY
setdiff(geo, aspatial)
## [1] "Santa Bárbara D'oeste"

After much inspection, SP_area’s municipality saved name was Santa Bárbara D'oeste while SP_state’s CITY name is saved as Santa Bárbara D'Oeste. Due to the capitalized O in the city name in SP_state, the inner join function was unsuccessful. Thus we will rename the city name in SP_state to match the one in SP_area using the code chunk below:

SP_state <- within(SP_state, CITY[CITY =="Santa Bárbara D'Oeste"] <- "Santa Bárbara D'oeste")

Inner_join

SP_inner_join <- SP_area %>% inner_join(SP_state, by = c("abbrev_state" = "STATE", "name_muni" = "CITY"))

Counting the number of observations in the newly formed inner joined dataframe

nrow(SP_inner_join)
## [1] 172
nrow(SP_area)-nrow(SP_inner_join)
## [1] 0

Since the difference in the number of rows in both dataframe is zero, the inner join process is done correctly.

3 Analysis

3.1 Selecting relevant columns from Brazil_cities

names(Brazil_cities)
##  [1] "CITY"                   "STATE"                  "CAPITAL"               
##  [4] "IBGE_RES_POP"           "IBGE_RES_POP_BRAS"      "IBGE_RES_POP_ESTR"     
##  [7] "IBGE_DU"                "IBGE_DU_URBAN"          "IBGE_DU_RURAL"         
## [10] "IBGE_POP"               "IBGE_1"                 "IBGE_1.4"              
## [13] "IBGE_5.9"               "IBGE_10.14"             "IBGE_15.59"            
## [16] "IBGE_60."               "IBGE_PLANTED_AREA"      "IBGE_CROP_PRODUCTION_."
## [19] "IDHM.Ranking.2010"      "IDHM"                   "IDHM_Renda"            
## [22] "IDHM_Longevidade"       "IDHM_Educacao"          "LONG"                  
## [25] "LAT"                    "ALT"                    "PAY_TV"                
## [28] "FIXED_PHONES"           "AREA"                   "REGIAO_TUR"            
## [31] "CATEGORIA_TUR"          "ESTIMATED_POP"          "RURAL_URBAN"           
## [34] "GVA_AGROPEC"            "GVA_INDUSTRY"           "GVA_SERVICES"          
## [37] "GVA_PUBLIC"             "GVA_TOTAL"              "TAXES"                 
## [40] "GDP"                    "POP_GDP"                "GDP_CAPITA"            
## [43] "GVA_MAIN"               "MUN_EXPENDIT"           "COMP_TOT"              
## [46] "COMP_A"                 "COMP_B"                 "COMP_C"                
## [49] "COMP_D"                 "COMP_E"                 "COMP_F"                
## [52] "COMP_G"                 "COMP_H"                 "COMP_I"                
## [55] "COMP_J"                 "COMP_K"                 "COMP_L"                
## [58] "COMP_M"                 "COMP_N"                 "COMP_O"                
## [61] "COMP_P"                 "COMP_Q"                 "COMP_R"                
## [64] "COMP_S"                 "COMP_T"                 "COMP_U"                
## [67] "HOTELS"                 "BEDS"                   "Pr_Agencies"           
## [70] "Pu_Agencies"            "Pr_Bank"                "Pu_Bank"               
## [73] "Pr_Assets"              "Pu_Assets"              "Cars"                  
## [76] "Motorcycles"            "Wheeled_tractor"        "UBER"                  
## [79] "MAC"                    "WAL.MART"               "POST_OFFICES"
names(SP_inner_join)
##  [1] "code_muni"              "name_muni"              "code_state"            
##  [4] "abbrev_state"           "name_metro"             "type"                  
##  [7] "subdivision"            "legislation"            "legislation_date"      
## [10] "CAPITAL"                "IBGE_RES_POP"           "IBGE_RES_POP_BRAS"     
## [13] "IBGE_RES_POP_ESTR"      "IBGE_DU"                "IBGE_DU_URBAN"         
## [16] "IBGE_DU_RURAL"          "IBGE_POP"               "IBGE_1"                
## [19] "IBGE_1.4"               "IBGE_5.9"               "IBGE_10.14"            
## [22] "IBGE_15.59"             "IBGE_60."               "IBGE_PLANTED_AREA"     
## [25] "IBGE_CROP_PRODUCTION_." "IDHM.Ranking.2010"      "IDHM"                  
## [28] "IDHM_Renda"             "IDHM_Longevidade"       "IDHM_Educacao"         
## [31] "LONG"                   "LAT"                    "ALT"                   
## [34] "PAY_TV"                 "FIXED_PHONES"           "AREA"                  
## [37] "REGIAO_TUR"             "CATEGORIA_TUR"          "ESTIMATED_POP"         
## [40] "RURAL_URBAN"            "GVA_AGROPEC"            "GVA_INDUSTRY"          
## [43] "GVA_SERVICES"           "GVA_PUBLIC"             "GVA_TOTAL"             
## [46] "TAXES"                  "GDP"                    "POP_GDP"               
## [49] "GDP_CAPITA"             "GVA_MAIN"               "MUN_EXPENDIT"          
## [52] "COMP_TOT"               "COMP_A"                 "COMP_B"                
## [55] "COMP_C"                 "COMP_D"                 "COMP_E"                
## [58] "COMP_F"                 "COMP_G"                 "COMP_H"                
## [61] "COMP_I"                 "COMP_J"                 "COMP_K"                
## [64] "COMP_L"                 "COMP_M"                 "COMP_N"                
## [67] "COMP_O"                 "COMP_P"                 "COMP_Q"                
## [70] "COMP_R"                 "COMP_S"                 "COMP_T"                
## [73] "COMP_U"                 "HOTELS"                 "BEDS"                  
## [76] "Pr_Agencies"            "Pu_Agencies"            "Pr_Bank"               
## [79] "Pu_Bank"                "Pr_Assets"              "Pu_Assets"             
## [82] "Cars"                   "Motorcycles"            "Wheeled_tractor"       
## [85] "UBER"                   "MAC"                    "WAL.MART"              
## [88] "POST_OFFICES"           "geom"

3.2 Location Quotient calculation for each industry

Formula of Location Quotient (LQ) Location Quotient Let assuming that you are calculating LQ for industry a for municipality i. The formula should be: (number of industry a in municipality i / sum of all industry in municipality i)/(number of industry a in Greater Sao Paulo/sum of all industry in Greater Sao Paulo)

LQ <- SP_inner_join %>%
  mutate(INDUSTRY_A = (`COMP_A`/`COMP_TOT`)/(sum(SP_inner_join$COMP_A)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_B = (`COMP_B`/`COMP_TOT`)/(sum(SP_inner_join$COMP_B)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_C = (`COMP_C`/`COMP_TOT`)/(sum(SP_inner_join$COMP_C)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_D = (`COMP_D`/`COMP_TOT`)/(sum(SP_inner_join$COMP_D)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_E = (`COMP_E`/`COMP_TOT`)/(sum(SP_inner_join$COMP_E)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_F = (`COMP_F`/`COMP_TOT`)/(sum(SP_inner_join$COMP_F)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_G = (`COMP_G`/`COMP_TOT`)/(sum(SP_inner_join$COMP_G)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_H = (`COMP_H`/`COMP_TOT`)/(sum(SP_inner_join$COMP_H)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_I = (`COMP_I`/`COMP_TOT`)/(sum(SP_inner_join$COMP_I)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_J = (`COMP_J`/`COMP_TOT`)/(sum(SP_inner_join$COMP_J)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_K = (`COMP_K`/`COMP_TOT`)/(sum(SP_inner_join$COMP_K)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_L = (`COMP_L`/`COMP_TOT`)/(sum(SP_inner_join$COMP_L)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_M = (`COMP_M`/`COMP_TOT`)/(sum(SP_inner_join$COMP_M)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_N = (`COMP_N`/`COMP_TOT`)/(sum(SP_inner_join$COMP_N)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_O = (`COMP_O`/`COMP_TOT`)/(sum(SP_inner_join$COMP_O)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_P = (`COMP_P`/`COMP_TOT`)/(sum(SP_inner_join$COMP_P)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_Q = (`COMP_Q`/`COMP_TOT`)/(sum(SP_inner_join$COMP_Q)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_R = (`COMP_R`/`COMP_TOT`)/(sum(SP_inner_join$COMP_R)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_S = (`COMP_S`/`COMP_TOT`)/(sum(SP_inner_join$COMP_S)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_T = (`COMP_T`/`COMP_TOT`)/(sum(SP_inner_join$COMP_T)/max(SP_inner_join$COMP_TOT))) %>%
  mutate(INDUSTRY_U = (`COMP_U`/`COMP_TOT`)/(sum(SP_inner_join$COMP_U)/max(SP_inner_join$COMP_TOT))) %>%
  select(name_muni,IBGE_RES_POP,AREA, GVA_AGROPEC, GVA_INDUSTRY, GVA_SERVICES, GVA_PUBLIC, INDUSTRY_A,INDUSTRY_B,INDUSTRY_C,INDUSTRY_D,INDUSTRY_E,INDUSTRY_F,INDUSTRY_G,INDUSTRY_H,INDUSTRY_I,INDUSTRY_J,INDUSTRY_K,INDUSTRY_L,INDUSTRY_M,INDUSTRY_N,INDUSTRY_O,INDUSTRY_P,INDUSTRY_Q,INDUSTRY_R,INDUSTRY_S,INDUSTRY_T,INDUSTRY_U, geom)

summary(LQ)
##   name_muni          IBGE_RES_POP          AREA            GVA_AGROPEC    
##  Length:172         Min.   :    2493   Length:172         Min.   :     0  
##  Class :character   1st Qu.:   15764   Class :character   1st Qu.:  1091  
##  Mode  :character   Median :   46969   Mode  :character   Median : 11829  
##                     Mean   :  178402                      Mean   : 36855  
##                     3rd Qu.:  127853                      3rd Qu.: 32278  
##                     Max.   :11253503                      Max.   :630080  
##                                                                           
##   GVA_INDUSTRY       GVA_SERVICES         GVA_PUBLIC         INDUSTRY_A     
##  Min.   :       3   Min.   :       24   Min.   :      16   Min.   : 0.0000  
##  1st Qu.:   14666   1st Qu.:    68912   1st Qu.:   28712   1st Qu.: 0.1825  
##  Median :  161242   Median :   459650   Median :  154560   Median : 0.8230  
##  Mean   : 1452326   Mean   :  5332188   Mean   :  668834   Mean   : 3.2699  
##  3rd Qu.: 1182734   3rd Qu.:  2016584   3rd Qu.:  507929   3rd Qu.: 3.3586  
##  Max.   :63306755   Max.   :464656988   Max.   :41902893   Max.   :29.9330  
##                                                                             
##    INDUSTRY_B        INDUSTRY_C       INDUSTRY_D        INDUSTRY_E    
##  Min.   : 0.0000   Min.   :0.0000   Min.   : 0.0000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.:0.3315   1st Qu.: 0.0000   1st Qu.:0.3003  
##  Median : 0.6588   Median :0.5079   Median : 0.0000   Median :0.5594  
##  Mean   : 2.7200   Mean   :0.5767   Mean   : 0.3215   Mean   :0.7087  
##  3rd Qu.: 2.4599   3rd Qu.:0.7633   3rd Qu.: 0.0000   3rd Qu.:0.9236  
##  Max.   :43.4756   Max.   :1.7140   Max.   :22.5329   Max.   :6.3925  
##                                                                       
##    INDUSTRY_F       INDUSTRY_G       INDUSTRY_H       INDUSTRY_I    
##  Min.   :0.0000   Min.   :0.1160   Min.   :0.0000   Min.   :0.1364  
##  1st Qu.:0.3218   1st Qu.:0.4906   1st Qu.:0.3378   1st Qu.:0.4355  
##  Median :0.4289   Median :0.5451   Median :0.4675   Median :0.5024  
##  Mean   :0.4269   Mean   :0.5402   Mean   :0.5527   Mean   :0.5738  
##  3rd Qu.:0.5268   3rd Qu.:0.6069   3rd Qu.:0.6732   3rd Qu.:0.6213  
##  Max.   :0.9380   Max.   :1.0147   Max.   :1.8660   Max.   :2.1228  
##                                                                     
##    INDUSTRY_J       INDUSTRY_K        INDUSTRY_L       INDUSTRY_M    
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1026   1st Qu.:0.09882   1st Qu.:0.1333   1st Qu.:0.1644  
##  Median :0.1561   Median :0.15815   Median :0.2744   Median :0.2311  
##  Mean   :0.2118   Mean   :0.17675   Mean   :0.2772   Mean   :0.2533  
##  3rd Qu.:0.2250   3rd Qu.:0.23262   3rd Qu.:0.3742   3rd Qu.:0.3134  
##  Max.   :3.7910   Max.   :1.02673   Max.   :0.8712   Max.   :1.1660  
##                                                                      
##    INDUSTRY_N       INDUSTRY_O        INDUSTRY_P       INDUSTRY_Q    
##  Min.   :0.0000   Min.   : 0.1331   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1578   1st Qu.: 0.6281   1st Qu.:0.3314   1st Qu.:0.1780  
##  Median :0.2459   Median : 1.4407   Median :0.4794   Median :0.2689  
##  Mean   :0.2749   Mean   : 3.3555   Mean   :0.4699   Mean   :0.2875  
##  3rd Qu.:0.3447   3rd Qu.: 3.5280   3rd Qu.:0.6249   3rd Qu.:0.3677  
##  Max.   :1.2462   Max.   :64.2167   Max.   :1.0357   Max.   :0.8574  
##                                                                      
##    INDUSTRY_R       INDUSTRY_S        INDUSTRY_T    INDUSTRY_U     
##  Min.   :0.0000   Min.   :0.06971   Min.   : NA   Min.   :0.00000  
##  1st Qu.:0.3372   1st Qu.:0.34038   1st Qu.: NA   1st Qu.:0.00000  
##  Median :0.4422   Median :0.42615   Median : NA   Median :0.00000  
##  Mean   :0.4218   Mean   :0.43369   Mean   :NaN   Mean   :0.02169  
##  3rd Qu.:0.5389   3rd Qu.:0.51768   3rd Qu.: NA   3rd Qu.:0.00000  
##  Max.   :0.8975   Max.   :1.13087   Max.   : NA   Max.   :2.04474  
##                                     NA's   :172                    
##             geom    
##  MULTIPOLYGON :172  
##  epsg:4674    :  0  
##  +proj=long...:  0  
##                     
##                     
##                     
## 

Note: We will only select the columns stated in the code chunk below as we are investigating the distribution of spatial specialisation by industry type.

names(LQ)
##  [1] "name_muni"    "IBGE_RES_POP" "AREA"         "GVA_AGROPEC"  "GVA_INDUSTRY"
##  [6] "GVA_SERVICES" "GVA_PUBLIC"   "INDUSTRY_A"   "INDUSTRY_B"   "INDUSTRY_C"  
## [11] "INDUSTRY_D"   "INDUSTRY_E"   "INDUSTRY_F"   "INDUSTRY_G"   "INDUSTRY_H"  
## [16] "INDUSTRY_I"   "INDUSTRY_J"   "INDUSTRY_K"   "INDUSTRY_L"   "INDUSTRY_M"  
## [21] "INDUSTRY_N"   "INDUSTRY_O"   "INDUSTRY_P"   "INDUSTRY_Q"   "INDUSTRY_R"  
## [26] "INDUSTRY_S"   "INDUSTRY_T"   "INDUSTRY_U"   "geom"

Rename the columns

new_names <- c("NAME_MUNICIPALITY",
              "POPULATION", 
              "AREA",
              "GVA_AGROPEC",
              "GVA_INDUSTRY",
              "GVA_SERVICES",
              "GVA_PUBLIC",
              "COMP_AGRO",
              "COMP_EXTRACTIVE",
              "COMP_TRANSFORMATION",
              "COMP_GAS_ELECT",
              "COMP_WATER_SEWAGE",
              "COMP_CONSTRUCTION",
              "COMP_VEHICLE_REPAIR",
              "COMP_TRANSPORT_MAIL",
              "COMP_ACCOM_FOOD", 
              "COMP_INFOCOMM",
              "COMP_FINANCE",
              "COMP_REALESTATE",
              "COMP_SCIENCE_TECHNICAL",
              "COMP_ADMIN",
              "COMP_DEFENSE_SS",
              "COMP_EDUCATION",
              "COMP_HEALTH_SOCIAL_SERVICE",
              "COMP_ART_SPORT_RECRE",
              "COMP_OTHER_SERVICES",
              "COMP_DOMESTIC_SERVICES",
              "COMP_INTERNATIONAL_INSTI",
              "geom"
              )

colnames(LQ) <- new_names
colnames(LQ)
##  [1] "NAME_MUNICIPALITY"          "POPULATION"                
##  [3] "AREA"                       "GVA_AGROPEC"               
##  [5] "GVA_INDUSTRY"               "GVA_SERVICES"              
##  [7] "GVA_PUBLIC"                 "COMP_AGRO"                 
##  [9] "COMP_EXTRACTIVE"            "COMP_TRANSFORMATION"       
## [11] "COMP_GAS_ELECT"             "COMP_WATER_SEWAGE"         
## [13] "COMP_CONSTRUCTION"          "COMP_VEHICLE_REPAIR"       
## [15] "COMP_TRANSPORT_MAIL"        "COMP_ACCOM_FOOD"           
## [17] "COMP_INFOCOMM"              "COMP_FINANCE"              
## [19] "COMP_REALESTATE"            "COMP_SCIENCE_TECHNICAL"    
## [21] "COMP_ADMIN"                 "COMP_DEFENSE_SS"           
## [23] "COMP_EDUCATION"             "COMP_HEALTH_SOCIAL_SERVICE"
## [25] "COMP_ART_SPORT_RECRE"       "COMP_OTHER_SERVICES"       
## [27] "COMP_DOMESTIC_SERVICES"     "COMP_INTERNATIONAL_INSTI"  
## [29] "geom"

Since COMP_DOMESTIC_SERVICES, previously known as INDUSTRY_T has 172 NA values, it will be removed from the dataframe.

SP_mun <- LQ %>%
  select(-COMP_DOMESTIC_SERVICES)

Now, we will perform data aggregation to aggregate different industry into bigger categories.

3.3 Data aggregation

According to Investopedia, we will group the attributes in our dataframe into 4 main categories.

We will group the Variables from COMP_AGROPEC to COMP_INTERNATIONAL_INSTI into 4 categories - AGROPEC, INDUSTRY, SERVICES & PUBLIC; for consistency with the GVA Categories that we have. I used the definitions from this to guide me through assigning the sectors to the main 4 categories.

SP_mun_aggre <- SP_mun %>% 
  mutate(`COMP_AGROPEC` = `COMP_AGRO` + `COMP_EXTRACTIVE`) %>%
  mutate(`COMP_INDUSTRY` = `COMP_GAS_ELECT` + `COMP_CONSTRUCTION` + `COMP_TRANSFORMATION`) %>%
  mutate(`COMP_SERVICES` =  `COMP_WATER_SEWAGE` + `COMP_VEHICLE_REPAIR` + `COMP_TRANSPORT_MAIL` + `COMP_INFOCOMM` + `COMP_FINANCE` + `COMP_REALESTATE` + `COMP_SCIENCE_TECHNICAL` + `COMP_ADMIN`+`COMP_ACCOM_FOOD` +`COMP_OTHER_SERVICES` + `COMP_INTERNATIONAL_INSTI`) %>%
  mutate(`COMP_PUBLIC` = `COMP_DEFENSE_SS` + `COMP_EDUCATION` + `COMP_HEALTH_SOCIAL_SERVICE` + `COMP_ART_SPORT_RECRE`) %>%
  select(NAME_MUNICIPALITY, POPULATION, AREA, GVA_AGROPEC, GVA_INDUSTRY, GVA_SERVICES, GVA_PUBLIC, COMP_AGROPEC, COMP_INDUSTRY, COMP_SERVICES, COMP_PUBLIC, geom)

4 EDA- Thematic Mapping of Industrial Variables

5 obtaining the distribution of each variable with histogram visualization

comp_agrop <- ggplot(data=SP_mun_aggre, 
             aes(x= `COMP_AGROPEC`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

comp_indus <- ggplot(data=SP_mun_aggre, 
             aes(x= `COMP_INDUSTRY`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

comp_service <- ggplot(data=SP_mun_aggre, 
             aes(x= `COMP_SERVICES`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

comp_public <- ggplot(data=SP_mun_aggre, 
             aes(x= `COMP_PUBLIC`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gva_agrop <- ggplot(data=SP_mun_aggre, 
             aes(x= `GVA_AGROPEC`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")
                 
gva_indus <- ggplot(data=SP_mun_aggre, 
             aes(x= `GVA_INDUSTRY`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gva_service <- ggplot(data=SP_mun_aggre, 
             aes(x= `GVA_SERVICES`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gva_public <- ggplot(data=SP_mun_aggre, 
             aes(x= `GVA_PUBLIC`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

ggarrange(comp_agrop, comp_indus, comp_service, comp_public,gva_agrop, gva_indus, gva_service, gva_public,
          ncol = 2,
          nrow = 4)

6 Chloropleth mapping

7 Boxmap

boxbreaks <- function(v,mult=1.5) { 
  qv <- unname(quantile(v,na.rm = TRUE))
  iqr <- qv[4] - qv[2]  
  upfence <- qv[4] + mult * iqr  
  lofence <- qv[2] - mult * iqr
  
  # initialize break points vector  
  bb <- vector(mode="numeric",length=7)
  # logic for lower and upper fences
  if (lofence < qv[1]) {  
    # no lower outliers    
    bb[1] <- lofence    
    bb[2] <- floor(qv[1])  
  } else {    
    bb[2] <- lofence    
    bb[1] <- qv[1]
  }
  
  if(upfence > qv[5]) {
  # no upper outliers   
    bb[7] <- upfence   
    bb[6] <- ceiling(qv[5]) 
  } else {   
    bb[6] <- upfence  
    bb[7] <- qv[5]
  } 
  bb[3:5] <- qv[2:4]
  return(bb)
}

get.var <- function(vname,df) { 
  v <- df[vname] %>% st_set_geometry(NULL)
  v <- unname(v[,1])
  return(v)
}

boxmap <- function(vnam,df,titlesize,legx, legy, legtitle=NA,mtitle="Box Map",mult=1.5){  
  var  <- get.var(vnam,df)
  bb <- boxbreaks(var)
  tm_shape(df) +    
    tm_fill(vnam,
            title=legtitle,
            breaks=bb,
            palette="-RdBu", 
            labels = c("lower outlier", "< 25%", "25% - 50%", "50% - 75%","> 75%", "upper outlier")
            ) +
    tm_fill(vnam, 
            df %>% 
              filter(is.na(vnam)), 
            labels = c("NA")
            ) +
  tm_borders(alpha =0.2) +
  tm_layout(main.title = paste("Box Map of ",toString(vnam)," Numbers by Municipality"),
            main.title.size = titlesize,
            main.title.position =  "center",
            frame = FALSE,
            legend.height = legx,
            legend.width = legy
            )
}

bm_qtm <- function(vnam,df,titlesize,legx, legy) {
  tm <- qtm(df, vnam, main.title = paste("Chloropleth of ",toString(vnam)," Numbers by Municipality"), frame = FALSE, main.title.size = titlesize, main.title.position = "center")
  bm <- boxmap(vnam, df, titlesize, legx,legy)
  tmap_arrange(tm,bm, nrow = 2)
}
names(SP_mun_aggre)
##  [1] "NAME_MUNICIPALITY" "POPULATION"        "AREA"             
##  [4] "GVA_AGROPEC"       "GVA_INDUSTRY"      "GVA_SERVICES"     
##  [7] "GVA_PUBLIC"        "COMP_AGROPEC"      "COMP_INDUSTRY"    
## [10] "COMP_SERVICES"     "COMP_PUBLIC"       "geom"

7.1 Agropecurary Sector

bm_qtm("GVA_AGROPEC", SP_mun_aggre, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

bm_qtm("COMP_AGROPEC", SP_mun_aggre, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

7.2 Industrial Sector

bm_qtm("GVA_INDUSTRY", SP_mun_aggre, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

bm_qtm("COMP_INDUSTRY", SP_mun_aggre, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

7.3 Services Sector

bm_qtm("GVA_SERVICES", SP_mun_aggre, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

bm_qtm("COMP_SERVICES", SP_mun_aggre, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

7.4 Public Sector

bm_qtm("GVA_PUBLIC", SP_mun_aggre, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

bm_qtm("COMP_PUBLIC", SP_mun_aggre, 0.8,0.5,0.5)
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

8 Hierarchial Clustering Analysis

8.1 Extract Clustering Variables

cluster_df <- SP_mun_aggre %>%
  st_set_geometry(NULL) %>%
  select("NAME_MUNICIPALITY", "GVA_AGROPEC","GVA_INDUSTRY", "GVA_SERVICES", "GVA_PUBLIC", "COMP_AGROPEC" ,"COMP_INDUSTRY", "COMP_SERVICES", "COMP_PUBLIC")

head(cluster_df,10)
##       NAME_MUNICIPALITY GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC
## 1              Cabreúva    25784.75   1035198.80   2019585.37  190767.21
## 2  Campo Limpo Paulista        7.87    378706.96    811522.38  287912.58
## 3               Itupeva    39554.78   1448072.53   1934070.02  268109.10
## 4                Jarinu    41414.02    210690.00   1287417.11  123189.79
## 5               Jundiaí   130551.25   7409989.10  23110009.69    1842.82
## 6              Louveira    29487.12   2396464.90      6140.22  280391.40
## 7       Várzea Paulista     1872.03    680125.23    965083.82  391572.39
## 8    Águas De São Pedro        0.00     10504.94     94367.21   20211.04
## 9             Analândia       28.54     39254.36        44.73   25947.92
## 10               Araras   102135.70   1396340.77   2434354.36  491613.15
##    COMP_AGROPEC COMP_INDUSTRY COMP_SERVICES COMP_PUBLIC
## 1     2.5192077     1.6356377      5.028079    2.236264
## 2     0.4932410     1.4602773      3.508875    2.944918
## 3     2.3571852     2.3098980      5.043139    2.836002
## 4     3.6122034     1.1050816      3.958751    3.326196
## 5     1.0585721     1.0198368      4.785635    2.103140
## 6     1.0208685     1.3427092      4.712520    3.523391
## 7     0.1495956     1.9806734      4.419918    1.664542
## 8     4.8854000     0.4172801      4.754708    8.258216
## 9    25.1141667     0.6880818      4.184522    9.473083
## 10    3.6963572     1.4619303      3.955060    1.836958
row.names(cluster_df) <- cluster_df$"NAME_MUNI"
cluster_df <- select(cluster_df, c(2:7))
head(cluster_df,10)
##                      GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC
## Cabreúva                25784.75   1035198.80   2019585.37  190767.21
## Campo Limpo Paulista        7.87    378706.96    811522.38  287912.58
## Itupeva                 39554.78   1448072.53   1934070.02  268109.10
## Jarinu                  41414.02    210690.00   1287417.11  123189.79
## Jundiaí                130551.25   7409989.10  23110009.69    1842.82
## Louveira                29487.12   2396464.90      6140.22  280391.40
## Várzea Paulista          1872.03    680125.23    965083.82  391572.39
## Águas De São Pedro          0.00     10504.94     94367.21   20211.04
## Analândia                  28.54     39254.36        44.73   25947.92
## Araras                 102135.70   1396340.77   2434354.36  491613.15
##                      COMP_AGROPEC COMP_INDUSTRY
## Cabreúva                2.5192077     1.6356377
## Campo Limpo Paulista    0.4932410     1.4602773
## Itupeva                 2.3571852     2.3098980
## Jarinu                  3.6122034     1.1050816
## Jundiaí                 1.0585721     1.0198368
## Louveira                1.0208685     1.3427092
## Várzea Paulista         0.1495956     1.9806734
## Águas De São Pedro      4.8854000     0.4172801
## Analândia              25.1141667     0.6880818
## Araras                  3.6963572     1.4619303

8.2 Standardization

cluster_df.std <- normalize(cluster_df)
summary(cluster_df.std)
##   GVA_AGROPEC        GVA_INDUSTRY        GVA_SERVICES         GVA_PUBLIC       
##  Min.   :0.000000   Min.   :0.0000000   Min.   :0.0000000   Min.   :0.0000000  
##  1st Qu.:0.001732   1st Qu.:0.0002316   1st Qu.:0.0001483   1st Qu.:0.0006848  
##  Median :0.018774   Median :0.0025469   Median :0.0009892   Median :0.0036881  
##  Mean   :0.058493   Mean   :0.0229410   Mean   :0.0114755   Mean   :0.0159611  
##  3rd Qu.:0.051229   3rd Qu.:0.0186825   3rd Qu.:0.0043399   3rd Qu.:0.0121212  
##  Max.   :1.000000   Max.   :1.0000000   Max.   :1.0000000   Max.   :1.0000000  
##   COMP_AGROPEC      COMP_INDUSTRY    
##  Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.009945   1st Qu.:0.02718  
##  Median :0.042914   Median :0.04153  
##  Mean   :0.113642   Mean   :0.05191  
##  3rd Qu.:0.160037   3rd Qu.:0.05820  
##  Max.   :1.000000   Max.   :1.00000

8.3 Proximity matrix

proxmat <- dist(cluster_df.std, method = "euclidean")

8.4 Hierarchical clustering

hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.3)

8.5 Identify the optimal number of cluster

set.seed(12345)
gap_stat <- clusGap(cluster_df.std, FUN = hcut, nstart = 25, K.max = 20, B =50)

print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = cluster_df.std, FUNcluster = hcut, K.max = 20, B = 50,     nstart = 25)
## B=50 simulated reference sets, k = 1..20; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 1
##            logW   E.logW      gap     SE.sim
##  [1,] 2.3802150 3.781715 1.401500 0.01887029
##  [2,] 2.3031387 3.627648 1.324509 0.02098156
##  [3,] 2.0502192 3.512714 1.462495 0.02410670
##  [4,] 1.8882852 3.427714 1.539429 0.02570221
##  [5,] 1.7246002 3.368021 1.643421 0.02439057
##  [6,] 1.6436401 3.315368 1.671728 0.02382105
##  [7,] 1.4978476 3.268759 1.770911 0.02219327
##  [8,] 1.3708533 3.225469 1.854616 0.02087923
##  [9,] 1.3168705 3.186190 1.869320 0.01998299
## [10,] 1.2678128 3.148082 1.880270 0.02010648
## [11,] 1.1747914 3.111681 1.936890 0.02063577
## [12,] 1.1086447 3.077134 1.968489 0.02033383
## [13,] 1.0733724 3.044673 1.971301 0.01980712
## [14,] 1.0231837 3.013326 1.990143 0.01976356
## [15,] 0.9873555 2.983594 1.996238 0.01908341
## [16,] 0.9624875 2.955132 1.992644 0.01878823
## [17,] 0.9047742 2.927445 2.022671 0.01839123
## [18,] 0.8700301 2.900950 2.030919 0.01800258
## [19,] 0.8390916 2.876066 2.036974 0.01797449
## [20,] 0.7989319 2.851117 2.052185 0.01785649
fviz_gap_stat(gap_stat)

Use k = 5

8.6 Hierarchical dendogram

plot(hclust_ward, cex = 0.4)
rect.hclust(hclust_ward, k = 5, border = 2:5)

cluster_df_mat <- data.matrix(cluster_df.std)
heatmaply(cluster_df_mat,
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 3,
          margins = c(NA,200,60,NA),
          fontsize_row = 2,
          fontsize_col = 5,
          main="Geographic Segmentation of Sao Paulo by Industrial",
          xlab = "Industrial Indicators",
          ylab = "Municipalities of Sao Paulo State"
          )

9 Cluster

groups <- as.factor(cutree(hclust_ward, k=5))
SP_mun_aggre_clust <- cbind(SP_mun_aggre, as.matrix(groups)) %>%
  rename(`CLUSTER`=`as.matrix.groups.`)
qtm(SP_mun_aggre_clust, "CLUSTER")

10 Spatially constrained clustering

SP_mun_aggre_sp <- as_Spatial(SP_mun_aggre)
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition,
##  but +towgs84= values preserved
SP_mun_aggre_sp_crs <- spTransform(SP_mun_aggre_sp, CRS("+init=epsg:4618"))
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum South_American_Datum_1969 in CRS definition

10.1 Find nearest neighbour

sp.nb <- poly2nb(SP_mun_aggre_sp_crs)
summary(sp.nb)
## Neighbour list object:
## Number of regions: 172 
## Number of nonzero links: 872 
## Percentage nonzero weights: 2.947539 
## Average number of links: 5.069767 
## 1 region with no links:
## 100
## Link number distribution:
## 
##  0  1  2  3  4  5  6  7  8  9 10 22 
##  1  4 14 21 35 31 29 17  9  5  5  1 
## 4 least connected regions:
## 8 9 29 36 with 1 link
## 1 most connected region:
## 161 with 22 links
plot(SP_mun_aggre_sp_crs, border=grey(.5))
plot(sp.nb, coordinates(SP_mun_aggre_sp_crs), col="blue", add=TRUE)

One of the munnicipality is left out.

cnt <- card(sp.nb)
island_ind <- match(0, cnt)
print(island_ind)
## [1] 100
nearest <- knearneigh(coordinates(SP_mun_aggre_sp_crs))$nn
sp.nb[[island_ind]] <- nearest[island_ind]

# We need to add this island to the neighbours of its connected municipality as well
sp.nb[[nearest[island_ind]]] <- c(as.integer(100), as.integer(sp.nb[[nearest[island_ind]]]))
plot(SP_mun_aggre_sp_crs, border=grey(.5))
plot(sp.nb, coordinates(SP_mun_aggre_sp_crs), col="blue", add=TRUE)