Brazil is the world’s fifth-largest country by area and the sixth most populous. Brazil is classified as an upper-middle income economy by the World Bank and a developing country, with the largest share of global wealth in Latin America. It is considered an advanced emerging economy. It has the ninth largest GDP in the world by nominal, and eighth by PPP measures. Behind all these impressive figures, the spatial development of Brazil is highly unequal as shown in the figure on the right. In 2016, the GDP per capita of the poorest municipality is R$3,190.6. On the other hand, the GDP per capita of the richest municipality is R$314,638. Half of the municipalities have GDP per capita less than R$16,000 and the top 25% municipalities earn R$26,155 and above.
In recent year, our government has been encouraging local enterprises to venture into international market and one of the highly recommended country is Brazil (refer to https://www.enterprisesg.gov.sg/overseas-markets/north-latin-america/brazil/market-profile, https://www.enterprisesg.gov.sg/blog/latin-america-tech-landscape, and https://www.enterprisesg.gov.sg/blog/where-to-begin-in-latin-america). However, in view of the spatial development of Brazil that is highly unequal, any company interested to invest in Brazil must have a good understanding of the geographical specialisation and competitiveness of industry in the country.
to segment São Paulo Macrometropolis at the municipality level into homogeneous industry specialization clusters by using data provided by Brazilian Institute of Geography and Statistics (IBGE) (https://www.ibge.gov.br/en/homeeng. html).
The specific tasks of the analysis are as follows:
For the purpose of this study, two data sets are provided. They are:
BRAZIL_CITIES.csv. This data file consists of 81 columns and 5573 rows. Each row representing one municipality.
Data_Dictionary.csv. This file provides meta data of each columns in BRAZIL_CITIES.csv.
Both data files were downloaded from Kaggle (https://www.kaggle.com/crisparada/braziliancities).
Beside these two data files, you are also required to obtain a copy of 2016 municipality boundary file and a copy of the metropolitan boundary file by using an awesome R package called geobr (https://cran.r-project.org/web/packages/geobr/). Before using the package, you are highly recommended to read the vignettes (https://cran.rproject. org/web/packages/geobr/vignettes/intro_to_geobr.html) and the relevant section of the user guide.
Before we get started, it is important for us to install the necessary R packages into R and launch these R packages into R environment.
The R packages needed for this exercise are as follows:
packages = c('rgdal', 'spdep', 'ClustGeo', 'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse', 'geobr','arsenal','factoextra','NbClust')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}Note: - With tidyverse, we do not have to install readr, ggplot2 and dplyr packages separately. In fact, tidyverse also installed other very useful R packages such as tidyr.
Ctrl + Alt + I (Cmd + Option + I on macOS). Source hereThe csv files will be import using read.csv function of Base R functions.
Initially read_csv function of readr package was used to read the csv file into the environment. However, as the read_csv function was unable to separate the data effectively using the “;” as the separator, read.csv function was used and it has yielded the desired outcome. Besides that, any none English observations or observation with no value in the dataframe will not be read into R.
The code chunks used are shown below:
#encoding = UTF-8 is set to account for the unique portguese names and labels
Brazil_cities <- read.csv("data/aspatial/BRAZIL_CITIES.csv", sep = ";", encoding = "UTF-8")## [1] "CITY" "STATE" "CAPITAL"
## [4] "IBGE_RES_POP" "IBGE_RES_POP_BRAS" "IBGE_RES_POP_ESTR"
## [7] "IBGE_DU" "IBGE_DU_URBAN" "IBGE_DU_RURAL"
## [10] "IBGE_POP" "IBGE_1" "IBGE_1.4"
## [13] "IBGE_5.9" "IBGE_10.14" "IBGE_15.59"
## [16] "IBGE_60." "IBGE_PLANTED_AREA" "IBGE_CROP_PRODUCTION_."
## [19] "IDHM.Ranking.2010" "IDHM" "IDHM_Renda"
## [22] "IDHM_Longevidade" "IDHM_Educacao" "LONG"
## [25] "LAT" "ALT" "PAY_TV"
## [28] "FIXED_PHONES" "AREA" "REGIAO_TUR"
## [31] "CATEGORIA_TUR" "ESTIMATED_POP" "RURAL_URBAN"
## [34] "GVA_AGROPEC" "GVA_INDUSTRY" "GVA_SERVICES"
## [37] "GVA_PUBLIC" "GVA_TOTAL" "TAXES"
## [40] "GDP" "POP_GDP" "GDP_CAPITA"
## [43] "GVA_MAIN" "MUN_EXPENDIT" "COMP_TOT"
## [46] "COMP_A" "COMP_B" "COMP_C"
## [49] "COMP_D" "COMP_E" "COMP_F"
## [52] "COMP_G" "COMP_H" "COMP_I"
## [55] "COMP_J" "COMP_K" "COMP_L"
## [58] "COMP_M" "COMP_N" "COMP_O"
## [61] "COMP_P" "COMP_Q" "COMP_R"
## [64] "COMP_S" "COMP_T" "COMP_U"
## [67] "HOTELS" "BEDS" "Pr_Agencies"
## [70] "Pu_Agencies" "Pr_Bank" "Pu_Bank"
## [73] "Pr_Assets" "Pu_Assets" "Cars"
## [76] "Motorcycles" "Wheeled_tractor" "UBER"
## [79] "MAC" "WAL.MART" "POST_OFFICES"
#encoding = UTF-8 is set to account for the unique portguese names and labels
Brazil_metadata <- read.csv("data/aspatial/Data_Dictionary.csv", sep = ";", encoding = "UTF-8")## [1] "X.U.FEFF.FIELD" "DESCRIPTION" "REFERENCE" "UNIT"
## [5] "SOURCE" "X"
The code chunk below is to ascertain the data type of Brazil_cities and Brazil_metadata.
## [1] "data.frame"
## [1] "data.frame"
The code chunk below reveal the summary statistics of Brazil_cities data.frame.
## CITY STATE CAPITAL IBGE_RES_POP
## Length:5573 Length:5573 Min. :0.000000 Min. : 805
## Class :character Class :character 1st Qu.:0.000000 1st Qu.: 5235
## Mode :character Mode :character Median :0.000000 Median : 10934
## Mean :0.004845 Mean : 34278
## 3rd Qu.:0.000000 3rd Qu.: 23424
## Max. :1.000000 Max. :11253503
## NA's :8
## IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN
## Min. : 805 Min. : 0.0 Min. : 239 Min. : 60
## 1st Qu.: 5230 1st Qu.: 0.0 1st Qu.: 1572 1st Qu.: 874
## Median : 10926 Median : 0.0 Median : 3174 Median : 1846
## Mean : 34200 Mean : 77.5 Mean : 10303 Mean : 8859
## 3rd Qu.: 23390 3rd Qu.: 10.0 3rd Qu.: 6726 3rd Qu.: 4624
## Max. :11133776 Max. :119727.0 Max. :3576148 Max. :3548433
## NA's :8 NA's :8 NA's :10 NA's :10
## IBGE_DU_RURAL IBGE_POP IBGE_1 IBGE_1.4
## Min. : 3 Min. : 174 Min. : 0.0 Min. : 5
## 1st Qu.: 487 1st Qu.: 2801 1st Qu.: 38.0 1st Qu.: 158
## Median : 931 Median : 6170 Median : 92.0 Median : 376
## Mean : 1463 Mean : 27595 Mean : 383.3 Mean : 1544
## 3rd Qu.: 1832 3rd Qu.: 15302 3rd Qu.: 232.0 3rd Qu.: 951
## Max. :33809 Max. :10463636 Max. :129464.0 Max. :514794
## NA's :81 NA's :8 NA's :8 NA's :8
## IBGE_5.9 IBGE_10.14 IBGE_15.59 IBGE_60.
## Min. : 7 Min. : 12 Min. : 94 Min. : 29
## 1st Qu.: 220 1st Qu.: 259 1st Qu.: 1734 1st Qu.: 341
## Median : 516 Median : 588 Median : 3841 Median : 722
## Mean : 2069 Mean : 2381 Mean : 18212 Mean : 3004
## 3rd Qu.: 1300 3rd Qu.: 1478 3rd Qu.: 9628 3rd Qu.: 1724
## Max. :684443 Max. :783702 Max. :7058221 Max. :1293012
## NA's :8 NA's :8 NA's :8 NA's :8
## IBGE_PLANTED_AREA IBGE_CROP_PRODUCTION_. IDHM.Ranking.2010 IDHM
## Min. : 0.0 Min. : 0 Min. : 1 Min. :0.4180
## 1st Qu.: 910.2 1st Qu.: 2326 1st Qu.:1392 1st Qu.:0.5990
## Median : 3471.5 Median : 13846 Median :2783 Median :0.6650
## Mean : 14179.9 Mean : 57384 Mean :2783 Mean :0.6592
## 3rd Qu.: 11194.2 3rd Qu.: 55619 3rd Qu.:4174 3rd Qu.:0.7180
## Max. :1205669.0 Max. :3274885 Max. :5565 Max. :0.8620
## NA's :3 NA's :3 NA's :8 NA's :8
## IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG
## Min. :0.4000 Min. :0.6720 Min. :0.2070 Min. :-72.92
## 1st Qu.:0.5720 1st Qu.:0.7690 1st Qu.:0.4900 1st Qu.:-50.87
## Median :0.6540 Median :0.8080 Median :0.5600 Median :-46.52
## Mean :0.6429 Mean :0.8016 Mean :0.5591 Mean :-46.23
## 3rd Qu.:0.7070 3rd Qu.:0.8360 3rd Qu.:0.6310 3rd Qu.:-41.40
## Max. :0.8910 Max. :0.8940 Max. :0.8250 Max. :-32.44
## NA's :8 NA's :8 NA's :8 NA's :9
## LAT ALT PAY_TV FIXED_PHONES
## Min. :-33.688 Min. : 0.0 Min. : 1 Min. : 3
## 1st Qu.:-22.838 1st Qu.: 169.8 1st Qu.: 88 1st Qu.: 119
## Median :-18.089 Median : 406.5 Median : 247 Median : 327
## Mean :-16.444 Mean : 893.8 Mean : 3094 Mean : 6567
## 3rd Qu.: -8.489 3rd Qu.: 628.9 3rd Qu.: 815 3rd Qu.: 1151
## Max. : 4.585 Max. :874579.0 Max. :2047668 Max. :5543127
## NA's :9 NA's :9 NA's :3 NA's :3
## AREA REGIAO_TUR CATEGORIA_TUR ESTIMATED_POP
## Length:5573 Length:5573 Length:5573 Min. : 786
## Class :character Class :character Class :character 1st Qu.: 5454
## Mode :character Mode :character Mode :character Median : 11590
## Mean : 37432
## 3rd Qu.: 25296
## Max. :12176866
## NA's :3
## RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES
## Length:5573 Min. : 0 Min. : 1 Min. : 2
## Class :character 1st Qu.: 4189 1st Qu.: 1726 1st Qu.: 10112
## Mode :character Median : 20426 Median : 7424 Median : 31211
## Mean : 47271 Mean : 175928 Mean : 489451
## 3rd Qu.: 51227 3rd Qu.: 41022 3rd Qu.: 115406
## Max. :1402282 Max. :63306755 Max. :464656988
## NA's :3 NA's :3 NA's :3
## GVA_PUBLIC GVA_TOTAL TAXES GDP
## Min. : 7 Min. : 17 Min. : -14159 Min. : 15
## 1st Qu.: 17267 1st Qu.: 42253 1st Qu.: 1305 1st Qu.: 43709
## Median : 35866 Median : 119492 Median : 5100 Median : 125153
## Mean : 123768 Mean : 832987 Mean : 118864 Mean : 954584
## 3rd Qu.: 89245 3rd Qu.: 313963 3rd Qu.: 22197 3rd Qu.: 329539
## Max. :41902893 Max. :569910503 Max. :117125387 Max. :687035890
## NA's :3 NA's :3 NA's :3 NA's :3
## POP_GDP GDP_CAPITA GVA_MAIN MUN_EXPENDIT
## Min. : 815 Min. : 3191 Length:5573 Min. :1.421e+06
## 1st Qu.: 5483 1st Qu.: 9058 Class :character 1st Qu.:1.573e+07
## Median : 11578 Median : 15870 Mode :character Median :2.746e+07
## Mean : 36998 Mean : 21126 Mean :1.043e+08
## 3rd Qu.: 25085 3rd Qu.: 26155 3rd Qu.:5.666e+07
## Max. :12038175 Max. :314638 Max. :4.577e+10
## NA's :3 NA's :3 NA's :1492
## COMP_TOT COMP_A COMP_B COMP_C
## Min. : 6.0 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 68.0 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 3.00
## Median : 162.0 Median : 2.00 Median : 0.000 Median : 11.00
## Mean : 906.8 Mean : 18.25 Mean : 1.852 Mean : 73.44
## 3rd Qu.: 448.0 3rd Qu.: 8.00 3rd Qu.: 2.000 3rd Qu.: 39.00
## Max. :530446.0 Max. :1948.00 Max. :274.000 Max. :31566.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_D COMP_E COMP_F COMP_G
## Min. : 0.0000 Min. : 0.000 Min. : 0.00 Min. : 1.0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 32.0
## Median : 0.0000 Median : 0.000 Median : 4.00 Median : 74.5
## Mean : 0.4262 Mean : 2.029 Mean : 43.26 Mean : 348.0
## 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 15.00 3rd Qu.: 199.0
## Max. :332.0000 Max. :657.000 Max. :25222.00 Max. :150633.0
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_H COMP_I COMP_J COMP_K
## Min. : 0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 7 Median : 7.00 Median : 1.00 Median : 0.00
## Mean : 41 Mean : 55.88 Mean : 24.74 Mean : 15.55
## 3rd Qu.: 25 3rd Qu.: 24.00 3rd Qu.: 5.00 3rd Qu.: 2.00
## Max. :19515 Max. :29290.00 Max. :38720.00 Max. :23738.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_L COMP_M COMP_N COMP_O
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 0.00 Median : 4.00 Median : 4.0 Median : 2.000
## Mean : 15.14 Mean : 51.29 Mean : 83.7 Mean : 3.269
## 3rd Qu.: 3.00 3rd Qu.: 13.00 3rd Qu.: 14.0 3rd Qu.: 3.000
## Max. :14003.00 Max. :49181.00 Max. :76757.0 Max. :204.000
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_P COMP_Q COMP_R COMP_S
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.: 5.00
## Median : 6.00 Median : 3.00 Median : 2.00 Median : 12.00
## Mean : 30.96 Mean : 34.15 Mean : 12.18 Mean : 51.61
## 3rd Qu.: 17.00 3rd Qu.: 12.00 3rd Qu.: 6.00 3rd Qu.: 31.00
## Max. :16030.00 Max. :22248.00 Max. :6687.00 Max. :24832.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_T COMP_U HOTELS BEDS
## Min. :0 Min. : 0.00000 Min. : 1.000 Min. : 2.0
## 1st Qu.:0 1st Qu.: 0.00000 1st Qu.: 1.000 1st Qu.: 40.0
## Median :0 Median : 0.00000 Median : 1.000 Median : 82.0
## Mean :0 Mean : 0.05027 Mean : 3.131 Mean : 257.5
## 3rd Qu.:0 3rd Qu.: 0.00000 3rd Qu.: 3.000 3rd Qu.: 200.0
## Max. :0 Max. :123.00000 Max. :97.000 Max. :13247.0
## NA's :3 NA's :3 NA's :4686 NA's :4686
## Pr_Agencies Pu_Agencies Pr_Bank Pu_Bank
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.00
## 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:1.00
## Median : 1.000 Median : 2.000 Median : 1.000 Median :2.00
## Mean : 3.383 Mean : 2.829 Mean : 1.312 Mean :1.58
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.:2.00
## Max. :1693.000 Max. :626.000 Max. :83.000 Max. :8.00
## NA's :2231 NA's :2231 NA's :2231 NA's :2231
## Pr_Assets Pu_Assets Cars Motorcycles
## Min. :0.000e+00 Min. :0.000e+00 Min. : 2 Min. : 4
## 1st Qu.:0.000e+00 1st Qu.:4.047e+07 1st Qu.: 602 1st Qu.: 591
## Median :3.231e+07 Median :1.339e+08 Median : 1438 Median : 1285
## Mean :9.180e+09 Mean :6.005e+09 Mean : 9859 Mean : 4879
## 3rd Qu.:1.148e+08 3rd Qu.:4.970e+08 3rd Qu.: 4086 3rd Qu.: 3294
## Max. :1.947e+13 Max. :8.016e+12 Max. :5740995 Max. :1134570
## NA's :2231 NA's :2231 NA's :11 NA's :11
## Wheeled_tractor UBER MAC WAL.MART
## Min. : 0.000 Min. :1 Min. : 1.000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:1 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 0.000 Median :1 Median : 2.000 Median : 1.000
## Mean : 5.754 Mean :1 Mean : 4.277 Mean : 2.059
## 3rd Qu.: 1.000 3rd Qu.:1 3rd Qu.: 3.000 3rd Qu.: 1.750
## Max. :3236.000 Max. :1 Max. :130.000 Max. :26.000
## NA's :11 NA's :5448 NA's :5407 NA's :5471
## POST_OFFICES
## Min. : 1.000
## 1st Qu.: 1.000
## Median : 1.000
## Mean : 2.081
## 3rd Qu.: 2.000
## Max. :225.000
## NA's :120
## [1] "GO" "MG" "PA" "CE" "BA" "PR" "SC" "PE" "TO" "MA" "RN" "PI" "RS" "MT" "AC"
## [16] "SP" "ES" "AL" "PB" "MS" "RO" "RR" "AM" "AP" "SE" "RJ" "DF"
After checking through Brazil_cities dataframe, we confirm that “SP” state is Sao Paulo state that we would like to look at. Next, we will extract all observations that belong to this category with the code chunk below.
In this section, we will import Brazil 2016 municipality boundary file GIS data and the metropolitan boundary GIS data into R environment.
The geobr package provides quick and easy access to official spatial data sets of Brazil. The syntax of all geobr functions operate on a simple logic that allows users to easily download a wide variety of data sets with updated geometries and harmonized attributes and geographic projections across geographies and years.
The documentation of different functions in the geobr package can be found here.
Description The function returns the shapes of municipalities grouped by their respective metro areas. Metropolitan areas are created by each state in Brazil. The data set includes the municipalities that belong to all metropolitan areas in the country according to state legislation in each year. Orignal data were generated by Institute of Geography. Data at scale 1:250,000, using Geodetic reference system “SIRGAS2000” and CRS(4674).
Usage read_metro_area(year = 2018, simplified = TRUE, showProgress = TRUE)
Arguments year A year number in YYYY format (defaults to 2018) 20 read_micro_region
simplified Logic FALSE or TRUE, indicating whether the function returns the data set with ’original’ resolution or a data set with ’simplified’ borders (Defaults to TRUE). For spatial analysis and statistics users should set simplified = FALSE. Borders have been simplified by removing vertices of borders using st_simplifysf preserving topology with a dTolerance of 100.
showProgress Logical. Defaults to (TRUE) display progress bar
The code chunks used are shown below:
## Using year 2016
Extracting online the São Paulo Macrometropolis by specifying the state to be “SP”.
The code chunk below is used to ascertain the total number of metropolis in the São Paulo Macrometropolis.
## [1] "Aglomeração Urbana de Jundiaí"
## [2] "Aglomeração Urbana de Piracicaba-AU- Piracicaba"
## [3] "RM Baixada Santista"
## [4] "RM Campinas"
## [5] "RM de Sorocaba"
## [6] "RM do Vale do Paraíba e Litoral Norte"
## [7] "RM São Paulo"
Through the output of the code chunk above, we can clearly see that Regional Unit of Bragança Paulista city is missing as we compare the list of name_metro against the list of Region available in Wikipedia. The list below are the 8 main regions in the São Paulo Macrometropolis:
# plot
ggplot() +
geom_sf(data=SP_metropolitan, fill="#2D3E50", color="#FEBF57", size=.15, show.legend = FALSE) +
labs(subtitle="Metropolitan of Sao Paolo Brazil, 2016", size=8) +
theme_minimal()Note: As stated previously, Regional Unit of Bragança Paulista city is the missing region in the Metropolitan data. thus we will be using another function from the geobr package, read_municipality to extract the required municipalities.
Arguments code_muni The 7-digit code of a municipality. If the two-digit code or a two-letter uppercase abbreviation of a state is passed, (e.g. 33 or “RJ”) the function will load all municipalities of that state. If code_muni=“all”, all municipalities of the country will be loaded.
year Year of the data (defaults to 2010)
simplified Logic FALSE or TRUE, indicating whether the function returns the data set with ’original’ resolution or a data set with ’simplified’ borders (Defaults to TRUE). For spatial analysis and statistics users should set simplified = FALSE. Borders have been simplified by removing vertices of borders using st_simplifysf preserving topology with a dTolerance of 100.
showProgress Logical. Defaults to (TRUE) display progress bar
The code chunks used are shown below:
# Read all municipalities of the country at a given year
mun <- read_municipality(year=2016, simplified = TRUE, showProgress = FALSE)## Using year 2016
## Loading data for the whole country. This might take a few minutes.
Finding out the different states that are included in the municipality dataset.
## [1] "RO" "AC" "AM" "RR" "PA" "AP" "TO" "MA" "PI" "CE" "RN" "PB" "PE" "AL" "SE"
## [16] "BA" "MG" "ES" "RJ" "SP" "PR" "SC" "RS" "MS" "MT" "GO" "DF"
# plot
ggplot() +
geom_sf(data=mun, fill="#2D3E50", color="#FEBF57", size=.15, show.legend = TRUE) +
labs(subtitle="Municipalities of Brazil, 2016", size=8) +
theme_minimal() Now we will try to extract the Regional Unit of Bragança Paulista city from the mun dataset using the code chunk below:
We will also extract the São Paulo state area to visualize whether our gazette region of interest is indeed in the correct location.
ggplot() +
geom_sf(data=Sao_Paulo, fill="#4AE107", color="#0941E7", size=.15, show.legend = TRUE, ) +
geom_sf(data=SP_metropolitan, fill="#E90729", color="#E6E907", size=.15, show.legend = TRUE, ) +
labs(subtitle="Metropolis of Brazil, 2016", size=8) +
theme_minimal() +
geom_sf(data=Bra_paulista, fill="#FF00CC", color="#030003", size=.15, show.legend = FALSE) Note: This is a mistake whereby only the Bragança Paulista municipality is gazetted instead of the whole Regional Unit of Bragança Paulista city which includes:
These information can be found from here
First we will create a vector of the names of municipalities in the Regional Unit of Bragança Paulista city using the code chunk below:
BP_munis <- c("Atibaia",
"Bom Jesus dos Perdões",
"Bragança Paulista",
"Itatiba",
"Jarinu",
"Joanópolis",
"Morungaba",
"Nazaré Paulista",
"Piracaia",
"Tuiuti",
"Vargem"
)Then, we will extract the relevant municipalities using the name as the index
## Simple feature collection with 11 features and 4 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -46.92735 ymin: -23.31611 xmax: -46.04105 ymax: -22.76762
## geographic CRS: SIRGAS 2000
## First 10 features:
## code_muni name_muni code_state abbrev_state
## 1 3504107 Atibaia 35 SP
## 2 3507100 Bom Jesus Dos Perdões 35 SP
## 3 3507605 Bragança Paulista 35 SP
## 4 3523404 Itatiba 35 SP
## 5 3525201 Jarinu 35 SP
## 6 3525508 Joanópolis 35 SP
## 7 3532009 Morungaba 35 SP
## 8 3532405 Nazaré Paulista 35 SP
## 9 3538600 Piracaia 35 SP
## 10 3554953 Tuiuti 35 SP
## geom
## 1 MULTIPOLYGON (((-46.66146 -...
## 2 MULTIPOLYGON (((-46.47303 -...
## 3 MULTIPOLYGON (((-46.50858 -...
## 4 MULTIPOLYGON (((-46.76852 -...
## 5 MULTIPOLYGON (((-46.71471 -...
## 6 MULTIPOLYGON (((-46.14383 -...
## 7 MULTIPOLYGON (((-46.80227 -...
## 8 MULTIPOLYGON (((-46.26139 -...
## 9 MULTIPOLYGON (((-46.34589 -...
## 10 MULTIPOLYGON (((-46.6655 -2...
Note: The confirmation that the extraction is done well when the BP_munis_gis returns 11 observations which is aligned with the total number of municipalities in the Regional Unit of Bragança Paulista city. This comparison can also be done by comparing with the length of BP_munis vector.
Checking for duplication of data
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [169] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
Check the number of rows in the dataframe
## [1] 175
Since there are duplication, let’s us examine the data that are duplicated
duplicated_rows <- SP_area[duplicated(SP_area$name_muni),]
duplicated_names <- duplicated_rows$name_muni
SP_area %>%
filter(name_muni %in% duplicated_names)## Simple feature collection with 6 features and 9 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -46.92735 ymin: -23.21495 xmax: -46.65393 ymax: -22.83401
## geographic CRS: SIRGAS 2000
## code_muni name_muni code_state abbrev_state name_metro
## 1 3525201 Jarinu 35 SP Aglomeração Urbana de Jundiaí
## 2 3523404 Itatiba 35 SP RM Campinas
## 3 3532009 Morungaba 35 SP RM Campinas
## 4 3523404 Itatiba 35 SP <NA>
## 5 3525201 Jarinu 35 SP <NA>
## 6 3532009 Morungaba 35 SP <NA>
## type subdivision legislation legislation_date
## 1 AGLO NÃO TEM Lei Complementar 1.146 24.08.2011
## 2 RM NÃO TEM Lei Complementar 870 19.06.2000
## 3 RM NÃO TEM Lei Complementar 1.234 14.03.2014
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA>
## geom
## 1 MULTIPOLYGON (((-46.71471 -...
## 2 MULTIPOLYGON (((-46.76852 -...
## 3 MULTIPOLYGON (((-46.80227 -...
## 4 MULTIPOLYGON (((-46.76852 -...
## 5 MULTIPOLYGON (((-46.71471 -...
## 6 MULTIPOLYGON (((-46.80227 -...
Through rigorous examination, it seems that 3 municipalities has been recognised to be in the Regional Unit of Bragança Paulista city instead of their previously identified metropolis. This information can be checked using this link and by examining the output above individually.
We will then proceed to drop the duplicated rows
And perform checks for duplication again:
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE
Check the number of row in the dataframe after the removal of duplication. Since the number is 172 which is 3 rows lesser than the un-processed SP_area dataframe, the duplication removal has been performed effectively.
## [1] 172
g1<- ggplot() +
geom_sf(data=Sao_Paulo, fill="#4AE107", color="#0941E7", size=.15, show.legend = TRUE, ) +
geom_sf(data=SP_metropolitan, fill="#E90729", color="#E6E907", size=.15, show.legend = TRUE, ) +
labs(subtitle="Metropolis of Brazil, 2016", size=8) +
theme_minimal() +
geom_sf(data=Bra_paulista, fill="#FF00CC", color="#030003", size=.15, show.legend = FALSE)
g2 <- ggplot() +
geom_sf(data=Sao_Paulo, fill="#4AE107", color="#0941E7", size=.15, show.legend = TRUE, ) +
geom_sf(data=SP_area, fill="#E90729", color="#E6E907", size=.15, show.legend = TRUE, ) +
labs(subtitle="Metropolis of Brazil (Complete), 2016", size=8) +
theme_minimal()
ggarrange(g1, g2, widths = c(2,2))Joining of geospatial and aspatial data needs to be done to ultimately assist in the analysis of distribution of spatial specialisation by industry type. However, due to the lack of common data field in both geospatial and aspatial dataframe, the challenge lies in the join process.
SP_inner_join <- SP_area %>% inner_join(SP_state, by = c("abbrev_state" = "STATE", "name_muni" = "CITY"))Counting the number of observations in the newly formed inner joined dataframe
## [1] 171
Recall that the geospatial dataframe SP_area has 172 observations, we will need to identify which municipality has been removed.
## [1] 1
## [1] "Santa Bárbara D'oeste"
After much inspection, SP_area’s municipality saved name was Santa Bárbara D'oeste while SP_state’s CITY name is saved as Santa Bárbara D'Oeste. Due to the capitalized O in the city name in SP_state, the inner join function was unsuccessful. Thus we will rename the city name in SP_state to match the one in SP_area using the code chunk below:
Inner_join
SP_inner_join <- SP_area %>% inner_join(SP_state, by = c("abbrev_state" = "STATE", "name_muni" = "CITY"))Counting the number of observations in the newly formed inner joined dataframe
## [1] 172
## [1] 0
Since the difference in the number of rows in both dataframe is zero, the inner join process is done correctly.
## [1] "CITY" "STATE" "CAPITAL"
## [4] "IBGE_RES_POP" "IBGE_RES_POP_BRAS" "IBGE_RES_POP_ESTR"
## [7] "IBGE_DU" "IBGE_DU_URBAN" "IBGE_DU_RURAL"
## [10] "IBGE_POP" "IBGE_1" "IBGE_1.4"
## [13] "IBGE_5.9" "IBGE_10.14" "IBGE_15.59"
## [16] "IBGE_60." "IBGE_PLANTED_AREA" "IBGE_CROP_PRODUCTION_."
## [19] "IDHM.Ranking.2010" "IDHM" "IDHM_Renda"
## [22] "IDHM_Longevidade" "IDHM_Educacao" "LONG"
## [25] "LAT" "ALT" "PAY_TV"
## [28] "FIXED_PHONES" "AREA" "REGIAO_TUR"
## [31] "CATEGORIA_TUR" "ESTIMATED_POP" "RURAL_URBAN"
## [34] "GVA_AGROPEC" "GVA_INDUSTRY" "GVA_SERVICES"
## [37] "GVA_PUBLIC" "GVA_TOTAL" "TAXES"
## [40] "GDP" "POP_GDP" "GDP_CAPITA"
## [43] "GVA_MAIN" "MUN_EXPENDIT" "COMP_TOT"
## [46] "COMP_A" "COMP_B" "COMP_C"
## [49] "COMP_D" "COMP_E" "COMP_F"
## [52] "COMP_G" "COMP_H" "COMP_I"
## [55] "COMP_J" "COMP_K" "COMP_L"
## [58] "COMP_M" "COMP_N" "COMP_O"
## [61] "COMP_P" "COMP_Q" "COMP_R"
## [64] "COMP_S" "COMP_T" "COMP_U"
## [67] "HOTELS" "BEDS" "Pr_Agencies"
## [70] "Pu_Agencies" "Pr_Bank" "Pu_Bank"
## [73] "Pr_Assets" "Pu_Assets" "Cars"
## [76] "Motorcycles" "Wheeled_tractor" "UBER"
## [79] "MAC" "WAL.MART" "POST_OFFICES"
## [1] "code_muni" "name_muni" "code_state"
## [4] "abbrev_state" "name_metro" "type"
## [7] "subdivision" "legislation" "legislation_date"
## [10] "CAPITAL" "IBGE_RES_POP" "IBGE_RES_POP_BRAS"
## [13] "IBGE_RES_POP_ESTR" "IBGE_DU" "IBGE_DU_URBAN"
## [16] "IBGE_DU_RURAL" "IBGE_POP" "IBGE_1"
## [19] "IBGE_1.4" "IBGE_5.9" "IBGE_10.14"
## [22] "IBGE_15.59" "IBGE_60." "IBGE_PLANTED_AREA"
## [25] "IBGE_CROP_PRODUCTION_." "IDHM.Ranking.2010" "IDHM"
## [28] "IDHM_Renda" "IDHM_Longevidade" "IDHM_Educacao"
## [31] "LONG" "LAT" "ALT"
## [34] "PAY_TV" "FIXED_PHONES" "AREA"
## [37] "REGIAO_TUR" "CATEGORIA_TUR" "ESTIMATED_POP"
## [40] "RURAL_URBAN" "GVA_AGROPEC" "GVA_INDUSTRY"
## [43] "GVA_SERVICES" "GVA_PUBLIC" "GVA_TOTAL"
## [46] "TAXES" "GDP" "POP_GDP"
## [49] "GDP_CAPITA" "GVA_MAIN" "MUN_EXPENDIT"
## [52] "COMP_TOT" "COMP_A" "COMP_B"
## [55] "COMP_C" "COMP_D" "COMP_E"
## [58] "COMP_F" "COMP_G" "COMP_H"
## [61] "COMP_I" "COMP_J" "COMP_K"
## [64] "COMP_L" "COMP_M" "COMP_N"
## [67] "COMP_O" "COMP_P" "COMP_Q"
## [70] "COMP_R" "COMP_S" "COMP_T"
## [73] "COMP_U" "HOTELS" "BEDS"
## [76] "Pr_Agencies" "Pu_Agencies" "Pr_Bank"
## [79] "Pu_Bank" "Pr_Assets" "Pu_Assets"
## [82] "Cars" "Motorcycles" "Wheeled_tractor"
## [85] "UBER" "MAC" "WAL.MART"
## [88] "POST_OFFICES" "geom"
Formula of Location Quotient (LQ) Let assuming that you are calculating LQ for industry a for municipality i. The formula should be: (number of industry a in municipality i / sum of all industry in municipality i)/(number of industry a in Greater Sao Paulo/sum of all industry in Greater Sao Paulo)
LQ <- SP_inner_join %>%
mutate(INDUSTRY_A = (`COMP_A`/`COMP_TOT`)/(sum(SP_inner_join$COMP_A)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_B = (`COMP_B`/`COMP_TOT`)/(sum(SP_inner_join$COMP_B)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_C = (`COMP_C`/`COMP_TOT`)/(sum(SP_inner_join$COMP_C)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_D = (`COMP_D`/`COMP_TOT`)/(sum(SP_inner_join$COMP_D)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_E = (`COMP_E`/`COMP_TOT`)/(sum(SP_inner_join$COMP_E)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_F = (`COMP_F`/`COMP_TOT`)/(sum(SP_inner_join$COMP_F)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_G = (`COMP_G`/`COMP_TOT`)/(sum(SP_inner_join$COMP_G)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_H = (`COMP_H`/`COMP_TOT`)/(sum(SP_inner_join$COMP_H)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_I = (`COMP_I`/`COMP_TOT`)/(sum(SP_inner_join$COMP_I)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_J = (`COMP_J`/`COMP_TOT`)/(sum(SP_inner_join$COMP_J)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_K = (`COMP_K`/`COMP_TOT`)/(sum(SP_inner_join$COMP_K)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_L = (`COMP_L`/`COMP_TOT`)/(sum(SP_inner_join$COMP_L)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_M = (`COMP_M`/`COMP_TOT`)/(sum(SP_inner_join$COMP_M)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_N = (`COMP_N`/`COMP_TOT`)/(sum(SP_inner_join$COMP_N)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_O = (`COMP_O`/`COMP_TOT`)/(sum(SP_inner_join$COMP_O)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_P = (`COMP_P`/`COMP_TOT`)/(sum(SP_inner_join$COMP_P)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_Q = (`COMP_Q`/`COMP_TOT`)/(sum(SP_inner_join$COMP_Q)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_R = (`COMP_R`/`COMP_TOT`)/(sum(SP_inner_join$COMP_R)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_S = (`COMP_S`/`COMP_TOT`)/(sum(SP_inner_join$COMP_S)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_T = (`COMP_T`/`COMP_TOT`)/(sum(SP_inner_join$COMP_T)/max(SP_inner_join$COMP_TOT))) %>%
mutate(INDUSTRY_U = (`COMP_U`/`COMP_TOT`)/(sum(SP_inner_join$COMP_U)/max(SP_inner_join$COMP_TOT))) %>%
select(name_muni,IBGE_RES_POP,AREA, GVA_AGROPEC, GVA_INDUSTRY, GVA_SERVICES, GVA_PUBLIC, INDUSTRY_A,INDUSTRY_B,INDUSTRY_C,INDUSTRY_D,INDUSTRY_E,INDUSTRY_F,INDUSTRY_G,INDUSTRY_H,INDUSTRY_I,INDUSTRY_J,INDUSTRY_K,INDUSTRY_L,INDUSTRY_M,INDUSTRY_N,INDUSTRY_O,INDUSTRY_P,INDUSTRY_Q,INDUSTRY_R,INDUSTRY_S,INDUSTRY_T,INDUSTRY_U, geom)
summary(LQ)## name_muni IBGE_RES_POP AREA GVA_AGROPEC
## Length:172 Min. : 2493 Length:172 Min. : 0
## Class :character 1st Qu.: 15764 Class :character 1st Qu.: 1091
## Mode :character Median : 46969 Mode :character Median : 11829
## Mean : 178402 Mean : 36855
## 3rd Qu.: 127853 3rd Qu.: 32278
## Max. :11253503 Max. :630080
##
## GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC INDUSTRY_A
## Min. : 3 Min. : 24 Min. : 16 Min. : 0.0000
## 1st Qu.: 14666 1st Qu.: 68912 1st Qu.: 28712 1st Qu.: 0.1825
## Median : 161242 Median : 459650 Median : 154560 Median : 0.8230
## Mean : 1452326 Mean : 5332188 Mean : 668834 Mean : 3.2699
## 3rd Qu.: 1182734 3rd Qu.: 2016584 3rd Qu.: 507929 3rd Qu.: 3.3586
## Max. :63306755 Max. :464656988 Max. :41902893 Max. :29.9330
##
## INDUSTRY_B INDUSTRY_C INDUSTRY_D INDUSTRY_E
## Min. : 0.0000 Min. :0.0000 Min. : 0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.:0.3315 1st Qu.: 0.0000 1st Qu.:0.3003
## Median : 0.6588 Median :0.5079 Median : 0.0000 Median :0.5594
## Mean : 2.7200 Mean :0.5767 Mean : 0.3215 Mean :0.7087
## 3rd Qu.: 2.4599 3rd Qu.:0.7633 3rd Qu.: 0.0000 3rd Qu.:0.9236
## Max. :43.4756 Max. :1.7140 Max. :22.5329 Max. :6.3925
##
## INDUSTRY_F INDUSTRY_G INDUSTRY_H INDUSTRY_I
## Min. :0.0000 Min. :0.1160 Min. :0.0000 Min. :0.1364
## 1st Qu.:0.3218 1st Qu.:0.4906 1st Qu.:0.3378 1st Qu.:0.4355
## Median :0.4289 Median :0.5451 Median :0.4675 Median :0.5024
## Mean :0.4269 Mean :0.5402 Mean :0.5527 Mean :0.5738
## 3rd Qu.:0.5268 3rd Qu.:0.6069 3rd Qu.:0.6732 3rd Qu.:0.6213
## Max. :0.9380 Max. :1.0147 Max. :1.8660 Max. :2.1228
##
## INDUSTRY_J INDUSTRY_K INDUSTRY_L INDUSTRY_M
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1026 1st Qu.:0.09882 1st Qu.:0.1333 1st Qu.:0.1644
## Median :0.1561 Median :0.15815 Median :0.2744 Median :0.2311
## Mean :0.2118 Mean :0.17675 Mean :0.2772 Mean :0.2533
## 3rd Qu.:0.2250 3rd Qu.:0.23262 3rd Qu.:0.3742 3rd Qu.:0.3134
## Max. :3.7910 Max. :1.02673 Max. :0.8712 Max. :1.1660
##
## INDUSTRY_N INDUSTRY_O INDUSTRY_P INDUSTRY_Q
## Min. :0.0000 Min. : 0.1331 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1578 1st Qu.: 0.6281 1st Qu.:0.3314 1st Qu.:0.1780
## Median :0.2459 Median : 1.4407 Median :0.4794 Median :0.2689
## Mean :0.2749 Mean : 3.3555 Mean :0.4699 Mean :0.2875
## 3rd Qu.:0.3447 3rd Qu.: 3.5280 3rd Qu.:0.6249 3rd Qu.:0.3677
## Max. :1.2462 Max. :64.2167 Max. :1.0357 Max. :0.8574
##
## INDUSTRY_R INDUSTRY_S INDUSTRY_T INDUSTRY_U
## Min. :0.0000 Min. :0.06971 Min. : NA Min. :0.00000
## 1st Qu.:0.3372 1st Qu.:0.34038 1st Qu.: NA 1st Qu.:0.00000
## Median :0.4422 Median :0.42615 Median : NA Median :0.00000
## Mean :0.4218 Mean :0.43369 Mean :NaN Mean :0.02169
## 3rd Qu.:0.5389 3rd Qu.:0.51768 3rd Qu.: NA 3rd Qu.:0.00000
## Max. :0.8975 Max. :1.13087 Max. : NA Max. :2.04474
## NA's :172
## geom
## MULTIPOLYGON :172
## epsg:4674 : 0
## +proj=long...: 0
##
##
##
##
Note: We will only select the columns stated in the code chunk below as we are investigating the distribution of spatial specialisation by industry type.
## [1] "name_muni" "IBGE_RES_POP" "AREA" "GVA_AGROPEC" "GVA_INDUSTRY"
## [6] "GVA_SERVICES" "GVA_PUBLIC" "INDUSTRY_A" "INDUSTRY_B" "INDUSTRY_C"
## [11] "INDUSTRY_D" "INDUSTRY_E" "INDUSTRY_F" "INDUSTRY_G" "INDUSTRY_H"
## [16] "INDUSTRY_I" "INDUSTRY_J" "INDUSTRY_K" "INDUSTRY_L" "INDUSTRY_M"
## [21] "INDUSTRY_N" "INDUSTRY_O" "INDUSTRY_P" "INDUSTRY_Q" "INDUSTRY_R"
## [26] "INDUSTRY_S" "INDUSTRY_T" "INDUSTRY_U" "geom"
Rename the columns
new_names <- c("NAME_MUNICIPALITY",
"POPULATION",
"AREA",
"GVA_AGROPEC",
"GVA_INDUSTRY",
"GVA_SERVICES",
"GVA_PUBLIC",
"COMP_AGRO",
"COMP_EXTRACTIVE",
"COMP_TRANSFORMATION",
"COMP_GAS_ELECT",
"COMP_WATER_SEWAGE",
"COMP_CONSTRUCTION",
"COMP_VEHICLE_REPAIR",
"COMP_TRANSPORT_MAIL",
"COMP_ACCOM_FOOD",
"COMP_INFOCOMM",
"COMP_FINANCE",
"COMP_REALESTATE",
"COMP_SCIENCE_TECHNICAL",
"COMP_ADMIN",
"COMP_DEFENSE_SS",
"COMP_EDUCATION",
"COMP_HEALTH_SOCIAL_SERVICE",
"COMP_ART_SPORT_RECRE",
"COMP_OTHER_SERVICES",
"COMP_DOMESTIC_SERVICES",
"COMP_INTERNATIONAL_INSTI",
"geom"
)
colnames(LQ) <- new_names## [1] "NAME_MUNICIPALITY" "POPULATION"
## [3] "AREA" "GVA_AGROPEC"
## [5] "GVA_INDUSTRY" "GVA_SERVICES"
## [7] "GVA_PUBLIC" "COMP_AGRO"
## [9] "COMP_EXTRACTIVE" "COMP_TRANSFORMATION"
## [11] "COMP_GAS_ELECT" "COMP_WATER_SEWAGE"
## [13] "COMP_CONSTRUCTION" "COMP_VEHICLE_REPAIR"
## [15] "COMP_TRANSPORT_MAIL" "COMP_ACCOM_FOOD"
## [17] "COMP_INFOCOMM" "COMP_FINANCE"
## [19] "COMP_REALESTATE" "COMP_SCIENCE_TECHNICAL"
## [21] "COMP_ADMIN" "COMP_DEFENSE_SS"
## [23] "COMP_EDUCATION" "COMP_HEALTH_SOCIAL_SERVICE"
## [25] "COMP_ART_SPORT_RECRE" "COMP_OTHER_SERVICES"
## [27] "COMP_DOMESTIC_SERVICES" "COMP_INTERNATIONAL_INSTI"
## [29] "geom"
Since COMP_DOMESTIC_SERVICES, previously known as INDUSTRY_T has 172 NA values, it will be removed from the dataframe.
Now, we will perform data aggregation to aggregate different industry into bigger categories.
According to Investopedia, we will group the attributes in our dataframe into 4 main categories.
We will group the Variables from COMP_AGROPEC to COMP_INTERNATIONAL_INSTI into 4 categories - AGROPEC, INDUSTRY, SERVICES & PUBLIC; for consistency with the GVA Categories that we have. I used the definitions from this to guide me through assigning the sectors to the main 4 categories.
SP_mun_aggre <- SP_mun %>%
mutate(`COMP_AGROPEC` = `COMP_AGRO` + `COMP_EXTRACTIVE`) %>%
mutate(`COMP_INDUSTRY` = `COMP_GAS_ELECT` + `COMP_CONSTRUCTION` + `COMP_TRANSFORMATION`) %>%
mutate(`COMP_SERVICES` = `COMP_WATER_SEWAGE` + `COMP_VEHICLE_REPAIR` + `COMP_TRANSPORT_MAIL` + `COMP_INFOCOMM` + `COMP_FINANCE` + `COMP_REALESTATE` + `COMP_SCIENCE_TECHNICAL` + `COMP_ADMIN`+`COMP_ACCOM_FOOD` +`COMP_OTHER_SERVICES` + `COMP_INTERNATIONAL_INSTI`) %>%
mutate(`COMP_PUBLIC` = `COMP_DEFENSE_SS` + `COMP_EDUCATION` + `COMP_HEALTH_SOCIAL_SERVICE` + `COMP_ART_SPORT_RECRE`) %>%
select(NAME_MUNICIPALITY, POPULATION, AREA, GVA_AGROPEC, GVA_INDUSTRY, GVA_SERVICES, GVA_PUBLIC, COMP_AGROPEC, COMP_INDUSTRY, COMP_SERVICES, COMP_PUBLIC, geom)comp_agrop <- ggplot(data=SP_mun_aggre,
aes(x= `COMP_AGROPEC`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
comp_indus <- ggplot(data=SP_mun_aggre,
aes(x= `COMP_INDUSTRY`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
comp_service <- ggplot(data=SP_mun_aggre,
aes(x= `COMP_SERVICES`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
comp_public <- ggplot(data=SP_mun_aggre,
aes(x= `COMP_PUBLIC`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gva_agrop <- ggplot(data=SP_mun_aggre,
aes(x= `GVA_AGROPEC`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gva_indus <- ggplot(data=SP_mun_aggre,
aes(x= `GVA_INDUSTRY`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gva_service <- ggplot(data=SP_mun_aggre,
aes(x= `GVA_SERVICES`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gva_public <- ggplot(data=SP_mun_aggre,
aes(x= `GVA_PUBLIC`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
ggarrange(comp_agrop, comp_indus, comp_service, comp_public,gva_agrop, gva_indus, gva_service, gva_public,
ncol = 2,
nrow = 4)boxbreaks <- function(v,mult=1.5) {
qv <- unname(quantile(v,na.rm = TRUE))
iqr <- qv[4] - qv[2]
upfence <- qv[4] + mult * iqr
lofence <- qv[2] - mult * iqr
# initialize break points vector
bb <- vector(mode="numeric",length=7)
# logic for lower and upper fences
if (lofence < qv[1]) {
# no lower outliers
bb[1] <- lofence
bb[2] <- floor(qv[1])
} else {
bb[2] <- lofence
bb[1] <- qv[1]
}
if(upfence > qv[5]) {
# no upper outliers
bb[7] <- upfence
bb[6] <- ceiling(qv[5])
} else {
bb[6] <- upfence
bb[7] <- qv[5]
}
bb[3:5] <- qv[2:4]
return(bb)
}
get.var <- function(vname,df) {
v <- df[vname] %>% st_set_geometry(NULL)
v <- unname(v[,1])
return(v)
}
boxmap <- function(vnam,df,titlesize,legx, legy, legtitle=NA,mtitle="Box Map",mult=1.5){
var <- get.var(vnam,df)
bb <- boxbreaks(var)
tm_shape(df) +
tm_fill(vnam,
title=legtitle,
breaks=bb,
palette="-RdBu",
labels = c("lower outlier", "< 25%", "25% - 50%", "50% - 75%","> 75%", "upper outlier")
) +
tm_fill(vnam,
df %>%
filter(is.na(vnam)),
labels = c("NA")
) +
tm_borders(alpha =0.2) +
tm_layout(main.title = paste("Box Map of ",toString(vnam)," Numbers by Municipality"),
main.title.size = titlesize,
main.title.position = "center",
frame = FALSE,
legend.height = legx,
legend.width = legy
)
}
bm_qtm <- function(vnam,df,titlesize,legx, legy) {
tm <- qtm(df, vnam, main.title = paste("Chloropleth of ",toString(vnam)," Numbers by Municipality"), frame = FALSE, main.title.size = titlesize, main.title.position = "center")
bm <- boxmap(vnam, df, titlesize, legx,legy)
tmap_arrange(tm,bm, nrow = 2)
}## [1] "NAME_MUNICIPALITY" "POPULATION" "AREA"
## [4] "GVA_AGROPEC" "GVA_INDUSTRY" "GVA_SERVICES"
## [7] "GVA_PUBLIC" "COMP_AGROPEC" "COMP_INDUSTRY"
## [10] "COMP_SERVICES" "COMP_PUBLIC" "geom"
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
cluster_df <- SP_mun_aggre %>%
st_set_geometry(NULL) %>%
select("NAME_MUNICIPALITY", "GVA_AGROPEC","GVA_INDUSTRY", "GVA_SERVICES", "GVA_PUBLIC", "COMP_AGROPEC" ,"COMP_INDUSTRY", "COMP_SERVICES", "COMP_PUBLIC")
head(cluster_df,10)## NAME_MUNICIPALITY GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC
## 1 Cabreúva 25784.75 1035198.80 2019585.37 190767.21
## 2 Campo Limpo Paulista 7.87 378706.96 811522.38 287912.58
## 3 Itupeva 39554.78 1448072.53 1934070.02 268109.10
## 4 Jarinu 41414.02 210690.00 1287417.11 123189.79
## 5 Jundiaí 130551.25 7409989.10 23110009.69 1842.82
## 6 Louveira 29487.12 2396464.90 6140.22 280391.40
## 7 Várzea Paulista 1872.03 680125.23 965083.82 391572.39
## 8 Águas De São Pedro 0.00 10504.94 94367.21 20211.04
## 9 Analândia 28.54 39254.36 44.73 25947.92
## 10 Araras 102135.70 1396340.77 2434354.36 491613.15
## COMP_AGROPEC COMP_INDUSTRY COMP_SERVICES COMP_PUBLIC
## 1 2.5192077 1.6356377 5.028079 2.236264
## 2 0.4932410 1.4602773 3.508875 2.944918
## 3 2.3571852 2.3098980 5.043139 2.836002
## 4 3.6122034 1.1050816 3.958751 3.326196
## 5 1.0585721 1.0198368 4.785635 2.103140
## 6 1.0208685 1.3427092 4.712520 3.523391
## 7 0.1495956 1.9806734 4.419918 1.664542
## 8 4.8854000 0.4172801 4.754708 8.258216
## 9 25.1141667 0.6880818 4.184522 9.473083
## 10 3.6963572 1.4619303 3.955060 1.836958
row.names(cluster_df) <- cluster_df$"NAME_MUNI"
cluster_df <- select(cluster_df, c(2:7))
head(cluster_df,10)## GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC
## Cabreúva 25784.75 1035198.80 2019585.37 190767.21
## Campo Limpo Paulista 7.87 378706.96 811522.38 287912.58
## Itupeva 39554.78 1448072.53 1934070.02 268109.10
## Jarinu 41414.02 210690.00 1287417.11 123189.79
## Jundiaí 130551.25 7409989.10 23110009.69 1842.82
## Louveira 29487.12 2396464.90 6140.22 280391.40
## Várzea Paulista 1872.03 680125.23 965083.82 391572.39
## Águas De São Pedro 0.00 10504.94 94367.21 20211.04
## Analândia 28.54 39254.36 44.73 25947.92
## Araras 102135.70 1396340.77 2434354.36 491613.15
## COMP_AGROPEC COMP_INDUSTRY
## Cabreúva 2.5192077 1.6356377
## Campo Limpo Paulista 0.4932410 1.4602773
## Itupeva 2.3571852 2.3098980
## Jarinu 3.6122034 1.1050816
## Jundiaí 1.0585721 1.0198368
## Louveira 1.0208685 1.3427092
## Várzea Paulista 0.1495956 1.9806734
## Águas De São Pedro 4.8854000 0.4172801
## Analândia 25.1141667 0.6880818
## Araras 3.6963572 1.4619303
## GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC
## Min. :0.000000 Min. :0.0000000 Min. :0.0000000 Min. :0.0000000
## 1st Qu.:0.001732 1st Qu.:0.0002316 1st Qu.:0.0001483 1st Qu.:0.0006848
## Median :0.018774 Median :0.0025469 Median :0.0009892 Median :0.0036881
## Mean :0.058493 Mean :0.0229410 Mean :0.0114755 Mean :0.0159611
## 3rd Qu.:0.051229 3rd Qu.:0.0186825 3rd Qu.:0.0043399 3rd Qu.:0.0121212
## Max. :1.000000 Max. :1.0000000 Max. :1.0000000 Max. :1.0000000
## COMP_AGROPEC COMP_INDUSTRY
## Min. :0.000000 Min. :0.00000
## 1st Qu.:0.009945 1st Qu.:0.02718
## Median :0.042914 Median :0.04153
## Mean :0.113642 Mean :0.05191
## 3rd Qu.:0.160037 3rd Qu.:0.05820
## Max. :1.000000 Max. :1.00000
set.seed(12345)
gap_stat <- clusGap(cluster_df.std, FUN = hcut, nstart = 25, K.max = 20, B =50)
print(gap_stat, method = "firstmax")## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = cluster_df.std, FUNcluster = hcut, K.max = 20, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..20; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 1
## logW E.logW gap SE.sim
## [1,] 2.3802150 3.781715 1.401500 0.01887029
## [2,] 2.3031387 3.627648 1.324509 0.02098156
## [3,] 2.0502192 3.512714 1.462495 0.02410670
## [4,] 1.8882852 3.427714 1.539429 0.02570221
## [5,] 1.7246002 3.368021 1.643421 0.02439057
## [6,] 1.6436401 3.315368 1.671728 0.02382105
## [7,] 1.4978476 3.268759 1.770911 0.02219327
## [8,] 1.3708533 3.225469 1.854616 0.02087923
## [9,] 1.3168705 3.186190 1.869320 0.01998299
## [10,] 1.2678128 3.148082 1.880270 0.02010648
## [11,] 1.1747914 3.111681 1.936890 0.02063577
## [12,] 1.1086447 3.077134 1.968489 0.02033383
## [13,] 1.0733724 3.044673 1.971301 0.01980712
## [14,] 1.0231837 3.013326 1.990143 0.01976356
## [15,] 0.9873555 2.983594 1.996238 0.01908341
## [16,] 0.9624875 2.955132 1.992644 0.01878823
## [17,] 0.9047742 2.927445 2.022671 0.01839123
## [18,] 0.8700301 2.900950 2.030919 0.01800258
## [19,] 0.8390916 2.876066 2.036974 0.01797449
## [20,] 0.7989319 2.851117 2.052185 0.01785649
Use k = 5
heatmaply(cluster_df_mat,
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 3,
margins = c(NA,200,60,NA),
fontsize_row = 2,
fontsize_col = 5,
main="Geographic Segmentation of Sao Paulo by Industrial",
xlab = "Industrial Indicators",
ylab = "Municipalities of Sao Paulo State"
)SP_mun_aggre_clust <- cbind(SP_mun_aggre, as.matrix(groups)) %>%
rename(`CLUSTER`=`as.matrix.groups.`)## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition,
## but +towgs84= values preserved
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum South_American_Datum_1969 in CRS definition
## Neighbour list object:
## Number of regions: 172
## Number of nonzero links: 872
## Percentage nonzero weights: 2.947539
## Average number of links: 5.069767
## 1 region with no links:
## 100
## Link number distribution:
##
## 0 1 2 3 4 5 6 7 8 9 10 22
## 1 4 14 21 35 31 29 17 9 5 5 1
## 4 least connected regions:
## 8 9 29 36 with 1 link
## 1 most connected region:
## 161 with 22 links
plot(SP_mun_aggre_sp_crs, border=grey(.5))
plot(sp.nb, coordinates(SP_mun_aggre_sp_crs), col="blue", add=TRUE)One of the munnicipality is left out.
## [1] 100
nearest <- knearneigh(coordinates(SP_mun_aggre_sp_crs))$nn
sp.nb[[island_ind]] <- nearest[island_ind]
# We need to add this island to the neighbours of its connected municipality as well
sp.nb[[nearest[island_ind]]] <- c(as.integer(100), as.integer(sp.nb[[nearest[island_ind]]]))plot(SP_mun_aggre_sp_crs, border=grey(.5))
plot(sp.nb, coordinates(SP_mun_aggre_sp_crs), col="blue", add=TRUE)