Brazil is the world's fifth-largest country by area and the sixth most populous. Brazil is classified as an upper-middle income economy by the World Bank and a developing country, with the largest share of global wealth in Latin America. It is considered an advanced emerging economy. It has the ninth largest GDP in the world by nominal, and eighth by PPP measures. Behind all this impressive figures,the spatial development of Brazil is highly unequal asshown in the figure on the right. In 2016, the GDP per capita of the poorest municipality is R$3,190.6. On the other hand, the GDP per capita of the richest municipality is R$314,638. Half of the municipalities have GDP per capita less than R$16,000 and the top 25% municipalities earn R$26,155 and above.
In recent year, our government has been encouraging local enterprises to venture into international market and one of the highly recommended country is Brazil (refer to https://www.enterprisesg.gov.sg/overseas-markets/north-latin-america/brazil/market-profile, https://www.enterprisesg.gov.sg/blog/latin-america-tech-landscape, and https://www.enterprisesg.gov.sg/blog/where-to-begin-in-latin-america). However, in view of the spatial development of Brazil that is highly unequal, any company interested to invest in Brazil must have a good understanding of the geographical specialisation and competitiveness of industry in the country.
In this take-home exercise, it is required to segment São Paulo Macrometropolis at the municipality level into homogeneous industry specialization clusters by using data provided by Brazilian Institute of Geography and Statistics (IBGE) (https://www.ibge.gov.br/en/homeeng.html).
The specific tasks of the analysis are as follows:
Preparing a series of choropleth maps showing the distribution of spatial specialisation by industry type, 2016 at municipality level.
Delineating industry specialisation clusters by using hierarchical clustering method and display their distribution by using appropriate thematic mapping technique.
Delineating industry specialisation clusters by using spatially constrained clustering method and display their distribution by using appropriate thematic mapping technique
packages = c('rgdal', 'spdep', 'tmap', 'sf', 'ggpubr', 'cluster', 'factoextra', 'NbClust', 'heatmaply', 'corrplot', 'psych', 'tidyverse','geobr')
for (p in packages) {
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The csv files will be import using the read.csv function, using ; as a separator.
cities_data <- read.csv("data/aspatial/BRAZIL_CITIES.csv", sep = ";", encoding = "UTF-8") # add in encoding = UTF-8 to account for the portguese names that are in a different script as english
dict <- read.csv("data/aspatial/Data_Dictionary.csv", sep = ";", encoding = "UTF-8")Take a brief look at the cities dataframe
glimpse(cities_data)## Rows: 5,573
## Columns: 81
## $ CITY <chr> "Abadia De Goiás", "Abadia Dos Dourados", "A...
## $ STATE <chr> "GO", "MG", "GO", "MG", "PA", "CE", "BA", "B...
## $ CAPITAL <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ IBGE_RES_POP <int> 6876, 6704, 15757, 22690, 141100, 10496, 831...
## $ IBGE_RES_POP_BRAS <int> 6876, 6704, 15609, 22690, 141040, 10496, 831...
## $ IBGE_RES_POP_ESTR <int> 0, 0, 148, 0, 60, 0, 0, 0, 0, 0, 0, 16, 17, ...
## $ IBGE_DU <int> 2137, 2328, 4655, 7694, 31061, 2791, 2572, 4...
## $ IBGE_DU_URBAN <int> 1546, 1481, 3233, 6667, 19057, 1251, 1193, 2...
## $ IBGE_DU_RURAL <int> 591, 847, 1422, 1027, 12004, 1540, 1379, 195...
## $ IBGE_POP <int> 5300, 4154, 10656, 18464, 82956, 4538, 3725,...
## $ IBGE_1 <int> 69, 38, 139, 176, 1354, 98, 37, 167, 69, 12,...
## $ IBGE_1.4 <int> 318, 207, 650, 856, 5567, 323, 156, 733, 302...
## $ IBGE_5.9 <int> 438, 260, 894, 1233, 7618, 421, 263, 978, 37...
## $ IBGE_10.14 <int> 517, 351, 1087, 1539, 8905, 483, 277, 927, 4...
## $ IBGE_15.59 <int> 3542, 2709, 6896, 11979, 53516, 2631, 2319, ...
## $ IBGE_60. <int> 416, 589, 990, 2681, 5996, 582, 673, 803, 81...
## $ IBGE_PLANTED_AREA <int> 319, 4479, 10307, 1862, 25200, 2598, 895, 20...
## $ IBGE_CROP_PRODUCTION_. <int> 1843, 18017, 33085, 7502, 700872, 5234, 3999...
## $ IDHM.Ranking.2010 <int> 1689, 2207, 2202, 1994, 3530, 3522, 4086, 47...
## $ IDHM <dbl> 0.708, 0.690, 0.690, 0.698, 0.628, 0.628, 0....
## $ IDHM_Renda <dbl> 0.687, 0.693, 0.671, 0.720, 0.579, 0.540, 0....
## $ IDHM_Longevidade <dbl> 0.830, 0.839, 0.841, 0.848, 0.798, 0.748, 0....
## $ IDHM_Educacao <dbl> 0.622, 0.563, 0.579, 0.556, 0.537, 0.612, 0....
## $ LONG <dbl> -49.44055, -47.39683, -48.71881, -45.44619, ...
## $ LAT <dbl> -16.758812, -18.487565, -16.182672, -19.1558...
## $ ALT <dbl> 893.60, 753.12, 1017.55, 644.74, 10.12, 403....
## $ PAY_TV <int> 360, 77, 227, 1230, 3389, 29, 952, 51, 55, 1...
## $ FIXED_PHONES <int> 842, 296, 720, 1716, 1218, 34, 335, 222, 392...
## $ AREA <chr> "147.26", "881.06", "1,045.13", "1,817.07", ...
## $ REGIAO_TUR <chr> "", "Caminhos Do Cerrado", "Região Turística...
## $ CATEGORIA_TUR <chr> "", "D", "C", "D", "D", "", "D", "", "", "D"...
## $ ESTIMATED_POP <int> 8583, 6972, 19614, 23223, 156292, 11663, 876...
## $ RURAL_URBAN <chr> "Urbano", "Rural Adjacente", "Rural Adjacent...
## $ GVA_AGROPEC <dbl> 6.20, 50524.57, 42.84, 113824.60, 140463.72,...
## $ GVA_INDUSTRY <dbl> 27991.25, 25917.70, 16728.30, 31002.62, 5861...
## $ GVA_SERVICES <dbl> 74750.32, 62689.23, 138198.58, 172.33, 46812...
## $ GVA_PUBLIC <dbl> 36915.04, 28083.79, 63396.20, 86081.41, 4868...
## $ GVA_TOTAL <dbl> 145857.60, 167215.28, 261161.91, 403241.27, ...
## $ TAXES <dbl> 20554.20, 12873.50, 26822.58, 26994.09, 9518...
## $ GDP <dbl> 166.41, 180.09, 287984.49, 430235.36, 124925...
## $ POP_GDP <int> 8053, 7037, 18427, 23574, 151934, 11483, 921...
## $ GDP_CAPITA <dbl> 20664.57, 25591.70, 15628.40, 18250.42, 8222...
## $ GVA_MAIN <chr> "Demais serviços", "Demais serviços", "Demai...
## $ MUN_EXPENDIT <dbl> 28227691, 17909274, 37513019, NA, NA, NA, NA...
## $ COMP_TOT <int> 284, 476, 288, 621, 931, 86, 191, 87, 285, 6...
## $ COMP_A <int> 5, 6, 5, 18, 4, 1, 6, 2, 5, 2, 0, 8, 3, 1, 4...
## $ COMP_B <int> 1, 6, 9, 1, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0,...
## $ COMP_C <int> 56, 30, 26, 40, 43, 4, 8, 3, 20, 4, 9, 40, 3...
## $ COMP_D <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ COMP_E <int> 2, 2, 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0,...
## $ COMP_F <int> 29, 34, 7, 20, 27, 6, 4, 0, 10, 2, 0, 25, 12...
## $ COMP_G <int> 110, 190, 117, 303, 500, 48, 97, 71, 133, 35...
## $ COMP_H <int> 26, 70, 12, 62, 16, 2, 5, 0, 18, 8, 1, 67, 3...
## $ COMP_I <int> 4, 28, 57, 30, 31, 10, 5, 1, 14, 3, 0, 25, 1...
## $ COMP_J <int> 5, 11, 2, 9, 6, 2, 3, 1, 8, 1, 1, 9, 5, 14, ...
## $ COMP_K <int> 0, 0, 1, 6, 1, 0, 1, 0, 0, 1, 0, 4, 3, 3, 0,...
## $ COMP_L <int> 2, 4, 0, 4, 1, 0, 0, 0, 4, 0, 0, 7, 4, 4, 0,...
## $ COMP_M <int> 10, 15, 7, 28, 22, 2, 5, 0, 11, 4, 2, 26, 17...
## $ COMP_N <int> 12, 29, 15, 27, 16, 3, 5, 1, 26, 0, 1, 16, 1...
## $ COMP_O <int> 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 6, 2, 4, 2, 4,...
## $ COMP_P <int> 6, 9, 11, 15, 155, 0, 8, 0, 8, 1, 6, 14, 11,...
## $ COMP_Q <int> 6, 14, 5, 19, 33, 2, 1, 2, 9, 3, 0, 13, 22, ...
## $ COMP_R <int> 1, 6, 1, 9, 15, 0, 2, 0, 4, 0, 0, 4, 6, 6, 0...
## $ COMP_S <int> 5, 19, 8, 27, 56, 4, 38, 4, 12, 3, 4, 23, 38...
## $ COMP_T <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COMP_U <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ HOTELS <int> NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA...
## $ BEDS <int> NA, NA, 34, NA, NA, NA, 24, NA, NA, NA, NA, ...
## $ Pr_Agencies <int> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0, 2...
## $ Pu_Agencies <int> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1, 3...
## $ Pr_Bank <int> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0, 2...
## $ Pu_Bank <int> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1, 3...
## $ Pr_Assets <dbl> NA, NA, 33724584, 44974716, 76181384, NA, NA...
## $ Pu_Assets <dbl> NA, NA, 67091904, 371922572, 800078483, NA, ...
## $ Cars <int> 2158, 2227, 2838, 6928, 5277, 553, 896, 613,...
## $ Motorcycles <int> 1246, 1142, 1426, 2953, 25661, 1674, 696, 15...
## $ Wheeled_tractor <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0,...
## $ UBER <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ MAC <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ WAL.MART <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ POST_OFFICES <int> 1, 1, 3, 4, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,...
Extract out Geospatial Data for Sao Paulo Macrometropolis
unique(cities_data$STATE)## [1] "GO" "MG" "PA" "CE" "BA" "PR" "SC" "PE" "TO" "MA" "RN" "PI" "RS" "MT" "AC"
## [16] "SP" "ES" "AL" "PB" "MS" "RO" "RR" "AM" "AP" "SE" "RJ" "DF"
We can see that there are 27 state codes for each of Brazil's 27 states. We also notice that there is one for "SP" - which stands for Sao Paulo, the metropolitian region we are interested in.
Next, let us extract out the GIS data for the region that we are interested in.
sp_data <- cities_data %>% filter(STATE == "SP")
head(sp_data) # To get a look at the dataframe
The required geospatial data will be downloaded using the geobr package. I will be using the read_municipality function to download the municipality level data and using read_metro_area to retrieve the metropolitian regions.
municipalities <- read_municipality(year = 2016, simplified = TRUE, showProgress = FALSE) # add simplified = TRUE for faster download and plotting## Using year 2016
## Loading data for the whole country. This might take a few minutes.
Extract out Geospatial Data for Sao Paulo Municipalities
unique(municipalities$abbrev_state)## [1] "RO" "AC" "AM" "RR" "PA" "AP" "TO" "MA" "PI" "CE" "RN" "PB" "PE" "AL" "SE"
## [16] "BA" "MG" "ES" "RJ" "SP" "PR" "SC" "RS" "MS" "MT" "GO" "DF"
Next, let us extract out the GIS data for the region that we are interested in.
saopaulo_muni <- municipalities %>% filter(abbrev_state == "SP")Do a plot to examine whether the subsetted region is correct
plot(saopaulo_muni$geom)
Cross referencing to the given wikipedia source; this geographical boundary that was subsetted out is correct.
metro <- read_metro_area(year = 2016, simplified = TRUE, showProgress = FALSE)## Using year 2016
saupaulo_metro <- metro %>% filter(abbrev_state == "SP")plot(saupaulo_metro$geom)
The above metropolitian boundaries matches closely with that of the wikipedia page, with the exception of the Braganca Paulista city. Thus, we need to extract it from the municipality level data that we downloaded in the previous section.
From Wikipedia, we can retrieve the names of the municipalities enclosed in Braganca Paulista.
bp_names <- c("Atibaia",
"Bom Jesus dos Perdões",
"Bragança Paulista",
"Itatiba",
"Jarinu",
"Joanópolis",
"Morungaba",
"Nazaré Paulista",
"Piracaia",
"Tuiuti",
"Vargem"
)Next, we use it to subset out the relevant rows in the Sao Paulo municipality GIS data that we have.
bp_gis <- saopaulo_muni %>% filter(tolower(name_muni) %in% tolower(bp_names))
bp_gis
The extracted data has 11 features - which matches the length of the bp_names vector.
This will give us the boundaries for our required study area.
study_area <- bind_rows(saupaulo_metro, bp_gis)Check for duplicates
duplicated(study_area$name_muni)## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [169] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
The binding has introduced duplicates into our dataframe. Let us examine why.
dup_rows <- study_area[duplicated(study_area$name_muni),]
dup_names <- dup_rows$name_muni
study_area %>% filter(name_muni %in% dup_names)This is likely due to the reshuffling of municipalities around, as we can see in the above data that 3 municipalities may have been shifted to the Braganca Paulista region. This is confirmed based on this source.
Drop duplicated columns
study_area <- study_area[!duplicated(study_area$name_muni),]Check for duplicates & Do a plot to check if the boundaries are correct
duplicated(study_area$name_muni)## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE
plot(study_area$geom)We need to join the geospatial data and aspatial data together. However, the lack of the unique identifier municipality code in the aspatial data (which is present in the saopaulo GIS data) presents some challenges. Let us try the following approaches and see which one allows us to perform the join as intended.
From previous inspection, this can be done by joining on municipality name and state code.
sp_inner <- study_area %>% inner_join(sp_data, by = c("abbrev_state" = "STATE", "name_muni" = "CITY"))Evaluate whether the join works as expected
plot(sp_inner$geom)print(nrow(study_area)-nrow(sp_inner))## [1] 1
We can see above that 1 municipality is lost as a result of the inner join.
gis_names <- study_area$name_muni
aspat_names <- sp_data$CITY
setdiff(gis_names, aspat_names)## [1] "Santa Bárbara D'oeste"
Looking into the sp_data dataframe, we can see that this is because it is stored there as Santa Bárbara D'Oeste.
Let us rename the municipality and perform the join again.
sp_data <- within(sp_data, CITY[CITY =="Santa Bárbara D'Oeste"] <- "Santa Bárbara D'oeste")Perform Inner Join
sp_fin <- study_area %>% inner_join(sp_data, by = c("abbrev_state" = "STATE", "name_muni" = "CITY"))print(nrow(study_area)-nrow(sp_fin))## [1] 0
The data is now prepared and ready for analysis.
selected <- c("name_muni",
"IBGE_RES_POP",
"AREA",
"GVA_AGROPEC",
"GVA_INDUSTRY",
"GVA_SERVICES",
"GVA_PUBLIC",
"COMP_A",
"COMP_B",
"COMP_C",
"COMP_D",
"COMP_E",
"COMP_F",
"COMP_G",
"COMP_H",
"COMP_I",
"COMP_J",
"COMP_K",
"COMP_L",
"COMP_M",
"COMP_N",
"COMP_O",
"COMP_P",
"COMP_Q",
"COMP_R",
"COMP_S",
"COMP_T",
"COMP_U",
"geom"
)Select needed columns
sp_fin_filtered <- sp_fin[,selected]Rename columns for easier reference
new_names <- c("NAME_MUNI",
"POPULATION",
"AREA",
"GVA_AGROPEC",
"GVA_INDUSTRY",
"GVA_SERVICES",
"GVA_PUBLIC",
"COMP_AGRO",
"COMP_EXTRACTIVE",
"COMP_TRANSFORMATION",
"COMP_GAS_ELECT",
"COMP_WATER_SEWAGE",
"COMP_CONSTRUCTION",
"COMP_VEHICLE_REPAIR",
"COMP_TRANSPORT_MAIL",
"COMP_ACCOM_FOOD",
"COMP_INFOCOMM",
"COMP_FINANCE",
"COMP_REALESTATE",
"COMP_SCIENCE_TECHNICAL",
"COMP_ADMIN",
"COMP_DEFENSE_SS",
"COMP_EDUCATION",
"COMP_HEALTH_SOCIAL_SERVICE",
"COMP_ART_SPORT_RECRE",
"COMP_OTHER_SERVICES",
"COMP_DOMESTIC_SERVICES",
"COMP_INTERNATIONAL_INSTI",
"geom"
)
colnames(sp_fin_filtered) <- new_namesWe will group the Variables from COMP_AGROPEC to COMP_INTERNATIONAL_INSTI into 4 categories - AGROPEC, INDUSTRY, SERVICES & PUBLIC; for consistency with the GVA Categories that we have. I used the definitions from this page to guide me through assigning the sectors to the main 4 categories.
sp_transformed_v1 <- sp_fin_filtered %>%
mutate(`COMP_AGROPEC` = `COMP_AGRO` + `COMP_EXTRACTIVE`) %>%
mutate(`COMP_INDUSTRY` = `COMP_GAS_ELECT` + `COMP_CONSTRUCTION` + `COMP_TRANSFORMATION`) %>%
mutate(`COMP_SERVICES` = + `COMP_WATER_SEWAGE` + `COMP_VEHICLE_REPAIR` + `COMP_TRANSPORT_MAIL` + `COMP_INFOCOMM` + `COMP_FINANCE` + `COMP_REALESTATE` + `COMP_SCIENCE_TECHNICAL` + `COMP_ADMIN`+`COMP_ACCOM_FOOD` +`COMP_OTHER_SERVICES` + `COMP_DOMESTIC_SERVICES` + `COMP_INTERNATIONAL_INSTI`) %>%
mutate(`COMP_PUBLIC` = `COMP_DEFENSE_SS` + `COMP_EDUCATION` + `COMP_HEALTH_SOCIAL_SERVICE` + `COMP_ART_SPORT_RECRE`) %>%
select(NAME_MUNI, POPULATION, AREA, GVA_AGROPEC, GVA_INDUSTRY, GVA_SERVICES, GVA_PUBLIC, COMP_AGROPEC, COMP_INDUSTRY, COMP_SERVICES, COMP_PUBLIC, geom)Next, we will transform the GVA variables by dividing with population (to account for productivity and industrial competitiveness), and transform the COMP variables by dividing by municipality area (to account for the fact that larger municipalities have more land & vice versa).
sp_transformed <- sp_transformed_v1 %>%
mutate(`GVA_AGROPEC_PC` = GVA_AGROPEC*1000/POPULATION) %>%
mutate(`GVA_INDUSTRY_PC` = GVA_INDUSTRY*1000/POPULATION) %>%
mutate(`GVA_SERVICES_PC` = GVA_SERVICES*1000/POPULATION) %>%
mutate(`GVA_PUBLIC_PC` = GVA_PUBLIC*1000/POPULATION) %>%
mutate(AREA = gsub(",","",AREA)) %>% #important step since some values in the dataframe (with 4 digits); are separated by commas
mutate(AREA = as.numeric(AREA)) %>%
mutate(`COMP_AGROPEC_BA` = COMP_AGROPEC/AREA) %>%
mutate(`COMP_INDUSTRY_BA` = COMP_INDUSTRY/AREA) %>%
mutate(`COMP_SERVICES_BA` = COMP_SERVICES/AREA) %>%
mutate(`COMP_PUBLIC_BA` = COMP_PUBLIC/AREA) %>%
select(NAME_MUNI, GVA_AGROPEC_PC, GVA_INDUSTRY_PC, GVA_SERVICES_PC, GVA_PUBLIC_PC, COMP_AGROPEC_BA, COMP_INDUSTRY_BA, COMP_SERVICES_BA, COMP_PUBLIC_BA, geom)The data is now transformed. We will scale the variables later for the clustering section - but before that, let us get an understanding of the spatial and aspatial distribution of our 8 variables.
In this portion, we will be visualising our transformed variables using histograms. This allows us to get a rough understanding of the variable distribution.
gva_agp <- ggplot(data=sp_transformed,
aes(x= `GVA_AGROPEC_PC`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
comp_agp <- ggplot(data=sp_transformed,
aes(x= `COMP_AGROPEC_BA`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gva_industry <- ggplot(data=sp_transformed,
aes(x= `GVA_INDUSTRY_PC`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
comp_industry <- ggplot(data=sp_transformed,
aes(x= `COMP_INDUSTRY_BA`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gva_service <- ggplot(data=sp_transformed,
aes(x= `GVA_SERVICES_PC`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
comp_service <- ggplot(data=sp_transformed,
aes(x= `COMP_SERVICES_BA`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gva_public <- ggplot(data=sp_transformed,
aes(x= `GVA_PUBLIC_PC`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
comp_public <- ggplot(data=sp_transformed,
aes(x= `COMP_PUBLIC_BA`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")Visualise the Variable Distributions together
ggarrange(gva_agp, comp_agp, gva_industry, comp_industry, gva_service, comp_service, gva_public,comp_public,
ncol = 2,
nrow = 4)In this portion, we will be visualising our transformed variables using both chloropleths and box maps. Using boxmaps in addition to chloropleths are important as chloropleths can be easily distorted by the scale and breaks used. We are using chloropleths still as they allow us to see the difference in magnitude by the variable across the different Sao Paulo municipalities.
boxbreaks <- function(v,mult=1.5) {
qv <- unname(quantile(v,na.rm = TRUE))
iqr <- qv[4] - qv[2]
upfence <- qv[4] + mult * iqr
lofence <- qv[2] - mult * iqr
# initialize break points vector
bb <- vector(mode="numeric",length=7)
# logic for lower and upper fences
if (lofence < qv[1]) {
# no lower outliers
bb[1] <- lofence
bb[2] <- floor(qv[1])
} else {
bb[2] <- lofence
bb[1] <- qv[1]
}
if(upfence > qv[5]) {
# no upper outliers
bb[7] <- upfence
bb[6] <- ceiling(qv[5])
} else {
bb[6] <- upfence
bb[7] <- qv[5]
}
bb[3:5] <- qv[2:4]
return(bb)
}
get.var <- function(vname,df) {
v <- df[vname] %>% st_set_geometry(NULL)
v <- unname(v[,1])
return(v)
}
boxmap <- function(vnam,df,titlesize,legx, legy, legtitle=NA,mtitle="Box Map",mult=1.5){
var <- get.var(vnam,df)
bb <- boxbreaks(var)
tm_shape(df) +
tm_fill(vnam,
title=legtitle,
breaks=bb,
palette="-RdBu",
labels = c("lower outlier", "< 25%", "25% - 50%", "50% - 75%","> 75%", "upper outlier")
) +
tm_fill(vnam,
df %>%
filter(is.na(vnam)),
labels = c("NA")
) +
tm_borders(alpha =0.2) +
tm_layout(main.title = paste("Box Map of ",toString(vnam)," Numbers by Municipality"),
main.title.size = titlesize,
main.title.position = "center",
frame = FALSE,
legend.height = legx,
legend.width = legy
)
}
bm_qtm <- function(vnam,df,titlesize,legx, legy) {
tm <- qtm(df, vnam, main.title = paste("Chloropleth of ",toString(vnam)," Numbers by Municipality"), frame = FALSE, main.title.size = titlesize, main.title.position = "center")
bm <- boxmap(vnam, df, titlesize, legx,legy)
tmap_arrange(tm,bm, nrow = 2)
}bm_qtm("GVA_AGROPEC_PC", sp_transformed, 0.8,0.5,0.5)## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
bm_qtm("COMP_AGROPEC_BA", sp_transformed, 0.8,0.5,0.5)## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
bm_qtm("GVA_INDUSTRY_PC", sp_transformed, 0.8,0.5,0.5)## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
bm_qtm("COMP_INDUSTRY_BA", sp_transformed, 0.8,0.5,0.5)## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
bm_qtm("GVA_SERVICES_PC", sp_transformed, 0.8,0.5,0.5)## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
bm_qtm("COMP_SERVICES_BA", sp_transformed, 0.8,0.5,0.5)## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
bm_qtm("GVA_PUBLIC_PC", sp_transformed, 0.8,0.5,0.5)## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
bm_qtm("COMP_PUBLIC_BA", sp_transformed, 0.8,0.5,0.5)## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
Next, we will examine how the variables are correlated to one another using a correlation plot. This is also done before clustering as we want to ensure that the cluster variables are not highly correlated.
sp.cor = cor(as.data.frame(sp_transformed[2:9])[,1:8])
corrplot.mixed(sp.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black",
tl.cex = 0.75
)cluster_df <- sp_transformed %>%
st_set_geometry(NULL) %>%
select("NAME_MUNI", "GVA_AGROPEC_PC","GVA_INDUSTRY_PC", "GVA_SERVICES_PC", "GVA_PUBLIC_PC", "COMP_AGROPEC_BA" ,"COMP_SERVICES_BA")
head(cluster_df,10)We removed COMP_INDUSTRY_BA and COMP_PUBLIC_BA due to their high correlation with COMP_SERVICES_BA.
Next, we rename the rows by municipality name
row.names(cluster_df) <- cluster_df$"NAME_MUNI"
head(cluster_df,10)Afterwards, we drop the NAME_MUNI column
cluster_df <- select(cluster_df, c(2:7))
head(cluster_df,10)This step is important as the variables that we have selected for are different in terms of units and thus range. We will use Min-Max standardisation as we cannot assume that the variables come from a normal distribution.
cluster_df.std <- normalize(cluster_df)
summary(cluster_df.std)## GVA_AGROPEC_PC GVA_INDUSTRY_PC GVA_SERVICES_PC GVA_PUBLIC_PC
## Min. :0.0000000 Min. :0.000000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0002927 1st Qu.:0.006709 1st Qu.:0.04940 1st Qu.:0.3210
## Median :0.0102740 Median :0.026104 Median :0.08954 Median :0.3561
## Mean :0.0401360 Mean :0.052581 Mean :0.12748 Mean :0.3334
## 3rd Qu.:0.0419550 3rd Qu.:0.060139 3rd Qu.:0.15174 3rd Qu.:0.4039
## Max. :1.0000000 Max. :1.000000 Max. :1.00000 Max. :1.0000
## COMP_AGROPEC_BA COMP_SERVICES_BA
## Min. :0.00000 Min. :0.000000
## 1st Qu.:0.02892 1st Qu.:0.002910
## Median :0.05792 Median :0.008356
## Mean :0.10246 Mean :0.041472
## 3rd Qu.:0.11148 3rd Qu.:0.028325
## Max. :1.00000 Max. :1.000000
This chunk computes the euclidean distance between any 2 municipalities using the input variables that we have.
proxmat <- dist(cluster_df.std, method = "euclidean")m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(cluster_df.std, method = x)$ac
}
map_dbl(m, ac)## average single complete ward
## 0.9294905 0.9095148 0.9413205 0.9610303
From the above, we can see the Ward's method provides the strongest clustering structure. Let us use this method for hierarchial clustering of the municipalities.
hclust_ward <- hclust(proxmat, method = 'ward.D')Let us visualise the tree
plot(hclust_ward, cex = 0.3)We will use the gap statistic method which compares total intra-cluster variation for different number of clusters used; with comparison to the expected values under null reference data distriubtion. We want to select the cluster number that maximises the gap statistic.
set.seed(12345)
gap_stat <- clusGap(cluster_df.std, FUN = hcut, nstart = 20, K.max = 10, B =50)
# Print the result
print(gap_stat, method = "firstmax")## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = cluster_df.std, FUNcluster = hcut, K.max = 10, B = 50, nstart = 20)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 1
## logW E.logW gap SE.sim
## [1,] 2.765606 3.813811 1.048205 0.01634477
## [2,] 2.645646 3.666258 1.020612 0.01835091
## [3,] 2.451175 3.597884 1.146709 0.01560244
## [4,] 2.351573 3.539409 1.187836 0.01549761
## [5,] 2.299680 3.489840 1.190160 0.01489049
## [6,] 2.248246 3.447668 1.199422 0.01426494
## [7,] 2.204067 3.409547 1.205480 0.01378066
## [8,] 2.165894 3.374018 1.208124 0.01388462
## [9,] 2.076070 3.341144 1.265074 0.01431569
## [10,] 2.011010 3.310141 1.299131 0.01425596
Visualise it using a plot
fviz_gap_stat(gap_stat)From the above plot, we can see that the Gap Statistic seems to be increasing as k increases. While theoratically it means we can pick a large number for k, this is not a good practice as it will result in lesser than meaningful clusters. Using this source, we will instead pick the first cluster number i where the gradient from cluster i-1 to i is steeper than the gradient from i to i+1. This means we pick k=3 for this context.
plot(hclust_ward, cex = 0.4)
rect.hclust(hclust_ward, k = 3, border = 2:5)cluster_df_mat <- data.matrix(cluster_df.std)heatmaply(cluster_df_mat,
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 3,
margins = c(NA,200,60,NA),
fontsize_row = 2,
fontsize_col = 5,
main="Geographic Segmentation of Sao Paulo by Industrial indicators",
xlab = "Industrial Indicators",
ylab = "Municipalities of Sao Paulo State"
)groups <- as.factor(cutree(hclust_ward, k=3))Map back to original simple feature data frame
sp_transformed_clust <- cbind(sp_transformed, as.matrix(groups)) %>%
rename(`CLUSTER`=`as.matrix.groups.`)Visualise it on a chloropleth
qtm(sp_transformed_clust, "CLUSTER")
The above shows us pretty fragmented clusters - difficult for further analysis.
For this section, we will be looking at the outputs from the hierarchial clustering to understand some differences that are present between the 3 generated clusters. This will be done by plotting them on bar charts.
Create the bar charts
agrp_pc <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`GVA_AGROPEC_PC`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")
indust_pc <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`GVA_INDUSTRY_PC`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")
services_pc <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`GVA_SERVICES_PC`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")
public_pc <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`GVA_PUBLIC_PC`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")
comp_agro_ba <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`COMP_AGROPEC_BA`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")
comp_serv_ba <- ggplot(sp_transformed_clust, aes(x = `CLUSTER` , y=`COMP_SERVICES_BA`, fill=`CLUSTER`)) + geom_bar(stat="summary", fun="mean")Arrange them together
ggarrange(agrp_pc, indust_pc, services_pc, public_pc, comp_agro_ba, comp_serv_ba, nrow = 2, ncol =3)To remedy the fragmented clusters of the hierachial clustering method, we will try the spatially constrained clustering method using the SKATER approach.
sp_transformed_sp <- as_Spatial(sp_transformed)## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum Unknown based on GRS80 ellipsoid in CRS definition
## Warning in showSRID(SRS_string, format = "PROJ", multiline = "NO", prefer_proj
## = prefer_proj): Discarded datum Sistema de Referencia Geocentrico para las
## AmericaS 2000 in CRS definition
Change CRS to EPSG 4618. Source
sp_transformed_sp <- spTransform(sp_transformed_sp, CRS("+init=epsg:4618"))## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO", prefer_proj =
## prefer_proj): Discarded datum South_American_Datum_1969 in CRS definition
sp.nb <- poly2nb(sp_transformed_sp)
summary(sp.nb)## Neighbour list object:
## Number of regions: 172
## Number of nonzero links: 872
## Percentage nonzero weights: 2.947539
## Average number of links: 5.069767
## 1 region with no links:
## 100
## Link number distribution:
##
## 0 1 2 3 4 5 6 7 8 9 10 22
## 1 4 14 21 35 31 29 17 9 5 5 1
## 4 least connected regions:
## 8 9 29 36 with 1 link
## 1 most connected region:
## 161 with 22 links
plot(sp_transformed_sp, border=grey(.5))
plot(sp.nb, coordinates(sp_transformed_sp), col="blue", add=TRUE)
We can see the municipality that has the most number of neighbours is the large one in the South Region. In addition, one municipality (in the south east) is not connected to any other municipalities, due to the fact that it is probably an island off the coast of the main land.
We first find out which municipality is the island
cnt <- card(sp.nb)
island_ind <- match(0, cnt)
print(island_ind)## [1] 100
The island municipality's index is 100 in sp.nb.
Now let us add in a neighbour for this municipality using the knearneigh function of the spdep package.
nearest <- knearneigh(coordinates(sp_transformed_sp))$nn
sp.nb[[island_ind]] <- nearest[island_ind]
# We need to add this island to the neighbours of its connected municipality as well
sp.nb[[nearest[island_ind]]] <- c(as.integer(100), as.integer(sp.nb[[nearest[island_ind]]]))plot(sp_transformed_sp, border=grey(.5))
plot(sp.nb, coordinates(sp_transformed_sp), col="blue", add=TRUE) The island municipality now has a neighbour, and we can proceed to compute the edge costs.
lcosts <- nbcosts(sp.nb, cluster_df.std)sp.w <- nb2listw(sp.nb, lcosts, style="B")
summary(sp.w)## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 172
## Number of nonzero links: 874
## Percentage nonzero weights: 2.9543
## Average number of links: 5.081395
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 22
## 5 14 21 35 30 30 17 9 5 5 1
## 5 least connected regions:
## 8 9 29 36 100 with 1 link
## 1 most connected region:
## 161 with 22 links
##
## Weights style: B
## Weights constants summary:
## n nn S0 S1 S2
## B 172 29584 293.1677 304.1517 3442.263
sp.mst <- mstree(sp.w)Check for dimensions of MST
dim(sp.mst)## [1] 171 3
This has 171 edges, which is equal to 172-1. Hence the mst is correct.
Plot MST
plot(sp_transformed_sp, border=gray(.5))
plot.mst(sp.mst, coordinates(sp_transformed_sp),
col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)clust3 <- skater(sp.mst[,1:2], cluster_df.std, method = "euclidean", 2)Check cluster assignment
ccs3 <- clust3$groups
ccs3## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 2 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1
table(ccs3)## ccs3
## 1 2 3
## 165 6 1
Visualise Clusters
plot(sp_transformed_sp, border=gray(.5))
plot(clust3, coordinates(sp_transformed_sp), cex.lab=.4,
groups.colors=c("red","green","blue"), cex.circles=0.005, add=TRUE)## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
groups_mat <- as.matrix(clust3$groups)
sp_transformed_sp_spatialclust <- cbind(sp_transformed_clust, as.factor(groups_mat)) %>%
rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(sp_transformed_sp_spatialclust, "SP_CLUSTER")Let us compare the Spatially Constrained vs Hierarchial Clustering Methods.
hclust.map <- qtm(sp_transformed_clust,
"CLUSTER", legend.outside = T) +
tm_borders(alpha = 0.5)
spclust.map <- qtm(sp_transformed_sp_spatialclust,
"SP_CLUSTER", legend.outside = T) +
tm_borders(alpha = 0.5)
tmap_arrange(hclust.map, spclust.map,
asp=NA, nrow=2)## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).