In this exercise, we will determine factors affecting the unequal development of Brazil at the municipality level by using the data provided. The specific task of the analysis are as follows:
Prepare a choropleth map showing the distribution of GDP per capita, 2016 at municipality level.
Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using multiple linear regression method.
Prepare a choropleth map showing the distribution of the residual of the GDP per capita.
Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using geographically weighted regression method.
Prepare a series of choropleth maps showing the outputs of the geographically weighted regression model.
The following data sets will be used for this study:
BRAZIL_CITIES.csv in csv format obtained from Kaggle
2016 Brazil Municipality Boundary File obtained using ‘geobr’ package
We will install and launch the relevant R packages required for this exercise in R environment.
packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse', 'geobr')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The 2016 Municipality Boundary Layer will be the geospatial data used for this exercise. It will be downloaded using the ‘geobr’ package.
We wil obtain a list of the geospatial datasets available in ‘geobr’ to select which one we need.
datasets <- list_geobr()
print(datasets)
## # A tibble: 22 x 4
## `function` geography years source
## <chr> <chr> <chr> <chr>
## 1 `read_country` Country 1872, 1900, 1911, 1920, 1933, ~ IBGE
## 2 `read_region` Region 2000, 2001, 2010, 2013, 2014, ~ IBGE
## 3 `read_state` States 1872, 1900, 1911, 1920, 1933, ~ IBGE
## 4 `read_meso_regi~ Meso region 2000, 2001, 2010, 2013, 2014, ~ IBGE
## 5 `read_micro_reg~ Micro region 2000, 2001, 2010, 2013, 2014, ~ IBGE
## 6 `read_intermedi~ Intermediate region 2017 IBGE
## 7 `read_immediate~ Immediate region 2017 IBGE
## 8 `read_municipal~ Municipality 1872, 1900, 1911, 1920, 1933, ~ IBGE
## 9 `read_weighting~ Census weighting are~ 2010 IBGE
## 10 `read_census_tr~ Census tract (setor ~ 2000, 2010 IBGE
## # ... with 12 more rows
We will use the read_minicipality() function to download the shape files of Brazil municipalities as ‘sf’ objects. code_muni is set to “all” to obtain municipalities of all states in Brazil and year is set to ‘2016’ to get data from 2016. The data is in the Geodetic Reference System “SIRGAS2000” and CRS 4674.
mun_2016 <- read_municipality(code_muni = "all", year = 2016, simplified = TRUE)
##
|
| | 0%
|
|=== | 4%
|
|===== | 7%
|
|======== | 11%
|
|========== | 15%
|
|============= | 19%
|
|================ | 22%
|
|================== | 26%
|
|===================== | 30%
|
|======================= | 33%
|
|========================== | 37%
|
|============================= | 41%
|
|=============================== | 44%
|
|================================== | 48%
|
|==================================== | 52%
|
|======================================= | 56%
|
|========================================= | 59%
|
|============================================ | 63%
|
|=============================================== | 67%
|
|================================================= | 70%
|
|==================================================== | 74%
|
|====================================================== | 78%
|
|========================================================= | 81%
|
|============================================================ | 85%
|
|============================================================== | 89%
|
|================================================================= | 93%
|
|=================================================================== | 96%
|
|======================================================================| 100%
test <- st_is_valid(mun_2016)
length(which(test==FALSE))
## [1] 1
There is one invalid polygon as observed from the st_is_valid test. Hence, we need to make this polygon valid.
mun_2016 <- st_make_valid(mun_2016)
test <- st_is_valid(mun_2016)
length(which(test==FALSE))
## [1] 0
The invalid polygon is now valid.
any(is.na(mun_2016))
## [1] FALSE
There are no NA values.
any(duplicated(mun_2016$code_muni))
## [1] FALSE
There are no duplicated values.
Only some of the columns in the dataframe are relevant to our analysis hence we will extract them.
mun_boundary <- mun_2016 %>%
select("name_muni", "abbrev_state", "code_muni", "geom")
# Remove plot axis
no_axis <- theme(axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank())
# Plot all Brazilian municipalities
ggplot() +
geom_sf(data= mun_2016, fill="#2D3E50", color="#FEBF57", size=.15, show.legend = FALSE) +
labs(subtitle="Municipalities", size=8) +
theme_minimal() +
no_axis
BRAZIL_CITIES.csv is in csv file format, separated by ‘;’. Hence, we will use read.delim() function to import it into R to ensure that decimal points are retained in the data and it is ONLY separated by ‘;’,
brazil_info <- read.delim("data/aspatial/BRAZIL_CITIES.csv", sep = ";")
Viewing Data Type and Fields:
glimpse(brazil_info)
## Rows: 5,573
## Columns: 81
## $ CITY <fct> Abadia De Goiás, Abadia Dos Dourados, Abadi...
## $ STATE <fct> GO, MG, GO, MG, PA, CE, BA, BA, PR, SC, PA, ...
## $ CAPITAL <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ IBGE_RES_POP <int> 6876, 6704, 15757, 22690, 141100, 10496, 831...
## $ IBGE_RES_POP_BRAS <int> 6876, 6704, 15609, 22690, 141040, 10496, 831...
## $ IBGE_RES_POP_ESTR <int> 0, 0, 148, 0, 60, 0, 0, 0, 0, 0, 0, 16, 17, ...
## $ IBGE_DU <int> 2137, 2328, 4655, 7694, 31061, 2791, 2572, 4...
## $ IBGE_DU_URBAN <int> 1546, 1481, 3233, 6667, 19057, 1251, 1193, 2...
## $ IBGE_DU_RURAL <int> 591, 847, 1422, 1027, 12004, 1540, 1379, 195...
## $ IBGE_POP <int> 5300, 4154, 10656, 18464, 82956, 4538, 3725,...
## $ IBGE_1 <int> 69, 38, 139, 176, 1354, 98, 37, 167, 69, 12,...
## $ IBGE_1.4 <int> 318, 207, 650, 856, 5567, 323, 156, 733, 302...
## $ IBGE_5.9 <int> 438, 260, 894, 1233, 7618, 421, 263, 978, 37...
## $ IBGE_10.14 <int> 517, 351, 1087, 1539, 8905, 483, 277, 927, 4...
## $ IBGE_15.59 <int> 3542, 2709, 6896, 11979, 53516, 2631, 2319, ...
## $ IBGE_60. <int> 416, 589, 990, 2681, 5996, 582, 673, 803, 81...
## $ IBGE_PLANTED_AREA <int> 319, 4479, 10307, 1862, 25200, 2598, 895, 20...
## $ IBGE_CROP_PRODUCTION_. <int> 1843, 18017, 33085, 7502, 700872, 5234, 3999...
## $ IDHM.Ranking.2010 <int> 1689, 2207, 2202, 1994, 3530, 3522, 4086, 47...
## $ IDHM <dbl> 0.708, 0.690, 0.690, 0.698, 0.628, 0.628, 0....
## $ IDHM_Renda <dbl> 0.687, 0.693, 0.671, 0.720, 0.579, 0.540, 0....
## $ IDHM_Longevidade <dbl> 0.830, 0.839, 0.841, 0.848, 0.798, 0.748, 0....
## $ IDHM_Educacao <dbl> 0.622, 0.563, 0.579, 0.556, 0.537, 0.612, 0....
## $ LONG <dbl> -49.44055, -47.39683, -48.71881, -45.44619, ...
## $ LAT <dbl> -16.758812, -18.487565, -16.182672, -19.1558...
## $ ALT <dbl> 893.60, 753.12, 1017.55, 644.74, 10.12, 403....
## $ PAY_TV <int> 360, 77, 227, 1230, 3389, 29, 952, 51, 55, 1...
## $ FIXED_PHONES <int> 842, 296, 720, 1716, 1218, 34, 335, 222, 392...
## $ AREA <fct> "147.26", "881.06", "1,045.13", "1,817.07", ...
## $ REGIAO_TUR <fct> , Caminhos Do Cerrado, Região TurÃstica Do...
## $ CATEGORIA_TUR <fct> , D, C, D, D, , D, , , D, E, D, D, D, , E, ,...
## $ ESTIMATED_POP <int> 8583, 6972, 19614, 23223, 156292, 11663, 876...
## $ RURAL_URBAN <fct> Urbano, Rural Adjacente, Rural Adjacente, Ur...
## $ GVA_AGROPEC <dbl> 6.20, 50524.57, 42.84, 113824.60, 140463.72,...
## $ GVA_INDUSTRY <dbl> 27991.25, 25917.70, 16728.30, 31002.62, 5861...
## $ GVA_SERVICES <dbl> 74750.32, 62689.23, 138198.58, 172.33, 46812...
## $ GVA_PUBLIC <dbl> 36915.04, 28083.79, 63396.20, 86081.41, 4868...
## $ GVA_TOTAL <dbl> 145857.60, 167215.28, 261161.91, 403241.27, ...
## $ TAXES <dbl> 20554.20, 12873.50, 26822.58, 26994.09, 9518...
## $ GDP <dbl> 166.41, 180.09, 287984.49, 430235.36, 124925...
## $ POP_GDP <int> 8053, 7037, 18427, 23574, 151934, 11483, 921...
## $ GDP_CAPITA <dbl> 20664.57, 25591.70, 15628.40, 18250.42, 8222...
## $ GVA_MAIN <fct> "Demais serviços", "Demais serviços", "Dem...
## $ MUN_EXPENDIT <dbl> 28227691, 17909274, 37513019, NA, NA, NA, NA...
## $ COMP_TOT <int> 284, 476, 288, 621, 931, 86, 191, 87, 285, 6...
## $ COMP_A <int> 5, 6, 5, 18, 4, 1, 6, 2, 5, 2, 0, 8, 3, 1, 4...
## $ COMP_B <int> 1, 6, 9, 1, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0,...
## $ COMP_C <int> 56, 30, 26, 40, 43, 4, 8, 3, 20, 4, 9, 40, 3...
## $ COMP_D <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ COMP_E <int> 2, 2, 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0,...
## $ COMP_F <int> 29, 34, 7, 20, 27, 6, 4, 0, 10, 2, 0, 25, 12...
## $ COMP_G <int> 110, 190, 117, 303, 500, 48, 97, 71, 133, 35...
## $ COMP_H <int> 26, 70, 12, 62, 16, 2, 5, 0, 18, 8, 1, 67, 3...
## $ COMP_I <int> 4, 28, 57, 30, 31, 10, 5, 1, 14, 3, 0, 25, 1...
## $ COMP_J <int> 5, 11, 2, 9, 6, 2, 3, 1, 8, 1, 1, 9, 5, 14, ...
## $ COMP_K <int> 0, 0, 1, 6, 1, 0, 1, 0, 0, 1, 0, 4, 3, 3, 0,...
## $ COMP_L <int> 2, 4, 0, 4, 1, 0, 0, 0, 4, 0, 0, 7, 4, 4, 0,...
## $ COMP_M <int> 10, 15, 7, 28, 22, 2, 5, 0, 11, 4, 2, 26, 17...
## $ COMP_N <int> 12, 29, 15, 27, 16, 3, 5, 1, 26, 0, 1, 16, 1...
## $ COMP_O <int> 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 6, 2, 4, 2, 4,...
## $ COMP_P <int> 6, 9, 11, 15, 155, 0, 8, 0, 8, 1, 6, 14, 11,...
## $ COMP_Q <int> 6, 14, 5, 19, 33, 2, 1, 2, 9, 3, 0, 13, 22, ...
## $ COMP_R <int> 1, 6, 1, 9, 15, 0, 2, 0, 4, 0, 0, 4, 6, 6, 0...
## $ COMP_S <int> 5, 19, 8, 27, 56, 4, 38, 4, 12, 3, 4, 23, 38...
## $ COMP_T <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COMP_U <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ HOTELS <int> NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA...
## $ BEDS <int> NA, NA, 34, NA, NA, NA, 24, NA, NA, NA, NA, ...
## $ Pr_Agencies <int> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0, 2...
## $ Pu_Agencies <int> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1, 3...
## $ Pr_Bank <int> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0, 2...
## $ Pu_Bank <int> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1, 3...
## $ Pr_Assets <dbl> NA, NA, 33724584, 44974716, 76181384, NA, NA...
## $ Pu_Assets <dbl> NA, NA, 67091904, 371922572, 800078483, NA, ...
## $ Cars <int> 2158, 2227, 2838, 6928, 5277, 553, 896, 613,...
## $ Motorcycles <int> 1246, 1142, 1426, 2953, 25661, 1674, 696, 15...
## $ Wheeled_tractor <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0,...
## $ UBER <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ MAC <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ WAL.MART <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ POST_OFFICES <int> 1, 1, 3, 4, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,...
Since we are determining the factors affecting the unequal development of Brazil based on the GDP in 2016, we will only extract variables that are relevant to our analysis. We will exclude variables that are not required and do not have a significant impact on GDP per capita. For example, we have the data for resident population regular urban planning for people aged 0-60+. We will only require the data of residents aged 15-59 who are active in the economy.
brazil_variables <- brazil_info %>%
select(1:4, 8:9, 15, 17:18, 20, 24:25, 27:29, 38:39, 42, 44:45, 67, 71:77)
summary(brazil_info)
## CITY STATE CAPITAL IBGE_RES_POP
## Bom Jesus : 5 MG : 853 Min. :0.000000 Min. : 805
## São Domingos : 5 SP : 645 1st Qu.:0.000000 1st Qu.: 5235
## Bonito : 4 RS : 498 Median :0.000000 Median : 10934
## Planalto : 4 BA : 418 Mean :0.004845 Mean : 34278
## São Francisco: 4 PR : 399 3rd Qu.:0.000000 3rd Qu.: 23424
## Santa Helena : 4 SC : 295 Max. :1.000000 Max. :11253503
## (Other) :5547 (Other):2465 NA's :8
## IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN
## Min. : 805 Min. : 0.0 Min. : 239 Min. : 60
## 1st Qu.: 5230 1st Qu.: 0.0 1st Qu.: 1572 1st Qu.: 874
## Median : 10926 Median : 0.0 Median : 3174 Median : 1846
## Mean : 34200 Mean : 77.5 Mean : 10303 Mean : 8859
## 3rd Qu.: 23390 3rd Qu.: 10.0 3rd Qu.: 6726 3rd Qu.: 4624
## Max. :11133776 Max. :119727.0 Max. :3576148 Max. :3548433
## NA's :8 NA's :8 NA's :10 NA's :10
## IBGE_DU_RURAL IBGE_POP IBGE_1 IBGE_1.4
## Min. : 3 Min. : 174 Min. : 0.0 Min. : 5
## 1st Qu.: 487 1st Qu.: 2801 1st Qu.: 38.0 1st Qu.: 158
## Median : 931 Median : 6170 Median : 92.0 Median : 376
## Mean : 1463 Mean : 27595 Mean : 383.3 Mean : 1544
## 3rd Qu.: 1832 3rd Qu.: 15302 3rd Qu.: 232.0 3rd Qu.: 951
## Max. :33809 Max. :10463636 Max. :129464.0 Max. :514794
## NA's :81 NA's :8 NA's :8 NA's :8
## IBGE_5.9 IBGE_10.14 IBGE_15.59 IBGE_60.
## Min. : 7 Min. : 12 Min. : 94 Min. : 29
## 1st Qu.: 220 1st Qu.: 259 1st Qu.: 1734 1st Qu.: 341
## Median : 516 Median : 588 Median : 3841 Median : 722
## Mean : 2069 Mean : 2381 Mean : 18212 Mean : 3004
## 3rd Qu.: 1300 3rd Qu.: 1478 3rd Qu.: 9628 3rd Qu.: 1724
## Max. :684443 Max. :783702 Max. :7058221 Max. :1293012
## NA's :8 NA's :8 NA's :8 NA's :8
## IBGE_PLANTED_AREA IBGE_CROP_PRODUCTION_. IDHM.Ranking.2010 IDHM
## Min. : 0.0 Min. : 0 Min. : 1 Min. :0.4180
## 1st Qu.: 910.2 1st Qu.: 2326 1st Qu.:1392 1st Qu.:0.5990
## Median : 3471.5 Median : 13846 Median :2783 Median :0.6650
## Mean : 14179.9 Mean : 57384 Mean :2783 Mean :0.6592
## 3rd Qu.: 11194.2 3rd Qu.: 55619 3rd Qu.:4174 3rd Qu.:0.7180
## Max. :1205669.0 Max. :3274885 Max. :5565 Max. :0.8620
## NA's :3 NA's :3 NA's :8 NA's :8
## IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG
## Min. :0.4000 Min. :0.6720 Min. :0.2070 Min. :-72.92
## 1st Qu.:0.5720 1st Qu.:0.7690 1st Qu.:0.4900 1st Qu.:-50.87
## Median :0.6540 Median :0.8080 Median :0.5600 Median :-46.52
## Mean :0.6429 Mean :0.8016 Mean :0.5591 Mean :-46.23
## 3rd Qu.:0.7070 3rd Qu.:0.8360 3rd Qu.:0.6310 3rd Qu.:-41.40
## Max. :0.8910 Max. :0.8940 Max. :0.8250 Max. :-32.44
## NA's :8 NA's :8 NA's :8 NA's :9
## LAT ALT PAY_TV FIXED_PHONES
## Min. :-33.688 Min. : 0.0 Min. : 1 Min. : 3
## 1st Qu.:-22.838 1st Qu.: 169.8 1st Qu.: 88 1st Qu.: 119
## Median :-18.089 Median : 406.5 Median : 247 Median : 327
## Mean :-16.444 Mean : 893.8 Mean : 3094 Mean : 6567
## 3rd Qu.: -8.489 3rd Qu.: 628.9 3rd Qu.: 815 3rd Qu.: 1151
## Max. : 4.585 Max. :874579.0 Max. :2047668 Max. :5543127
## NA's :9 NA's :9 NA's :3 NA's :3
## AREA REGIAO_TUR CATEGORIA_TUR ESTIMATED_POP
## : 3 :2288 :2288 Min. : 786
## 1,227.93: 2 Corredores Das Ã\201guas: 59 A: 51 1st Qu.: 5454
## 100.07 : 2 Vale Do Contestado : 45 B: 168 Median : 11590
## 101.89 : 2 Amazônia Atlântica : 40 C: 521 Mean : 37432
## 102.00 : 2 Araguaia-Tocantins : 39 D:1892 3rd Qu.: 25296
## 106.85 : 2 Cariri : 37 E: 653 Max. :12176866
## (Other) :5560 (Other) :3065 NA's :3
## RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY
## : 3 Min. : 0 Min. : 1
## Intermediário Adjacente: 686 1st Qu.: 4189 1st Qu.: 1726
## Intermediário Remoto : 60 Median : 20426 Median : 7424
## Rural Adjacente :3040 Mean : 47271 Mean : 175928
## Rural Remoto : 323 3rd Qu.: 51227 3rd Qu.: 41022
## Sem classificação : 5 Max. :1402282 Max. :63306755
## Urbano :1456 NA's :3 NA's :3
## GVA_SERVICES GVA_PUBLIC GVA_TOTAL TAXES
## Min. : 2 Min. : 7 Min. : 17 Min. : -14159
## 1st Qu.: 10112 1st Qu.: 17267 1st Qu.: 42253 1st Qu.: 1305
## Median : 31211 Median : 35866 Median : 119492 Median : 5100
## Mean : 489451 Mean : 123768 Mean : 832987 Mean : 118864
## 3rd Qu.: 115406 3rd Qu.: 89245 3rd Qu.: 313963 3rd Qu.: 22197
## Max. :464656988 Max. :41902893 Max. :569910503 Max. :117125387
## NA's :3 NA's :3 NA's :3 NA's :3
## GDP POP_GDP GDP_CAPITA
## Min. : 15 Min. : 815 Min. : 3191
## 1st Qu.: 43709 1st Qu.: 5483 1st Qu.: 9058
## Median : 125153 Median : 11578 Median : 15870
## Mean : 954584 Mean : 36998 Mean : 21126
## 3rd Qu.: 329539 3rd Qu.: 25085 3rd Qu.: 26155
## Max. :687035890 Max. :12038175 Max. :314638
## NA's :3 NA's :3 NA's :3
## GVA_MAIN
## Administração, defesa, educação e saúde públicas e seguridade social :2725
## Demais serviços :1477
## Agricultura, inclusive apoio à agricultura e a pós colheita : 735
## Indústrias de transformação : 261
## Pecuária, inclusive apoio à pecuária : 161
## Eletricidade e gás, água, esgoto, atividades de gestão de resÃduos e descontaminação: 98
## (Other) : 116
## MUN_EXPENDIT COMP_TOT COMP_A COMP_B
## Min. :1.421e+06 Min. : 6.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:1.573e+07 1st Qu.: 68.0 1st Qu.: 1.00 1st Qu.: 0.000
## Median :2.746e+07 Median : 162.0 Median : 2.00 Median : 0.000
## Mean :1.043e+08 Mean : 906.8 Mean : 18.25 Mean : 1.852
## 3rd Qu.:5.666e+07 3rd Qu.: 448.0 3rd Qu.: 8.00 3rd Qu.: 2.000
## Max. :4.577e+10 Max. :530446.0 Max. :1948.00 Max. :274.000
## NA's :1492 NA's :3 NA's :3 NA's :3
## COMP_C COMP_D COMP_E COMP_F
## Min. : 0.00 Min. : 0.0000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3.00 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 1.00
## Median : 11.00 Median : 0.0000 Median : 0.000 Median : 4.00
## Mean : 73.44 Mean : 0.4262 Mean : 2.029 Mean : 43.26
## 3rd Qu.: 39.00 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 15.00
## Max. :31566.00 Max. :332.0000 Max. :657.000 Max. :25222.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_G COMP_H COMP_I COMP_J
## Min. : 1.0 Min. : 0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 32.0 1st Qu.: 1 1st Qu.: 2.00 1st Qu.: 0.00
## Median : 74.5 Median : 7 Median : 7.00 Median : 1.00
## Mean : 348.0 Mean : 41 Mean : 55.88 Mean : 24.74
## 3rd Qu.: 199.0 3rd Qu.: 25 3rd Qu.: 24.00 3rd Qu.: 5.00
## Max. :150633.0 Max. :19515 Max. :29290.00 Max. :38720.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_K COMP_L COMP_M COMP_N
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 1.0
## Median : 0.00 Median : 0.00 Median : 4.00 Median : 4.0
## Mean : 15.55 Mean : 15.14 Mean : 51.29 Mean : 83.7
## 3rd Qu.: 2.00 3rd Qu.: 3.00 3rd Qu.: 13.00 3rd Qu.: 14.0
## Max. :23738.00 Max. :14003.00 Max. :49181.00 Max. :76757.0
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_O COMP_P COMP_Q COMP_R
## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 0.00
## Median : 2.000 Median : 6.00 Median : 3.00 Median : 2.00
## Mean : 3.269 Mean : 30.96 Mean : 34.15 Mean : 12.18
## 3rd Qu.: 3.000 3rd Qu.: 17.00 3rd Qu.: 12.00 3rd Qu.: 6.00
## Max. :204.000 Max. :16030.00 Max. :22248.00 Max. :6687.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_S COMP_T COMP_U HOTELS
## Min. : 0.00 Min. :0 Min. : 0.00000 Min. : 1.000
## 1st Qu.: 5.00 1st Qu.:0 1st Qu.: 0.00000 1st Qu.: 1.000
## Median : 12.00 Median :0 Median : 0.00000 Median : 1.000
## Mean : 51.61 Mean :0 Mean : 0.05027 Mean : 3.131
## 3rd Qu.: 31.00 3rd Qu.:0 3rd Qu.: 0.00000 3rd Qu.: 3.000
## Max. :24832.00 Max. :0 Max. :123.00000 Max. :97.000
## NA's :3 NA's :3 NA's :3 NA's :4686
## BEDS Pr_Agencies Pu_Agencies Pr_Bank
## Min. : 2.0 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 40.0 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 0.000
## Median : 82.0 Median : 1.000 Median : 2.000 Median : 1.000
## Mean : 257.5 Mean : 3.383 Mean : 2.829 Mean : 1.312
## 3rd Qu.: 200.0 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :13247.0 Max. :1693.000 Max. :626.000 Max. :83.000
## NA's :4686 NA's :2231 NA's :2231 NA's :2231
## Pu_Bank Pr_Assets Pu_Assets Cars
## Min. :0.00 Min. :0.000e+00 Min. :0.000e+00 Min. : 2
## 1st Qu.:1.00 1st Qu.:0.000e+00 1st Qu.:4.047e+07 1st Qu.: 602
## Median :2.00 Median :3.231e+07 Median :1.339e+08 Median : 1438
## Mean :1.58 Mean :9.180e+09 Mean :6.005e+09 Mean : 9859
## 3rd Qu.:2.00 3rd Qu.:1.148e+08 3rd Qu.:4.970e+08 3rd Qu.: 4086
## Max. :8.00 Max. :1.947e+13 Max. :8.016e+12 Max. :5740995
## NA's :2231 NA's :2231 NA's :2231 NA's :11
## Motorcycles Wheeled_tractor UBER MAC
## Min. : 4 Min. : 0.000 Min. :1 Min. : 1.000
## 1st Qu.: 591 1st Qu.: 0.000 1st Qu.:1 1st Qu.: 1.000
## Median : 1285 Median : 0.000 Median :1 Median : 2.000
## Mean : 4879 Mean : 5.754 Mean :1 Mean : 4.277
## 3rd Qu.: 3294 3rd Qu.: 1.000 3rd Qu.:1 3rd Qu.: 3.000
## Max. :1134570 Max. :3236.000 Max. :1 Max. :130.000
## NA's :11 NA's :11 NA's :5448 NA's :5407
## WAL.MART POST_OFFICES
## Min. : 1.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 1.000 Median : 1.000
## Mean : 2.059 Mean : 2.081
## 3rd Qu.: 1.750 3rd Qu.: 2.000
## Max. :26.000 Max. :225.000
## NA's :5471 NA's :120
Viewing Extracted Dataset:
summary(brazil_variables)
## CITY STATE CAPITAL IBGE_RES_POP
## Bom Jesus : 5 MG : 853 Min. :0.000000 Min. : 805
## São Domingos : 5 SP : 645 1st Qu.:0.000000 1st Qu.: 5235
## Bonito : 4 RS : 498 Median :0.000000 Median : 10934
## Planalto : 4 BA : 418 Mean :0.004845 Mean : 34278
## São Francisco: 4 PR : 399 3rd Qu.:0.000000 3rd Qu.: 23424
## Santa Helena : 4 SC : 295 Max. :1.000000 Max. :11253503
## (Other) :5547 (Other):2465 NA's :8
## IBGE_DU_URBAN IBGE_DU_RURAL IBGE_15.59 IBGE_PLANTED_AREA
## Min. : 60 Min. : 3 Min. : 94 Min. : 0.0
## 1st Qu.: 874 1st Qu.: 487 1st Qu.: 1734 1st Qu.: 910.2
## Median : 1846 Median : 931 Median : 3841 Median : 3471.5
## Mean : 8859 Mean : 1463 Mean : 18212 Mean : 14179.9
## 3rd Qu.: 4624 3rd Qu.: 1832 3rd Qu.: 9628 3rd Qu.: 11194.2
## Max. :3548433 Max. :33809 Max. :7058221 Max. :1205669.0
## NA's :10 NA's :81 NA's :8 NA's :3
## IBGE_CROP_PRODUCTION_. IDHM LONG LAT
## Min. : 0 Min. :0.4180 Min. :-72.92 Min. :-33.688
## 1st Qu.: 2326 1st Qu.:0.5990 1st Qu.:-50.87 1st Qu.:-22.838
## Median : 13846 Median :0.6650 Median :-46.52 Median :-18.089
## Mean : 57384 Mean :0.6592 Mean :-46.23 Mean :-16.444
## 3rd Qu.: 55619 3rd Qu.:0.7180 3rd Qu.:-41.40 3rd Qu.: -8.489
## Max. :3274885 Max. :0.8620 Max. :-32.44 Max. : 4.585
## NA's :3 NA's :8 NA's :9 NA's :9
## PAY_TV FIXED_PHONES AREA GVA_TOTAL
## Min. : 1 Min. : 3 : 3 Min. : 17
## 1st Qu.: 88 1st Qu.: 119 1,227.93: 2 1st Qu.: 42253
## Median : 247 Median : 327 100.07 : 2 Median : 119492
## Mean : 3094 Mean : 6567 101.89 : 2 Mean : 832987
## 3rd Qu.: 815 3rd Qu.: 1151 102.00 : 2 3rd Qu.: 313963
## Max. :2047668 Max. :5543127 106.85 : 2 Max. :569910503
## NA's :3 NA's :3 (Other) :5560 NA's :3
## TAXES GDP_CAPITA MUN_EXPENDIT COMP_TOT
## Min. : -14159 Min. : 3191 Min. :1.421e+06 Min. : 6.0
## 1st Qu.: 1305 1st Qu.: 9058 1st Qu.:1.573e+07 1st Qu.: 68.0
## Median : 5100 Median : 15870 Median :2.746e+07 Median : 162.0
## Mean : 118864 Mean : 21126 Mean :1.043e+08 Mean : 906.8
## 3rd Qu.: 22197 3rd Qu.: 26155 3rd Qu.:5.666e+07 3rd Qu.: 448.0
## Max. :117125387 Max. :314638 Max. :4.577e+10 Max. :530446.0
## NA's :3 NA's :3 NA's :1492 NA's :3
## HOTELS Pr_Bank Pu_Bank Pr_Assets
## Min. : 1.000 Min. : 0.000 Min. :0.00 Min. :0.000e+00
## 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:1.00 1st Qu.:0.000e+00
## Median : 1.000 Median : 1.000 Median :2.00 Median :3.231e+07
## Mean : 3.131 Mean : 1.312 Mean :1.58 Mean :9.180e+09
## 3rd Qu.: 3.000 3rd Qu.: 2.000 3rd Qu.:2.00 3rd Qu.:1.148e+08
## Max. :97.000 Max. :83.000 Max. :8.00 Max. :1.947e+13
## NA's :4686 NA's :2231 NA's :2231 NA's :2231
## Pu_Assets Cars Motorcycles Wheeled_tractor
## Min. :0.000e+00 Min. : 2 Min. : 4 Min. : 0.000
## 1st Qu.:4.047e+07 1st Qu.: 602 1st Qu.: 591 1st Qu.: 0.000
## Median :1.339e+08 Median : 1438 Median : 1285 Median : 0.000
## Mean :6.005e+09 Mean : 9859 Mean : 4879 Mean : 5.754
## 3rd Qu.:4.970e+08 3rd Qu.: 4086 3rd Qu.: 3294 3rd Qu.: 1.000
## Max. :8.016e+12 Max. :5740995 Max. :1134570 Max. :3236.000
## NA's :2231 NA's :11 NA's :11 NA's :11
As seen from the summary, there are many variables with NA values. Some municipalities have NA values for Longitude and Latitude too. All these will need to be rectified before proceeding.
The values in the area column are factors instead of being numeric. For example, 1100.5 is documented as 1,100.5. We will rectify this by first removing the “,” from the strings and replacing it with "" using gsub(). Next, we will use as.numeric() to convert the string to a numeric variable.
brazil_variables$AREA = as.numeric(gsub(",", "", brazil_variables$AREA))
Rechecking Area Values:
We will use head() to view the first 6 rows in the AREA column.
head(brazil_variables$AREA)
## [1] 147.26 881.06 1045.13 1817.07 1610.65 180.08
The values have been rectified.
Identifying Position of NA Values:
which(is.na(brazil_variables$LONG))
## [1] 472 2702 3117 3581 3761 3806 3821 4490 4606
which(is.na(brazil_variables$LAT))
## [1] 472 2702 3117 3581 3761 3806 3821 4490 4606
Identifying Name of Municipalities with NA values:
brazil_variables[c(472,2702,3117,3581,3761,3806,3821,4490,4606), 1]
## [1] Balneário Rincão Lagoa Dos Patos Mojuà Dos Campos
## [4] ParaÃso Das Ã\201guas Pescaria Brava Pinhal Da Serra
## [7] Pinto Bandeira Santa Terezinha São Caetano
## 5299 Levels: Ângulo Óbidos Óleo Érico Cardoso ... Zortéa
Replacing NA Values with Coordinates:
We will replace the NA values with the correct coordinates taken from https://pt.db-city.com/
brazil_variables$LONG[472] <- -49.2361
brazil_variables$LAT[472] <- -28.8344
brazil_variables$LONG[2702] <- -51.4725
brazil_variables$LAT[2702] <- -31.0697
brazil_variables$LONG[3117] <- -54.6431
brazil_variables$LAT[3117] <- -2.68472
brazil_variables$LONG[3581] <- -53.0102
brazil_variables$LAT[3581] <- -19.0257
brazil_variables$LONG[3761] <- -48.8956
brazil_variables$LAT[3761] <- -28.4247
brazil_variables$LONG[3806] <- -51.1733
brazil_variables$LAT[3806] <- -27.8747
brazil_variables$LONG[3821] <- -51.4503
brazil_variables$LAT[3821] <- -29.0978
brazil_variables$LONG[4490] <- -39.5184
brazil_variables$LAT[4490] <- -12.7498
brazil_variables$LONG[4606] <- -36.1459
brazil_variables$LAT[4606] <- -8.33
Rechecking for NA Longitude/Latitude Values:
any(is.na(brazil_variables$LONG))
## [1] FALSE
any(is.na(brazil_variables$LAT))
## [1] FALSE
The coordinates for all municipalities have been set.
Since there are NA values present for the GDP per capita for some municipalities, we will drop these observations as it is unfair to assume that their GDP per capita is 0.
brazil_variables <- brazil_variables %>% filter(!is.na(GDP_CAPITA))
Replacing NA values with 0
The other variables with NA values can be replaced with 0 to indicate that the data is not present.
brazil_variables[is.na(brazil_variables)] <- 0
Rechecking for NA values
any(is.na(brazil_variables))
## [1] FALSE
All the NA values have been rectified.
any(duplicated(brazil_variables$LAT))
## [1] FALSE
There are no duplicated rows. The dataset has been cleaned and is ready to be used.
Currently, the brazil_variables data frame is aspatial. We will convert it to a sf object for ease of handling later.
brazil_variables.sf <- st_as_sf(brazil_variables, coords = c("LONG", "LAT"), crs = 4674)
Before we can prepare the choropleth map, we need to combine both the geospatial ‘sf’ object (i.e. mun_boundary) and aspatial ‘sf’ object (i.e. brazil_variables.sf) into one. This will be performed by using the st_join function of dplyr package since both are ‘sf’ objects.
brazil_sf <- st_join(mun_boundary, brazil_variables.sf)
Checking for Rows with Na Values:
The number of observations has increased hence we suspect the presence of NA values.
any(is.na(brazil_sf))
## [1] TRUE
Due to the joining of data sets, NA values may have arised from mismatched data. We will proceed to remove them.
Removing Rows with NA values:
brazil_sf <- brazil_sf %>% filter(!is.na(GDP_CAPITA))
Rechecking for NA values:
any(is.na(brazil_sf))
## [1] FALSE
tm_shape(brazil_sf)+
tm_fill("GDP_CAPITA") +
tm_borders(lwd = 0.1, alpha = 1)
Observations:
The west-central region of Brazil (State of Mato Grosso) has relatively higher GDP per capita as compared to other regions. This may be because it is mostly covered with Amazon rainforest, wetlands and savanna plains and the capital is a travel hub.
The capital of Brazil, Brasilia, has slightly higher GDP per capita as compared to many other municipalities.
The Selviria municipality has very high GDP per capita and the municipalities around it have higher GDP per capita compared to majority of the municipalities in Brazil.
There are some small municipalities with high GDP per capita towards Southern Brazil.
Distribution of GDP per capita, 2016 in Brazil:
ggplot(data=brazil_variables.sf, aes(x=`GDP_CAPITA`)) +
geom_histogram(bins=20, color="black", fill="light blue")
The figure above reveals a right skewed distribution. This means that most municipalities have low GDP per capita.
Before building a multiple regression model, we need ensure that the independent variables used are not highly correlated to each other. If highly correlated independent variables are used in building a regression model, the quality of the model will be compromised.
We will plot a Correlation matrix to visualise the relationships between the independent variables using the corrplot package.
Selecting Independent Variables:
corr_plot_variables <- brazil_variables[, -c(11,12,18) ]
Plotting Correlation Matrix:
We will order the variables alphabetically to mine any hidden patterns.
corrplot(cor(corr_plot_variables[, 3:25]), diag = FALSE, order = "alphabet",
tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")
From the correlogram, we can see that the following variables (IBGE_DU_URBAN, CARS, IBGE_15.59, PAY_TV, MUN_EXPENDIT, FIXED_PHONES, GVA_TOTAL, IBGE_RES_POP, TAXES & COMP_TOT) are highly correlated to one another. Therefore, we will only retain MUN_EXPENDIT out of the highly correlated variables.
We will use lm() to calibrate the multiple linear regression model.
variables.mlr <- lm(formula = GDP_CAPITA ~ CAPITAL + IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM + AREA + MUN_EXPENDIT + HOTELS + Pr_Bank + Pu_Bank + Pr_Assets + Pu_Assets + Motorcycles + Wheeled_tractor, data = brazil_variables.sf)
summary(variables.mlr)
##
## Call:
## lm(formula = GDP_CAPITA ~ CAPITAL + IBGE_DU_RURAL + IBGE_PLANTED_AREA +
## IBGE_CROP_PRODUCTION_. + IDHM + AREA + MUN_EXPENDIT + HOTELS +
## Pr_Bank + Pu_Bank + Pr_Assets + Pu_Assets + Motorcycles +
## Wheeled_tractor, data = brazil_variables.sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77386 -6808 -2949 1852 283561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.356e+04 2.574e+03 -20.810 < 2e-16 ***
## CAPITAL 1.888e+03 4.703e+03 0.401 0.688117
## IBGE_DU_RURAL -7.817e-01 1.781e-01 -4.388 1.16e-05 ***
## IBGE_PLANTED_AREA 7.588e-02 1.400e-02 5.418 6.28e-08 ***
## IBGE_CROP_PRODUCTION_. 8.155e-03 4.255e-03 1.916 0.055369 .
## IDHM 1.117e+05 3.936e+03 28.368 < 2e-16 ***
## AREA 5.302e-02 4.292e-02 1.235 0.216784
## MUN_EXPENDIT 3.624e-06 1.026e-06 3.531 0.000417 ***
## HOTELS 7.202e+01 1.032e+02 0.698 0.485503
## Pr_Bank 3.993e+02 3.133e+02 1.274 0.202554
## Pu_Bank 8.999e+02 3.016e+02 2.983 0.002862 **
## Pr_Assets -6.310e-10 1.774e-09 -0.356 0.722029
## Pu_Assets 5.040e-09 2.346e-09 2.149 0.031708 *
## Motorcycles -1.998e-01 3.398e-02 -5.881 4.31e-09 ***
## Wheeled_tractor 2.379e+01 8.018e+00 2.967 0.003024 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17200 on 5555 degrees of freedom
## Multiple R-squared: 0.2861, Adjusted R-squared: 0.2843
## F-statistic: 159 on 14 and 5555 DF, p-value: < 2.2e-16
From the report, we can see that CAPITAL, AREA, HOTELS, Pr_Bank and Pr_Assets are not statistically significant independent variables as compared to the rest. Hence, we will revise our model by removing AREA from our multiple linear regression model.
Calibrating Revised Model:
variables.mlr1 <- lm(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM + MUN_EXPENDIT + Pu_Bank + Pu_Assets + Motorcycles + Wheeled_tractor, data = brazil_variables.sf)
summary(variables.mlr1)
##
## Call:
## lm(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA +
## IBGE_CROP_PRODUCTION_. + IDHM + MUN_EXPENDIT + Pu_Bank +
## Pu_Assets + Motorcycles + Wheeled_tractor, data = brazil_variables.sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77118 -6860 -2975 1848 283569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.382e+04 2.468e+03 -21.812 < 2e-16 ***
## IBGE_DU_RURAL -8.005e-01 1.763e-01 -4.541 5.71e-06 ***
## IBGE_PLANTED_AREA 7.495e-02 1.393e-02 5.380 7.76e-08 ***
## IBGE_CROP_PRODUCTION_. 8.814e-03 4.225e-03 2.086 0.037002 *
## IDHM 1.122e+05 3.753e+03 29.909 < 2e-16 ***
## MUN_EXPENDIT 3.822e-06 8.094e-07 4.722 2.40e-06 ***
## Pu_Bank 1.093e+03 2.821e+02 3.872 0.000109 ***
## Pu_Assets 5.431e-09 2.305e-09 2.357 0.018476 *
## Motorcycles -1.760e-01 2.856e-02 -6.163 7.66e-10 ***
## Wheeled_tractor 2.284e+01 7.500e+00 3.045 0.002337 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17200 on 5560 degrees of freedom
## Multiple R-squared: 0.2855, Adjusted R-squared: 0.2843
## F-statistic: 246.8 on 9 and 5560 DF, p-value: < 2.2e-16
All the independent variables are statistically significant.
It is important for us to test the key assumptions when building our model before proceeding to finalise it.
Multicollinearity is a situation where collinearity exists between three or more variables even if no pair of variables have particularly high correlation. This may lead to redundancy between independent variables.In the presence of multicollinearity, the regression model becomes unstable.
We will use ols_vif_tol() of olsrr package to test if there are sign of multicollinearity.
ols_vif_tol(variables.mlr1)
## Variables Tolerance VIF
## 1 IBGE_DU_RURAL 0.5980051 1.672227
## 2 IBGE_PLANTED_AREA 0.1409950 7.092452
## 3 IBGE_CROP_PRODUCTION_. 0.1355743 7.376030
## 4 IDHM 0.6687730 1.495276
## 5 MUN_EXPENDIT 0.1475242 6.778548
## 6 Pu_Bank 0.5836669 1.713306
## 7 Pu_Assets 0.7368104 1.357201
## 8 Motorcycles 0.1486427 6.727543
## 9 Wheeled_tractor 0.3083798 3.242755
Since the VIF of the independent variables are less than 10, we conclude that there are no signs of strong multicollinearity amongst the independent variables.
It is important for us to test the linearity and additivity of the relationship between the dependent and independent variables.
We will use ols_plot_resid_fit() function of olsrr package is used to perform the linearity assumption test.
ols_plot_resid_fit(variables.mlr1)
From the figure, we can see that majority of the data points are scattered near the line passing through 0. This enables us to conclude that the relationships between the dependent and independent variables are generally linear.
We will also test to detect any possible violation of the normality assumption on the residuals of our model.
We will use the ols_plot_resid_hist() function of olsrr package for the normality assumption test. This was chosen over ols_test_normality() as it does not a allow a sample size greater than 5000.
ols_plot_resid_hist(variables.mlr1)
The figure above explicitly shows us that the residuals of the multiple linear regression model resemble a normal distribution.
The Homoscedasticity assumption assumes that the variance of error terms are similar across the values of the independent variables.
We will plot our residuals against predicted values to ascertain whether the points are equally distributed across all the values of the independent variables.
plot(variables.mlr1, 1)
As seen from the plot, the line in red is generally flat, signalling that the disturbances are homoscedastic in nature. There does not seem to be a pattern.
Since the data we have used is geospatial in nature, it can be subjected to spatial autocorrelation, leading to misspecification errors. The presence of spatial autocorrelation will challenge the independence of variables hence it is important to perform this test.
We will first take a look at how the residuals are distributed to detect potential signs of clustering.
Extracting Residuals of Model and Merging them with Data:
We will join the mun_2016 dataset with brazil_variables and convert it to an ‘sf’ object to act as the polygon layer.
We will then extract the residuals and join them to the polygon layer as well as the spatial points layer (brazil_variables.sf). The polygons are for mapping purposes while the points will be used for analysis subsequently.
# Extract Residuals
residuals <- as.data.frame(variables.mlr1$residuals)
# Joining Residuals with Data & Renaming Residuals
point_res_sf <- cbind(brazil_sf, variables.mlr1$residuals) %>%
rename(`residuals` = `variables.mlr1.residuals`)
Choropleth Map to Visualise the Distribution of the Residuals of GDP per capita, 2016 in Brazil:
tm_shape(point_res_sf) +
tm_fill("residuals")
## Variable(s) "residuals" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
From the map that have been plotted, the residuals appear to be randomly distributed across Brazil. We will conduct the Global Moran’s I test to further investigate.
Converting ‘sf’ to ‘sp’ Dataframe:
First, we need to convert our ‘sf’ dataframe to a SpatialPointDataFrame as ‘spdep’ package requires us to do so.
point_res_sp <- as_Spatial(point_res_sf)
Computing Distanced Based Neighbours:
Next, we will determine the upper limit of the distance band for our distance based weight matrix.
coords <- coordinates(point_res_sp)
k <- knn2nb(knearneigh(coords))
kdists <- unlist(nbdists(k, coords, longlat=FALSE))
summary(kdists)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01289 0.10906 0.14690 0.19642 0.22010 3.33432
The largest distance is approximately 3.4. Hence, we will use this distance so that every point has >= 1 neighbour.
Computing Distance Based Weights Matrix:
nb <- dnearneigh(coords, 0, 3.4, longlat=FALSE)
nb_lw <- nb2listw(nb, style='B')
Computing Global Moran’s I:
Null Hypothesis: The residuals for the multiple linear regression model are randomly distributed.
Alternative Hypothesis: The residuals for the multiple linear regression model are not randomly distributed.
Confidence Interval: 0.95
lm.morantest(variables.mlr1,nb_lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA +
## IBGE_CROP_PRODUCTION_. + IDHM + MUN_EXPENDIT + Pu_Bank + Pu_Assets +
## Motorcycles + Wheeled_tractor, data = brazil_variables.sf)
## weights: nb_lw
##
## Moran I statistic standard deviate = -1.3091, p-value = 0.9048
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## -1.174246e-03 -1.799330e-04 5.768867e-07
Conclusion:
– The p-value is greater than than 0.05 (alpha value at 95% confidence interval)
– We do not reject the null hypothesis at the 95% confidence interval.
–> The residuals for the multiple linear regression model are randomly distributed.
We will be modelling the GWR using both fixed and adaptive bandwidth schemes. Amongst the two approaches that can be used (AIC & Cross-Validation Approach), we will use Cross-Validation(CV).
bw.fixed <- bw.gwr(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM + MUN_EXPENDIT + Pu_Bank + Pu_Assets + Motorcycles + Wheeled_tractor, data = point_res_sp, approach="cv", kernel="gaussian", adaptive=FALSE, longlat=FALSE)
## Take a cup of tea and have a break, it will take a few minutes.
## -----A kind suggestion from GWmodel development group
## Fixed bandwidth: 34.70094 CV score: 2.120404e+12
## Fixed bandwidth: 21.45065 CV score: 2.107555e+12
## Fixed bandwidth: 13.26152 CV score: 2.079318e+12
## Fixed bandwidth: 8.200357 CV score: 2.02493e+12
## Fixed bandwidth: 5.072388 CV score: 1.949693e+12
## Fixed bandwidth: 3.139197 CV score: 1.905993e+12
## Fixed bandwidth: 1.944419 CV score: 2.204673e+12
## Fixed bandwidth: 3.87761 CV score: 1.907414e+12
## Fixed bandwidth: 2.682833 CV score: 1.979863e+12
## Fixed bandwidth: 3.421246 CV score: 1.894095e+12
## Fixed bandwidth: 3.595562 CV score: 1.897478e+12
## Fixed bandwidth: 3.313513 CV score: 1.894926e+12
## Fixed bandwidth: 3.487829 CV score: 1.894912e+12
## Fixed bandwidth: 3.380096 CV score: 1.894033e+12
## Fixed bandwidth: 3.354663 CV score: 1.894212e+12
## Fixed bandwidth: 3.395814 CV score: 1.894009e+12
## Fixed bandwidth: 3.405528 CV score: 1.894024e+12
## Fixed bandwidth: 3.38981 CV score: 1.894011e+12
## Fixed bandwidth: 3.399524 CV score: 1.894012e+12
## Fixed bandwidth: 3.39352 CV score: 1.894009e+12
## Fixed bandwidth: 3.392103 CV score: 1.894009e+12
## Fixed bandwidth: 3.394396 CV score: 1.894009e+12
## Fixed bandwidth: 3.394938 CV score: 1.894009e+12
## Fixed bandwidth: 3.394062 CV score: 1.894009e+12
## Fixed bandwidth: 3.393855 CV score: 1.894009e+12
## Fixed bandwidth: 3.394189 CV score: 1.894009e+12
## Fixed bandwidth: 3.393983 CV score: 1.894009e+12
## Fixed bandwidth: 3.393934 CV score: 1.894009e+12
## Fixed bandwidth: 3.394013 CV score: 1.894009e+12
The result shows that the recommended bandwidth is 3.402438 kilometres.
gwr.fixed <- gwr.basic(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM + MUN_EXPENDIT + Pu_Bank + Pu_Assets + Motorcycles + Wheeled_tractor, data = point_res_sp, bw=bw.fixed, kernel = 'gaussian', longlat = FALSE)
gwr.fixed
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2020-05-31 18:26:36
## Call:
## gwr.basic(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA +
## IBGE_CROP_PRODUCTION_. + IDHM + MUN_EXPENDIT + Pu_Bank +
## Pu_Assets + Motorcycles + Wheeled_tractor, data = point_res_sp,
## bw = bw.fixed, kernel = "gaussian", longlat = FALSE)
##
## Dependent (y) variable: GDP_CAPITA
## Independent variables: IBGE_DU_RURAL IBGE_PLANTED_AREA IBGE_CROP_PRODUCTION_. IDHM MUN_EXPENDIT Pu_Bank Pu_Assets Motorcycles Wheeled_tractor
## Number of data points: 5570
## ***********************************************************************
## * Results of Global Regression *
## ***********************************************************************
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77118 -6860 -2975 1848 283569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.382e+04 2.468e+03 -21.812 < 2e-16 ***
## IBGE_DU_RURAL -8.005e-01 1.763e-01 -4.541 5.71e-06 ***
## IBGE_PLANTED_AREA 7.495e-02 1.393e-02 5.380 7.76e-08 ***
## IBGE_CROP_PRODUCTION_. 8.814e-03 4.225e-03 2.086 0.037002 *
## IDHM 1.122e+05 3.753e+03 29.909 < 2e-16 ***
## MUN_EXPENDIT 3.822e-06 8.094e-07 4.722 2.40e-06 ***
## Pu_Bank 1.093e+03 2.821e+02 3.872 0.000109 ***
## Pu_Assets 5.431e-09 2.305e-09 2.357 0.018476 *
## Motorcycles -1.760e-01 2.856e-02 -6.163 7.66e-10 ***
## Wheeled_tractor 2.284e+01 7.500e+00 3.045 0.002337 **
##
## ---Significance stars
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 17200 on 5560 degrees of freedom
## Multiple R-squared: 0.2855
## Adjusted R-squared: 0.2843
## F-statistic: 246.8 on 9 and 5560 DF, p-value: < 2.2e-16
## ***Extra Diagnostic information
## Residual sum of squares: 1.64508e+12
## Sigma(hat): 17188.74
## AIC: 124464.4
## AICc: 124464.4
## ***********************************************************************
## * Results of Geographically Weighted Regression *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: gaussian
## Fixed bandwidth: 3.394013
## Regression points: the same locations as observations are used.
## Distance metric: Euclidean distance metric is used.
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median 3rd Qu.
## Intercept -9.1424e+04 -5.4536e+04 -2.6593e+04 -1.5657e+04
## IBGE_DU_RURAL -5.8976e+00 -1.1036e+00 -6.1339e-01 -3.6570e-01
## IBGE_PLANTED_AREA -5.4649e-02 3.0439e-02 7.4042e-02 1.2122e-01
## IBGE_CROP_PRODUCTION_. -3.2018e-02 -4.0013e-03 5.7125e-03 1.4905e-02
## IDHM -1.0090e+04 4.3097e+04 7.6575e+04 1.1362e+05
## MUN_EXPENDIT -2.1700e-05 4.1576e-06 6.6813e-06 1.7404e-05
## Pu_Bank -1.0231e+03 7.9581e+02 1.2374e+03 1.8688e+03
## Pu_Assets -4.6589e-07 -8.6199e-08 -3.1347e-09 5.2917e-09
## Motorcycles -9.8492e-01 -2.8808e-01 -2.3099e-01 -1.4970e-01
## Wheeled_tractor -7.5831e+02 -1.9104e+01 2.6848e+01 2.9990e+01
## Max.
## Intercept 3.6888e+04
## IBGE_DU_RURAL 9.7750e-01
## IBGE_PLANTED_AREA 6.3350e-01
## IBGE_CROP_PRODUCTION_. 4.9000e-02
## IDHM 1.6482e+05
## MUN_EXPENDIT 0.0000e+00
## Pu_Bank 2.7312e+03
## Pu_Assets 0.0000e+00
## Motorcycles 2.8650e-01
## Wheeled_tractor 1.3618e+03
## ************************Diagnostic information*************************
## Number of data points: 5570
## Effective number of parameters (2trace(S) - trace(S'S)): 116.6109
## Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5453.389
## AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 124177.5
## AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 124086.9
## Residual sum of squares: 1.519795e+12
## R-square value: 0.3398847
## Adjusted R-square value: 0.3257668
##
## ***********************************************************************
## Program stops at: 2020-05-31 18:26:57
(Screenshots of the computation and calibration are used for ease of knitting)
DM <- gw.dist(dp.locat = coordinates(point_res_sp))
bw.adaptive <- bw.gwr(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM + MUN_EXPENDIT + Pu_Bank + Pu_Assets + Motorcycles + Wheeled_tractor, data = point_res_sp, approach="CV", kernel="gaussian", adaptive=TRUE, longlat=FALSE, dMat=DM)
The result shows that the recommended data points to be used is 350.
gwr.adaptive <- gwr.basic(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM + MUN_EXPENDIT + Pu_Bank + Pu_Assets + Motorcycles + Wheeled_tractor, data = point_res_sp, bw=bw.adaptive, kernel = 'gaussian', adaptive=TRUE, longlat = FALSE)
gwr.adaptive
Comparing Fixed vs Adaptive Bandwidth:
The model using adaptive bandwidth has a higher adjusted R-square value than the model using the fixed bandwidth model.
Therefore, we will be using the model with adaptive bandwidth for visualisation purposes.
To visualise the fields in SDF, we need to first convert it into a ‘sf’ data frame.
brazil_adaptive_Sf <- st_as_sf(gwr.adaptive$SDF) %>%
st_transform(crs=4674)
gwr.adaptive.output <- as.data.frame(gwr.adaptive$SDF)
brazil_adaptive_sf <- cbind(point_res_sf, as.matrix(gwr.adaptive.output))
tm_shape(brazil_adaptive_sf) +
tm_fill(col = "Local_R2")
Analysis of Local R Square Values Map:
The local R square values ranging between 0-1 indicate how well the geographic weighted regression model fits the observed GDP per capita values. Very high local R square values suggest that the model is performing very well and vice versa.
The local R square values are very low towards the South of Brazil and very high towards North Eastern Brazil.
This means that our regression model accurately describes the North Eastern regions of Brazil best and is not very indicate of regions in the South.
Renaming Coefficients:
brazil_adaptive_sf <- brazil_adaptive_sf %>%
rename(coeff_IBGE_DU_RURAL = IBGE_DU_RURAL.1) %>%
rename(coeff_IBGE_PLANTED_AREA = IBGE_PLANTED_AREA.1) %>%
rename(coeff_IBGE_CROP_PRODUCTION = IBGE_CROP_PRODUCTION_..1) %>%
rename(coeff_IDHM = IDHM.1) %>%
rename(coeff_MUN_EXPENDIT = MUN_EXPENDIT.1) %>%
rename(coeff_Pu_Bank = Pu_Bank.1) %>%
rename(coeff_Pu_Assets = Pu_Assets.1) %>%
rename(coeff_Motorcycles = Motorcycles.1) %>%
rename(coeff_Wheeled_tractor = Wheeled_tractor.1)
Plotting Coefficients of Independent Variables:
Maps with negative coefficients will be plotted with the “RdBu” palette. maps with only positive coefficients will be plotted with the “Reds” palette and maps with only negative coefficients will be plotted with the “Blues” palette.
coeff_rural_map <- tm_shape(brazil_adaptive_sf)+
tm_fill("coeff_IBGE_DU_RURAL" , palette = "-RdBu")
coeff_plantedarea_map <- tm_shape(brazil_adaptive_sf)+
tm_fill("coeff_IBGE_PLANTED_AREA", palette = "-RdBu")
coeff_crops_map <- tm_shape(brazil_adaptive_sf)+
tm_fill("coeff_IBGE_CROP_PRODUCTION", palette = "-RdBu")
coeff_idhm_map <- tm_shape(brazil_adaptive_sf)+
tm_fill("coeff_IDHM", palette = "Reds")
coeff_mun_exp_map <- tm_shape(brazil_adaptive_sf)+
tm_fill("coeff_MUN_EXPENDIT", palette = "Reds")
coeff_pubank_map <- tm_shape(brazil_adaptive_sf)+
tm_fill("coeff_Pu_Bank", palette = "-RdBu")
coeff_puassets_map <- tm_shape(brazil_adaptive_sf)+
tm_fill("coeff_Pu_Assets", palette = "-RdBu")
coeff_motorcycles_map <- tm_shape(brazil_adaptive_sf)+
tm_fill("coeff_Motorcycles", palette = "Blues")
coeff_tractors_map <- tm_shape(brazil_adaptive_sf)+
tm_fill("coeff_Wheeled_tractor", palette = "-RdBu")
tmap_arrange(coeff_rural_map, coeff_plantedarea_map, coeff_crops_map, coeff_idhm_map, coeff_mun_exp_map, coeff_pubank_map, coeff_puassets_map, coeff_motorcycles_map, coeff_tractors_map, ncol=3, nrow = 3)
Analysis of Coefficients:
For the first plot of IBGE domestic rural units coefficients, most of the areas in Brazil are blue. This means that there is a negative relationship between GDP per Capita and the number of domestic rural units in a municipality. As the number of domestic rural units increases, the GDP per Capita of the municipality decreases. This is an expected observation as more rural units would mean that the area is less developed and urbanised hence the GDP per capita will be lower.
The second plot of IBGE planted area coefficients shows that most of the areas in Brazil are red and the east region has some blue areas. This means that there is a positive relationship between GDP per Capita and the planted area in a municipality. As the area of plantation increases, the GDP per Capita of the municipality increases. This is also an expected observation as a greater planted area would mean that there is higher crop production, leading to an increase in revenue and GDP, causing GDP per capita to increase.
The third plot of IBGE crop production coefficients shows that most of the areas in Brazil are slightly red and the south region is blue. This means that there is a general positive relationship between GDP per Capita and crop production in a municipality. As crop production increases, the GDP per Capita of the municipality increases slightly. The area in the South may be more urbanised than other areas hence an increased crop production may cause a fall in GDP per capita.
The fourth plot of the IDHM coefficients shows that all of the areas in Brazil are red. This means that there is strong positive relationship between GDP per Capita and IDHM value in each municipality. IDHM refers to the Human Development Index(HDI). As HDI increases, the GDP per Capita of the municipality increases significantly. This is an expected observation as the HDI index reflects the development of each municipality. With a higher HDI index, the municipality is deemed to be more developed which causes the GDP per capita of that municipality to be higher too.
The fifth plot of the Mun Expenditure coefficients shows that all of the areas in Brazil are red. This means that there is positive relationship between GDP per Capita and expenditure of municipalities. As the expenditure increases, the GDP per Capita of the municipality. This is true because expenditure is part of the calculation to obtain GDP, hence it will have a positive relationship with GDP per capita as well.
The sixth plot of public bank coefficients shows that most of the areas in Brazil are red and the central-south region is blue. This means that there is a general positive relationship between GDP per Capita and the number of public banks in a municipality. As the number of public banks increases, the GDP per Capita of the municipality increases slightly. This may be because a higher number of banks may lead to more investments being made and investments are a part of GDP calculation hence it affects the GDP per capita positively too.
The seventh plot of public bank assets shows that most of the areas in Brazil are slightly red and regions towards the east and south of Brazil are blue. This means that there is a general positive relationship between GDP per Capita and the public bank assets in a municipality. As the amount of assets increase, the GDP per Capita of the municipality increases slightly. Bank assets are considered to be the money supply of banks. With a higher money supply, more money will be loaned to investors, causing GDP to increase and hence this may cause the positive relationship with GDP per capita.
The eighth plot of the motorcycle coefficients shows that all of the areas in Brazil are blue. This means that there is negative relationship between GDP per Capita and the number of motorcycles in each municipality. As the number of motorcycle increases, the GDP per Capita of the municipality decreases. This may be because motorcycles are a sign of being less developed. More developed municipalities will have more people using cars instead of motorcycles. Hence, if there is a higher number of motorcycles, it may mean that the municipality is less developed, causing the GDP per capita to be lower.
The ninth plot of the number of wheeled tractors shows that most of the areas in Brazil are slightly red with blue regions towards the east. This means that there is general positive relationship between GDP per Capita and number of wheeled tractors in each municipality. As the number of wheeled tractors increases, the GDP per Capita of the municipality increases slightly. Tractors are used for farming purposes hence a higher number of tractors may mean that there is a higher amount of agricultural production. This may lead to increased revenue, causing the GDP per capita to increase.
Conclusion:
In this exercise, we analysed the factors affecting the GDP per capita of Brazil at the municipality level by using the data provided. From our analysis, we can tell that there is no clustering of GDP per capita. Amongst our 9 independent analysis variables, the HDI index seems to be the most important factor affecting GDP per capita of Brazil. This was supported by the very large geographic weighted regression coefficients.