packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse','geobr','heatmaply')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
We will be importing the municipalities of Brazil from the geobr package. Studying the vignettes, the function to obtain our required municipality data from 2016 is read_municipality.
However, re-running the code will involve pulling the municipality data over and over, therefore instead we will save the municipality data into a local file instead for faster loading in future uses.
The code chunk below will run once on my client side and will be commented out for use of the local file instead moving forward.
#Municipal2016 <- read_municipality(year=2016)
#saveRDS(Municipal2016, file = "Municipal2016.rds")
Municipal2016 <- readRDS(file = "Municipal2016.rds")
glimpse(Municipal2016)
## Rows: 5,572
## Columns: 5
## $ code_muni <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100064, ...
## $ name_muni <fct> Alta Floresta D'oeste, Ariquemes, Cabixi, Cacoal, Cere...
## $ code_state <fct> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11...
## $ abbrev_state <fct> RO, RO, RO, RO, RO, RO, RO, RO, RO, RO, RO, RO, RO, RO...
## $ geom <POLYGON [°]> POLYGON ((-62.19465 -11.827..., POLYGON ((-62....
5572 Municipalities are observed in our Simple Feature Dataframe.
st_crs(Municipal2016)
## Coordinate Reference System:
## User input: SIRGAS 2000
## wkt:
## GEOGCRS["SIRGAS 2000",
## DATUM["Sistema de Referencia Geocentrico para las AmericaS 2000",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["unknown"],
## AREA["Latin America - SIRGAS 2000 by country"],
## BBOX[-59.87,-122.19,32.72,-25.28]],
## ID["EPSG",4674]]
We observe that our current projection is in EPSG:4674
Next we will check for invalid polygons in Municipal2016
unique(st_is_valid(Municipal2016))
## [1] TRUE FALSE
It is observed that there are invalid polygons that we will use the st_make_valid() function to fix. st_make_Valid attempts to repair invalidities without only minimal alterations to the input geometries. No vertices are dropped or moved, the structure of the object is simply re-arranged.
Municipal2016 <- st_make_valid(Municipal2016)
Now we’ll recheck to ensure no invalid polygons remain
unique(st_is_valid(Municipal2016))
## [1] TRUE
No invalid polygons remain!
Now we’ll check for any NA data in Municipal2016
any(is.na(Municipal2016))
## [1] FALSE
No NA data is observed in Municipal2016.
# Remove plot axis
no_axis <- theme(axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank())
# Plot all Brazilian municipalities
ggplot() +
geom_sf(data=Municipal2016, fill="#FFDF00", color="#009C3B", size=.15, show.legend = FALSE) +
labs(subtitle="Brazil Municipalities, 2016", size=8) +
theme_minimal() +
no_axis
We will proceed with importing our provided Aspatial Data.
We will be importing both BRAZIL_CITIES.csv and Data_Dictionary.csv
Further investigations into the csv files reveal that columns are delimited by “;”. Therefore we will be using read_delim instead of read_csv for importing the Aspatial Data. Note that read_csv2 is also an alternative here.
Brazil_Cities<- read_delim ("data/aspatial/BRAZIL_CITIES.csv", ";")
## Parsed with column specification:
## cols(
## .default = col_double(),
## CITY = col_character(),
## STATE = col_character(),
## AREA = col_number(),
## REGIAO_TUR = col_character(),
## CATEGORIA_TUR = col_character(),
## RURAL_URBAN = col_character(),
## GVA_MAIN = col_character()
## )
## See spec(...) for full column specifications.
Data_Dict <- read_delim("data/aspatial/Data_Dictionary.csv", ";")
## Warning: Missing column names filled in: 'X6' [6]
## Parsed with column specification:
## cols(
## FIELD = col_character(),
## DESCRIPTION = col_character(),
## REFERENCE = col_character(),
## UNIT = col_character(),
## SOURCE = col_character(),
## X6 = col_character()
## )
Next we will take a look at our imported Brazil_Cities aspatial data with summary.
summary(Brazil_Cities)
## CITY STATE CAPITAL IBGE_RES_POP
## Length:5573 Length:5573 Min. :0.000000 Min. : 805
## Class :character Class :character 1st Qu.:0.000000 1st Qu.: 5235
## Mode :character Mode :character Median :0.000000 Median : 10934
## Mean :0.004845 Mean : 34278
## 3rd Qu.:0.000000 3rd Qu.: 23424
## Max. :1.000000 Max. :11253503
## NA's :8
## IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN
## Min. : 805 Min. : 0.0 Min. : 239 Min. : 60
## 1st Qu.: 5230 1st Qu.: 0.0 1st Qu.: 1572 1st Qu.: 874
## Median : 10926 Median : 0.0 Median : 3174 Median : 1846
## Mean : 34200 Mean : 77.5 Mean : 10303 Mean : 8859
## 3rd Qu.: 23390 3rd Qu.: 10.0 3rd Qu.: 6726 3rd Qu.: 4624
## Max. :11133776 Max. :119727.0 Max. :3576148 Max. :3548433
## NA's :8 NA's :8 NA's :10 NA's :10
## IBGE_DU_RURAL IBGE_POP IBGE_1 IBGE_1-4
## Min. : 3 Min. : 174 Min. : 0.0 Min. : 5
## 1st Qu.: 487 1st Qu.: 2801 1st Qu.: 38.0 1st Qu.: 158
## Median : 931 Median : 6170 Median : 92.0 Median : 376
## Mean : 1463 Mean : 27595 Mean : 383.3 Mean : 1544
## 3rd Qu.: 1832 3rd Qu.: 15302 3rd Qu.: 232.0 3rd Qu.: 951
## Max. :33809 Max. :10463636 Max. :129464.0 Max. :514794
## NA's :81 NA's :8 NA's :8 NA's :8
## IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## Min. : 7 Min. : 12 Min. : 94 Min. : 29
## 1st Qu.: 220 1st Qu.: 259 1st Qu.: 1734 1st Qu.: 341
## Median : 516 Median : 588 Median : 3841 Median : 722
## Mean : 2069 Mean : 2381 Mean : 18212 Mean : 3004
## 3rd Qu.: 1300 3rd Qu.: 1478 3rd Qu.: 9628 3rd Qu.: 1724
## Max. :684443 Max. :783702 Max. :7058221 Max. :1293012
## NA's :8 NA's :8 NA's :8 NA's :8
## IBGE_PLANTED_AREA IBGE_CROP_PRODUCTION_$ IDHM Ranking 2010 IDHM
## Min. : 0.0 Min. : 0 Min. : 1 Min. :0.4180
## 1st Qu.: 910.2 1st Qu.: 2326 1st Qu.:1392 1st Qu.:0.5990
## Median : 3471.5 Median : 13846 Median :2783 Median :0.6650
## Mean : 14179.9 Mean : 57384 Mean :2783 Mean :0.6592
## 3rd Qu.: 11194.2 3rd Qu.: 55619 3rd Qu.:4174 3rd Qu.:0.7180
## Max. :1205669.0 Max. :3274885 Max. :5565 Max. :0.8620
## NA's :3 NA's :3 NA's :8 NA's :8
## IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG
## Min. :0.4000 Min. :0.6720 Min. :0.2070 Min. :-72.92
## 1st Qu.:0.5720 1st Qu.:0.7690 1st Qu.:0.4900 1st Qu.:-50.87
## Median :0.6540 Median :0.8080 Median :0.5600 Median :-46.52
## Mean :0.6429 Mean :0.8016 Mean :0.5591 Mean :-46.23
## 3rd Qu.:0.7070 3rd Qu.:0.8360 3rd Qu.:0.6310 3rd Qu.:-41.40
## Max. :0.8910 Max. :0.8940 Max. :0.8250 Max. :-32.44
## NA's :8 NA's :8 NA's :8 NA's :9
## LAT ALT PAY_TV FIXED_PHONES
## Min. :-33.688 Min. : 0.0 Min. : 1 Min. : 3
## 1st Qu.:-22.838 1st Qu.: 169.8 1st Qu.: 88 1st Qu.: 119
## Median :-18.089 Median : 406.5 Median : 247 Median : 327
## Mean :-16.444 Mean : 893.8 Mean : 3094 Mean : 6567
## 3rd Qu.: -8.489 3rd Qu.: 628.9 3rd Qu.: 815 3rd Qu.: 1151
## Max. : 4.585 Max. :874579.0 Max. :2047668 Max. :5543127
## NA's :9 NA's :9 NA's :3 NA's :3
## AREA REGIAO_TUR CATEGORIA_TUR ESTIMATED_POP
## Min. : 3.57 Length:5573 Length:5573 Min. : 786
## 1st Qu.: 204.44 Class :character Class :character 1st Qu.: 5454
## Median : 416.59 Mode :character Mode :character Median : 11590
## Mean : 1517.44 Mean : 37432
## 3rd Qu.: 1026.57 3rd Qu.: 25296
## Max. :159533.33 Max. :12176866
## NA's :3 NA's :3
## RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES
## Length:5573 Min. : 0 Min. : 1 Min. : 2
## Class :character 1st Qu.: 4189 1st Qu.: 1726 1st Qu.: 10112
## Mode :character Median : 20426 Median : 7424 Median : 31211
## Mean : 47271 Mean : 175928 Mean : 489451
## 3rd Qu.: 51227 3rd Qu.: 41022 3rd Qu.: 115406
## Max. :1402282 Max. :63306755 Max. :464656988
## NA's :3 NA's :3 NA's :3
## GVA_PUBLIC GVA_TOTAL TAXES GDP
## Min. : 7 Min. : 17 Min. : -14159 Min. : 15
## 1st Qu.: 17267 1st Qu.: 42253 1st Qu.: 1305 1st Qu.: 43709
## Median : 35866 Median : 119492 Median : 5100 Median : 125153
## Mean : 123768 Mean : 832987 Mean : 118864 Mean : 954584
## 3rd Qu.: 89245 3rd Qu.: 313963 3rd Qu.: 22197 3rd Qu.: 329539
## Max. :41902893 Max. :569910503 Max. :117125387 Max. :687035890
## NA's :3 NA's :3 NA's :3 NA's :3
## POP_GDP GDP_CAPITA GVA_MAIN MUN_EXPENDIT
## Min. : 815 Min. : 3191 Length:5573 Min. :1.421e+06
## 1st Qu.: 5483 1st Qu.: 9058 Class :character 1st Qu.:1.573e+07
## Median : 11578 Median : 15870 Mode :character Median :2.746e+07
## Mean : 36998 Mean : 21126 Mean :1.043e+08
## 3rd Qu.: 25085 3rd Qu.: 26155 3rd Qu.:5.666e+07
## Max. :12038175 Max. :314638 Max. :4.577e+10
## NA's :3 NA's :3 NA's :1492
## COMP_TOT COMP_A COMP_B COMP_C
## Min. : 6.0 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 68.0 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 3.00
## Median : 162.0 Median : 2.00 Median : 0.000 Median : 11.00
## Mean : 906.8 Mean : 18.25 Mean : 1.852 Mean : 73.44
## 3rd Qu.: 448.0 3rd Qu.: 8.00 3rd Qu.: 2.000 3rd Qu.: 39.00
## Max. :530446.0 Max. :1948.00 Max. :274.000 Max. :31566.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_D COMP_E COMP_F COMP_G
## Min. : 0.0000 Min. : 0.000 Min. : 0.00 Min. : 1.0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 32.0
## Median : 0.0000 Median : 0.000 Median : 4.00 Median : 74.5
## Mean : 0.4262 Mean : 2.029 Mean : 43.26 Mean : 348.0
## 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 15.00 3rd Qu.: 199.0
## Max. :332.0000 Max. :657.000 Max. :25222.00 Max. :150633.0
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_H COMP_I COMP_J COMP_K
## Min. : 0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 7 Median : 7.00 Median : 1.00 Median : 0.00
## Mean : 41 Mean : 55.88 Mean : 24.74 Mean : 15.55
## 3rd Qu.: 25 3rd Qu.: 24.00 3rd Qu.: 5.00 3rd Qu.: 2.00
## Max. :19515 Max. :29290.00 Max. :38720.00 Max. :23738.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_L COMP_M COMP_N COMP_O
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 0.00 Median : 4.00 Median : 4.0 Median : 2.000
## Mean : 15.14 Mean : 51.29 Mean : 83.7 Mean : 3.269
## 3rd Qu.: 3.00 3rd Qu.: 13.00 3rd Qu.: 14.0 3rd Qu.: 3.000
## Max. :14003.00 Max. :49181.00 Max. :76757.0 Max. :204.000
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_P COMP_Q COMP_R COMP_S
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.: 5.00
## Median : 6.00 Median : 3.00 Median : 2.00 Median : 12.00
## Mean : 30.96 Mean : 34.15 Mean : 12.18 Mean : 51.61
## 3rd Qu.: 17.00 3rd Qu.: 12.00 3rd Qu.: 6.00 3rd Qu.: 31.00
## Max. :16030.00 Max. :22248.00 Max. :6687.00 Max. :24832.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_T COMP_U HOTELS BEDS
## Min. :0 Min. : 0.00000 Min. : 1.000 Min. : 2.0
## 1st Qu.:0 1st Qu.: 0.00000 1st Qu.: 1.000 1st Qu.: 40.0
## Median :0 Median : 0.00000 Median : 1.000 Median : 82.0
## Mean :0 Mean : 0.05027 Mean : 3.131 Mean : 257.5
## 3rd Qu.:0 3rd Qu.: 0.00000 3rd Qu.: 3.000 3rd Qu.: 200.0
## Max. :0 Max. :123.00000 Max. :97.000 Max. :13247.0
## NA's :3 NA's :3 NA's :4686 NA's :4686
## Pr_Agencies Pu_Agencies Pr_Bank Pu_Bank
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.00
## 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:1.00
## Median : 1.000 Median : 2.000 Median : 1.000 Median :2.00
## Mean : 3.383 Mean : 2.829 Mean : 1.312 Mean :1.58
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.:2.00
## Max. :1693.000 Max. :626.000 Max. :83.000 Max. :8.00
## NA's :2231 NA's :2231 NA's :2231 NA's :2231
## Pr_Assets Pu_Assets Cars Motorcycles
## Min. :0.000e+00 Min. :0.000e+00 Min. : 2 Min. : 4
## 1st Qu.:0.000e+00 1st Qu.:4.047e+07 1st Qu.: 602 1st Qu.: 591
## Median :3.231e+07 Median :1.339e+08 Median : 1438 Median : 1285
## Mean :9.180e+09 Mean :6.005e+09 Mean : 9859 Mean : 4879
## 3rd Qu.:1.148e+08 3rd Qu.:4.970e+08 3rd Qu.: 4086 3rd Qu.: 3294
## Max. :1.947e+13 Max. :8.016e+12 Max. :5740995 Max. :1134570
## NA's :2231 NA's :2231 NA's :11 NA's :11
## Wheeled_tractor UBER MAC WAL-MART
## Min. : 0.000 Min. :1 Min. : 1.000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:1 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 0.000 Median :1 Median : 2.000 Median : 1.000
## Mean : 5.754 Mean :1 Mean : 4.277 Mean : 2.059
## 3rd Qu.: 1.000 3rd Qu.:1 3rd Qu.: 3.000 3rd Qu.: 1.750
## Max. :3236.000 Max. :1 Max. :130.000 Max. :26.000
## NA's :11 NA's :5448 NA's :5407 NA's :5471
## POST_OFFICES
## Min. : 1.000
## 1st Qu.: 1.000
## Median : 1.000
## Mean : 2.081
## 3rd Qu.: 2.000
## Max. :225.000
## NA's :120
Many NA values are observed in our data, most concerningly in our LONG & LAT Columns which are required for transformation to geospatial data. We will investigate it further
Brazil_Cities[is.na(Brazil_Cities$LAT),]
## # A tibble: 9 x 81
## CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR~ IBGE_RES_POP_ES~ IBGE_DU
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baln~ SC 0 NA NA NA NA
## 2 Lago~ RS 0 NA NA NA NA
## 3 Moju~ PA 0 NA NA NA NA
## 4 Para~ MS 0 NA NA NA NA
## 5 Pesc~ SC 0 NA NA NA NA
## 6 Pinh~ RS 0 2130 2130 0 745
## 7 Pint~ RS 0 NA NA NA NA
## 8 Sant~ BA 0 9648 9648 0 2891
## 9 São ~ PE 0 NA NA NA NA
## # ... with 74 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## # IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## # `IBGE_10-14` <dbl>, `IBGE_15-59` <dbl>, `IBGE_60+` <dbl>,
## # IBGE_PLANTED_AREA <dbl>, `IBGE_CROP_PRODUCTION_$` <dbl>, `IDHM Ranking
## # 2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## # IDHM_Educacao <dbl>, LONG <dbl>, LAT <dbl>, ALT <dbl>, PAY_TV <dbl>,
## # FIXED_PHONES <dbl>, AREA <dbl>, REGIAO_TUR <chr>, CATEGORIA_TUR <chr>,
## # ESTIMATED_POP <dbl>, RURAL_URBAN <chr>, GVA_AGROPEC <dbl>,
## # GVA_INDUSTRY <dbl>, GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL
## # ` <dbl>, TAXES <dbl>, GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>,
## # GVA_MAIN <chr>, MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>,
## # COMP_B <dbl>, COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>,
## # COMP_G <dbl>, COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>,
## # COMP_L <dbl>, COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>,
## # COMP_Q <dbl>, COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>,
## # HOTELS <dbl>, BEDS <dbl>, Pr_Agencies <dbl>, Pu_Agencies <dbl>,
## # Pr_Bank <dbl>, Pu_Bank <dbl>, Pr_Assets <dbl>, Pu_Assets <dbl>, Cars <dbl>,
## # Motorcycles <dbl>, Wheeled_tractor <dbl>, UBER <dbl>, MAC <dbl>,
## # `WAL-MART` <dbl>, POST_OFFICES <dbl>
Brazil_Cities[is.na(Brazil_Cities$LONG),]
## # A tibble: 9 x 81
## CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR~ IBGE_RES_POP_ES~ IBGE_DU
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baln~ SC 0 NA NA NA NA
## 2 Lago~ RS 0 NA NA NA NA
## 3 Moju~ PA 0 NA NA NA NA
## 4 Para~ MS 0 NA NA NA NA
## 5 Pesc~ SC 0 NA NA NA NA
## 6 Pinh~ RS 0 2130 2130 0 745
## 7 Pint~ RS 0 NA NA NA NA
## 8 Sant~ BA 0 9648 9648 0 2891
## 9 São ~ PE 0 NA NA NA NA
## # ... with 74 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## # IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## # `IBGE_10-14` <dbl>, `IBGE_15-59` <dbl>, `IBGE_60+` <dbl>,
## # IBGE_PLANTED_AREA <dbl>, `IBGE_CROP_PRODUCTION_$` <dbl>, `IDHM Ranking
## # 2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## # IDHM_Educacao <dbl>, LONG <dbl>, LAT <dbl>, ALT <dbl>, PAY_TV <dbl>,
## # FIXED_PHONES <dbl>, AREA <dbl>, REGIAO_TUR <chr>, CATEGORIA_TUR <chr>,
## # ESTIMATED_POP <dbl>, RURAL_URBAN <chr>, GVA_AGROPEC <dbl>,
## # GVA_INDUSTRY <dbl>, GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL
## # ` <dbl>, TAXES <dbl>, GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>,
## # GVA_MAIN <chr>, MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>,
## # COMP_B <dbl>, COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>,
## # COMP_G <dbl>, COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>,
## # COMP_L <dbl>, COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>,
## # COMP_Q <dbl>, COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>,
## # HOTELS <dbl>, BEDS <dbl>, Pr_Agencies <dbl>, Pu_Agencies <dbl>,
## # Pr_Bank <dbl>, Pu_Bank <dbl>, Pr_Assets <dbl>, Pu_Assets <dbl>, Cars <dbl>,
## # Motorcycles <dbl>, Wheeled_tractor <dbl>, UBER <dbl>, MAC <dbl>,
## # `WAL-MART` <dbl>, POST_OFFICES <dbl>
It is observed that NA rows in both LAT and LONG are the same. For these missing LAT LONG data, we will reference https://pt.db-city.com/ and https://latitude.to/ to find our missing data.
Our following reference in order of the cities listed above are: https://pt.db-city.com/Brasil--Santa-Catarina--Balne%C3%A1rio-Rinc%C3%A3o https://latitude.to/articles-by-country/br/brazil/69447/lagoa-dos-patos https://en.db-city.com/Brazil--Par%C3%A1--Moju%C3%AD-dos-Campos https://pt.db-city.com/Brasil--Mato-Grosso-do-Sul--Para%C3%ADso-das-%C3%81guas https://pt.db-city.com/Brasil--Santa-Catarina--Pescaria-Brava https://pt.db-city.com/Brasil--Rio-Grande-do-Sul--Pinhal-da-Serra https://pt.db-city.com/Brasil--Rio-Grande-do-Sul--Pinto-Bandeira https://pt.db-city.com/Brasil--Bahia--Santa-Teresinha https://pt.db-city.com/Brasil--S%C3%A3o-Paulo--S%C3%A3o-Caetano-do-Sul
We will append these data where our NA rows are. Note that Lagoa Dos Patos and Santa Terezinha have duplicate names in different states hence we will need to specify state.
Brazil_Cities$LAT[Brazil_Cities$CITY == "Balneário Rincão"] <- -28.8344
Brazil_Cities$LONG[Brazil_Cities$CITY == "Balneário Rincão"] <- -49.2361
Brazil_Cities$LAT[Brazil_Cities$CITY == "Lagoa Dos Patos" & Brazil_Cities$STATE == "RS"] <- -31.0697
Brazil_Cities$LONG[Brazil_Cities$CITY == "Lagoa Dos Patos" & Brazil_Cities$STATE == "RS"] <- --51.4725
Brazil_Cities$LAT[Brazil_Cities$CITY == "Mojuí Dos Campos"] <- -2.68472
Brazil_Cities$LONG[Brazil_Cities$CITY == "Mojuí Dos Campos"] <- -54.6431
Brazil_Cities$LAT[Brazil_Cities$CITY == "Paraíso Das Águas"] <- -19.0257
Brazil_Cities$LONG[Brazil_Cities$CITY == "Paraíso Das Águas"] <- -53.0102
Brazil_Cities$LAT[Brazil_Cities$CITY == "Pescaria Brava"] <- -28.4247
Brazil_Cities$LONG[Brazil_Cities$CITY == "Pescaria Brava"] <- -48.8956
Brazil_Cities$LAT[Brazil_Cities$CITY == "Pinhal Da Serra"] <- -27.8747
Brazil_Cities$LONG[Brazil_Cities$CITY == "Pinhal Da Serra"] <- -51.1733
Brazil_Cities$LAT[Brazil_Cities$CITY == "Pinto Bandeira"] <- -29.0978
Brazil_Cities$LONG[Brazil_Cities$CITY == "Pinto Bandeira"] <- -51.4503
Brazil_Cities$LAT[Brazil_Cities$CITY == "Santa Terezinha" & Brazil_Cities$STATE == "BA"] <- -12.7498
Brazil_Cities$LONG[Brazil_Cities$CITY == "Santa Terezinha" & Brazil_Cities$STATE == "BA"] <- -39.5184
Brazil_Cities$LAT[Brazil_Cities$CITY == "São Caetano"] <- -8.33
Brazil_Cities$LONG[Brazil_Cities$CITY == "São Caetano"] <- -36.1459
Further investigation reveals that Lagoa Dos Patos in the State RS is a lagoon and not a municipal, it will be removed.
Brazil_Cities <- Brazil_Cities[!(Brazil_Cities$CITY == "Lagoa Dos Patos" & Brazil_Cities$STATE == "RS" ),]
Brazil_Cities[is.na(Brazil_Cities$LAT),]
## # A tibble: 0 x 81
## # ... with 81 variables: CITY <chr>, STATE <chr>, CAPITAL <dbl>,
## # IBGE_RES_POP <dbl>, IBGE_RES_POP_BRAS <dbl>, IBGE_RES_POP_ESTR <dbl>,
## # IBGE_DU <dbl>, IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>, IBGE_POP <dbl>,
## # IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>, `IBGE_10-14` <dbl>,
## # `IBGE_15-59` <dbl>, `IBGE_60+` <dbl>, IBGE_PLANTED_AREA <dbl>,
## # `IBGE_CROP_PRODUCTION_$` <dbl>, `IDHM Ranking 2010` <dbl>, IDHM <dbl>,
## # IDHM_Renda <dbl>, IDHM_Longevidade <dbl>, IDHM_Educacao <dbl>, LONG <dbl>,
## # LAT <dbl>, ALT <dbl>, PAY_TV <dbl>, FIXED_PHONES <dbl>, AREA <dbl>,
## # REGIAO_TUR <chr>, CATEGORIA_TUR <chr>, ESTIMATED_POP <dbl>,
## # RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## # GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## # GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## # MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## # COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## # COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## # COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## # COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, HOTELS <dbl>,
## # BEDS <dbl>, Pr_Agencies <dbl>, Pu_Agencies <dbl>, Pr_Bank <dbl>,
## # Pu_Bank <dbl>, Pr_Assets <dbl>, Pu_Assets <dbl>, Cars <dbl>,
## # Motorcycles <dbl>, Wheeled_tractor <dbl>, UBER <dbl>, MAC <dbl>,
## # `WAL-MART` <dbl>, POST_OFFICES <dbl>
Brazil_Cities[is.na(Brazil_Cities$LONG),]
## # A tibble: 0 x 81
## # ... with 81 variables: CITY <chr>, STATE <chr>, CAPITAL <dbl>,
## # IBGE_RES_POP <dbl>, IBGE_RES_POP_BRAS <dbl>, IBGE_RES_POP_ESTR <dbl>,
## # IBGE_DU <dbl>, IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>, IBGE_POP <dbl>,
## # IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>, `IBGE_10-14` <dbl>,
## # `IBGE_15-59` <dbl>, `IBGE_60+` <dbl>, IBGE_PLANTED_AREA <dbl>,
## # `IBGE_CROP_PRODUCTION_$` <dbl>, `IDHM Ranking 2010` <dbl>, IDHM <dbl>,
## # IDHM_Renda <dbl>, IDHM_Longevidade <dbl>, IDHM_Educacao <dbl>, LONG <dbl>,
## # LAT <dbl>, ALT <dbl>, PAY_TV <dbl>, FIXED_PHONES <dbl>, AREA <dbl>,
## # REGIAO_TUR <chr>, CATEGORIA_TUR <chr>, ESTIMATED_POP <dbl>,
## # RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## # GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## # GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## # MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## # COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## # COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## # COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## # COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, HOTELS <dbl>,
## # BEDS <dbl>, Pr_Agencies <dbl>, Pu_Agencies <dbl>, Pr_Bank <dbl>,
## # Pu_Bank <dbl>, Pr_Assets <dbl>, Pu_Assets <dbl>, Cars <dbl>,
## # Motorcycles <dbl>, Wheeled_tractor <dbl>, UBER <dbl>, MAC <dbl>,
## # `WAL-MART` <dbl>, POST_OFFICES <dbl>
We observe no more NA data within our LAT LONG columns.
We will replace the remaining NA values with 0 to represent the absent data
We will then look at our imported Brazil_Cities aspatial data with summary.
glimpse(Data_Dict)
## Rows: 96
## Columns: 6
## $ FIELD <chr> "CITY", "STATE", "CAPITAL", "IBGE_RES_POP", "IBGE_RES_P...
## $ DESCRIPTION <chr> "Name of the City", "Name of the State", "1 if Capital ...
## $ REFERENCE <chr> NA, NA, NA, "2010", "2010", "2010", "2010", "2010", "20...
## $ UNIT <chr> NA, NA, NA, "-", "-", "-", "-", "-", "-", "-", "-", "-"...
## $ SOURCE <chr> "-", "-", "-", "https://sidra.ibge.gov.br/tabela/1497",...
## $ X6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
It appears that Data_Dict has a description column where each rows explain the columns in Brazil_Cities that are vague such as IBGE_DU etc. However we also observe 96 rows where our Brazil_Cities only has 81. We will perform further investigation to find out what the extra rows may be describing.
tail(Data_Dict)
## # A tibble: 6 x 6
## FIELD DESCRIPTION REFERENCE UNIT SOURCE X6
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> <NA> <NA> <NA>
Taking a look at the extra rows, we find that they are rows purely of NA. We will remove these rows.
Data_Dict <- Data_Dict[!(is.na(Data_Dict$FIELD)),]
nrow(Data_Dict)
## [1] 81
We now see that the number of remaining rows in Data_Dict correspond to the number of columns we have in Brazil_Cities.
Data_Dict also provides us the reference year of the data, allowing us to exlude data that is past 2016 from our study; such as HOTELS and Cars etc. However our population data is also old at 2010, we will accept this however as we do not expect the municipality populations to have drastic changes over the 6 years as there were no major disease outbreaks, save for a couple minor outbreaks, one in 2015 - 2016 involving newborns and microcephaly or in 2016 with Guillain-Barré syndrome, these were likely residual effects from a zika outbreak earlier in October 2015. No wars were recorded in Brazil during the period either.
https://www.who.int/csr/don/21-october-2015-zika/en/ https://www.who.int/csr/don/15-december-2015-microcephaly-brazil/en/ https://www.who.int/csr/don/8-february-2016-gbs-brazil/en/
As mentioned, columns with reference years > 2016 will not be considered. Thus we will filter them out .
Brazil_Cities_Filt <- subset(Brazil_Cities, select = c(1:16, 19:26,29,33:66))
From the list of columns in the codechunk below, we will select our independent variables.
colnames(Brazil_Cities_Filt)
## [1] "CITY" "STATE" "CAPITAL"
## [4] "IBGE_RES_POP" "IBGE_RES_POP_BRAS" "IBGE_RES_POP_ESTR"
## [7] "IBGE_DU" "IBGE_DU_URBAN" "IBGE_DU_RURAL"
## [10] "IBGE_POP" "IBGE_1" "IBGE_1-4"
## [13] "IBGE_5-9" "IBGE_10-14" "IBGE_15-59"
## [16] "IBGE_60+" "IDHM Ranking 2010" "IDHM"
## [19] "IDHM_Renda" "IDHM_Longevidade" "IDHM_Educacao"
## [22] "LONG" "LAT" "ALT"
## [25] "AREA" "RURAL_URBAN" "GVA_AGROPEC"
## [28] "GVA_INDUSTRY" "GVA_SERVICES" "GVA_PUBLIC"
## [31] " GVA_TOTAL " "TAXES" "GDP"
## [34] "POP_GDP" "GDP_CAPITA" "GVA_MAIN"
## [37] "MUN_EXPENDIT" "COMP_TOT" "COMP_A"
## [40] "COMP_B" "COMP_C" "COMP_D"
## [43] "COMP_E" "COMP_F" "COMP_G"
## [46] "COMP_H" "COMP_I" "COMP_J"
## [49] "COMP_K" "COMP_L" "COMP_M"
## [52] "COMP_N" "COMP_O" "COMP_P"
## [55] "COMP_Q" "COMP_R" "COMP_S"
## [58] "COMP_T" "COMP_U"
As we know, our dependent variable is GDP_CAPITA.
The independent variable choices for our study will be:
For selection 1, this is the working populace, who consume products and services, earn an income and pay taxes, hence contributing most to the economy. However, higher populated municipals would likely have a naturally higher economically active populace. Thus, we will need to scale it.
It is concluded that GDP and HDI have a strong positive relationship (https://www.researchgate.net/publication/324803815_RELATIONSHIP_BETWEEN_GROSS_DOMESTIC_PRODUCT_AND_HUMAN_DEVELOPMENT_INDEX), therefore we will be selecting the various IDHM indexes in our study.
For selections 2,3 and 4, it is chosen over IDHM itself because IDHM is a summation of all 3 categories, and selecting the individual categories will allow for a finer view of how each component may contribute more to GDP per capita rather than the entire IDHM index itself.
It is also known that GVA or Gross Value Added is used in the estimation of GDP (https://webarchive.nationalarchives.gov.uk/20160107002005/http://www.ons.gov.uk/ons/guide-method/method-quality/specific/economy/national-accounts/gva/index.html). Hence, we will add it to our study.
Once again instead of selecting the summation, we will select the individual attributes instead.
Taxes also form part of the GDP in the Income Approach of calculating GDP, therefore we will add it to our study as well. The numbers may be inflated for more populated cities and will need to be scaled.
On the topic of more populated cities, the higher the population count would likely have higher GDP as the economically active counts will likely be higher relative to less populated cities, meaning more businesses, more consumption and hence, overall higher GDP. However this is my theory, we will have to find out how this truly affects the overall GDP.
Population Density
Taxes
However for taxes, rather than looking at the raw number, it is perhaps better to view it as a % contribution of the overall GDP as it would show the measure of a city’s tax revenue relative to the size of its economy. This is also known as Tax-to-GDP Ratio. (https://www.investopedia.com/terms/t/tax-to-gdp-ratio.asp)
It would be interesting to see if the municipal being in a capital state would have a bearing on the overall municipal’s GDP.
MUN_EXPENDIT would have been considered as well, as government spending does contribute to the GDP, however the vast number of missing values to remove or impute would likely affect our study negatively and thus will not be considered.
It appears IBGE_15-59 gives some error when trying to use the column. We will have to rename it. The column will now be E_Active
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% rename("E_Active" = `IBGE_15-59`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`E_ACTIVE_SC` = `E_Active`/`IBGE_POP`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`GVA_AGROPEC_SC` = `GVA_AGROPEC`/`POP_GDP`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`GVA_INDUSTRY_SC` = `GVA_INDUSTRY`/`POP_GDP`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`GVA_SERVICES_SC` = `GVA_SERVICES`/`POP_GDP`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`GVA_PUBLIC_SC` = `GVA_SERVICES`/`POP_GDP`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`POP_DENS` = `POP_GDP`/`AREA`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`TTG_RATIO` = `TAXES`/`GDP`)
summary(Brazil_Cities_Filt)
## CITY STATE CAPITAL IBGE_RES_POP
## Length:5572 Length:5572 Min. :0.000000 Min. : 805
## Class :character Class :character 1st Qu.:0.000000 1st Qu.: 5235
## Mode :character Mode :character Median :0.000000 Median : 10934
## Mean :0.004846 Mean : 34278
## 3rd Qu.:0.000000 3rd Qu.: 23424
## Max. :1.000000 Max. :11253503
## NA's :7
## IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN
## Min. : 805 Min. : 0.0 Min. : 239 Min. : 60
## 1st Qu.: 5230 1st Qu.: 0.0 1st Qu.: 1572 1st Qu.: 874
## Median : 10926 Median : 0.0 Median : 3174 Median : 1846
## Mean : 34200 Mean : 77.5 Mean : 10303 Mean : 8859
## 3rd Qu.: 23390 3rd Qu.: 10.0 3rd Qu.: 6726 3rd Qu.: 4624
## Max. :11133776 Max. :119727.0 Max. :3576148 Max. :3548433
## NA's :7 NA's :7 NA's :9 NA's :9
## IBGE_DU_RURAL IBGE_POP IBGE_1 IBGE_1-4
## Min. : 3 Min. : 174 Min. : 0.0 Min. : 5
## 1st Qu.: 487 1st Qu.: 2801 1st Qu.: 38.0 1st Qu.: 158
## Median : 931 Median : 6170 Median : 92.0 Median : 376
## Mean : 1463 Mean : 27595 Mean : 383.3 Mean : 1544
## 3rd Qu.: 1832 3rd Qu.: 15302 3rd Qu.: 232.0 3rd Qu.: 951
## Max. :33809 Max. :10463636 Max. :129464.0 Max. :514794
## NA's :80 NA's :7 NA's :7 NA's :7
## IBGE_5-9 IBGE_10-14 E_Active IBGE_60+
## Min. : 7 Min. : 12 Min. : 94 Min. : 29
## 1st Qu.: 220 1st Qu.: 259 1st Qu.: 1734 1st Qu.: 341
## Median : 516 Median : 588 Median : 3841 Median : 722
## Mean : 2069 Mean : 2381 Mean : 18212 Mean : 3004
## 3rd Qu.: 1300 3rd Qu.: 1478 3rd Qu.: 9628 3rd Qu.: 1724
## Max. :684443 Max. :783702 Max. :7058221 Max. :1293012
## NA's :7 NA's :7 NA's :7 NA's :7
## IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade
## Min. : 1 Min. :0.4180 Min. :0.4000 Min. :0.6720
## 1st Qu.:1392 1st Qu.:0.5990 1st Qu.:0.5720 1st Qu.:0.7690
## Median :2783 Median :0.6650 Median :0.6540 Median :0.8080
## Mean :2783 Mean :0.6592 Mean :0.6429 Mean :0.8016
## 3rd Qu.:4174 3rd Qu.:0.7180 3rd Qu.:0.7070 3rd Qu.:0.8360
## Max. :5565 Max. :0.8620 Max. :0.8910 Max. :0.8940
## NA's :7 NA's :7 NA's :7 NA's :7
## IDHM_Educacao LONG LAT ALT
## Min. :0.2070 Min. :-72.92 Min. :-33.688 Min. : 0.0
## 1st Qu.:0.4900 1st Qu.:-50.87 1st Qu.:-22.841 1st Qu.: 169.8
## Median :0.5600 Median :-46.52 Median :-18.091 Median : 406.5
## Mean :0.5591 Mean :-46.23 Mean :-16.449 Mean : 893.8
## 3rd Qu.:0.6310 3rd Qu.:-41.40 3rd Qu.: -8.489 3rd Qu.: 628.9
## Max. :0.8250 Max. :-32.44 Max. : 4.585 Max. :874579.0
## NA's :7 NA's :8
## AREA RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY
## Min. : 3.57 Length:5572 Min. : 0 Min. : 1
## 1st Qu.: 204.43 Class :character 1st Qu.: 4189 1st Qu.: 1726
## Median : 415.92 Mode :character Median : 20426 Median : 7424
## Mean : 1515.89 Mean : 47271 Mean : 175928
## 3rd Qu.: 1026.38 3rd Qu.: 51227 3rd Qu.: 41022
## Max. :159533.33 Max. :1402282 Max. :63306755
## NA's :3 NA's :2 NA's :2
## GVA_SERVICES GVA_PUBLIC GVA_TOTAL TAXES
## Min. : 2 Min. : 7 Min. : 17 Min. : -14159
## 1st Qu.: 10112 1st Qu.: 17267 1st Qu.: 42253 1st Qu.: 1305
## Median : 31211 Median : 35866 Median : 119492 Median : 5100
## Mean : 489451 Mean : 123768 Mean : 832987 Mean : 118864
## 3rd Qu.: 115406 3rd Qu.: 89245 3rd Qu.: 313963 3rd Qu.: 22197
## Max. :464656988 Max. :41902893 Max. :569910503 Max. :117125387
## NA's :2 NA's :2 NA's :2 NA's :2
## GDP POP_GDP GDP_CAPITA GVA_MAIN
## Min. : 15 Min. : 815 Min. : 3191 Length:5572
## 1st Qu.: 43709 1st Qu.: 5483 1st Qu.: 9058 Class :character
## Median : 125153 Median : 11578 Median : 15870 Mode :character
## Mean : 954584 Mean : 36998 Mean : 21126
## 3rd Qu.: 329539 3rd Qu.: 25085 3rd Qu.: 26155
## Max. :687035890 Max. :12038175 Max. :314638
## NA's :2 NA's :2 NA's :2
## MUN_EXPENDIT COMP_TOT COMP_A COMP_B
## Min. :1.421e+06 Min. : 6.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:1.573e+07 1st Qu.: 68.0 1st Qu.: 1.00 1st Qu.: 0.000
## Median :2.746e+07 Median : 162.0 Median : 2.00 Median : 0.000
## Mean :1.043e+08 Mean : 906.8 Mean : 18.25 Mean : 1.852
## 3rd Qu.:5.666e+07 3rd Qu.: 448.0 3rd Qu.: 8.00 3rd Qu.: 2.000
## Max. :4.577e+10 Max. :530446.0 Max. :1948.00 Max. :274.000
## NA's :1491 NA's :2 NA's :2 NA's :2
## COMP_C COMP_D COMP_E COMP_F
## Min. : 0.00 Min. : 0.0000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3.00 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 1.00
## Median : 11.00 Median : 0.0000 Median : 0.000 Median : 4.00
## Mean : 73.44 Mean : 0.4262 Mean : 2.029 Mean : 43.26
## 3rd Qu.: 39.00 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 15.00
## Max. :31566.00 Max. :332.0000 Max. :657.000 Max. :25222.00
## NA's :2 NA's :2 NA's :2 NA's :2
## COMP_G COMP_H COMP_I COMP_J
## Min. : 1.0 Min. : 0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 32.0 1st Qu.: 1 1st Qu.: 2.00 1st Qu.: 0.00
## Median : 74.5 Median : 7 Median : 7.00 Median : 1.00
## Mean : 348.0 Mean : 41 Mean : 55.88 Mean : 24.74
## 3rd Qu.: 199.0 3rd Qu.: 25 3rd Qu.: 24.00 3rd Qu.: 5.00
## Max. :150633.0 Max. :19515 Max. :29290.00 Max. :38720.00
## NA's :2 NA's :2 NA's :2 NA's :2
## COMP_K COMP_L COMP_M COMP_N
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 1.0
## Median : 0.00 Median : 0.00 Median : 4.00 Median : 4.0
## Mean : 15.55 Mean : 15.14 Mean : 51.29 Mean : 83.7
## 3rd Qu.: 2.00 3rd Qu.: 3.00 3rd Qu.: 13.00 3rd Qu.: 14.0
## Max. :23738.00 Max. :14003.00 Max. :49181.00 Max. :76757.0
## NA's :2 NA's :2 NA's :2 NA's :2
## COMP_O COMP_P COMP_Q COMP_R
## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 0.00
## Median : 2.000 Median : 6.00 Median : 3.00 Median : 2.00
## Mean : 3.269 Mean : 30.96 Mean : 34.15 Mean : 12.18
## 3rd Qu.: 3.000 3rd Qu.: 17.00 3rd Qu.: 12.00 3rd Qu.: 6.00
## Max. :204.000 Max. :16030.00 Max. :22248.00 Max. :6687.00
## NA's :2 NA's :2 NA's :2 NA's :2
## COMP_S COMP_T COMP_U E_ACTIVE_SC
## Min. : 0.00 Min. :0 Min. : 0.00000 Min. :0.4716
## 1st Qu.: 5.00 1st Qu.:0 1st Qu.: 0.00000 1st Qu.:0.6087
## Median : 12.00 Median :0 Median : 0.00000 Median :0.6325
## Mean : 51.61 Mean :0 Mean : 0.05027 Mean :0.6308
## 3rd Qu.: 31.00 3rd Qu.:0 3rd Qu.: 0.00000 3rd Qu.:0.6543
## Max. :24832.00 Max. :0 Max. :123.00000 Max. :0.7448
## NA's :2 NA's :2 NA's :2 NA's :7
## GVA_AGROPEC_SC GVA_INDUSTRY_SC GVA_SERVICES_SC GVA_PUBLIC_SC
## Min. : 0.0000 Min. : 0.0001 Min. : 0.00049 Min. : 0.00049
## 1st Qu.: 0.3538 1st Qu.: 0.2608 1st Qu.: 1.61880 1st Qu.: 1.61880
## Median : 1.4255 Median : 0.7902 Median : 3.52067 Median : 3.52067
## Mean : 3.7525 Mean : 3.3071 Mean : 5.60526 Mean : 5.60526
## 3rd Qu.: 4.7315 3rd Qu.: 2.5900 3rd Qu.: 7.73028 3rd Qu.: 7.73028
## Max. :121.0575 Max. :254.9198 Max. :108.80127 Max. :108.80127
## NA's :2 NA's :2 NA's :2 NA's :2
## POP_DENS TTG_RATIO
## Min. : 0.166 Min. : -2.0920
## 1st Qu.: 11.945 1st Qu.: 0.0298
## Median : 25.306 Median : 0.0527
## Mean : 117.225 Mean : 8.5545
## 3rd Qu.: 55.446 3rd Qu.: 0.0972
## Max. :13533.497 Max. :324.6684
## NA's :3 NA's :2
NA data is observed. Next we will have to handle the NA values in our independent variables. We will remove the NA rows.
For our independent data, we observe unique rows with NA data in our POP_DENS, IDHM and E_ACTIVE, removing the NA using these 3 rows are sufficient in removing all the NA data in the other independent variables.
Brazil_Cities_Filt[is.na(Brazil_Cities_Filt$E_Active),]
## # A tibble: 7 x 66
## CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR~ IBGE_RES_POP_ES~ IBGE_DU
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baln~ SC 0 NA NA NA NA
## 2 Moju~ PA 0 NA NA NA NA
## 3 Para~ MS 0 NA NA NA NA
## 4 Pesc~ SC 0 NA NA NA NA
## 5 Pint~ RS 0 NA NA NA NA
## 6 Sant~ BA 0 NA NA NA NA
## 7 São ~ PE 0 NA NA NA NA
## # ... with 59 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## # IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## # `IBGE_10-14` <dbl>, E_Active <dbl>, `IBGE_60+` <dbl>, `IDHM Ranking
## # 2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## # IDHM_Educacao <dbl>, LONG <dbl>, LAT <dbl>, ALT <dbl>, AREA <dbl>,
## # RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## # GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## # GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## # MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## # COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## # COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## # COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## # COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, E_ACTIVE_SC <dbl>,
## # GVA_AGROPEC_SC <dbl>, GVA_INDUSTRY_SC <dbl>, GVA_SERVICES_SC <dbl>,
## # GVA_PUBLIC_SC <dbl>, POP_DENS <dbl>, TTG_RATIO <dbl>
Brazil_Cities_Filt[is.na(Brazil_Cities_Filt$IDHM_Renda),]
## # A tibble: 7 x 66
## CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR~ IBGE_RES_POP_ES~ IBGE_DU
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baln~ SC 0 NA NA NA NA
## 2 Moju~ PA 0 NA NA NA NA
## 3 Para~ MS 0 NA NA NA NA
## 4 Pesc~ SC 0 NA NA NA NA
## 5 Pint~ RS 0 NA NA NA NA
## 6 Sant~ BA 0 9648 9648 0 2891
## 7 São ~ PE 0 NA NA NA NA
## # ... with 59 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## # IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## # `IBGE_10-14` <dbl>, E_Active <dbl>, `IBGE_60+` <dbl>, `IDHM Ranking
## # 2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## # IDHM_Educacao <dbl>, LONG <dbl>, LAT <dbl>, ALT <dbl>, AREA <dbl>,
## # RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## # GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## # GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## # MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## # COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## # COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## # COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## # COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, E_ACTIVE_SC <dbl>,
## # GVA_AGROPEC_SC <dbl>, GVA_INDUSTRY_SC <dbl>, GVA_SERVICES_SC <dbl>,
## # GVA_PUBLIC_SC <dbl>, POP_DENS <dbl>, TTG_RATIO <dbl>
Brazil_Cities_Filt[is.na(Brazil_Cities_Filt$POP_DENS),]
## # A tibble: 3 x 66
## CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR~ IBGE_RES_POP_ES~ IBGE_DU
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Japu~ AM 0 7326 7318 8 1043
## 2 Sant~ BA 0 NA NA NA NA
## 3 São ~ PE 0 NA NA NA NA
## # ... with 59 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## # IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## # `IBGE_10-14` <dbl>, E_Active <dbl>, `IBGE_60+` <dbl>, `IDHM Ranking
## # 2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## # IDHM_Educacao <dbl>, LONG <dbl>, LAT <dbl>, ALT <dbl>, AREA <dbl>,
## # RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## # GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## # GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## # MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## # COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## # COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## # COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## # COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, E_ACTIVE_SC <dbl>,
## # GVA_AGROPEC_SC <dbl>, GVA_INDUSTRY_SC <dbl>, GVA_SERVICES_SC <dbl>,
## # GVA_PUBLIC_SC <dbl>, POP_DENS <dbl>, TTG_RATIO <dbl>
The code chunk below removes the NA values from our independent variables
Brazil_Cities_Filt <- Brazil_Cities_Filt %>%
filter(!is.na(`E_Active`)) %>%
filter(!is.na(`POP_DENS`)) %>%
filter(!is.na(`IDHM_Renda`))
Ensuring that our independent variables do not have any NA values
colnames(Brazil_Cities_Filt)[colSums(is.na(Brazil_Cities_Filt)) > 0]
## [1] "IBGE_DU" "IBGE_DU_URBAN" "IBGE_DU_RURAL" "ALT"
## [5] "MUN_EXPENDIT"
As we observe, our independent variables do not have any NA values remaining.
We will next select the only columns that we will need for the analysis.
BC_IndepVar <- subset(Brazil_Cities_Filt, select = c(1:3,19:23,35,60:66))
BC_IndepVar_sf <- st_as_sf(BC_IndepVar, coords = c("LONG","LAT"), crs=4326) %>%
st_transform(crs=4674)
We will join our newly created BC_IndepVar_sf simple feature data frame with our Municipal2016 simple feature data frame
Municipal_BC <- st_join(Municipal2016, BC_IndepVar_sf)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
Now we’ll check for any NA values present
Municipal_BC[rowSums(is.na(Municipal_BC)) > 0,]
## Simple feature collection with 9 features and 18 fields
## geometry type: GEOMETRY
## dimension: XY
## bbox: xmin: -70.05811 ymin: -33.61653 xmax: -39.43499 ymax: 0.06830957
## geographic CRS: SIRGAS 2000
## code_muni name_muni code_state abbrev_state CITY STATE CAPITAL
## 106 1302108 Japurá 13 AM <NA> <NA> NA
## 225 1504752 Mojuí Dos Campos 15 PA <NA> <NA> NA
## 2175 2928505 Santa Terezinha 29 BA <NA> <NA> NA
## 4504 4212650 Pescaria Brava 42 SC <NA> <NA> NA
## 4606 4220000 Balneário Rincão 42 SC <NA> <NA> NA
## 4607 4300001 Lagoa Mirim 43 RS <NA> <NA> NA
## 4608 4300002 Lagoa Dos Patos 43 RS <NA> <NA> NA
## 4926 4314548 Pinto Bandeira 43 RS <NA> <NA> NA
## 5163 5006275 Paraíso Das Águas 50 MS <NA> <NA> NA
## IDHM_Renda IDHM_Longevidade IDHM_Educacao GDP_CAPITA E_ACTIVE_SC
## 106 NA NA NA NA NA
## 225 NA NA NA NA NA
## 2175 NA NA NA NA NA
## 4504 NA NA NA NA NA
## 4606 NA NA NA NA NA
## 4607 NA NA NA NA NA
## 4608 NA NA NA NA NA
## 4926 NA NA NA NA NA
## 5163 NA NA NA NA NA
## GVA_AGROPEC_SC GVA_INDUSTRY_SC GVA_SERVICES_SC GVA_PUBLIC_SC POP_DENS
## 106 NA NA NA NA NA
## 225 NA NA NA NA NA
## 2175 NA NA NA NA NA
## 4504 NA NA NA NA NA
## 4606 NA NA NA NA NA
## 4607 NA NA NA NA NA
## 4608 NA NA NA NA NA
## 4926 NA NA NA NA NA
## 5163 NA NA NA NA NA
## TTG_RATIO geom
## 106 NA POLYGON ((-69.89499 -0.1310...
## 225 NA MULTIPOLYGON (((-54.58777 -...
## 2175 NA MULTIPOLYGON (((-39.56748 -...
## 4504 NA MULTIPOLYGON (((-48.92608 -...
## 4606 NA MULTIPOLYGON (((-49.20933 -...
## 4607 NA POLYGON ((-52.62241 -32.146...
## 4608 NA POLYGON ((-51.2719 -30.0389...
## 4926 NA POLYGON ((-51.45725 -29.029...
## 5163 NA POLYGON ((-52.90027 -18.782...
We observe that there are 9 rows with NA values. We will proceed to remove them.
Municipal_BC <- Municipal_BC %>%
filter(!is.na(`CITY`))
Ensuring no NA rows remain.
Municipal_BC[rowSums(is.na(Municipal_BC)) > 0,]
## Simple feature collection with 0 features and 18 fields
## bbox: xmin: NA ymin: NA xmax: NA ymax: NA
## geographic CRS: SIRGAS 2000
## [1] code_muni name_muni code_state abbrev_state
## [5] CITY STATE CAPITAL IDHM_Renda
## [9] IDHM_Longevidade IDHM_Educacao GDP_CAPITA E_ACTIVE_SC
## [13] GVA_AGROPEC_SC GVA_INDUSTRY_SC GVA_SERVICES_SC GVA_PUBLIC_SC
## [17] POP_DENS TTG_RATIO geom
## <0 rows> (or 0-length row.names)
From the output we can see no NA rows remain.
Next we will check if there are any invalid polygons in our joined simple feature data frame.
unique(st_is_valid(Municipal_BC))
## [1] TRUE
As we can see there are no invalid polygons.
We will plot a choropleth map showing the distribution of GDP per capita, 2016 at Municipality Level. We will increase the number of classes to have a finer distribution of the GDP_CAPITA.
We will also print two maps with inverse palette so that the macro details are not lost to contrast as well.
GDP2016_CMAP <- tm_shape(Municipal_BC) +
tm_polygons(col = "GDP_CAPITA",n = 15, border.col = 0.15, palette = "Reds") +
tm_layout(title = "Brazil Municipal 2016 GDP Per Capita",
title.size = 1,
title.position = c("center", "top"),
inner.margins = c(0.06, 0.10, 0.10, 0.08),
legend.width = 0.3,
legend.height = 0.3)
GDP2016_INV_CMAP <- tm_shape(Municipal_BC) +
tm_polygons(col = "GDP_CAPITA",n = 15, border.col = 0.15, palette = "-Reds") +
tm_layout(title = "Brazil Municipal 2016 GDP Per Capita",
title.size = 1,
title.position = c("center", "top"),
inner.margins = c(0.06, 0.10, 0.10, 0.08),
legend.width = 0.3,
legend.height = 0.3)
tmap_arrange(GDP2016_CMAP,GDP2016_INV_CMAP)
We can observe a cluster of municipalities with higher GDP_CAPITA in the central region. The area around the central cluster also generally shows a higher GDP_CAPITA relative to the pheripheral (Western and Eastern) regions. The highest GDP_CAPITA is recorded south of that cluster with high GDP_CAPITA small municipals scattered around. Lowest GDP_CAPITAs are observeede around the Western and Eastern municipality regions of Brazil.
We will take a look at the distribution of GDP per capita in a histogram. To make the scales more readable, we will divide GDP_CAPITA by 1000.
ggplot(data=Municipal_BC, aes(x=GDP_CAPITA/1000)) +
geom_histogram(bins=20, color="white", fill="red") +
ggtitle("Brazil Municipality GDP per capita Distribution, 2016") +
labs(x="Brazil Municipality GDP per capita, 2016 ($ 1,000 reais) ", y="count")
We observe a clear right-skewed distribution of GDP per capita or a higher number of municipals with lower GDP_CAPITA. This is to be expected as income is usually skewed.
First we will create the histograms for each independent variable
E_ACTIVE_HIST <- ggplot(data=Municipal_BC, aes(x=E_ACTIVE_SC)) +
geom_histogram(bins=20, color="black", fill="light blue")
RENDA_HIST <- ggplot(data=Municipal_BC, aes(x=IDHM_Renda)) +
geom_histogram(bins=20, color="black", fill="light blue")
LONG_HIST <- ggplot(data=Municipal_BC, aes(x=IDHM_Longevidade)) +
geom_histogram(bins=20, color="black", fill="light blue")
EDU_HIST <- ggplot(data=Municipal_BC, aes(x=IDHM_Educacao)) +
geom_histogram(bins=20, color="black", fill="light blue")
GVA_AGRO_HIST <- ggplot(data=Municipal_BC, aes(x=GVA_AGROPEC_SC)) +
geom_histogram(bins=20, color="black", fill="light blue")
GVA_IND_HIST <- ggplot(data=Municipal_BC, aes(x=GVA_INDUSTRY_SC)) +
geom_histogram(bins=20, color="black", fill="light blue")
GVA_SVC_HIST <- ggplot(data=Municipal_BC, aes(x=GVA_SERVICES_SC)) +
geom_histogram(bins=20, color="black", fill="light blue")
GVA_PUB_HIST <- ggplot(data=Municipal_BC, aes(x=GVA_PUBLIC_SC)) +
geom_histogram(bins=20, color="black", fill="light blue")
POP_DENS_HIST <- ggplot(data=Municipal_BC, aes(x=POP_DENS)) +
geom_histogram(bins=20, color="black", fill="light blue")
TTG_RATIO_HIST <- ggplot(data=Municipal_BC, aes(x=TTG_RATIO)) +
geom_histogram(bins=20, color="black", fill="light blue")
The code chunk below prints the independent variables that we will use
ggarrange(E_ACTIVE_HIST, RENDA_HIST, LONG_HIST, EDU_HIST, GVA_AGRO_HIST, GVA_IND_HIST, GVA_SVC_HIST, GVA_PUB_HIST, POP_DENS_HIST, TTG_RATIO_HIST, ncol = 3, nrow = 4)
We observe that the our E_ACTIVE_SC and IDHM variables are normally distributed and within the 0-1 value range. However our GVA, POP_DENS and TTG_RATIO are all right-skewed with large variable ranges especially with POP_DENS and thus will require normalisation. Note that our dependent variable GDP_CAPITA is not within the range of 0-1 and will require normalisation as well.
Since the data for the variables we are normalising are not normally distributed, we will use the Min-Max method.
Municipal_BC$GVA_AGROPEC_SC <- normalize(Municipal_BC$GVA_AGROPEC_SC)
Municipal_BC$GVA_INDUSTRY_SC <- normalize(Municipal_BC$GVA_INDUSTRY_SC)
Municipal_BC$GVA_SERVICES_SC <- normalize(Municipal_BC$GVA_SERVICES_SC)
Municipal_BC$GVA_PUBLIC_SC <- normalize(Municipal_BC$GVA_PUBLIC_SC)
Municipal_BC$POP_DENS <- normalize(Municipal_BC$POP_DENS)
Municipal_BC$TTG_RATIO <- normalize(Municipal_BC$TTG_RATIO)
Municipal_BC$GDP_CAPITA <- normalize(Municipal_BC$GDP_CAPITA)
GVA_AGRO_HIST_NM <- ggplot(data=Municipal_BC, aes(x=GVA_AGROPEC_SC)) +
geom_histogram(bins=20, color="black", fill="light blue")
GVA_IND_HIST_NM <- ggplot(data=Municipal_BC, aes(x=GVA_INDUSTRY_SC)) +
geom_histogram(bins=20, color="black", fill="light blue")
GVA_SVC_HIST_NM <- ggplot(data=Municipal_BC, aes(x=GVA_SERVICES_SC)) +
geom_histogram(bins=20, color="black", fill="light blue")
GVA_PUB_HIST_NM <- ggplot(data=Municipal_BC, aes(x=GVA_PUBLIC_SC)) +
geom_histogram(bins=20, color="black", fill="light blue")
POP_DENS_HIST_NM <- ggplot(data=Municipal_BC, aes(x=POP_DENS)) +
geom_histogram(bins=20, color="black", fill="light blue")
TTG_RATIO_HIST_NM <- ggplot(data=Municipal_BC, aes(x=TTG_RATIO)) +
geom_histogram(bins=20, color="black", fill="light blue")
GDP_CAPITA_HIST_NM <- ggplot(data=Municipal_BC, aes(x=GDP_CAPITA)) +
geom_histogram(bins=20, color="black", fill="light blue")
ggarrange(GVA_AGRO_HIST_NM, GVA_IND_HIST_NM, GVA_SVC_HIST_NM, GVA_PUB_HIST_NM, POP_DENS_HIST_NM, TTG_RATIO_HIST_NM, GDP_CAPITA_HIST_NM, ncol = 3, nrow = 3)
We see that the data ranges are now betwee 0 and 1
We will first plot the correlation matrix to find high correlation coefficient magnitudes between our independent variables.
The code chunk below is to create a dataframe with our Municipal_BC Simple Feature Data Frame
Municipal_BC_df <- as.data.frame(Municipal_BC)
Next we will use the columns of the newly created data frame to create our correlation matrix
corrplot(cor(Municipal_BC_df[, 7:18]), diag = FALSE, order = "AOE",
tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")
We observe that GVA_SERVICES_SC and GVA_PUBLIC_SC have a strong positive correlation coefficient at 1 and thus we will omit GVA_PUBLIC_SC from our future analysis.
Next we see that IDHM_Renda has a high correlation coefficient magnitude (>0.8) with IDHM_Longevidade at 0.83 and IDHM_Educaco at 0.82. Hence we will omit IDHM_Renda from our future analysis.
With our highly correlated independent variables removed, we can proceed with calibrating our multiple linear regression model.
For our analysis we will be assuming 95% confidence interval or 0.05 alpha value
Municipal_BC.mlr <- lm(GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + POP_DENS + TTG_RATIO + CAPITAL, data=Municipal_BC)
summary(Municipal_BC.mlr)
##
## Call:
## lm(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao +
## GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + POP_DENS +
## TTG_RATIO + CAPITAL, data = Municipal_BC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10127 -0.01036 -0.00407 0.00328 0.83669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.110319 0.009951 -11.087 < 2e-16 ***
## E_ACTIVE_SC 0.092282 0.016697 5.527 3.41e-08 ***
## IDHM_Longevidade 0.060388 0.012449 4.851 1.26e-06 ***
## IDHM_Educacao 0.033109 0.006210 5.332 1.01e-07 ***
## GVA_AGROPEC_SC 0.376736 0.007214 52.222 < 2e-16 ***
## GVA_INDUSTRY_SC 0.925389 0.010191 90.802 < 2e-16 ***
## GVA_SERVICES_SC 0.360013 0.007123 50.541 < 2e-16 ***
## POP_DENS 0.015377 0.008679 1.772 0.07649 .
## TTG_RATIO 0.013037 0.004636 2.812 0.00493 **
## CAPITAL -0.001007 0.005513 -0.183 0.85505
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02713 on 5553 degrees of freedom
## Multiple R-squared: 0.8273, Adjusted R-squared: 0.8271
## F-statistic: 2957 on 9 and 5553 DF, p-value: < 2.2e-16
At our confidence interval of 95%, we observe that only cAPITAL and POP_DENS are not statistically significant and hence will be dropped from the multiple linear regression model.
This code chunk will calibrate a revised MLR model with the prior statistically insignificant variables dropped.
Municipal_BC.mlr1 <- lm(GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + TTG_RATIO, data=Municipal_BC)
ols_regress(Municipal_BC.mlr1)
## Model Summary
## --------------------------------------------------------------
## R 0.910 RMSE 0.027
## R-Squared 0.827 Coef. Var 47.129
## Adj. R-Squared 0.827 MSE 0.001
## Pred R-Squared 0.826 MAE 0.012
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------
## Regression 19.587 7 2.798 3799.94 0.0000
## Residual 4.090 5555 0.001
## Total 23.677 5562
## -----------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------------
## (Intercept) -0.111 0.010 -11.212 0.000 -0.131 -0.092
## E_ACTIVE_SC 0.094 0.017 0.046 5.667 0.000 0.062 0.127
## IDHM_Longevidade 0.060 0.012 0.041 4.800 0.000 0.035 0.084
## IDHM_Educacao 0.034 0.006 0.048 5.472 0.000 0.022 0.046
## GVA_AGROPEC_SC 0.375 0.007 0.304 52.403 0.000 0.361 0.389
## GVA_INDUSTRY_SC 0.924 0.010 0.549 90.909 0.000 0.904 0.944
## GVA_SERVICES_SC 0.362 0.007 0.352 51.289 0.000 0.348 0.375
## TTG_RATIO 0.013 0.005 0.016 2.819 0.005 0.004 0.022
## ----------------------------------------------------------------------------------------------
It is observed that our adjusted R^2 value is at 0.827 which tells us that 82.7% of the variation in our dependent variable is explained by the independent variables.
We will use ols_vif_tol() to test for multicolinearity between our independent variables.
ols_vif_tol(Municipal_BC.mlr1)
## Variables Tolerance VIF
## 1 E_ACTIVE_SC 0.4641728 2.154370
## 2 IDHM_Longevidade 0.4290432 2.330768
## 3 IDHM_Educacao 0.3971184 2.518141
## 4 GVA_AGROPEC_SC 0.9265720 1.079247
## 5 GVA_INDUSTRY_SC 0.8513631 1.174587
## 6 GVA_SERVICES_SC 0.6589498 1.517566
## 7 TTG_RATIO 0.9810219 1.019345
We observe that none of our variables have a VIF of >=10, hence we can conclude that there are no signs of multicolinearity between our independent variables.
The below code chunk will test for the assumption of linearity between the independent and dependent variables.
ols_plot_resid_fit(Municipal_BC.mlr1)
It can be observed from the output plot that most of the data points are scattered around the 0 line. Thus we can conclude that the relationship between the independent variables and the dependent variables are linear.
The code chunk below will test for normality using the ols_plot_resid_hist() function.
ols_plot_resid_hist(Municipal_BC.mlr1)
The output figure reveals that the residual of the multiple linear regression resembles normal distribution
For our spatial autocorrelation testing,
Our confidence interval will be 95% with an alpha value of 0.05.
Null Hypothesis H0: The residuals of the MLR model are randomly distributed.
H1: The residuals of the MLR model are NOT randomly distributed.
We will firstly need to extract our residuals from our MLR model and save it as a dataframe
mlr.output <- as.data.frame(Municipal_BC.mlr1$residuals)
Next we’ll join our newly created residuals data frame with Municipal_BC simple feature object
Municipal_BC.sf <- cbind(Municipal_BC, Municipal_BC.mlr1$residuals) %>%
rename(`MLR_RES` = `Municipal_BC.mlr1.residuals`)
After that we will convert our joined simple feature object into a SpatialPointsDataFrame as spdep only can process sp conformed spatial data objects.
Municipal_BC.sp <- as_Spatial(Municipal_BC.sf)
Municipal_BC.sp
## class : SpatialPolygonsDataFrame
## features : 5563
## extent : -73.99045, -28.83594, -33.75118, 5.271841 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
## variables : 19
## names : code_muni, name_muni, code_state, abbrev_state, CITY, STATE, CAPITAL, IDHM_Renda, IDHM_Longevidade, IDHM_Educacao, GDP_CAPITA, E_ACTIVE_SC, GVA_AGROPEC_SC, GVA_INDUSTRY_SC, GVA_SERVICES_SC, ...
## min values : 1100015, Abadia De Goiás, 11, AC, Abadia De Goiás, AC, 0, 0.4, 0.672, 0.207, 0, 0.471631205673759, 0, 0, 0, ...
## max values : 5300108, Zortéa, 53, TO, Zortéa, TO, 1, 0.891, 0.894, 0.825, 1, 0.744760130414532, 1, 1, 1, ...
Next we will visualise our residuals by plotting a choropleth map together with an interactive point symbol map to enable us to have a more precise investigation.
Resid_Cmap <- tm_shape(Municipal_BC) +
tm_polygons(border.col = 0.5) +
tm_shape(Municipal_BC.sp) +
tm_fill(col="MLR_RES",
alpha = 0.6,
style="quantile",
midpoint = 0)+
tm_layout(title = "GDP Per Capita Residual Distribution",
title.size = 1,
title.position = c("center", "top"),
inner.margins = c(0.06, 0.10, 0.10, 0.08),
legend.width = 0.64,
legend.height = 0.64)
Resid_Pmap <- tm_shape(Municipal_BC)+
tm_polygons(border.col = 0.5)+
tm_shape(Municipal_BC.sf)+
tm_dots(col="MLR_RES", alpha=0.6, style="quantile", midpoint = 0)+
tm_layout(title = "GDP Per Capita Residual Point Map",
title.size = 1,
title.position = c("center", "top"),
inner.margins = c(0.06, 0.10, 0.10, 0.08),
legend.width = 0.64,
legend.height = 0.64)
tmap_arrange(Resid_Cmap, Resid_Pmap, ncol = 2)
We will perform more statistical tests to determine if there is spatial autocorrelation, however from the point map we can observe some spatial autocorrelation such as in the south eastern regions of brazil by the shorelines.
Our next steps will involve performing the Moran’s I Test.
But before we do so, we will find the distance summary for computing our distance-based matrix.
coords <- coordinates(Municipal_BC.sp)
k <- knn2nb(knearneigh(coords))
kdists <- unlist(nbdists(k, coords, longlat=FALSE))
summary(kdists)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01289 0.10908 0.14688 0.19638 0.22004 3.33432
The output tells us 3.33432 is the max, therefore we will adopt 3.4 as our upper threshold distance to ensure all units will have at least one neighbour.
The code chunk below will compute the distance-based weight matrix and print out the summary data of our matrix.
nb <- dnearneigh(coords, 0, 3.4, longlat=FALSE)
nb_lw <- nb2listw(nb, style='B')
summary(nb_lw)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 5563
## Number of nonzero links: 2983304
## Percentage nonzero weights: 9.640052
## Average number of links: 536.2761
## Link number distribution:
##
## 3 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 2 1 2 4 1 2 7 13 4 7 6 4 10 9 2 3 7 4 6 5
## 27 28 29 30 31 32 33 34 36 37 38 39 40 41 42 43 44 45 46 47
## 8 6 6 4 5 3 2 2 1 2 1 2 5 3 5 1 3 2 5 6
## 48 49 50 51 52 53 54 55 56 57 58 59 60 61 63 64 65 66 67 69
## 5 5 8 6 5 3 5 12 11 8 8 2 11 3 1 1 1 6 2 5
## 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 90
## 3 3 7 1 3 2 5 3 2 4 2 3 1 3 4 3 4 1 5 4
## 91 92 96 97 98 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
## 3 1 1 1 3 3 1 2 3 3 3 1 1 1 5 1 6 1 1 2
## 115 116 117 118 119 120 121 122 123 124 125 127 128 130 131 132 133 134 135 136
## 1 3 1 10 3 1 1 3 2 1 1 3 3 4 3 2 5 2 4 2
## 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156
## 2 2 4 2 2 3 7 2 4 8 7 7 5 3 5 9 6 5 9 7
## 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 5 7 4 2 6 5 3 7 7 1 7 8 8 4 3 4 3 2 9 7
## 177 178 179 180 181 182 183 184 185 186 187 188 190 191 192 193 194 195 196 197
## 4 2 3 1 3 2 1 3 1 2 2 3 5 1 3 4 5 5 4 2
## 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217
## 4 1 2 4 2 1 2 2 2 1 1 4 3 3 2 4 1 4 6 2
## 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237
## 6 4 4 8 1 2 9 6 3 7 2 4 2 2 3 5 4 5 4 5
## 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257
## 5 6 4 3 6 3 2 7 4 2 6 8 4 6 5 7 8 6 7 7
## 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277
## 2 5 3 7 3 7 6 3 9 5 5 5 5 5 4 11 7 3 7 4
## 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 297 298
## 6 9 6 9 7 6 5 8 6 2 4 11 6 8 3 5 5 8 1 4
## 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318
## 4 7 3 3 4 1 6 7 3 7 7 3 4 7 7 5 7 4 3 7
## 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338
## 7 12 2 4 7 5 6 6 2 7 7 7 3 3 5 5 2 1 5 5
## 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358
## 3 8 3 6 5 7 6 7 4 6 6 5 8 6 3 3 4 4 10 2
## 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378
## 4 6 10 5 3 6 4 4 5 5 9 7 1 6 5 7 8 6 10 6
## 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398
## 7 9 15 8 8 7 12 11 5 6 6 14 9 7 8 4 11 6 8 9
## 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418
## 3 6 4 5 5 6 7 6 9 16 8 3 6 4 6 6 4 6 8 5
## 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438
## 7 6 2 7 7 2 2 3 4 7 6 2 9 7 2 7 7 4 3 10
## 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458
## 7 5 4 7 1 4 4 2 3 4 4 3 4 7 5 3 3 3 7 5
## 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478
## 4 7 7 2 4 3 4 3 8 2 2 4 4 4 6 7 2 2 4 2
## 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498
## 6 3 7 1 2 4 6 2 5 3 3 5 2 3 3 7 6 8 3 2
## 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518
## 4 7 10 6 5 5 6 9 1 3 4 7 11 4 5 8 4 3 6 5
## 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538
## 2 2 3 5 2 5 6 4 2 9 5 2 4 3 7 7 7 8 8 3
## 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558
## 4 4 6 4 5 2 6 2 7 5 10 6 4 7 5 11 5 4 2 11
## 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578
## 2 5 5 11 4 11 7 8 5 6 8 5 4 4 5 6 13 7 5 4
## 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598
## 4 13 11 8 4 5 5 9 4 8 4 7 6 5 15 7 13 7 15 5
## 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618
## 5 6 6 10 7 6 8 7 6 12 5 10 11 6 5 10 8 7 6 8
## 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638
## 9 11 12 10 8 10 4 9 8 9 8 10 7 7 9 8 11 6 15 9
## 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658
## 12 7 11 9 6 11 3 12 7 8 14 11 10 14 6 7 12 10 10 10
## 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678
## 10 11 9 8 12 13 16 3 10 8 12 8 10 13 8 7 5 7 14 9
## 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698
## 8 6 13 11 15 13 12 3 10 9 4 20 13 7 12 12 8 8 10 10
## 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718
## 5 4 12 10 12 16 11 15 7 10 10 12 5 9 9 14 12 11 5 11
## 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738
## 8 9 7 8 8 13 12 12 12 9 15 5 7 12 9 14 8 15 8 5
## 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758
## 5 16 4 11 9 11 8 14 14 9 14 9 4 11 7 9 15 5 8 12
## 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778
## 7 7 13 12 5 9 10 9 9 7 14 9 6 2 12 11 6 12 14 9
## 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798
## 7 6 7 8 11 11 7 6 5 12 6 10 7 8 14 8 12 9 7 16
## 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818
## 6 8 9 8 10 8 12 10 12 5 4 13 11 5 10 9 4 8 10 6
## 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838
## 11 10 12 6 8 11 17 9 10 7 8 4 6 8 10 7 5 11 6 11
## 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858
## 11 10 16 8 4 12 8 8 8 11 12 7 10 10 5 12 7 9 5 4
## 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878
## 10 8 5 15 6 9 10 11 6 8 8 7 8 2 8 8 6 5 7 7
## 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898
## 3 7 8 5 9 3 3 8 5 8 6 7 7 2 7 6 7 8 6 6
## 899 900 901 902 903 904 905 906 907 908 910 911 912 913 914 917 918 920 921 922
## 5 2 7 5 2 4 5 1 3 1 3 1 1 2 1 1 3 2 2 1
## 923 925 928 931 932 934 935 944 948 949 954 955 956
## 2 1 2 1 2 1 1 2 2 1 1 1 1
## 2 least connected regions:
## 125 1524 with 3 links
## 1 most connected region:
## 4032 with 956 links
##
## Weights style: B
## Weights constants summary:
## n nn S0 S1 S2
## B 5563 30946969 2983304 5966608 7764659904
Now we will compute our Moran’s I with the code chunk below.
lm.morantest(Municipal_BC.mlr1,nb_lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade +
## IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC +
## TTG_RATIO, data = Municipal_BC)
## weights: nb_lw
##
## Moran I statistic standard deviate = 8.6329, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 5.550393e-03 -3.663159e-04 4.697237e-07
It is observed here that the p value is less than our alpha value of 0.05, and therefore we reject the null hypothesis that the residual distribution is random.
Also, our Observed Moran’s I is > 0 at 0.005550393. Thus we can conclude that residuals resemble cluster distribution.
The code chunk below will help us find our optimal bandwidth recommendation for our fixed bandwidth gwr model.
bw.fixed <- bw.gwr(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + TTG_RATIO, data=Municipal_BC.sp, approach="CV", kernel="gaussian", adaptive=FALSE, longlat=FALSE)
## Take a cup of tea and have a break, it will take a few minutes.
## -----A kind suggestion from GWmodel development group
## Fixed bandwidth: 34.70094 CV score: 4.121122
## Fixed bandwidth: 21.45065 CV score: 4.11316
## Fixed bandwidth: 13.26152 CV score: 4.096541
## Fixed bandwidth: 8.200357 CV score: 4.071068
## Fixed bandwidth: 5.072388 CV score: 4.044393
## Fixed bandwidth: 3.139197 CV score: 4.036085
## Fixed bandwidth: 1.944419 CV score: 4.063393
## Fixed bandwidth: 3.87761 CV score: 4.036465
## Fixed bandwidth: 2.682833 CV score: 4.040249
## Fixed bandwidth: 3.421246 CV score: 4.035577
## Fixed bandwidth: 3.595562 CV score: 4.035705
## Fixed bandwidth: 3.313513 CV score: 4.035652
We see the output bandwidth recommendation of 3.313513 metres.
gwr.fixed <- gwr.basic(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + TTG_RATIO, data=Municipal_BC.sp, bw=bw.fixed, kernel = 'gaussian', longlat = FALSE)
gwr.fixed
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2020-05-31 23:10:32
## Call:
## gwr.basic(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade +
## IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC +
## TTG_RATIO, data = Municipal_BC.sp, bw = bw.fixed, kernel = "gaussian",
## longlat = FALSE)
##
## Dependent (y) variable: GDP_CAPITA
## Independent variables: E_ACTIVE_SC IDHM_Longevidade IDHM_Educacao GVA_AGROPEC_SC GVA_INDUSTRY_SC GVA_SERVICES_SC TTG_RATIO
## Number of data points: 5563
## ***********************************************************************
## * Results of Global Regression *
## ***********************************************************************
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10000 -0.01043 -0.00407 0.00335 0.83660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.111384 0.009934 -11.212 < 2e-16 ***
## E_ACTIVE_SC 0.094369 0.016652 5.667 1.52e-08 ***
## IDHM_Longevidade 0.059676 0.012432 4.800 1.63e-06 ***
## IDHM_Educacao 0.033865 0.006189 5.472 4.65e-08 ***
## GVA_AGROPEC_SC 0.375172 0.007159 52.403 < 2e-16 ***
## GVA_INDUSTRY_SC 0.924359 0.010168 90.909 < 2e-16 ***
## GVA_SERVICES_SC 0.361674 0.007052 51.289 < 2e-16 ***
## TTG_RATIO 0.013063 0.004634 2.819 0.00483 **
##
## ---Significance stars
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.02714 on 5555 degrees of freedom
## Multiple R-squared: 0.8272
## Adjusted R-squared: 0.827
## F-statistic: 3800 on 7 and 5555 DF, p-value: < 2.2e-16
## ***Extra Diagnostic information
## Residual sum of squares: 4.090433
## Sigma(hat): 0.02712116
## AIC: -24333.28
## AICc: -24333.25
## ***********************************************************************
## * Results of Geographically Weighted Regression *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: gaussian
## Fixed bandwidth: 3.421246
## Regression points: the same locations as observations are used.
## Distance metric: Euclidean distance metric is used.
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median 3rd Qu. Max.
## Intercept -0.2445780 -0.1547615 -0.1166513 -0.0439943 0.0776
## E_ACTIVE_SC -0.1762191 0.0501316 0.1093239 0.1828372 0.2774
## IDHM_Longevidade -0.0413518 0.0118118 0.0227798 0.0590263 0.2692
## IDHM_Educacao -0.0639098 0.0128900 0.0345071 0.0473316 0.0933
## GVA_AGROPEC_SC 0.2532614 0.3469561 0.3626922 0.3711437 0.4089
## GVA_INDUSTRY_SC 0.8059552 0.8955244 0.9467388 1.0010579 1.7183
## GVA_SERVICES_SC 0.1636195 0.3265849 0.3492343 0.3638618 0.6414
## TTG_RATIO -0.0238476 0.0087139 0.0115743 0.0154589 0.0537
## ************************Diagnostic information*************************
## Number of data points: 5563
## Effective number of parameters (2trace(S) - trace(S'S)): 113.7207
## Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5449.279
## AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): -24569.17
## AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): -24655.64
## Residual sum of squares: 3.816032
## R-square value: 0.8388301
## Adjusted R-square value: 0.8354661
##
## ***********************************************************************
## Program stops at: 2020-05-31 23:10:39
We observe that the adjusted r square value improved from 0.827 to 0.835, thus explaining the variation better than our multiple linear regression model. We also observe a lower GWR AIC value at -24655.64 than the GR AIC at -24333.28 which indicates that the GWR model has a better fit.
We will compare it with adaptive bandwidth to see which is better.
The code chunk below will help us find our optimal number of points for our adaptive bandwidth gwr model.
dMat <- gw.dist(dp.locat = coordinates(Municipal_BC.sp))
bw.adaptive <- bw.gwr(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + TTG_RATIO, data=Municipal_BC.sp, approach="CV", kernel="gaussian",
adaptive=TRUE, longlat=FALSE, dMat=dMat)
## Take a cup of tea and have a break, it will take a few minutes.
## -----A kind suggestion from GWmodel development group
## Adaptive bandwidth: 3445 CV score: 4.101626
## Adaptive bandwidth: 2137 CV score: 4.078878
## Adaptive bandwidth: 1327 CV score: 4.047649
## Adaptive bandwidth: 828 CV score: 4.020536
## Adaptive bandwidth: 518 CV score: 4.006486
## Adaptive bandwidth: 328 CV score: 4.010149
## Adaptive bandwidth: 637 CV score: 4.011214
## Adaptive bandwidth: 446 CV score: 4.006151
## Adaptive bandwidth: 400 CV score: 4.006689
## Adaptive bandwidth: 473 CV score: 4.005898
## Adaptive bandwidth: 491 CV score: 4.006018
## Adaptive bandwidth: 463 CV score: 4.005963
From the output we can see that we are recommended 463 points for our adaptive bandwidth.
gwr.adaptive <- gwr.basic(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + TTG_RATIO, data=Municipal_BC.sp, bw=bw.adaptive, kernel = 'gaussian', adaptive=TRUE, longlat = FALSE)
Displaying the output
gwr.adaptive
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2020-05-31 23:11:38
## Call:
## gwr.basic(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade +
## IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC +
## TTG_RATIO, data = Municipal_BC.sp, bw = bw.adaptive, kernel = "gaussian",
## adaptive = TRUE, longlat = FALSE)
##
## Dependent (y) variable: GDP_CAPITA
## Independent variables: E_ACTIVE_SC IDHM_Longevidade IDHM_Educacao GVA_AGROPEC_SC GVA_INDUSTRY_SC GVA_SERVICES_SC TTG_RATIO
## Number of data points: 5563
## ***********************************************************************
## * Results of Global Regression *
## ***********************************************************************
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10000 -0.01043 -0.00407 0.00335 0.83660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.111384 0.009934 -11.212 < 2e-16 ***
## E_ACTIVE_SC 0.094369 0.016652 5.667 1.52e-08 ***
## IDHM_Longevidade 0.059676 0.012432 4.800 1.63e-06 ***
## IDHM_Educacao 0.033865 0.006189 5.472 4.65e-08 ***
## GVA_AGROPEC_SC 0.375172 0.007159 52.403 < 2e-16 ***
## GVA_INDUSTRY_SC 0.924359 0.010168 90.909 < 2e-16 ***
## GVA_SERVICES_SC 0.361674 0.007052 51.289 < 2e-16 ***
## TTG_RATIO 0.013063 0.004634 2.819 0.00483 **
##
## ---Significance stars
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.02714 on 5555 degrees of freedom
## Multiple R-squared: 0.8272
## Adjusted R-squared: 0.827
## F-statistic: 3800 on 7 and 5555 DF, p-value: < 2.2e-16
## ***Extra Diagnostic information
## Residual sum of squares: 4.090433
## Sigma(hat): 0.02712116
## AIC: -24333.28
## AICc: -24333.25
## ***********************************************************************
## * Results of Geographically Weighted Regression *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: gaussian
## Adaptive bandwidth: 473 (number of nearest neighbours)
## Regression points: the same locations as observations are used.
## Distance metric: Euclidean distance metric is used.
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median 3rd Qu. Max.
## Intercept -0.2190601 -0.1533961 -0.1187402 -0.0506453 -0.0256
## E_ACTIVE_SC 0.0219149 0.0543199 0.1058008 0.1819362 0.3436
## IDHM_Longevidade -0.0081273 0.0121306 0.0260083 0.0531487 0.0759
## IDHM_Educacao 0.0038680 0.0149151 0.0324215 0.0483241 0.0983
## GVA_AGROPEC_SC 0.2919661 0.3413579 0.3670434 0.3739130 0.4019
## GVA_INDUSTRY_SC 0.8457256 0.8943751 0.9434476 1.0019872 1.0782
## GVA_SERVICES_SC 0.2304597 0.3165152 0.3468584 0.3673190 0.4012
## TTG_RATIO -0.0011856 0.0084792 0.0115244 0.0157431 0.0223
## ************************Diagnostic information*************************
## Number of data points: 5563
## Effective number of parameters (2trace(S) - trace(S'S)): 71.5749
## Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5491.425
## AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): -24590.57
## AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): -24645.5
## Residual sum of squares: 3.843693
## R-square value: 0.8376618
## Adjusted R-square value: 0.8355456
##
## ***********************************************************************
## Program stops at: 2020-05-31 23:11:50
For the output for adaptive matrix, we observe that the adjusted r^2 value is at 0.3855456, or marginally better than fixed bandwidth. AIC is the same. We also observe a lower GWR AIC value at -24645.5 than the GR AIC at -24333.28 which indicates that the GWR model has a better fit. However we it is observed that fixed GWR has a lower AIC value at -24655.64.
Since the adjusted r square values between fixed and adaptive bandwidth gwrs or essentially on par, we will use Fixed bandwidth moving foward as it has the better AIC value.
The code chunk below will convert our SDF data into a sf data.frame
Municipal_BC_Fixed <- st_as_sf(gwr.fixed$SDF) %>%
st_transform(crs=4674)
gwr.fixed.output <- as.data.frame(gwr.fixed$SDF)
Municipal_BC_Fixed <- cbind(Municipal_BC_Fixed, as.matrix(gwr.fixed.output))
Municipal_BC_Fixed
## Simple feature collection with 5563 features and 60 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -73.99045 ymin: -33.75118 xmax: -28.83594 ymax: 5.271841
## geographic CRS: SIRGAS 2000
## First 10 features:
## Intercept E_ACTIVE_SC IDHM_Longevidade IDHM_Educacao GVA_AGROPEC_SC
## 1 0.04238358 -1.095105e-01 0.07861155 -0.05096840 0.3241375
## 2 -0.02399894 8.716528e-05 0.06225939 -0.03157196 0.3259902
## 3 0.07584267 -1.733096e-01 0.09611157 -0.06096790 0.3257255
## 4 0.01377714 -7.553315e-02 0.08720081 -0.04963263 0.3262039
## 5 0.06912431 -1.583624e-01 0.09065771 -0.05945466 0.3248705
## 6 0.06865024 -1.641142e-01 0.09672861 -0.05930521 0.3259579
## 7 0.06285221 -1.505709e-01 0.09157375 -0.05812038 0.3252057
## 8 0.01051681 -5.851647e-02 0.06640669 -0.03355619 0.3241697
## 9 0.01451844 -8.528723e-02 0.09567998 -0.05127515 0.3272012
## 10 -0.01139459 -2.501714e-02 0.06142085 -0.02359192 0.3252393
## GVA_INDUSTRY_SC GVA_SERVICES_SC TTG_RATIO y yhat
## 1 1.543784 0.5990802 -0.019444635 0.04990125 0.04599972
## 2 1.573399 0.5482027 -0.008369576 0.05595688 0.03155272
## 3 1.321921 0.6259474 -0.020850347 0.05783450 0.05323487
## 4 1.539014 0.5780136 -0.022612264 0.06081357 0.07573962
## 5 1.412513 0.6205307 -0.022503581 0.06270891 0.07270296
## 6 1.346432 0.6183183 -0.021200888 0.04244695 0.04144850
## 7 1.422878 0.6140240 -0.022660226 0.07657810 0.07341621
## 8 1.634248 0.5671677 -0.002533554 0.02861561 0.03342865
## 9 1.505531 0.5776164 -0.023742653 0.04321665 0.05454719
## 10 1.647379 0.5398315 0.004548765 0.04026828 0.01825436
## residual CV_Score Stud_residual Intercept_SE E_ACTIVE_SC_SE
## 1 0.0039015286 0 0.1529909 0.08744519 0.10628095
## 2 0.0244041622 0 0.9921369 0.07965353 0.09213324
## 3 0.0045996394 0 0.1837896 0.08810145 0.10666639
## 4 -0.0149260521 0 -0.5844040 0.08027562 0.09563057
## 5 -0.0099940523 0 -0.3889078 0.08864302 0.10801825
## 6 0.0009984495 0 0.0383953 0.08654187 0.10468834
## 7 0.0031618971 0 0.1253911 0.08695399 0.10580955
## 8 -0.0048130424 0 -0.1988655 0.09028950 0.10712896
## 9 -0.0113305412 0 -0.4369966 0.07931918 0.09386379
## 10 0.0220139183 0 0.8884765 0.08833568 0.10452726
## IDHM_Longevidade_SE IDHM_Educacao_SE GVA_AGROPEC_SC_SE GVA_INDUSTRY_SC_SE
## 1 0.1110368 0.04588061 0.02652382 0.1579919
## 2 0.1119694 0.04740315 0.03257650 0.2150466
## 3 0.1020531 0.03981730 0.02201388 0.1005255
## 4 0.1058184 0.04277850 0.02465380 0.1544803
## 5 0.1055777 0.04211908 0.02334130 0.1209299
## 6 0.1016634 0.03955925 0.02197706 0.1045009
## 7 0.1046616 0.04156298 0.02311131 0.1219210
## 8 0.1202339 0.05172279 0.03458566 0.2100146
## 9 0.1032246 0.04088969 0.02331786 0.1417038
## 10 0.1209990 0.05267237 0.03912466 0.2400011
## GVA_SERVICES_SC_SE TTG_RATIO_SE Intercept_TV E_ACTIVE_SC_TV
## 1 0.04484955 0.02876817 0.4846874 -1.0303865742
## 2 0.05455090 0.03203154 -0.3012916 0.0009460785
## 3 0.03805070 0.02730878 0.8608561 -1.6247814079
## 4 0.04243829 0.02822399 0.1716229 -0.7898431015
## 5 0.04010442 0.02771053 0.7798055 -1.4660709286
## 6 0.03797883 0.02708284 0.7932604 -1.5676451309
## 7 0.03978546 0.02749665 0.7228215 -1.4230371567
## 8 0.05564012 0.03190093 0.1164787 -0.5462245150
## 9 0.04021954 0.02750499 0.1830382 -0.9086276383
## 10 0.06256852 0.03351537 -0.1289919 -0.2393360792
## IDHM_Longevidade_TV IDHM_Educacao_TV GVA_AGROPEC_SC_TV GVA_INDUSTRY_SC_TV
## 1 0.7079772 -1.1108919 12.220621 9.771289
## 2 0.5560394 -0.6660309 10.006914 7.316547
## 3 0.9417800 -1.5311911 14.796375 13.150111
## 4 0.8240608 -1.1602237 13.231385 9.962527
## 5 0.8586825 -1.4115848 13.918270 11.680424
## 6 0.9514596 -1.4991493 14.831733 12.884406
## 7 0.8749508 -1.3983691 14.071279 11.670487
## 8 0.5523124 -0.6487699 9.372950 7.781592
## 9 0.9269104 -1.2539872 14.032212 10.624491
## 10 0.5076143 -0.4478994 8.312898 6.864047
## GVA_SERVICES_SC_TV TTG_RATIO_TV Local_R2 Intercept.1 E_ACTIVE_SC.1
## 1 13.357552 -0.67590792 0.9231928 0.04238358 -1.095105e-01
## 2 10.049381 -0.26129173 0.9106030 -0.02399894 8.716528e-05
## 3 16.450355 -0.76350348 0.9299215 0.07584267 -1.733096e-01
## 4 13.620094 -0.80117180 0.9240774 0.01377714 -7.553315e-02
## 5 15.472876 -0.81209492 0.9282309 0.06912431 -1.583624e-01
## 6 16.280601 -0.78281617 0.9299548 0.06865024 -1.641142e-01
## 7 15.433375 -0.82410846 0.9283817 0.06285221 -1.505709e-01
## 8 10.193502 -0.07941943 0.9120981 0.01051681 -5.851647e-02
## 9 14.361586 -0.86321250 0.9262483 0.01451844 -8.528723e-02
## 10 8.627846 0.13572177 0.9052905 -0.01139459 -2.501714e-02
## IDHM_Longevidade.1 IDHM_Educacao.1 GVA_AGROPEC_SC.1 GVA_INDUSTRY_SC.1
## 1 0.07861155 -0.05096840 0.3241375 1.543784
## 2 0.06225939 -0.03157196 0.3259902 1.573399
## 3 0.09611157 -0.06096790 0.3257255 1.321921
## 4 0.08720081 -0.04963263 0.3262039 1.539014
## 5 0.09065771 -0.05945466 0.3248705 1.412513
## 6 0.09672861 -0.05930521 0.3259579 1.346432
## 7 0.09157375 -0.05812038 0.3252057 1.422878
## 8 0.06640669 -0.03355619 0.3241697 1.634248
## 9 0.09567998 -0.05127515 0.3272012 1.505531
## 10 0.06142085 -0.02359192 0.3252393 1.647379
## GVA_SERVICES_SC.1 TTG_RATIO.1 y.1 yhat.1 residual.1
## 1 0.5990802 -0.019444635 0.04990125 0.04599972 0.0039015286
## 2 0.5482027 -0.008369576 0.05595688 0.03155272 0.0244041622
## 3 0.6259474 -0.020850347 0.05783450 0.05323487 0.0045996394
## 4 0.5780136 -0.022612264 0.06081357 0.07573962 -0.0149260521
## 5 0.6205307 -0.022503581 0.06270891 0.07270296 -0.0099940523
## 6 0.6183183 -0.021200888 0.04244695 0.04144850 0.0009984495
## 7 0.6140240 -0.022660226 0.07657810 0.07341621 0.0031618971
## 8 0.5671677 -0.002533554 0.02861561 0.03342865 -0.0048130424
## 9 0.5776164 -0.023742653 0.04321665 0.05454719 -0.0113305412
## 10 0.5398315 0.004548765 0.04026828 0.01825436 0.0220139183
## CV_Score.1 Stud_residual.1 Intercept_SE.1 E_ACTIVE_SC_SE.1
## 1 0 0.1529909 0.08744519 0.10628095
## 2 0 0.9921369 0.07965353 0.09213324
## 3 0 0.1837896 0.08810145 0.10666639
## 4 0 -0.5844040 0.08027562 0.09563057
## 5 0 -0.3889078 0.08864302 0.10801825
## 6 0 0.0383953 0.08654187 0.10468834
## 7 0 0.1253911 0.08695399 0.10580955
## 8 0 -0.1988655 0.09028950 0.10712896
## 9 0 -0.4369966 0.07931918 0.09386379
## 10 0 0.8884765 0.08833568 0.10452726
## IDHM_Longevidade_SE.1 IDHM_Educacao_SE.1 GVA_AGROPEC_SC_SE.1
## 1 0.1110368 0.04588061 0.02652382
## 2 0.1119694 0.04740315 0.03257650
## 3 0.1020531 0.03981730 0.02201388
## 4 0.1058184 0.04277850 0.02465380
## 5 0.1055777 0.04211908 0.02334130
## 6 0.1016634 0.03955925 0.02197706
## 7 0.1046616 0.04156298 0.02311131
## 8 0.1202339 0.05172279 0.03458566
## 9 0.1032246 0.04088969 0.02331786
## 10 0.1209990 0.05267237 0.03912466
## GVA_INDUSTRY_SC_SE.1 GVA_SERVICES_SC_SE.1 TTG_RATIO_SE.1 Intercept_TV.1
## 1 0.1579919 0.04484955 0.02876817 0.4846874
## 2 0.2150466 0.05455090 0.03203154 -0.3012916
## 3 0.1005255 0.03805070 0.02730878 0.8608561
## 4 0.1544803 0.04243829 0.02822399 0.1716229
## 5 0.1209299 0.04010442 0.02771053 0.7798055
## 6 0.1045009 0.03797883 0.02708284 0.7932604
## 7 0.1219210 0.03978546 0.02749665 0.7228215
## 8 0.2100146 0.05564012 0.03190093 0.1164787
## 9 0.1417038 0.04021954 0.02750499 0.1830382
## 10 0.2400011 0.06256852 0.03351537 -0.1289919
## E_ACTIVE_SC_TV.1 IDHM_Longevidade_TV.1 IDHM_Educacao_TV.1
## 1 -1.0303865742 0.7079772 -1.1108919
## 2 0.0009460785 0.5560394 -0.6660309
## 3 -1.6247814079 0.9417800 -1.5311911
## 4 -0.7898431015 0.8240608 -1.1602237
## 5 -1.4660709286 0.8586825 -1.4115848
## 6 -1.5676451309 0.9514596 -1.4991493
## 7 -1.4230371567 0.8749508 -1.3983691
## 8 -0.5462245150 0.5523124 -0.6487699
## 9 -0.9086276383 0.9269104 -1.2539872
## 10 -0.2393360792 0.5076143 -0.4478994
## GVA_AGROPEC_SC_TV.1 GVA_INDUSTRY_SC_TV.1 GVA_SERVICES_SC_TV.1 TTG_RATIO_TV.1
## 1 12.220621 9.771289 13.357552 -0.67590792
## 2 10.006914 7.316547 10.049381 -0.26129173
## 3 14.796375 13.150111 16.450355 -0.76350348
## 4 13.231385 9.962527 13.620094 -0.80117180
## 5 13.918270 11.680424 15.472876 -0.81209492
## 6 14.831733 12.884406 16.280601 -0.78281617
## 7 14.071279 11.670487 15.433375 -0.82410846
## 8 9.372950 7.781592 10.193502 -0.07941943
## 9 14.032212 10.624491 14.361586 -0.86321250
## 10 8.312898 6.864047 8.627846 0.13572177
## Local_R2.1 geometry
## 1 0.9231928 MULTIPOLYGON (((-62.19465 -...
## 2 0.9106030 MULTIPOLYGON (((-62.53648 -...
## 3 0.9299215 MULTIPOLYGON (((-60.37075 -...
## 4 0.9240774 MULTIPOLYGON (((-61.0008 -1...
## 5 0.9282309 MULTIPOLYGON (((-61.49976 -...
## 6 0.9299548 MULTIPOLYGON (((-60.50475 -...
## 7 0.9283817 MULTIPOLYGON (((-61.34273 -...
## 8 0.9120981 MULTIPOLYGON (((-63.71199 -...
## 9 0.9262483 MULTIPOLYGON (((-60.94827 -...
## 10 0.9052905 MULTIPOLYGON (((-65.37724 -...
LocalR2_Cmap <- tm_shape(Municipal_BC) +
tm_polygons(border.col = 0.5) +
tm_shape(Municipal_BC_Fixed) +
tm_fill(col="Local_R2",
alpha = 0.6,
border.col="gray60",
border.lwd=1,
style="quantile")+
tm_layout(title = "Local R Squared Distribution",
title.size = 1,
title.position = c("center", "top"),
inner.margins = c(0.06, 0.10, 0.10, 0.08),
legend.width = 0.64,
legend.height = 0.64)
LocalR2_Pmap <- tm_shape(Municipal_BC)+
tm_polygons() +
tm_shape(Municipal_BC_Fixed) +
tm_dots(col = "Local_R2",
alpha = 0.6,
border.col = "gray60",
border.lwd = 1)+
tm_layout(title = "Local R Squared Point Map",
title.size = 1,
title.position = c("center", "top"),
inner.margins = c(0.06, 0.10, 0.10, 0.08),
legend.width = 0.64,
legend.height = 0.64)
tmap_arrange(LocalR2_Cmap, LocalR2_Pmap, ncol = 2)
With a local R2 range between 0.7 to 1 (0.749 - 0.986 to be exact), this indicates to us that at least 74.9% of the variation in our dependent variable is explained by the independent variables, up to 98.6%. Hence, it indicates to us that the local regression model fits the observed y -values well.
The areas with the highest Local R^2 values are in the north west and eastern regions of brazil. This tells us that our independent variables fit the local regression model better relative to the other regions and the relationship between our dependent and indepdent variables (GPD per capita and factors affecting it) are well captured in these regions of brazil.
The central southern region shows the lowest local R^2 values which tells us that the relationship between our dependent and independent variables are not as well captured relative to the other regions although still pretty well at 0.749.
tm_shape(Municipal_BC) +
tm_polygons(border.col = 0.5) +
tm_shape(Municipal_BC_Fixed) +
tm_fill(col="yhat",
alpha = 0.6,)+
tm_layout(title = "yhat Distribution",
title.size = 1,
title.position = c("center", "top"),
inner.margins = c(0.06, 0.10, 0.10, 0.08),
legend.width = 0.64,
legend.height = 0.64)
The values of yhat represent the predicted equation for the best fit line in regression; it is used to differentiate between the predicted data and observed data y (GDP_CAPITA).
It is observed that majority of the areas in brazil have a yhat range of 0.0 to 0.2, with a small cluster in the central and central south region displaying areas with slightly higher yhat.
tm_shape(Municipal_BC) +
tm_polygons(border.col = 0.5) +
tm_shape(Municipal_BC_Fixed) +
tm_fill(col="residual",
alpha = 0.6,
midpoint = 0)+
tm_layout(title = "Residual Distribution",
title.size = 1,
title.position = c("center", "top"),
inner.margins = c(0.06, 0.10, 0.10, 0.08),
legend.width = 0.64,
legend.height = 0.64)
The residual values are obtained from the observed GDP_CAPITA subtracted from the predicted GDP_CAPITA. It is observed from this output that majority of the areas in brazil have a residual value of between -0.2 to 0.0 or 0.0 to 0.2. This indicates to us that the model explains the variation in our dependent variable by the independent variables well.
tm_shape(Municipal_BC) +
tm_polygons() +
tm_shape(Municipal_BC_Fixed) +
tm_fill(col="Intercept_SE",
alpha = 0.6,
border.col="gray60",
border.lwd=1)+
tm_layout(title = "Intercept Std Error Distribution",
title.size = 1,
title.position = c("center", "top"),
inner.margins = c(0.06, 0.10, 0.10, 0.08),
legend.width = 0.64,
legend.height = 0.64)
Intercept SE is a measures the reliability of each coefficient estimate. Low standard errors are preferred as high standard errors may signal that the model has problems where local regression coefficients are potentially collinear and thus has lower confidence in the overall estimates.
From the plot output we see that the north western regions of brazil has higher standard errors relative to the other regions (especially the central eastern and southern regions wherethe standard error is lower.) As mentioned above, a higher standard error may indicate that the local model has problems where local regression coefficients are potentially collinear.