In this take-home exercise, you are tasked to determine factors affecting the unequal development of Brazil at the municipality level by using the data provided. The specific task of the analysis are as follows:
Prepare a choropleth map showing the distribution of GDP per capita, 2016 at municipality level.
Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using multiple linear regression method.
Prepare a choropleth map showing the distribution of the residual of the GDP per capita.
Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using geographically weighted regression method.
Prepare a series of choropleth maps showing the outputs of the geographically weighted regression model
The R packages needed for this exercise are as follows:
Geospatial statistical modelling package * GWmodel, heatmaply, spatstat Spatial data handling * sf, geobr Attribute data handling * tidyverse, readr, ggplot2 and dplyr Choropleth mapping * tmap Savling and loading Geospatial data * rgdal (for easier loading of data)
The code chunks below installs and launches these R packages into R environment.
packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse', 'geobr','rgdal', 'heatmaply', "spatstat")
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
# Retrieves a quick breakdown of the number of NA rows and invalid polygons/points
Validity_NA_Check <- function(target_st) {
validity <- st_is_valid(target_st)
NA_rows <- target_st[rowSums(is.na(target_st))!=0,]
Invalid_rows <- which(validity==FALSE)
print(paste("For:", deparse(substitute(target_st))))
print(paste("Number of Invalid polygons/points is:", length(Invalid_rows)))
print(paste("Number of NA rows is:", nrow((NA_rows))))
}
# Retrieves the exact polygon which is invalid
get_invalid <- function(target_st) {
validity <- st_is_valid(target_st)
Invalid_rows <- which(validity==FALSE)
return(Invalid_rows)
}
# Retrieves the exact rows which contain NA values for you to check the columns
get_NA_rows <- function(target_st) {
NA_rows <- target_st[rowSums(is.na(target_st))!=0,]
return(NA_rows)
}
# A cleaning function that replaces NA with "Missing" so that calculations can still be done.
## This function is a little unnessary as we will not be using the data attached to the geospatial points.
replace_NA_with_zero <- function(x, column_name){
x$column_name[is.na(x$column_name)] <- 0
}
The condo_resale_2015 is in csv file format. The codes chunk below uses read_csv() function of readr package to import condo_resale_2015 into R as a tibble data frame called condo_resale.
Brazil_cities_raw = read_delim("data/aspatial/BRAZIL_CITIES.csv", ";")
Reference = read_delim("data/aspatial/Data_Dictionary.csv", ";")
summary(Brazil_cities_raw)
## CITY STATE CAPITAL IBGE_RES_POP
## Length:5573 Length:5573 Min. :0.000000 Min. : 805
## Class :character Class :character 1st Qu.:0.000000 1st Qu.: 5235
## Mode :character Mode :character Median :0.000000 Median : 10934
## Mean :0.004845 Mean : 34278
## 3rd Qu.:0.000000 3rd Qu.: 23424
## Max. :1.000000 Max. :11253503
## NA's :8
## IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN
## Min. : 805 Min. : 0.0 Min. : 239 Min. : 60
## 1st Qu.: 5230 1st Qu.: 0.0 1st Qu.: 1572 1st Qu.: 874
## Median : 10926 Median : 0.0 Median : 3174 Median : 1846
## Mean : 34200 Mean : 77.5 Mean : 10303 Mean : 8859
## 3rd Qu.: 23390 3rd Qu.: 10.0 3rd Qu.: 6726 3rd Qu.: 4624
## Max. :11133776 Max. :119727.0 Max. :3576148 Max. :3548433
## NA's :8 NA's :8 NA's :10 NA's :10
## IBGE_DU_RURAL IBGE_POP IBGE_1 IBGE_1-4
## Min. : 3 Min. : 174 Min. : 0.0 Min. : 5
## 1st Qu.: 487 1st Qu.: 2801 1st Qu.: 38.0 1st Qu.: 158
## Median : 931 Median : 6170 Median : 92.0 Median : 376
## Mean : 1463 Mean : 27595 Mean : 383.3 Mean : 1544
## 3rd Qu.: 1832 3rd Qu.: 15302 3rd Qu.: 232.0 3rd Qu.: 951
## Max. :33809 Max. :10463636 Max. :129464.0 Max. :514794
## NA's :81 NA's :8 NA's :8 NA's :8
## IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## Min. : 7 Min. : 12 Min. : 94 Min. : 29
## 1st Qu.: 220 1st Qu.: 259 1st Qu.: 1734 1st Qu.: 341
## Median : 516 Median : 588 Median : 3841 Median : 722
## Mean : 2069 Mean : 2381 Mean : 18212 Mean : 3004
## 3rd Qu.: 1300 3rd Qu.: 1478 3rd Qu.: 9628 3rd Qu.: 1724
## Max. :684443 Max. :783702 Max. :7058221 Max. :1293012
## NA's :8 NA's :8 NA's :8 NA's :8
## IBGE_PLANTED_AREA IBGE_CROP_PRODUCTION_$ IDHM Ranking 2010 IDHM
## Min. : 0.0 Min. : 0 Min. : 1 Min. :0.4180
## 1st Qu.: 910.2 1st Qu.: 2326 1st Qu.:1392 1st Qu.:0.5990
## Median : 3471.5 Median : 13846 Median :2783 Median :0.6650
## Mean : 14179.9 Mean : 57384 Mean :2783 Mean :0.6592
## 3rd Qu.: 11194.2 3rd Qu.: 55619 3rd Qu.:4174 3rd Qu.:0.7180
## Max. :1205669.0 Max. :3274885 Max. :5565 Max. :0.8620
## NA's :3 NA's :3 NA's :8 NA's :8
## IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG
## Min. :0.4000 Min. :0.6720 Min. :0.2070 Min. :-72.92
## 1st Qu.:0.5720 1st Qu.:0.7690 1st Qu.:0.4900 1st Qu.:-50.87
## Median :0.6540 Median :0.8080 Median :0.5600 Median :-46.52
## Mean :0.6429 Mean :0.8016 Mean :0.5591 Mean :-46.23
## 3rd Qu.:0.7070 3rd Qu.:0.8360 3rd Qu.:0.6310 3rd Qu.:-41.40
## Max. :0.8910 Max. :0.8940 Max. :0.8250 Max. :-32.44
## NA's :8 NA's :8 NA's :8 NA's :9
## LAT ALT PAY_TV FIXED_PHONES
## Min. :-33.688 Min. : 0.0 Min. : 1 Min. : 3
## 1st Qu.:-22.838 1st Qu.: 169.8 1st Qu.: 88 1st Qu.: 119
## Median :-18.089 Median : 406.5 Median : 247 Median : 327
## Mean :-16.444 Mean : 893.8 Mean : 3094 Mean : 6567
## 3rd Qu.: -8.489 3rd Qu.: 628.9 3rd Qu.: 815 3rd Qu.: 1151
## Max. : 4.585 Max. :874579.0 Max. :2047668 Max. :5543127
## NA's :9 NA's :9 NA's :3 NA's :3
## AREA REGIAO_TUR CATEGORIA_TUR ESTIMATED_POP
## Min. : 3.57 Length:5573 Length:5573 Min. : 786
## 1st Qu.: 204.44 Class :character Class :character 1st Qu.: 5454
## Median : 416.59 Mode :character Mode :character Median : 11590
## Mean : 1517.44 Mean : 37432
## 3rd Qu.: 1026.57 3rd Qu.: 25296
## Max. :159533.33 Max. :12176866
## NA's :3 NA's :3
## RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES
## Length:5573 Min. : 0 Min. : 1 Min. : 2
## Class :character 1st Qu.: 4189 1st Qu.: 1726 1st Qu.: 10112
## Mode :character Median : 20426 Median : 7424 Median : 31211
## Mean : 47271 Mean : 175928 Mean : 489451
## 3rd Qu.: 51227 3rd Qu.: 41022 3rd Qu.: 115406
## Max. :1402282 Max. :63306755 Max. :464656988
## NA's :3 NA's :3 NA's :3
## GVA_PUBLIC GVA_TOTAL TAXES GDP
## Min. : 7 Min. : 17 Min. : -14159 Min. : 15
## 1st Qu.: 17267 1st Qu.: 42253 1st Qu.: 1305 1st Qu.: 43709
## Median : 35866 Median : 119492 Median : 5100 Median : 125153
## Mean : 123768 Mean : 832987 Mean : 118864 Mean : 954584
## 3rd Qu.: 89245 3rd Qu.: 313963 3rd Qu.: 22197 3rd Qu.: 329539
## Max. :41902893 Max. :569910503 Max. :117125387 Max. :687035890
## NA's :3 NA's :3 NA's :3 NA's :3
## POP_GDP GDP_CAPITA GVA_MAIN MUN_EXPENDIT
## Min. : 815 Min. : 3191 Length:5573 Min. :1.421e+06
## 1st Qu.: 5483 1st Qu.: 9058 Class :character 1st Qu.:1.573e+07
## Median : 11578 Median : 15870 Mode :character Median :2.746e+07
## Mean : 36998 Mean : 21126 Mean :1.043e+08
## 3rd Qu.: 25085 3rd Qu.: 26155 3rd Qu.:5.666e+07
## Max. :12038175 Max. :314638 Max. :4.577e+10
## NA's :3 NA's :3 NA's :1492
## COMP_TOT COMP_A COMP_B COMP_C
## Min. : 6.0 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 68.0 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 3.00
## Median : 162.0 Median : 2.00 Median : 0.000 Median : 11.00
## Mean : 906.8 Mean : 18.25 Mean : 1.852 Mean : 73.44
## 3rd Qu.: 448.0 3rd Qu.: 8.00 3rd Qu.: 2.000 3rd Qu.: 39.00
## Max. :530446.0 Max. :1948.00 Max. :274.000 Max. :31566.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_D COMP_E COMP_F COMP_G
## Min. : 0.0000 Min. : 0.000 Min. : 0.00 Min. : 1.0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 32.0
## Median : 0.0000 Median : 0.000 Median : 4.00 Median : 74.5
## Mean : 0.4262 Mean : 2.029 Mean : 43.26 Mean : 348.0
## 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 15.00 3rd Qu.: 199.0
## Max. :332.0000 Max. :657.000 Max. :25222.00 Max. :150633.0
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_H COMP_I COMP_J COMP_K
## Min. : 0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 7 Median : 7.00 Median : 1.00 Median : 0.00
## Mean : 41 Mean : 55.88 Mean : 24.74 Mean : 15.55
## 3rd Qu.: 25 3rd Qu.: 24.00 3rd Qu.: 5.00 3rd Qu.: 2.00
## Max. :19515 Max. :29290.00 Max. :38720.00 Max. :23738.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_L COMP_M COMP_N COMP_O
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 0.00 Median : 4.00 Median : 4.0 Median : 2.000
## Mean : 15.14 Mean : 51.29 Mean : 83.7 Mean : 3.269
## 3rd Qu.: 3.00 3rd Qu.: 13.00 3rd Qu.: 14.0 3rd Qu.: 3.000
## Max. :14003.00 Max. :49181.00 Max. :76757.0 Max. :204.000
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_P COMP_Q COMP_R COMP_S
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.: 5.00
## Median : 6.00 Median : 3.00 Median : 2.00 Median : 12.00
## Mean : 30.96 Mean : 34.15 Mean : 12.18 Mean : 51.61
## 3rd Qu.: 17.00 3rd Qu.: 12.00 3rd Qu.: 6.00 3rd Qu.: 31.00
## Max. :16030.00 Max. :22248.00 Max. :6687.00 Max. :24832.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_T COMP_U HOTELS BEDS
## Min. :0 Min. : 0.00000 Min. : 1.000 Min. : 2.0
## 1st Qu.:0 1st Qu.: 0.00000 1st Qu.: 1.000 1st Qu.: 40.0
## Median :0 Median : 0.00000 Median : 1.000 Median : 82.0
## Mean :0 Mean : 0.05027 Mean : 3.131 Mean : 257.5
## 3rd Qu.:0 3rd Qu.: 0.00000 3rd Qu.: 3.000 3rd Qu.: 200.0
## Max. :0 Max. :123.00000 Max. :97.000 Max. :13247.0
## NA's :3 NA's :3 NA's :4686 NA's :4686
## Pr_Agencies Pu_Agencies Pr_Bank Pu_Bank
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.00
## 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:1.00
## Median : 1.000 Median : 2.000 Median : 1.000 Median :2.00
## Mean : 3.383 Mean : 2.829 Mean : 1.312 Mean :1.58
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.:2.00
## Max. :1693.000 Max. :626.000 Max. :83.000 Max. :8.00
## NA's :2231 NA's :2231 NA's :2231 NA's :2231
## Pr_Assets Pu_Assets Cars Motorcycles
## Min. :0.000e+00 Min. :0.000e+00 Min. : 2 Min. : 4
## 1st Qu.:0.000e+00 1st Qu.:4.047e+07 1st Qu.: 602 1st Qu.: 591
## Median :3.231e+07 Median :1.339e+08 Median : 1438 Median : 1285
## Mean :9.180e+09 Mean :6.005e+09 Mean : 9859 Mean : 4879
## 3rd Qu.:1.148e+08 3rd Qu.:4.970e+08 3rd Qu.: 4086 3rd Qu.: 3294
## Max. :1.947e+13 Max. :8.016e+12 Max. :5740995 Max. :1134570
## NA's :2231 NA's :2231 NA's :11 NA's :11
## Wheeled_tractor UBER MAC WAL-MART
## Min. : 0.000 Min. :1 Min. : 1.000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:1 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 0.000 Median :1 Median : 2.000 Median : 1.000
## Mean : 5.754 Mean :1 Mean : 4.277 Mean : 2.059
## 3rd Qu.: 1.000 3rd Qu.:1 3rd Qu.: 3.000 3rd Qu.: 1.750
## Max. :3236.000 Max. :1 Max. :130.000 Max. :26.000
## NA's :11 NA's :5448 NA's :5407 NA's :5471
## POST_OFFICES
## Min. : 1.000
## 1st Qu.: 1.000
## Median : 1.000
## Mean : 2.081
## 3rd Qu.: 2.000
## Max. :225.000
## NA's :120
Extensive data cleaning is also required to ensure the data would be useful and regressions can be formulated.
Unfortunately it seems that there are a lot of rows with missing values. In fact almost all of them are missing some values. We will begin to clean the dataset as best we can in order to formulate our desired indicators to test variables which affect GDP per capita growth.
which(duplicated(Brazil_cities_raw[,1]))
## [1] 48 50 51 91 142 143 159 179 207 226 261 270 318 352 370
## [16] 418 434 484 497 508 517 539 551 563 582 583 591 634 635 644
## [31] 657 670 671 676 677 678 679 693 703 704 709 715 716 717 730
## [46] 766 813 851 856 857 877 885 939 957 973 1007 1009 1015 1041 1042
## [61] 1049 1058 1089 1102 1162 1184 1210 1212 1217 1306 1317 1351 1353 1485 1486
## [76] 1535 1620 1646 1673 1699 1723 1748 1762 1790 1805 1827 1901 1982 2004 2006
## [91] 2062 2072 2163 2189 2195 2198 2253 2258 2273 2285 2327 2343 2344 2375 2381
## [106] 2393 2465 2489 2514 2531 2539 2547 2557 2640 2652 2661 2662 2702 2707 2713
## [121] 2724 2744 2935 2992 3053 3062 3082 3135 3151 3182 3213 3216 3217 3245 3251
## [136] 3298 3324 3354 3356 3357 3378 3387 3390 3405 3406 3422 3483 3484 3490 3502
## [151] 3521 3533 3536 3552 3580 3625 3635 3659 3670 3693 3702 3764 3785 3789 3811
## [166] 3813 3845 3868 3880 3881 3882 4003 4008 4015 4019 4025 4027 4031 4040 4073
## [181] 4092 4116 4141 4148 4152 4158 4195 4201 4232 4296 4312 4324 4351 4363 4369
## [196] 4370 4397 4401 4402 4403 4407 4408 4409 4411 4419 4422 4423 4424 4433 4454
## [211] 4473 4482 4488 4489 4490 4499 4538 4590 4611 4617 4618 4619 4620 4643 4644
## [226] 4645 4651 4663 4674 4686 4688 4724 4776 4829 4862 4891 4912 4917 4924 4937
## [241] 4941 5027 5038 5074 5077 5085 5115 5145 5156 5159 5162 5164 5191 5207 5222
## [256] 5226 5258 5302 5305 5306 5340 5346 5425 5435 5439 5450 5457 5471 5472 5473
## [271] 5491 5498 5499 5556
which(duplicated(Brazil_cities_raw[,1:2]))
## integer(0)
With respect to the data, there appears to be a large number of city names repeated. This could cause problems in further joining operations. We will need to create unique identifers by combining them with the STATE column in order to perform any sort of joining.
Brazil_cities_uniques <- cbind(CITY_STATE = paste(Brazil_cities_raw$CITY, Brazil_cities_raw$STATE, sep="_"), Brazil_cities_raw)
which(duplicated(Brazil_cities_uniques[,1]))
## integer(0)
For the purpose of our analysis, since we’re looking at contributive factors that might lead to the differences in GDP per captial, with reference to the Data_Dictionary, we will be removing variables which come after 2016
NOTE: This is important because we would be making a logical fallacy if we try to build explainatory models on factors which happen post-event which may draw reverse causation. This would not affect things such as Area as those would stay constant regardless of time differences. Additionally, we will still have enough variables and derived variables to perform our analysis.
We will also be removing MUN_EXPENDITURE because of the large amounts of missing data points and our inability to properly estimate these values from external sources. Because this specific column has much larger amounts of missing rows, it would be ill-advised to remove rows rather than the entire column itself.
Lastly we will also remove COMP_T as there is no data values there at all
drops <- c("IBGE_PLANTED_AREA","IBGE_CROP_PRODUCTION_$", "PAY_TV", "FIXED_PHONES", "ESTIMATED_POP", "REGIAO_TUR", "CATEGORIA_TUR", "HOTELS", "BEDS", "Pr_Agencies", "Pu_Agencies", "Pr_Bank", "Pu_Bank", "Pr_Assets", "Pu_Assets", "Cars", "Motorcycles", "Wheeled_tractor", "UBER", "MAC", "WAL-MART", "POST_OFFICES", "MUN_EXPENDIT", "COMP_T")
Brazil_cities_2016 <- Brazil_cities_uniques[ , !(names(Brazil_cities_uniques) %in% drops)]
If the dependant variable is missing in our data, that specific city will unforunately not be able utilized in our analysis.
Missing_GDP_PC <- Brazil_cities_2016[(is.na(Brazil_cities_2016$GDP_CAPITA))!=0,]
Missing_GDP_PC
## CITY_STATE CITY STATE CAPITAL IBGE_RES_POP
## 2702 Lagoa Dos Patos_RS Lagoa Dos Patos RS 0 NA
## 4482 Santa Teresinha_BA Santa Teresinha BA 0 NA
## 4606 São Caetano_PE São Caetano PE 0 NA
## IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL
## 2702 NA NA NA NA NA
## 4482 NA NA NA NA NA
## 4606 NA NA NA NA NA
## IBGE_POP IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## 2702 NA NA NA NA NA NA NA
## 4482 NA NA NA NA NA NA NA
## 4606 NA NA NA NA NA NA NA
## IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG
## 2702 NA NA NA NA NA NA
## 4482 4493 0.59 0.549 0.804 0.459 -39.52114
## 4606 NA NA NA NA NA NA
## LAT ALT AREA RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY
## 2702 NA NA 10158.75 <NA> NA NA
## 4482 -12.77285 222.51 NA <NA> NA NA
## 4606 NA NA NA <NA> NA NA
## GVA_SERVICES GVA_PUBLIC GVA_TOTAL TAXES GDP POP_GDP GDP_CAPITA GVA_MAIN
## 2702 NA NA NA NA NA NA NA <NA>
## 4482 NA NA NA NA NA NA NA <NA>
## 4606 NA NA NA NA NA NA NA <NA>
## COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 2702 NA NA NA NA NA NA NA NA NA NA
## 4482 NA NA NA NA NA NA NA NA NA NA
## 4606 NA NA NA NA NA NA NA NA NA NA
## COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 2702 NA NA NA NA NA NA NA NA NA NA
## 4482 NA NA NA NA NA NA NA NA NA NA
## 4606 NA NA NA NA NA NA NA NA NA NA
## COMP_U
## 2702 NA
## 4482 NA
## 4606 NA
According to Wikipedia, the number of municipalities in Brazil should amount to 5,573. However, our dataset includes 5,573. Which means that the 3 cities with missing GDPC are probably not accoutned for in some way. We will then remove the observed cities assuming they are irrelevant to our study. Source: https://en.wikipedia.org/wiki/Municipalities_of_Brazil
Brazil_cities_allGDPC <- Brazil_cities_2016[(is.na(Brazil_cities_2016$GDP_CAPITA))==0,]
#summary((Brazil_cities_allGDPC))
Brazil_cities_allGDPC[(is.na(Brazil_cities_allGDPC$IBGE_RES_POP_ESTR))!=0,]
## CITY_STATE CITY STATE CAPITAL IBGE_RES_POP
## 472 Balneário Rincão_SC Balneário Rincão SC 0 NA
## 3117 Mojuí Dos Campos_PA Mojuí Dos Campos PA 0 NA
## 3581 Paraíso Das Águas_MS Paraíso Das Águas MS 0 NA
## 3761 Pescaria Brava_SC Pescaria Brava SC 0 NA
## 3821 Pinto Bandeira_RS Pinto Bandeira RS 0 NA
## IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL
## 472 NA NA NA NA NA
## 3117 NA NA NA NA NA
## 3581 NA NA NA NA NA
## 3761 NA NA NA NA NA
## 3821 NA NA NA NA NA
## IBGE_POP IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## 472 NA NA NA NA NA NA NA
## 3117 NA NA NA NA NA NA NA
## 3581 NA NA NA NA NA NA NA
## 3761 NA NA NA NA NA NA NA
## 3821 NA NA NA NA NA NA NA
## IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG LAT
## 472 NA NA NA NA NA NA NA
## 3117 NA NA NA NA NA NA NA
## 3581 NA NA NA NA NA NA NA
## 3761 NA NA NA NA NA NA NA
## 3821 NA NA NA NA NA NA NA
## ALT AREA RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES
## 472 NA 63.43 Sem classificação 2045.03 51257.53 96248.50
## 3117 NA 4988.24 Sem classificação 42123.35 7.20 28168.56
## 3581 NA 5061.43 Sem classificação 210844.60 146514.00 68393.39
## 3761 NA 106.85 Sem classificação 3167.11 5812.35 29.46
## 3821 NA 104.86 Sem classificação 19067.89 4366.36 9652.04
## GVA_PUBLIC GVA_TOTAL TAXES GDP POP_GDP GDP_CAPITA
## 472 52820.64 202371.69 14863.05 217234.75 12212 17788.63
## 3117 55645.41 133135.10 4177.94 137313.05 15548 8831.56
## 3581 36606.37 462358.36 21594.41 483952.77 5251 92163.92
## 3761 39700.00 78.14 4505.77 82645.86 9908 8341.33
## 3821 14620.12 47.71 4064.74 51771.14 2847 18184.45
## GVA_MAIN
## 472 Demais serviços
## 3117 Administração, defesa, educação e saúde públicas e seguridade social
## 3581 Agricultura, inclusive apoio à agricultura e a pós colheita
## 3761 Administração, defesa, educação e saúde públicas e seguridade social
## 3821 Agricultura, inclusive apoio à agricultura e a pós colheita
## COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 472 270 1 1 16 0 2 47 112 8 13
## 3117 78 0 0 3 0 0 2 14 6 0
## 3581 129 5 1 0 1 2 9 57 21 7
## 3761 105 1 1 22 0 2 6 36 7 3
## 3821 63 1 0 12 0 0 4 18 7 5
## COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 472 3 6 11 10 23 2 3 6 1 5
## 3117 0 0 0 2 2 0 41 2 0 6
## 3581 1 0 0 4 9 2 3 2 0 5
## 3761 1 0 1 1 1 2 14 0 1 6
## 3821 0 0 2 2 2 1 1 1 3 4
## COMP_U
## 472 0
## 3117 0
## 3581 0
## 3761 0
## 3821 0
Due to the large amount of missing data from these cities, we will be removing them as we would be unable to properly estimate the population at these specific dates unless the data is provided to us. Additionally, as they are only 5 cities, we can still utilize the remaining 5565 for the purposes of our analysis which is more than sufficient.
Brazil_cities_allpop <- Brazil_cities_allGDPC[(is.na(Brazil_cities_allGDPC$IBGE_RES_POP_ESTR))==0,]
#summary(Brazil_cities_allpop)
# Brazil_cities_allpop[(is.na(Brazil_cities_allpop$IBGE_DU_RURAL))!=0,]
Here we can see that by comparing the IBGE_DU and IBGE_DU_URBAN values that the NA values are due to missing 0s as all the IBGE_DU are classified as urban. We will then do a mass fill for the columns.
Brazil_cities_allpop$IBGE_DU_RURAL[is.na(Brazil_cities_allpop$IBGE_DU_RURAL)] <- 0
#summary(Brazil_cities_allpop)
In this case, IBGE_DU in the reference refers to “Domestic Units”. Upon further investigation, this is reference to Permenant Private Housing Units. We determined this by viewing the source of the data and observing the additional description at the top of the Webpage. Source: https://sidra.ibge.gov.br/tabela/3495
Unfortunately the source data does not provide us with the values we need. However, we can use the alternate data source from the IBGE website report to find a good estimate of these values. Although the values are not exact due to some corrections made further on, after checking with other cities where the IBGE_DU values are known such as Petrolina and Sao Paulo, we can confirm that the data is at least somewhat accurate.
Source: https://cidades.ibge.gov.br/brasil/pb/marcacao/pesquisa/23/25124?tipo=ranking&indicador=29522
From this, we can make a reasonable estimate
Brazil_cities_allpop[(is.na(Brazil_cities_allpop$IBGE_DU))!=0,]
## CITY_STATE CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BRAS
## 2937 Marcação_PB Marcação PB 0 7609 7609
## 5367 Uiramutã_RR Uiramutã RR 0 8375 8375
## IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL IBGE_POP IBGE_1
## 2937 0 NA NA 0 2838 45
## 5367 0 NA NA 0 794 19
## IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+ IDHM Ranking 2010 IDHM
## 2937 211 277 266 1701 338 5404 0.529
## 5367 83 129 110 424 29 5561 0.453
## IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG LAT ALT
## 2937 0.525 0.691 0.408 -35.01392 -6.770054 92.93
## 5367 0.439 0.766 0.276 -60.19572 4.585440 605.80
## AREA RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC
## 2937 123.83 Rural Adjacente 23738.38 1724.29 11192.86 37551.82
## 5367 8065.56 Rural Remoto 9864.83 1189.55 4.75 87.28
## GVA_TOTAL TAXES GDP POP_GDP GDP_CAPITA
## 2937 74207.34 1436.36 75643.7 8475 8925.51
## 5367 103089.25 0.59 103680.3 9664 10728.51
## GVA_MAIN
## 2937 Administração, defesa, educação e saúde públicas e seguridade social
## 5367 Administração, defesa, educação e saúde públicas e seguridade social
## COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 2937 36 2 0 2 0 0 0 15 1 1
## 5367 8 0 0 0 0 0 0 7 0 0
## COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 2937 0 0 0 1 1 2 5 0 0 6
## 5367 0 0 0 0 0 1 0 0 0 0
## COMP_U
## 2937 0
## 5367 0
Brazil_cities_allpop$IBGE_DU[which(Brazil_cities_allpop$CITY_STATE == "Marcação_PB")] <- 2040
Brazil_cities_allpop$IBGE_DU_URBAN[which(Brazil_cities_allpop$CITY_STATE == "Marcação_PB")] <- 824
Brazil_cities_allpop$IBGE_DU_RURAL[which(Brazil_cities_allpop$CITY_STATE == "Marcação_PB")] <- 1216
Brazil_cities_allpop$IBGE_DU[which(Brazil_cities_allpop$CITY_STATE == "Uiramutã_RR")] <- 1444
Brazil_cities_allpop$IBGE_DU_URBAN[which(Brazil_cities_allpop$CITY_STATE == "Uiramutã_RR")] <- 219
Brazil_cities_allpop$IBGE_DU_RURAL[which(Brazil_cities_allpop$CITY_STATE == "Uiramutã_RR")] <- 1225
#summary(Brazil_cities_allpop)
Brazil_cities_allpop[(is.na(Brazil_cities_allpop$LONG))!=0,]
## CITY_STATE CITY STATE CAPITAL IBGE_RES_POP
## 3806 Pinhal Da Serra_RS Pinhal Da Serra RS 0 2130
## 4490 Santa Terezinha_BA Santa Terezinha BA 0 9648
## IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL
## 3806 2130 0 745 180 565
## 4490 9648 0 2891 734 2157
## IBGE_POP IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## 3806 478 11 22 34 32 312 67
## 4490 2332 40 126 191 217 1419 339
## IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG LAT
## 3806 3121 0.65 0.641 0.835 0.513 NA NA
## 4490 NA NA NA NA NA NA NA
## ALT AREA RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES
## 3806 NA 438.11 Rural Adjacente 56030.9 267670.32 15.85
## 4490 NA 719.26 Rural Adjacente 13235.2 5398.61 17754.37
## GVA_PUBLIC GVA_TOTAL TAXES GDP POP_GDP GDP_CAPITA
## 3806 19831.52 359.38 25222.60 384602.56 2115 181845.18
## 4490 32630.97 69019.14 3149.33 72168.48 10619 6796.16
## GVA_MAIN
## 3806 Eletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação
## 4490 Administração, defesa, educação e saúde públicas e seguridade social
## COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 3806 45 1 0 2 1 1 3 23 2 4
## 4490 74 2 1 4 0 0 3 37 0 3
## COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 3806 0 0 0 0 1 2 1 1 0 3
## 4490 1 0 0 1 2 2 12 2 0 4
## COMP_U
## 3806 0
## 4490 0
Source: https://www.latlong.net/ Source: https://www.freemaptools.com/elevation-finder.htm
Brazil_cities_allpop$LONG[which(Brazil_cities_allpop$CITY_STATE == "Pinhal Da Serra_RS")] <- -51.171909
Brazil_cities_allpop$LAT[which(Brazil_cities_allpop$CITY_STATE == "Pinhal Da Serra_RS")] <- -27.874420
Brazil_cities_allpop$ALT[which(Brazil_cities_allpop$CITY_STATE == "Pinhal Da Serra_RS")] <- 918
Brazil_cities_allpop$LONG[which(Brazil_cities_allpop$CITY_STATE == "Santa Terezinha_BA")] <- -39.5184
Brazil_cities_allpop$LAT[which(Brazil_cities_allpop$CITY_STATE == "Santa Terezinha_BA")] <- -12.7498
Brazil_cities_allpop$ALT[which(Brazil_cities_allpop$CITY_STATE == "Santa Terezinha_BA")] <- 210
Source: https://en.wikipedia.org/wiki/Japur%C3%A1
Brazil_cities_allpop[(is.na(Brazil_cities_allpop$AREA))!=0,]
## CITY_STATE CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BRAS
## 2531 Japurá_AM Japurá AM 0 7326 7318
## IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL IBGE_POP IBGE_1
## 2531 8 1043 583 460 3235 92
## IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+ IDHM Ranking 2010 IDHM
## 2531 369 435 478 1764 97 5451 0.522
## IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG LAT ALT AREA
## 2531 0.552 0.748 0.345 -66.9969 -1.880845 69.84 NA
## RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC GVA_TOTAL
## 2531 Rural Remoto 16398.64 2146.9 9908.92 29244.3 57.7
## TAXES GDP POP_GDP GDP_CAPITA
## 2531 1489.89 59.19 4660 12701.43
## GVA_MAIN
## 2531 Administração, defesa, educação e saúde públicas e seguridade social
## COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 2531 16 0 0 0 0 0 0 13 0 0
## COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 2531 0 0 0 0 1 2 0 0 0 0
## COMP_U
## 2531 0
Brazil_cities_allpop$AREA[which(Brazil_cities_allpop$CITY_STATE == "Japurá_AM")] <- 55791
#summary(Brazil_cities_allpop)
Brazil_cities_allpop[(is.na(Brazil_cities_allpop$IDHM))!=0,]
## CITY_STATE CITY STATE CAPITAL IBGE_RES_POP
## 4490 Santa Terezinha_BA Santa Terezinha BA 0 9648
## IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL
## 4490 9648 0 2891 734 2157
## IBGE_POP IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## 4490 2332 40 126 191 217 1419 339
## IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG
## 4490 NA NA NA NA NA -39.5184
## LAT ALT AREA RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES
## 4490 -12.7498 210 719.26 Rural Adjacente 13235.2 5398.61 17754.37
## GVA_PUBLIC GVA_TOTAL TAXES GDP POP_GDP GDP_CAPITA
## 4490 32630.97 69019.14 3149.33 72168.48 10619 6796.16
## GVA_MAIN
## 4490 Administração, defesa, educação e saúde públicas e seguridade social
## COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 4490 74 2 1 4 0 0 3 37 0 3
## COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 4490 1 0 0 1 2 2 12 2 0 4
## COMP_U
## 4490 0
Unfortunately, we will not be able to use this datapoint as we are unable to replace the remaining missing data values for the Human Development Indexes. For the purpose of this study, this datavalue will also be excluded
Brazil_cities_cleaned<- Brazil_cities_allpop[(is.na(Brazil_cities_allpop$IDHM))==0,]
summary(Brazil_cities_cleaned)
## CITY_STATE CITY STATE
## Abadia De Goiás_GO : 1 Length:5564 Length:5564
## Abadia Dos Dourados_MG: 1 Class :character Class :character
## Abadiânia_GO : 1 Mode :character Mode :character
## Abaeté_MG : 1
## Abaetetuba_PA : 1
## Abaiara_CE : 1
## (Other) :5558
## CAPITAL IBGE_RES_POP IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR
## Min. :0.000000 Min. : 805 Min. : 805 Min. : 0.00
## 1st Qu.:0.000000 1st Qu.: 5234 1st Qu.: 5228 1st Qu.: 0.00
## Median :0.000000 Median : 10935 Median : 10930 Median : 0.00
## Mean :0.004853 Mean : 34282 Mean : 34205 Mean : 77.52
## 3rd Qu.:0.000000 3rd Qu.: 23446 3rd Qu.: 23392 3rd Qu.: 10.00
## Max. :1.000000 Max. :11253503 Max. :11133776 Max. :119727.00
##
## IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL IBGE_POP
## Min. : 239 Min. : 60 Min. : 0.0 Min. : 174
## 1st Qu.: 1572 1st Qu.: 874 1st Qu.: 471.8 1st Qu.: 2802
## Median : 3174 Median : 1845 Median : 918.5 Median : 6174
## Mean : 10301 Mean : 8857 Mean : 1443.8 Mean : 27599
## 3rd Qu.: 6726 3rd Qu.: 4622 3rd Qu.: 1813.0 3rd Qu.: 15303
## Max. :3576148 Max. :3548433 Max. :33809.0 Max. :10463636
##
## IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14
## Min. : 0.0 Min. : 5.0 Min. : 7 Min. : 12.0
## 1st Qu.: 38.0 1st Qu.: 158.0 1st Qu.: 220 1st Qu.: 259.8
## Median : 92.0 Median : 376.5 Median : 516 Median : 588.5
## Mean : 383.3 Mean : 1544.8 Mean : 2070 Mean : 2381.8
## 3rd Qu.: 232.0 3rd Qu.: 951.2 3rd Qu.: 1300 3rd Qu.: 1478.2
## Max. :129464.0 Max. :514794.0 Max. :684443 Max. :783702.0
##
## IBGE_15-59 IBGE_60+ IDHM Ranking 2010 IDHM
## Min. : 94 Min. : 29.0 Min. : 1 Min. :0.4180
## 1st Qu.: 1735 1st Qu.: 341.0 1st Qu.:1392 1st Qu.:0.5990
## Median : 3842 Median : 722.5 Median :2782 Median :0.6650
## Mean : 18215 Mean : 3004.7 Mean :2783 Mean :0.6592
## 3rd Qu.: 9629 3rd Qu.: 1724.2 3rd Qu.:4173 3rd Qu.:0.7180
## Max. :7058221 Max. :1293012.0 Max. :5565 Max. :0.8620
##
## IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG
## Min. :0.4000 Min. :0.6720 Min. :0.2070 Min. :-72.92
## 1st Qu.:0.5720 1st Qu.:0.7690 1st Qu.:0.4900 1st Qu.:-50.87
## Median :0.6540 Median :0.8080 Median :0.5600 Median :-46.52
## Mean :0.6429 Mean :0.8016 Mean :0.5591 Mean :-46.23
## 3rd Qu.:0.7070 3rd Qu.:0.8360 3rd Qu.:0.6310 3rd Qu.:-41.41
## Max. :0.8910 Max. :0.8940 Max. :0.8250 Max. :-32.44
##
## LAT ALT AREA RURAL_URBAN
## Min. :-33.688 Min. : 0.0 Min. : 3.57 Length:5564
## 1st Qu.:-22.839 1st Qu.: 169.8 1st Qu.: 204.53 Class :character
## Median :-18.091 Median : 406.5 Median : 416.59 Mode :character
## Mean :-16.447 Mean : 894.0 Mean : 1525.29
## 3rd Qu.: -8.489 3rd Qu.: 629.1 3rd Qu.: 1026.44
## Max. : 4.585 Max. :874579.0 Max. :159533.33
##
## GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC
## Min. : 0 Min. : 1 Min. : 2 Min. : 7
## 1st Qu.: 4192 1st Qu.: 1725 1st Qu.: 10113 1st Qu.: 17258
## Median : 20432 Median : 7428 Median : 31214 Median : 35837
## Mean : 47270 Mean : 176080 Mean : 489940 Mean : 123860
## 3rd Qu.: 51239 3rd Qu.: 41015 3rd Qu.: 115552 3rd Qu.: 89328
## Max. :1402282 Max. :63306755 Max. :464656988 Max. :41902893
##
## GVA_TOTAL TAXES GDP POP_GDP
## Min. : 17 Min. : -14159 Min. : 15 Min. : 815
## 1st Qu.: 42254 1st Qu.: 1302 1st Qu.: 43691 1st Qu.: 5486
## Median : 119492 Median : 5108 Median : 125153 Median : 11584
## Mean : 833729 Mean : 118983 Mean : 955425 Mean : 37028
## 3rd Qu.: 314039 3rd Qu.: 22219 3rd Qu.: 329733 3rd Qu.: 25105
## Max. :569910503 Max. :117125387 Max. :687035890 Max. :12038175
##
## GDP_CAPITA GVA_MAIN COMP_TOT COMP_A
## Min. : 3191 Length:5564 Min. : 6.0 Min. : 0.00
## 1st Qu.: 9062 Class :character 1st Qu.: 68.0 1st Qu.: 1.00
## Median : 15870 Mode :character Median : 162.0 Median : 2.00
## Mean : 21122 Mean : 907.6 Mean : 18.27
## 3rd Qu.: 26155 3rd Qu.: 449.2 3rd Qu.: 8.00
## Max. :314638 Max. :530446.0 Max. :1948.00
##
## COMP_B COMP_C COMP_D COMP_E
## Min. : 0.000 Min. : 0.00 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 0.000 Median : 11.00 Median : 0.0000 Median : 0.000
## Mean : 1.853 Mean : 73.51 Mean : 0.4265 Mean : 2.031
## 3rd Qu.: 2.000 3rd Qu.: 39.00 3rd Qu.: 0.0000 3rd Qu.: 1.000
## Max. :274.000 Max. :31566.00 Max. :332.0000 Max. :657.000
##
## COMP_F COMP_G COMP_H COMP_I
## Min. : 0.00 Min. : 1.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.00 1st Qu.: 32.0 1st Qu.: 1.00 1st Qu.: 2.00
## Median : 4.00 Median : 75.0 Median : 7.00 Median : 7.00
## Mean : 43.29 Mean : 348.3 Mean : 41.03 Mean : 55.93
## 3rd Qu.: 15.00 3rd Qu.: 200.0 3rd Qu.: 25.00 3rd Qu.: 24.00
## Max. :25222.00 Max. :150633.0 Max. :19515.00 Max. :29290.00
##
## COMP_J COMP_K COMP_L COMP_M
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 1.00
## Median : 1.00 Median : 0.00 Median : 0.00 Median : 4.00
## Mean : 24.77 Mean : 15.57 Mean : 15.15 Mean : 51.34
## 3rd Qu.: 5.00 3rd Qu.: 2.00 3rd Qu.: 3.00 3rd Qu.: 13.00
## Max. :38720.00 Max. :23738.00 Max. :14003.00 Max. :49181.00
##
## COMP_N COMP_O COMP_P COMP_Q
## Min. : 0.00 Min. : 1.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.00 1st Qu.: 2.000 1st Qu.: 2.00 1st Qu.: 1.00
## Median : 4.00 Median : 2.000 Median : 6.00 Median : 3.00
## Mean : 83.78 Mean : 3.271 Mean : 30.98 Mean : 34.18
## 3rd Qu.: 14.00 3rd Qu.: 3.000 3rd Qu.: 17.00 3rd Qu.: 12.00
## Max. :76757.00 Max. :204.000 Max. :16030.00 Max. :22248.00
##
## COMP_R COMP_S COMP_U
## Min. : 0.00 Min. : 0.00 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.: 5.00 1st Qu.: 0.00000
## Median : 2.00 Median : 12.00 Median : 0.00000
## Mean : 12.19 Mean : 51.66 Mean : 0.05032
## 3rd Qu.: 6.00 3rd Qu.: 31.00 3rd Qu.: 0.00000
## Max. :6687.00 Max. :24832.00 Max. :123.00000
##
Overall we had lost a total of 9 rows of data during the data cleaning. 3 of which were missing depedent variable of GDP per Capita, 5 of which were missing a large number of variables and lastly 1 due to missing IDHM values.
Overall we reduced our number of variables from 81 to 58. We added 1 variable as a unique identifer for each state, removed 22 variables due to the collection of data recorded after our dependent variable (2016), removed 1 for all 0 values and removed 1 variable due to a large portion of missing values for each row.
In order to formulate our indicators, we will need to create some derived variables to ensure that our indicators for our explainatory model are not correlated with one another or the dependent variable by some underlying issue. Since our dependent variable is a metric which is divided by population, we would need to process values which are dependant on population in some ways.
We will be taking 3 different approaches in this case.
Using Ratios rather than counts for metrics where we have totals. E.g. (foreign resident population / total resident population)
Using the values divided by POP_GDP which is the population scale used to formulate GDP Per capita.
We can derive more variables for our analysis by converting categorical variables into binary arrays. This will allow us to retain our categorical variables during our regression by making them into dummy variables.
Examining GVA_MAIN
unique(Brazil_cities_cleaned[,37])
## [1] "Demais serviços"
## [2] "Administração, defesa, educação e saúde públicas e seguridade social"
## [3] "Agricultura, inclusive apoio à agricultura e a pós colheita"
## [4] "Indústrias de transformação"
## [5] "Pecuária, inclusive apoio à pecuária"
## [6] "Eletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação"
## [7] "Comércio e reparação de veículos automotores e motocicletas"
## [8] "Indústrias extrativas"
## [9] "Construção"
## [10] "Produção florestal, pesca e aquicultura"
Examining RURAL_URBAN
unique(Brazil_cities_cleaned[,27])
## [1] "Urbano" "Rural Adjacente"
## [3] "Rural Remoto" "Intermediário Adjacente"
## [5] "Intermediário Remoto"
Creating Dummy Variable Arrays
Brazil_cities_CAT <- cbind(Brazil_cities_cleaned, as.data.frame(with(Brazil_cities_cleaned, model.matrix(~ RURAL_URBAN + 0))))
Brazil_cities_CAT <- cbind(Brazil_cities_CAT, as.data.frame(with(Brazil_cities_cleaned, model.matrix(~ GVA_MAIN + 0))))
Dropping Categorical Columns
dropCategorical <- c("GVA_MAIN", "RURAL_URBAN")
Brazil_cities_withDummy <- Brazil_cities_CAT[ , !(names(Brazil_cities_CAT) %in% dropCategorical)]
In order to control for populational differences, we can take ratios instead of pure counts to get a better understanding of the makeup of each town
After examining the data and source of the data. There appears to be an error in the GVA totals. This would greatly affect our ratios for GVA and upon inspection of the source data, all other GVA values are correct except the totals. It is not clear where the values in the totals are coming from, as such we will replace them by summing up all the values for each category of GVA to formulate new GVA totals.
Brazil_cities_withDummy <- Brazil_cities_withDummy %>%
mutate(` GVA_TOTAL ` = as.numeric(rowSums(.[27:30])))
Brazil_cities_Derived <- Brazil_cities_withDummy %>%
# Foregin vs Local population
mutate(RES_BRAZ_POP_RATIO = ifelse((IBGE_RES_POP_BRAS == 0), 0, (IBGE_RES_POP_BRAS/IBGE_RES_POP))) %>%
mutate(RES_FOREIGN_POP_RATIO = ifelse((IBGE_RES_POP_ESTR == 0), 0, (IBGE_RES_POP_ESTR/IBGE_RES_POP))) %>%
# Rural vs Urban Domestic Units
mutate(DOM_URBAN_RATIO = ifelse((IBGE_DU_URBAN == 0), 0, (IBGE_DU_URBAN/IBGE_DU)))%>%
mutate(DOM_RURAL_RATIO = ifelse((IBGE_DU_RURAL == 0), 0, (IBGE_DU_RURAL/IBGE_DU)))%>%
# Residential Population Age Ratios
mutate(POP_BEL_ONE_RATIO = ifelse((IBGE_1 == 0), 0, (IBGE_1/IBGE_POP)))%>%
mutate(POP_ONE_to_FOUR_RATIO = ifelse((`IBGE_1-4` == 0), 0, (`IBGE_1-4`/IBGE_POP)))%>%
mutate(POP_FIVE_to_NINE_RATIO = ifelse((`IBGE_5-9` == 0), 0, (`IBGE_5-9`/IBGE_POP)))%>%
mutate(POP_TEN_to_FOURTEEN_RATIO = ifelse((`IBGE_10-14` == 0), 0, (`IBGE_10-14`/IBGE_POP)))%>%
mutate(POP_WORKING_RATIO = ifelse((`IBGE_15-59` == 0), 0, (`IBGE_15-59`/IBGE_POP))) %>%
mutate(POP_ELDERLY_RATIO = ifelse((`IBGE_60+` == 0), 0, (`IBGE_60+`/IBGE_POP)))%>%
# Gross Added Value Ratios
mutate(GVA_AGROPEC_RATIO = ifelse((GVA_AGROPEC == 0), 0, (GVA_AGROPEC/as.numeric(` GVA_TOTAL `))))%>%
mutate(GVA_INDUSTRY_RATIO = ifelse((GVA_INDUSTRY == 0), 0, (GVA_INDUSTRY/as.numeric(` GVA_TOTAL `))))%>%
mutate(GVA_SERVICES_RATIO = ifelse((GVA_SERVICES == 0), 0, (GVA_SERVICES/as.numeric(` GVA_TOTAL `))))%>%
mutate(GVA_PUBLIC_RATIO = ifelse((GVA_PUBLIC == 0), 0, (GVA_PUBLIC/as.numeric(` GVA_TOTAL `))))%>%
# Company Ratios
mutate(COM_A_RATIO = ifelse((COMP_A == 0), 0, (COMP_A/COMP_TOT)))%>%
mutate(COM_B_RATIO = ifelse((COMP_B == 0), 0, (COMP_B/COMP_TOT)))%>%
mutate(COM_C_RATIO = ifelse((COMP_C == 0), 0, (COMP_C/COMP_TOT)))%>%
mutate(COM_D_RATIO = ifelse((COMP_D == 0), 0, (COMP_D/COMP_TOT)))%>%
mutate(COM_E_RATIO = ifelse((COMP_E == 0), 0, (COMP_E/COMP_TOT)))%>%
mutate(COM_F_RATIO = ifelse((COMP_F == 0), 0, (COMP_F/COMP_TOT)))%>%
mutate(COM_G_RATIO = ifelse((COMP_G == 0), 0, (COMP_G/COMP_TOT)))%>%
mutate(COM_H_RATIO = ifelse((COMP_H == 0), 0, (COMP_H/COMP_TOT)))%>%
mutate(COM_I_RATIO = ifelse((COMP_I == 0), 0, (COMP_I/COMP_TOT)))%>%
mutate(COM_J_RATIO = ifelse((COMP_J == 0), 0, (COMP_J/COMP_TOT)))%>%
mutate(COM_K_RATIO = ifelse((COMP_K == 0), 0, (COMP_K/COMP_TOT)))%>%
mutate(COM_L_RATIO = ifelse((COMP_L == 0), 0, (COMP_L/COMP_TOT)))%>%
mutate(COM_M_RATIO = ifelse((COMP_M == 0), 0, (COMP_M/COMP_TOT)))%>%
mutate(COM_N_RATIO = ifelse((COMP_N == 0), 0, (COMP_N/COMP_TOT)))%>%
mutate(COM_O_RATIO = ifelse((COMP_O == 0), 0, (COMP_O/COMP_TOT)))%>%
mutate(COM_P_RATIO = ifelse((COMP_P == 0), 0, (COMP_P/COMP_TOT)))%>%
mutate(COM_Q_RATIO = ifelse((COMP_Q == 0), 0, (COMP_Q/COMP_TOT)))%>%
mutate(COM_R_RATIO = ifelse((COMP_R == 0), 0, (COMP_R/COMP_TOT)))%>%
mutate(COM_S_RATIO = ifelse((COMP_S == 0), 0, (COMP_S/COMP_TOT)))%>%
mutate(COM_U_RATIO = ifelse((COMP_U == 0), 0, (COMP_U/COMP_TOT)))
Brazil_cities_Derived <- Brazil_cities_Derived %>%
mutate(POP_DENSITY = POP_GDP/AREA)
#summary(Brazil_cities_Derived)
Data Looks good, though we should pay attention to the ratios which have a max-value less than 1. It would be prudent not to normalize them.
Brazil_cities_Derived[73:74]%>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
Brazil_cities_Derived[75:76] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
Brazil_cities_Derived[77:82] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
Brazil_cities_Derived[83:86] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
Brazil_cities_Derived[87:106] %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
Brazil_cities.sf <- st_as_sf(Brazil_cities_Derived,
coords = c("LONG", "LAT"),
crs=4326) %>%
st_transform(crs=4674)
#head(Brazil_cities.sf)
We will be changing the CRS to 4674 as per the geobr documentation in order to accurately map the datapoints to the Brazil country map for the municipalities.
Validity_NA_Check(Brazil_cities.sf)
## [1] "For: Brazil_cities.sf"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 0"
#muni.sf <- read_municipality(year=2010)
We will be loading in the municipalities from 2010 in order to ensure that our data to align with the lat long data from our aspatial dataset which specifies the date as 2010. Additionally this will be commented out as we will save the data locally after cleaning to reduce processing time of the file.
#Validity_NA_Check(muni.sf)
#muni.sf <- st_make_valid(muni.sf)
#Validity_NA_Check(muni.sf)
#muni.sp <- as_Spatial(muni.sf)
#writeOGR(muni.sp, "./data/geospatial", "Brazil_Muni", driver="ESRI Shapefile")
The above were commented out to reduce loading times. We will load in the file locally and check the validity.
tmap_mode("plot")
muni_loaded.sf <- st_read(dsn="data/geospatial", layer="Brazil_Muni")
## Reading layer `Brazil_Muni' from data source `D:\GSA\Take_Home_EX04\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 5567 features and 4 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -73.99045 ymin: -33.75208 xmax: -28.83609 ymax: 5.271841
## geographic CRS: GRS 1980(IUGG, 1980)
st_crs(muni_loaded.sf) <- 4674
qtm(muni_loaded.sf)
Validity_NA_Check(muni_loaded.sf)
## [1] "For: muni_loaded.sf"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 0"
muni_loaded_w_unique.sf <- cbind(CITY_STATE_M = paste(muni_loaded.sf$name_mn, muni_loaded.sf$abbrv_s, sep="_"), muni_loaded.sf)
tm_shape(muni_loaded_w_unique.sf)+
tm_fill(col= "code_mn")+
tm_shape(Brazil_cities.sf)+
tm_dots(size = 0.01)
Based on the map above, we can observe the points are accurately mapped to the respective municipalities in Brazil We will create a combined dataframe to allow us to perform our next phase of choropleth mapping.
#Brazil_cities.sf <- Brazil_cities.sf[!(Brazil_cities.sf$CITY_STATE =="Fernando De Noronha_PE"), ]
tmap_mode("plot")
Brazil_super.sf <- st_join(muni_loaded_w_unique.sf, Brazil_cities.sf, join=st_intersects)
Validity_NA_Check(Brazil_super.sf)
## [1] "For: Brazil_super.sf"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 3"
Checking NA Row locations
temp_NA <- Brazil_super.sf[rowSums(is.na(Brazil_super.sf))!=0,]
as.character(temp_NA$name_mn)
## [1] "Santa Teresinha" "Lagoa Mirim" "Lagoa Dos Patos"
Based on the data above, we can see that 2 of the polygons with NA are lakes and the last one is Santa Teresinha which we removed because of missing values in the data cleaning. This means that the rest of the polygons should have the data mapped to them correctly, unless there are double points in them.
Removing NA rows
Brazil_super_cleaned.sf<- Brazil_super.sf[rowSums(is.na(Brazil_super.sf))==0,]
Validity_NA_Check(Brazil_super_cleaned.sf)
## [1] "For: Brazil_super_cleaned.sf"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 0"
Checking for duplicates
dim(Brazil_super_cleaned.sf[duplicated(Brazil_super_cleaned.sf$CITY_STATE.x),])
## [1] 0 110
Seems there are no duplicate rows. Which means that each polygon has only one data point attached to it.
tmap_mode("plot")
tm_shape(Brazil_super_cleaned.sf)+
tm_fill(col= "GDP_CAPITA",
style="jenks",
title = "GDP per Capita",
palette ="Greens")+
tm_layout(main.title = "Distribution of GDP per Capita by Municipality \n(Jenks classification)",
main.title.position = "center",
main.title.size = 1,
legend.height = 0.45,
legend.width = 0.35,
legend.outside = FALSE,
legend.position = c("right", "bottom"),
frame = FALSE) +
tm_borders(alpha = 0.1)
Based on the map above. We can see a surprising result in our mapping for GDP per Capita. It appears that the highest GDP per capita are around the satelight cities around Sao Paulo rather than the main city itself. Additionally, very far inland in areas like Selviria and Campos De Júlio, we can also see concentrations of higher GDP per capita. This could be due to a lower population while the region is still generating a large amount of production. This is surpising given the larger areas of these polygons.
What is even more suprising is that the two main cities in Brazil of Rio De Janeiro and Sao Paolo only have GDP per capita of 50,690 and 57,071 respectively. This is most likely due to a much larger population count concentrated in these smaller areas which is concerning from a social development standpoint.
dropsAbrev <- c("CITY_STATE_M", "code_mn", "name_mn", "cod_stt", "abbrv_s", "CITY", "STATE")
Brazil_reg.sf <- Brazil_super_cleaned.sf[ , !(names(Brazil_super_cleaned.sf) %in% dropsAbrev)]
Brazil_numeric_vars <- cbind(Brazil_reg.sf[,3:28]%>%
st_set_geometry(NULL), Brazil_reg.sf[,32:52]%>%
st_set_geometry(NULL), Brazil_reg.sf[,102]%>%
st_set_geometry(NULL))
Brazil_numeric_vars.norm <- normalize(Brazil_numeric_vars)
Brazil_Ratios_vars <- Brazil_reg.sf[,68:101] %>%
st_set_geometry(NULL)
Brazil_Categorical_vars <- cbind(Brazil_reg.sf[,2]%>%
st_set_geometry(NULL), Brazil_reg.sf[,53:67]%>%
st_set_geometry(NULL))
dropsReg <- c("CITY_STATE", "GDP", "GDP_CAPITA", "POP_GDP")
Brazil_All_vars <- Brazil_reg.sf[ , !(names(Brazil_reg.sf) %in% dropsReg)] %>%
st_set_geometry(NULL)
corrplot(cor(Brazil_numeric_vars.norm, use = "complete.obs"), diag = FALSE, order = "AOE",
tl.pos = "td", tl.cex = 0.5, method = "square", type = "upper")
corrplot(cor(Brazil_Ratios_vars, use = "complete.obs"), diag = FALSE, order = "AOE",
tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")
corrplot(cor(Brazil_Categorical_vars, use = "complete.obs"), diag = FALSE, order = "AOE",
tl.pos = "td", tl.cex = 0.5, method = "square", type = "upper")
# Removed all variables for display reasons. Although they were checked in the analysis to ensure all variables don't correlate too much
# corrplot(cor(Brazil_All_vars, use = "complete.obs"), diag = FALSE, order = "AOE",
#tl.pos = "td", tl.cex = 0.5, method = "sqaure", type = "upper")
As expected, there are a number of indicators from our numeric dataset that are clearly highyl correlated with one another, noticaply the IBGE, GVA, TAXES and COMP numbers. Because of their correlation with COMP_TOT, we will use that as a metric to capture all those numbers as it is the likely contributor to those variables arizing (particularly taxes). We will also use IDHM as a measure for all the IDHM indicators specified although there will be some loss of information.
Within Ratios, we can see the amongst the population ratios the youths are very highly correlated. As these are ratios, we can sum them up to give us a new Youth metric instead. Additionally because DOM_RURAL_RATIO, DOM_URBAN_RATIO and RES_BRAZ_POP_RATIO, RES_FOREIGN_POP_RATIO are polar opposites, we can just take one to use as an indicator. In our case, we will choose the Foreign Population ratio and the Domestic Urban Units ratis.
Brazil_numeric_vars_pro <- Brazil_numeric_vars.norm %>% select("ALT", "AREA", "IDHM", "POP_DENSITY", "COMP_TOT")
Brazil_Ratios_vars_pro <- Brazil_Ratios_vars %>%
mutate( POP_YOUTH_RATIO = as.numeric((POP_BEL_ONE_RATIO + POP_ONE_to_FOUR_RATIO + POP_FIVE_to_NINE_RATIO + POP_TEN_to_FOURTEEN_RATIO)))
dropsRatios <- c("POP_BEL_ONE_RATIO", "POP_ONE_to_FOUR_RATIO", "POP_FIVE_to_NINE_RATIO", "POP_TEN_to_FOURTEEN_RATIO", "RES_BRAZ_POP_RATIO", "DOM_RURAL_RATIO")
Brazil_Ratios_vars_pro <- Brazil_Ratios_vars_pro[ , !(names(Brazil_Ratios_vars_pro) %in% dropsRatios)]
Brazil_indicators <- cbind(Brazil_Ratios_vars_pro, Brazil_Categorical_vars, Brazil_numeric_vars_pro)
corrplot(cor(Brazil_indicators, use = "complete.obs"), diag = FALSE, order = "AOE",
tl.pos = "td", tl.cex = 0.4, number.cex= 0.3, method = "number", type = "upper")
Based on our correlational plot, we dont see any variables which are heavily correlated beyond 0.75. As such, we will take these variables to be those we utilize in our regression.
polygon_frame <- Brazil_reg.sf %>% select("CITY_STATE")
joining_frame <- Brazil_reg.sf %>% select("CITY_STATE", "GDP_CAPITA") %>% st_set_geometry(NULL)
joining_frame_states <- cbind(joining_frame, Brazil_indicators)
Brazil_Indicators.sf <- left_join(polygon_frame, joining_frame_states, by="CITY_STATE") ## Usually you would use an index but after checking the data, we find that it does align with the data from Brazil_reg.sf so as such, we can assume the data was actually joint to the original SF
Validity_NA_Check(Brazil_Indicators.sf)
## [1] "For: Brazil_Indicators.sf"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 0"
summary(Brazil_Indicators.sf)
## CITY_STATE GDP_CAPITA RES_FOREIGN_POP_RATIO
## Abadia De Goiás_GO : 1 Min. : 3191 Min. :0.0000000
## Abadia Dos Dourados_MG: 1 1st Qu.: 9062 1st Qu.:0.0000000
## Abadiânia_GO : 1 Median : 15870 Median :0.0000000
## Abaeté_MG : 1 Mean : 21122 Mean :0.0007593
## Abaetetuba_PA : 1 3rd Qu.: 26155 3rd Qu.:0.0006992
## Abaiara_CE : 1 Max. :314638 Max. :0.3772182
## (Other) :5558
## DOM_URBAN_RATIO POP_WORKING_RATIO POP_ELDERLY_RATIO GVA_AGROPEC_RATIO
## Min. :0.04553 Min. :0.4716 Min. :0.02255 Min. :0.00000
## 1st Qu.:0.49148 1st Qu.:0.6087 1st Qu.:0.09799 1st Qu.:0.03364
## Median :0.66263 Median :0.6325 Median :0.11921 Median :0.15062
## Mean :0.65205 Mean :0.6308 Mean :0.12009 Mean :0.21034
## 3rd Qu.:0.83040 3rd Qu.:0.6543 3rd Qu.:0.14103 3rd Qu.:0.34094
## Max. :1.00000 Max. :0.7448 Max. :0.42199 Max. :0.99877
##
## GVA_INDUSTRY_RATIO GVA_SERVICES_RATIO GVA_PUBLIC_RATIO COM_A_RATIO
## Min. :0.0000157 Min. :0.0000461 Min. :0.0000433 Min. :0.000000
## 1st Qu.:0.0368730 1st Qu.:0.1985910 1st Qu.:0.1448472 1st Qu.:0.001569
## Median :0.0714602 Median :0.3117002 Median :0.2948082 Median :0.011803
## Mean :0.1377745 Mean :0.3260963 Mean :0.3257928 Mean :0.039408
## 3rd Qu.:0.1795132 3rd Qu.:0.4600063 3rd Qu.:0.4966551 3rd Qu.:0.031915
## Max. :0.9991868 Max. :0.9995977 Max. :0.9996029 Max. :0.917085
##
## COM_B_RATIO COM_C_RATIO COM_D_RATIO COM_E_RATIO
## Min. :0.000000 Min. :0.00000 Min. :0.0000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.03636 1st Qu.:0.0000000 1st Qu.:0.000000
## Median :0.000000 Median :0.06590 Median :0.0000000 Median :0.000000
## Mean :0.006019 Mean :0.07967 Mean :0.0007847 Mean :0.002508
## 3rd Qu.:0.005188 3rd Qu.:0.10593 3rd Qu.:0.0000000 3rd Qu.:0.003226
## Max. :0.333333 Max. :0.54518 Max. :0.4444444 Max. :0.083333
##
## COM_F_RATIO COM_G_RATIO COM_H_RATIO COM_I_RATIO
## Min. :0.00000 Min. :0.01789 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.01389 1st Qu.:0.38980 1st Qu.:0.01562 1st Qu.:0.02128
## Median :0.02778 Median :0.46396 Median :0.03757 Median :0.04167
## Mean :0.03130 Mean :0.47234 Mean :0.04955 Mean :0.04567
## 3rd Qu.:0.04348 3rd Qu.:0.55263 3rd Qu.:0.07052 3rd Qu.:0.06202
## Max. :0.29213 Max. :0.89091 Max. :0.43689 Max. :0.52542
##
## COM_J_RATIO COM_K_RATIO COM_L_RATIO COM_M_RATIO
## Min. :0.000000 Min. :0.000000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.01144
## Median :0.007299 Median :0.000000 Median :0.000000 Median :0.02362
## Mean :0.009054 Mean :0.003933 Mean :0.005450 Mean :0.02536
## 3rd Qu.:0.013982 3rd Qu.:0.006112 3rd Qu.:0.008601 3rd Qu.:0.03659
## Max. :0.417249 Max. :0.087912 Max. :0.156863 Max. :0.24444
##
## COM_N_RATIO COM_O_RATIO COM_P_RATIO COM_Q_RATIO
## Min. :0.00000 Min. :0.0001764 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.01802 1st Qu.:0.0058954 1st Qu.:0.01786 1st Qu.:0.006615
## Median :0.02924 Median :0.0153846 Median :0.02985 Median :0.019946
## Mean :0.03553 Mean :0.0277867 Mean :0.04350 Mean :0.022028
## 3rd Qu.:0.04496 3rd Qu.:0.0361664 3rd Qu.:0.04878 3rd Qu.:0.033033
## Max. :0.33527 Max. :0.3636364 Max. :0.83673 Max. :0.214286
##
## COM_R_RATIO COM_S_RATIO COM_U_RATIO POP_YOUTH_RATIO
## Min. :0.000000 Min. :0.00000 Min. :0.000e+00 Min. :0.1064
## 1st Qu.:0.000000 1st Qu.:0.04116 1st Qu.:0.000e+00 1st Qu.:0.2153
## Median :0.009091 Median :0.06395 Median :0.000e+00 Median :0.2452
## Mean :0.010772 Mean :0.08933 Mean :2.036e-06 Mean :0.2491
## 3rd Qu.:0.015310 3rd Qu.:0.11147 3rd Qu.:0.000e+00 3rd Qu.:0.2771
## Max. :0.166667 Max. :0.56716 Max. :2.985e-03 Max. :0.4408
##
## CAPITAL RURAL_URBANIntermediário Adjacente
## Min. :0.000000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.000000 Median :0.0000
## Mean :0.004853 Mean :0.1233
## 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :1.000000 Max. :1.0000
##
## RURAL_URBANIntermediário Remoto RURAL_URBANRural Adjacente
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :1.0000
## Mean :0.01078 Mean :0.5462
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000
##
## RURAL_URBANRural Remoto RURAL_URBANUrbano
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.05805 Mean :0.2617
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000
##
## GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4892
## 3rd Qu.:1.0000
## Max. :1.0000
##
## GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1317
## 3rd Qu.:0.0000
## Max. :1.0000
##
## GVA_MAINComércio e reparação de veículos automotores e motocicletas
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.008267
## 3rd Qu.:0.000000
## Max. :1.000000
##
## GVA_MAINConstrução GVA_MAINDemais serviços
## Min. :0.000000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.000000 Median :0.0000
## Mean :0.001258 Mean :0.2653
## 3rd Qu.:0.000000 3rd Qu.:1.0000
## Max. :1.000000 Max. :1.0000
##
## GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.01761
## 3rd Qu.:0.00000
## Max. :1.00000
##
## GVA_MAINIndústrias de transformação GVA_MAINIndústrias extrativas
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.04691 Mean :0.00629
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
##
## GVA_MAINPecuária, inclusive apoio à pecuária
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.02894
## 3rd Qu.:0.00000
## Max. :1.00000
##
## GVA_MAINProdução florestal, pesca e aquicultura ALT
## Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.000000 1st Qu.:0.0001941
## Median :0.000000 Median :0.0004648
## Mean :0.004493 Mean :0.0010222
## 3rd Qu.:0.000000 3rd Qu.:0.0007193
## Max. :1.000000 Max. :1.0000000
##
## AREA IDHM POP_DENSITY COMP_TOT
## Min. :0.000000 Min. :0.0000 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.001260 1st Qu.:0.4077 1st Qu.:0.000876 1st Qu.:0.0001169
## Median :0.002589 Median :0.5563 Median :0.001864 Median :0.0002941
## Mean :0.009539 Mean :0.5432 Mean :0.008659 Mean :0.0016997
## 3rd Qu.:0.006412 3rd Qu.:0.6757 3rd Qu.:0.004091 3rd Qu.:0.0008356
## Max. :1.000000 Max. :1.0000 Max. :1.000000 Max. :1.0000000
##
## geometry
## MULTIPOLYGON :5564
## epsg:4674 : 0
## +proj=long...: 0
##
##
##
##
When performing a multi-linear regression, we need to define our Null Hypothesis: * NULL Hypothesis: The data is randomly distributed * Alternative Hypothesis: The data is not randomly distributed
We will be selecting a confidence level of 95% for this analysis. Meaning we would need an alpha value below 0.05 in order to reject the null hypothesis
Because we have Categorical data and data which sums to 1, we will need to decide which one of the following is our baseline:
GDPPC.mlr<- lm(GDP_CAPITA ~ ., data=Brazil_Indicators.sf[2:53] %>% st_set_geometry(NULL))
summary(GDPPC.mlr)
##
## Call:
## lm(formula = GDP_CAPITA ~ ., data = Brazil_Indicators.sf[2:53] %>%
## st_set_geometry(NULL))
##
## Residuals:
## Min 1Q Median 3Q Max
## -40924 -5198 -713 3256 246925
##
## Coefficients: (5 not defined because of singularities)
## Estimate
## (Intercept) 5457565.3
## RES_FOREIGN_POP_RATIO -5490.9
## DOM_URBAN_RATIO -1953.4
## POP_WORKING_RATIO 37961.8
## POP_ELDERLY_RATIO -34416.2
## GVA_AGROPEC_RATIO 8195.2
## GVA_INDUSTRY_RATIO 22856.5
## GVA_SERVICES_RATIO 4935.0
## GVA_PUBLIC_RATIO NA
## COM_A_RATIO -5474832.5
## COM_B_RATIO -5510719.9
## COM_C_RATIO -5504450.0
## COM_D_RATIO -5436381.1
## COM_E_RATIO -5436230.7
## COM_F_RATIO -5477177.6
## COM_G_RATIO -5479871.4
## COM_H_RATIO -5458575.7
## COM_I_RATIO -5467005.8
## COM_J_RATIO -5467998.6
## COM_K_RATIO -5364519.1
## COM_L_RATIO -5373763.3
## COM_M_RATIO -5460390.8
## COM_N_RATIO -5466424.3
## COM_O_RATIO -5461544.0
## COM_P_RATIO -5478092.3
## COM_Q_RATIO -5487459.3
## COM_R_RATIO -5479679.6
## COM_S_RATIO -5477920.3
## COM_U_RATIO NA
## POP_YOUTH_RATIO NA
## CAPITAL -8485.0
## `RURAL_URBANIntermediário Adjacente` -370.4
## `RURAL_URBANIntermediário Remoto` 4839.7
## `RURAL_URBANRural Adjacente` 1555.1
## `RURAL_URBANRural Remoto` 4727.9
## RURAL_URBANUrbano NA
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social` -9635.6
## `GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita` 2708.8
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas` 21939.3
## GVA_MAINConstrução -7020.6
## `GVA_MAINDemais serviços` -8217.5
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` 19968.9
## `GVA_MAINIndústrias de transformação` 13099.6
## `GVA_MAINIndústrias extrativas` 14549.4
## `GVA_MAINPecuária, inclusive apoio à pecuária` -5565.2
## `GVA_MAINProdução florestal, pesca e aquicultura` NA
## ALT -5481.8
## AREA 8959.2
## IDHM 36695.3
## POP_DENSITY 7831.5
## COMP_TOT 36690.6
## Std. Error
## (Intercept) 5758288.1
## RES_FOREIGN_POP_RATIO 56651.7
## DOM_URBAN_RATIO 1462.6
## POP_WORKING_RATIO 10630.0
## POP_ELDERLY_RATIO 7766.5
## GVA_AGROPEC_RATIO 1309.1
## GVA_INDUSTRY_RATIO 1634.1
## GVA_SERVICES_RATIO 1161.4
## GVA_PUBLIC_RATIO NA
## COM_A_RATIO 5758215.0
## COM_B_RATIO 5758386.0
## COM_C_RATIO 5758212.1
## COM_D_RATIO 5758474.6
## COM_E_RATIO 5758512.8
## COM_F_RATIO 5758246.2
## COM_G_RATIO 5758265.9
## COM_H_RATIO 5758277.5
## COM_I_RATIO 5757930.3
## COM_J_RATIO 5758076.2
## COM_K_RATIO 5757653.4
## COM_L_RATIO 5757834.4
## COM_M_RATIO 5758161.2
## COM_N_RATIO 5757990.4
## COM_O_RATIO 5758278.8
## COM_P_RATIO 5758244.6
## COM_Q_RATIO 5758228.4
## COM_R_RATIO 5758306.1
## COM_S_RATIO 5758244.3
## COM_U_RATIO NA
## POP_YOUTH_RATIO NA
## CAPITAL 3355.8
## `RURAL_URBANIntermediário Adjacente` 738.4
## `RURAL_URBANIntermediário Remoto` 2083.3
## `RURAL_URBANRural Adjacente` 695.9
## `RURAL_URBANRural Remoto` 1085.4
## RURAL_URBANUrbano NA
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social` 2983.5
## `GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita` 2991.4
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas` 3698.9
## GVA_MAINConstrução 6278.1
## `GVA_MAINDemais serviços` 3025.8
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` 3389.8
## `GVA_MAINIndústrias de transformação` 3168.8
## `GVA_MAINIndústrias extrativas` 3923.8
## `GVA_MAINPecuária, inclusive apoio à pecuária` 3170.5
## `GVA_MAINProdução florestal, pesca e aquicultura` NA
## ALT 9989.1
## AREA 6234.3
## IDHM 2842.5
## POP_DENSITY 4947.6
## COMP_TOT 14889.5
## t value
## (Intercept) 0.948
## RES_FOREIGN_POP_RATIO -0.097
## DOM_URBAN_RATIO -1.336
## POP_WORKING_RATIO 3.571
## POP_ELDERLY_RATIO -4.431
## GVA_AGROPEC_RATIO 6.260
## GVA_INDUSTRY_RATIO 13.987
## GVA_SERVICES_RATIO 4.249
## GVA_PUBLIC_RATIO NA
## COM_A_RATIO -0.951
## COM_B_RATIO -0.957
## COM_C_RATIO -0.956
## COM_D_RATIO -0.944
## COM_E_RATIO -0.944
## COM_F_RATIO -0.951
## COM_G_RATIO -0.952
## COM_H_RATIO -0.948
## COM_I_RATIO -0.949
## COM_J_RATIO -0.950
## COM_K_RATIO -0.932
## COM_L_RATIO -0.933
## COM_M_RATIO -0.948
## COM_N_RATIO -0.949
## COM_O_RATIO -0.948
## COM_P_RATIO -0.951
## COM_Q_RATIO -0.953
## COM_R_RATIO -0.952
## COM_S_RATIO -0.951
## COM_U_RATIO NA
## POP_YOUTH_RATIO NA
## CAPITAL -2.528
## `RURAL_URBANIntermediário Adjacente` -0.502
## `RURAL_URBANIntermediário Remoto` 2.323
## `RURAL_URBANRural Adjacente` 2.235
## `RURAL_URBANRural Remoto` 4.356
## RURAL_URBANUrbano NA
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social` -3.230
## `GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita` 0.906
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas` 5.931
## GVA_MAINConstrução -1.118
## `GVA_MAINDemais serviços` -2.716
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` 5.891
## `GVA_MAINIndústrias de transformação` 4.134
## `GVA_MAINIndústrias extrativas` 3.708
## `GVA_MAINPecuária, inclusive apoio à pecuária` -1.755
## `GVA_MAINProdução florestal, pesca e aquicultura` NA
## ALT -0.549
## AREA 1.437
## IDHM 12.910
## POP_DENSITY 1.583
## COMP_TOT 2.464
## Pr(>|t|)
## (Intercept) 0.343285
## RES_FOREIGN_POP_RATIO 0.922791
## DOM_URBAN_RATIO 0.181731
## POP_WORKING_RATIO 0.000358
## POP_ELDERLY_RATIO 9.55e-06
## GVA_AGROPEC_RATIO 4.14e-10
## GVA_INDUSTRY_RATIO < 2e-16
## GVA_SERVICES_RATIO 2.18e-05
## GVA_PUBLIC_RATIO NA
## COM_A_RATIO 0.341754
## COM_B_RATIO 0.338614
## COM_C_RATIO 0.339149
## COM_D_RATIO 0.345177
## COM_E_RATIO 0.345194
## COM_F_RATIO 0.341550
## COM_G_RATIO 0.341315
## COM_H_RATIO 0.343195
## COM_I_RATIO 0.342421
## COM_J_RATIO 0.342346
## COM_K_RATIO 0.351522
## COM_L_RATIO 0.350708
## COM_M_RATIO 0.343025
## COM_N_RATIO 0.342477
## COM_O_RATIO 0.342933
## COM_P_RATIO 0.341470
## COM_Q_RATIO 0.340643
## COM_R_RATIO 0.341335
## COM_S_RATIO 0.341485
## COM_U_RATIO NA
## POP_YOUTH_RATIO NA
## CAPITAL 0.011484
## `RURAL_URBANIntermediário Adjacente` 0.615964
## `RURAL_URBANIntermediário Remoto` 0.020211
## `RURAL_URBANRural Adjacente` 0.025484
## `RURAL_URBANRural Remoto` 1.35e-05
## RURAL_URBANUrbano NA
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social` 0.001247
## `GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita` 0.365227
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas` 3.19e-09
## GVA_MAINConstrução 0.263506
## `GVA_MAINDemais serviços` 0.006632
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` 4.07e-09
## `GVA_MAINIndústrias de transformação` 3.62e-05
## `GVA_MAINIndústrias extrativas` 0.000211
## `GVA_MAINPecuária, inclusive apoio à pecuária` 0.079260
## `GVA_MAINProdução florestal, pesca e aquicultura` NA
## ALT 0.583182
## AREA 0.150749
## IDHM < 2e-16
## POP_DENSITY 0.113508
## COMP_TOT 0.013763
##
## (Intercept)
## RES_FOREIGN_POP_RATIO
## DOM_URBAN_RATIO
## POP_WORKING_RATIO ***
## POP_ELDERLY_RATIO ***
## GVA_AGROPEC_RATIO ***
## GVA_INDUSTRY_RATIO ***
## GVA_SERVICES_RATIO ***
## GVA_PUBLIC_RATIO
## COM_A_RATIO
## COM_B_RATIO
## COM_C_RATIO
## COM_D_RATIO
## COM_E_RATIO
## COM_F_RATIO
## COM_G_RATIO
## COM_H_RATIO
## COM_I_RATIO
## COM_J_RATIO
## COM_K_RATIO
## COM_L_RATIO
## COM_M_RATIO
## COM_N_RATIO
## COM_O_RATIO
## COM_P_RATIO
## COM_Q_RATIO
## COM_R_RATIO
## COM_S_RATIO
## COM_U_RATIO
## POP_YOUTH_RATIO
## CAPITAL *
## `RURAL_URBANIntermediário Adjacente`
## `RURAL_URBANIntermediário Remoto` *
## `RURAL_URBANRural Adjacente` *
## `RURAL_URBANRural Remoto` ***
## RURAL_URBANUrbano
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social` **
## `GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita`
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas` ***
## GVA_MAINConstrução
## `GVA_MAINDemais serviços` **
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` ***
## `GVA_MAINIndústrias de transformação` ***
## `GVA_MAINIndústrias extrativas` ***
## `GVA_MAINPecuária, inclusive apoio à pecuária` .
## `GVA_MAINProdução florestal, pesca e aquicultura`
## ALT
## AREA
## IDHM ***
## POP_DENSITY
## COMP_TOT *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14600 on 5518 degrees of freedom
## Multiple R-squared: 0.488, Adjusted R-squared: 0.4838
## F-statistic: 116.9 on 45 and 5518 DF, p-value: < 2.2e-16
Based on the F-statistic, it seems our model has a p-value less than 0.05 which means that the goodness of fit for the model is significant to reject the null hypothesis which is that the rate of change in the dependent variable is explainable by the mean.
It would seem that the company type ratios do not contribute signifcantly to GDP per Capita. Addtionally, the altitude and size of the municipality also show not significance. The same is seen for population density, ratio of foreigners in the population and percentage of urbanized households. There are some GVA main categories which are also not statistically significant which we will remove. Lastly the Urban or Rural classifications seem to have some significance except for Intermediário Remoto which is likely because the definition is very inbetween many of the othse.
Brazil_sig_Indic.sf <- Brazil_Indicators.sf %>% select("CITY_STATE", "GDP_CAPITA", "POP_WORKING_RATIO", "POP_ELDERLY_RATIO","GVA_AGROPEC_RATIO", "GVA_INDUSTRY_RATIO", "GVA_SERVICES_RATIO", "CAPITAL", "RURAL_URBANIntermediário Adjacente", "RURAL_URBANIntermediário Remoto", "RURAL_URBANRural Adjacente", "RURAL_URBANRural Remoto", "GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social", "GVA_MAINComércio e reparação de veículos automotores e motocicletas", "GVA_MAINDemais serviços", "GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação", "GVA_MAINIndústrias de transformação", "GVA_MAINIndústrias extrativas", "IDHM", "COMP_TOT")
GDPPC_sig.mlr<- lm(GDP_CAPITA ~ ., data=Brazil_sig_Indic.sf[2:21] %>% st_set_geometry(NULL))
summary(GDPPC_sig.mlr)
##
## Call:
## lm(formula = GDP_CAPITA ~ ., data = Brazil_sig_Indic.sf[2:21] %>%
## st_set_geometry(NULL))
##
## Residuals:
## Min 1Q Median 3Q Max
## -42585 -5379 -942 3078 252473
##
## Coefficients:
## Estimate
## (Intercept) -13666.5
## POP_WORKING_RATIO 28125.7
## POP_ELDERLY_RATIO -43172.8
## GVA_AGROPEC_RATIO 8705.5
## GVA_INDUSTRY_RATIO 22762.2
## GVA_SERVICES_RATIO 5337.2
## CAPITAL -4722.8
## `RURAL_URBANIntermediário Adjacente` -1085.7
## `RURAL_URBANIntermediário Remoto` 4783.1
## `RURAL_URBANRural Adjacente` 1129.0
## `RURAL_URBANRural Remoto` 4452.0
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social` -11208.5
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas` 21417.8
## `GVA_MAINDemais serviços` -9095.2
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` 19564.9
## `GVA_MAINIndústrias de transformação` 11273.0
## `GVA_MAINIndústrias extrativas` 15435.8
## IDHM 39405.4
## COMP_TOT 55830.7
## Std. Error
## (Intercept) 6122.4
## POP_WORKING_RATIO 10242.9
## POP_ELDERLY_RATIO 7379.9
## GVA_AGROPEC_RATIO 1296.0
## GVA_INDUSTRY_RATIO 1636.9
## GVA_SERVICES_RATIO 1153.6
## CAPITAL 3279.7
## `RURAL_URBANIntermediário Adjacente` 732.7
## `RURAL_URBANIntermediário Remoto` 2012.5
## `RURAL_URBANRural Adjacente` 627.3
## `RURAL_URBANRural Remoto` 1037.5
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social` 707.0
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas` 2295.3
## `GVA_MAINDemais serviços` 774.5
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` 1733.3
## `GVA_MAINIndústrias de transformação` 1222.3
## `GVA_MAINIndústrias extrativas` 2652.6
## IDHM 2416.5
## COMP_TOT 14553.1
## t value
## (Intercept) -2.232
## POP_WORKING_RATIO 2.746
## POP_ELDERLY_RATIO -5.850
## GVA_AGROPEC_RATIO 6.717
## GVA_INDUSTRY_RATIO 13.906
## GVA_SERVICES_RATIO 4.627
## CAPITAL -1.440
## `RURAL_URBANIntermediário Adjacente` -1.482
## `RURAL_URBANIntermediário Remoto` 2.377
## `RURAL_URBANRural Adjacente` 1.800
## `RURAL_URBANRural Remoto` 4.291
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social` -15.853
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas` 9.331
## `GVA_MAINDemais serviços` -11.743
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` 11.288
## `GVA_MAINIndústrias de transformação` 9.223
## `GVA_MAINIndústrias extrativas` 5.819
## IDHM 16.307
## COMP_TOT 3.836
## Pr(>|t|)
## (Intercept) 0.025639
## POP_WORKING_RATIO 0.006054
## POP_ELDERLY_RATIO 5.19e-09
## GVA_AGROPEC_RATIO 2.04e-11
## GVA_INDUSTRY_RATIO < 2e-16
## GVA_SERVICES_RATIO 3.80e-06
## CAPITAL 0.149924
## `RURAL_URBANIntermediário Adjacente` 0.138455
## `RURAL_URBANIntermediário Remoto` 0.017506
## `RURAL_URBANRural Adjacente` 0.071941
## `RURAL_URBANRural Remoto` 1.81e-05
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social` < 2e-16
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas` < 2e-16
## `GVA_MAINDemais serviços` < 2e-16
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` < 2e-16
## `GVA_MAINIndústrias de transformação` < 2e-16
## `GVA_MAINIndústrias extrativas` 6.25e-09
## IDHM < 2e-16
## COMP_TOT 0.000126
##
## (Intercept) *
## POP_WORKING_RATIO **
## POP_ELDERLY_RATIO ***
## GVA_AGROPEC_RATIO ***
## GVA_INDUSTRY_RATIO ***
## GVA_SERVICES_RATIO ***
## CAPITAL
## `RURAL_URBANIntermediário Adjacente`
## `RURAL_URBANIntermediário Remoto` *
## `RURAL_URBANRural Adjacente` .
## `RURAL_URBANRural Remoto` ***
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social` ***
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas` ***
## `GVA_MAINDemais serviços` ***
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` ***
## `GVA_MAINIndústrias de transformação` ***
## `GVA_MAINIndústrias extrativas` ***
## IDHM ***
## COMP_TOT ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14840 on 5545 degrees of freedom
## Multiple R-squared: 0.4685, Adjusted R-squared: 0.4668
## F-statistic: 271.5 on 18 and 5545 DF, p-value: < 2.2e-16
Based on our new regression, we can see some of the variables have become insignificant, Notably the CAPITAL classification and Rural Intermediate or Urban classifications for Adjacente have also become insigifcant. We will run the regression again without them.
dropsInsig <- c("CAPITAL", "RURAL_URBANIntermediário Adjacente", "RURAL_URBANRural Adjacente")
Brazil_sig_Indic.sf <- Brazil_sig_Indic.sf[ , !(names(Brazil_sig_Indic.sf) %in% dropsInsig)]
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'RURAL_URBANIntermediário Remoto'] <- 'CAT_INTERMEDIATE_REMOTE'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'RURAL_URBANRural Remoto'] <- 'CAT_RURAL_REMOTE'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social'] <- 'GVA_MAIN_Public_Sector'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINComércio e reparação de veículos automotores e motocicletas'] <- 'GVA_MAIN_Commercial'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINDemais serviços'] <- 'GVA_MAIN_Other_services'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação'] <- 'GVA_MAIN_Public_Utilities'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINIndústrias de transformação'] <- 'GVA_MAIN_Industry_transformation'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINIndústrias extrativas'] <- 'GVA_MAIN_Industrial'
GDPPC_sig2.mlr<- lm(GDP_CAPITA ~ ., data=Brazil_sig_Indic.sf[2:18] %>% st_set_geometry(NULL))
summary(GDPPC_sig2.mlr)
##
## Call:
## lm(formula = GDP_CAPITA ~ ., data = Brazil_sig_Indic.sf[2:18] %>%
## st_set_geometry(NULL))
##
## Residuals:
## Min 1Q Median 3Q Max
## -42091 -5367 -884 3055 252671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13922.8 6105.4 -2.280 0.022622 *
## POP_WORKING_RATIO 29840.1 10241.3 2.914 0.003586 **
## POP_ELDERLY_RATIO -37947.9 6994.2 -5.426 6.02e-08 ***
## GVA_AGROPEC_RATIO 9008.2 1287.0 6.999 2.88e-12 ***
## GVA_INDUSTRY_RATIO 22507.7 1633.3 13.780 < 2e-16 ***
## GVA_SERVICES_RATIO 4989.7 1148.6 4.344 1.42e-05 ***
## CAT_INTERMEDIATE_REMOTE 4304.2 1967.3 2.188 0.028725 *
## CAT_RURAL_REMOTE 3815.4 908.6 4.199 2.72e-05 ***
## GVA_MAIN_Public_Sector -11314.6 706.3 -16.019 < 2e-16 ***
## GVA_MAIN_Commercial 21167.5 2293.4 9.230 < 2e-16 ***
## GVA_MAIN_Other_services -9509.9 751.3 -12.657 < 2e-16 ***
## GVA_MAIN_Public_Utilities 19421.8 1734.1 11.200 < 2e-16 ***
## GVA_MAIN_Industry_transformation 11100.7 1220.7 9.094 < 2e-16 ***
## GVA_MAIN_Industrial 15348.7 2654.9 5.781 7.82e-09 ***
## IDHM 38163.3 2351.1 16.232 < 2e-16 ***
## COMP_TOT 46378.4 12894.6 3.597 0.000325 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14850 on 5548 degrees of freedom
## Multiple R-squared: 0.4671, Adjusted R-squared: 0.4657
## F-statistic: 324.2 on 15 and 5548 DF, p-value: < 2.2e-16
Now we can see the final regression, we have an adjusted R-square value of 0.4657 which is quite low which means there the majority of varation in GDP per capita are still unexplained in our model. We’ve seen the Adjusted R-squared value decrease as we continue to refine our model. The F-statistic still shows that the model is still able to reject the null hypothesis that the mean is better at explaining the rate of change in the dependent variable.
As per our regression above which we will validate below, we can see that the variables have a certain impact on GDP per capita. Unsuprisingly, the total number of companies significantly correlates to the GDP per capita. This is most probably due to there being more jobs and therefore more people are able to be employed. Though if we wanted to investigate further, we could examine if the ratio of Companies to Population could have an effect on GDP per capita.
The working population ratio has a positive correlation while the elderly ratio has a negative correlation. This is in line with the logic that the more economically active population percentages contribute to GDP per capita where as the higher dependents in the Elderly results in lower GDP per capita. For our Gross Value Added ratio by industry, it seems most of them contribute positively to GDP per capita, however the Industrial companies seem to contribute greater by a large amount compared to the other two. This is most probably due to the way in which GDP per capita is calculated and amnufacturing sectors contributing more to it than others.
the IDHM which is our Human Development Index seems to also be positively correlated to GDP per capita. However, it is not certain if this is a causal relationship has it might have reverse causality. This is because GDP per capita often leads to greater outcomes in life. But because this data was recorded in 2010 and the GDP per capita is in 2016, we can safetly say that a higher HDI might lead to greater GDP per capita for the people.
In terms of our categorical variables, it seems that being clusified as a Rural or Intermediate Remote region is positively correlated with higher GDP per capita. This sort of matches our choropleth map that showed the inland areas with higher GDP per capita compared to what you would think is more urbanized areas. This could be due to a lower population in these remote areas and more focus on industrial or manufacturing jobs whihc could be contributing to this.
Interestingly the labeling of main sector for Gross Value added shows that areas in which their main sector is Public services such as Public administration, defense, education and health and social security actually correlates less with GDP per Capita. This may be due to municipalities being specialized for certain government functions. Other services also follows the same negative correlation however it is not clear why this is the case. As expected, the places with main economic activities being commercial correlate the most to GDP per capita but suprisingly public utilities such as electricity and gas, water, sewage, waste management and decontamination activities comes in close as well beating out industrial and industrial transformation labelled municipalities.
VIF <- ols_vif_tol(GDPPC_sig2.mlr)
VIF
## Variables Tolerance VIF
## 1 POP_WORKING_RATIO 0.3671431 2.723734
## 2 POP_ELDERLY_RATIO 0.7191852 1.390463
## 3 GVA_AGROPEC_RATIO 0.5633825 1.774993
## 4 GVA_INDUSTRY_RATIO 0.5075373 1.970299
## 5 GVA_SERVICES_RATIO 0.6316755 1.583091
## 6 CAT_INTERMEDIATE_REMOTE 0.9602515 1.041394
## 7 CAT_RURAL_REMOTE 0.8783178 1.138540
## 8 GVA_MAIN_Public_Sector 0.3180169 3.144487
## 9 GVA_MAIN_Commercial 0.9193246 1.087755
## 10 GVA_MAIN_Other_services 0.3603482 2.775094
## 11 GVA_MAIN_Public_Utilities 0.7619520 1.312419
## 12 GVA_MAIN_Industry_transformation 0.5950853 1.680431
## 13 GVA_MAIN_Industrial 0.8998277 1.111324
## 14 IDHM 0.2730812 3.661915
## 15 COMP_TOT 0.9651462 1.036112
As we can see from our VIF analysis, all our variables are non-redundant as cleared by the correlational analysis done earlier.
ols_plot_resid_fit(GDPPC_sig2.mlr)
From the data, we plot above we can see that the data is relatively scattered around the mean. This means that the model passes the linearity assumption required in the multi-linear regression analysis. Additionally, there does not seem to be any obvious signs of heteroscadicity in the plot above.
ols_plot_resid_hist(GDPPC_sig2.mlr)
The figure reveals that the residual of the multiple linear regression model resembles a normal distribution which passes the Normality Assumption. We would normally use ols_test_normality() to further test this assumption. But the function is limtied to sample sizes between 3 to 5000 and we have 5564 observations, thus we will skip this step as we have sufficient evidence from the plot that it passes normality test.
The model we built is using geographically referenced attributes, hence it is also important for us to visualize the residuals of the model in order to rule out spatial autocorrelation.
mlr.output <- as.data.frame(GDPPC_sig2.mlr$residuals)
Brazil_residual.sf <- cbind(Brazil_sig_Indic.sf,
GDPPC_sig2.mlr$residuals) %>%
rename(`MLR_RES` = `GDPPC_sig2.mlr.residuals`)
tmap_mode("plot")
tm_shape(Brazil_residual.sf)+
tm_fill("MLR_RES",
n = 6,
style = "quantile",
palette = "RdYlBu" ) +
tm_borders(alpha = 0.5)
From our mapping of residuals, there isn’t a clear sign on whether or not it is clustered in any way or if theres a geospatial pattern in distribution. However, we can test this using the Moran’s I test.
For this, we will be using the spatial points of the actual municipality since we have them already. We will assume the indexing has no real change as well as we had not done any form of sorting.
Brazil_cities.sp <- as_Spatial(Brazil_cities.sf)
#st_crs(Brazil_cities.sf)
proj4string(Brazil_cities.sp)
## [1] "+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs"
coords <- coordinates(Brazil_cities.sp)
k1 <- knn2nb(knearneigh(coords))
k1dists <- unlist(nbdists(k1, coords, longlat = TRUE))
summary(k1dists)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6029 9.1046 13.1276 17.0081 19.7337 363.0083
nb <- dnearneigh(coordinates(Brazil_cities.sp), 0, 364, longlat = TRUE)
nb_lw <- nb2listw(nb, style = 'W')
lm.morantest(GDPPC_sig2.mlr, nb_lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = GDP_CAPITA ~ ., data = Brazil_sig_Indic.sf[2:18]
## %>% st_set_geometry(NULL))
## weights: nb_lw
##
## Moran I statistic standard deviate = 0.69146, p-value = 0.2446
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 6.391678e-04 -1.799828e-04 1.403444e-06
Based on our global Moran’s I test, we can see that the P-value is above 0.05 which means we are unable to reject the Null hypothesis that the values are randomly distributed. Showing that there is no spatial autocorrelation between the residuals which means that our data is cleared of any spatial autocorrelation in the regression. This allows us to trust the correlations in our model a little better.
We will try to refine our regression using the GWModel
Joint_sf <- left_join(Brazil_cities.sf[,1], Brazil_sig_Indic.sf %>% st_set_geometry(NULL))
Joint_sp <- as_Spatial(Joint_sf)
summary(Joint_sp@data)
## CITY_STATE GDP_CAPITA POP_WORKING_RATIO
## Abadia De Goiás_GO : 1 Min. : 3191 Min. :0.4716
## Abadia Dos Dourados_MG: 1 1st Qu.: 9062 1st Qu.:0.6087
## Abadiânia_GO : 1 Median : 15870 Median :0.6325
## Abaeté_MG : 1 Mean : 21122 Mean :0.6308
## Abaetetuba_PA : 1 3rd Qu.: 26155 3rd Qu.:0.6543
## Abaiara_CE : 1 Max. :314638 Max. :0.7448
## (Other) :5558
## POP_ELDERLY_RATIO GVA_AGROPEC_RATIO GVA_INDUSTRY_RATIO GVA_SERVICES_RATIO
## Min. :0.02255 Min. :0.00000 Min. :0.0000157 Min. :0.0000461
## 1st Qu.:0.09799 1st Qu.:0.03364 1st Qu.:0.0368730 1st Qu.:0.1985910
## Median :0.11921 Median :0.15062 Median :0.0714602 Median :0.3117002
## Mean :0.12009 Mean :0.21034 Mean :0.1377745 Mean :0.3260963
## 3rd Qu.:0.14103 3rd Qu.:0.34094 3rd Qu.:0.1795132 3rd Qu.:0.4600063
## Max. :0.42199 Max. :0.99877 Max. :0.9991868 Max. :0.9995977
##
## CAT_INTERMEDIATE_REMOTE CAT_RURAL_REMOTE GVA_MAIN_Public_Sector
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.01078 Mean :0.05805 Mean :0.4892
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## GVA_MAIN_Commercial GVA_MAIN_Other_services GVA_MAIN_Public_Utilities
## Min. :0.000000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.000000 Median :0.0000 Median :0.00000
## Mean :0.008267 Mean :0.2653 Mean :0.01761
## 3rd Qu.:0.000000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.0000 Max. :1.00000
##
## GVA_MAIN_Industry_transformation GVA_MAIN_Industrial IDHM
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.4077
## Median :0.00000 Median :0.00000 Median :0.5563
## Mean :0.04691 Mean :0.00629 Mean :0.5432
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.6757
## Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## COMP_TOT
## Min. :0.0000000
## 1st Qu.:0.0001169
## Median :0.0002941
## Mean :0.0016997
## 3rd Qu.:0.0008356
## Max. :1.0000000
##
##Building Fixed Bandwidth GWR Mode We will be using an Fixed bandwith here due to the varying nature of the polygons in Brazil
#bw.fixed <- bw.gwr(formula = GDP_CAPITA ~ POP_WORKING_RATIO + POP_ELDERLY_RATIO + GVA_AGROPEC_RATIO + GVA_INDUSTRY_RATIO + GVA_SERVICES_RATIO + CAT_INTERMEDIATE_REMOTE + CAT_RURAL_REMOTE + GVA_MAIN_Public_Sector + GVA_MAIN_Commercial + GVA_MAIN_Other_services+ GVA_MAIN_Public_Utilities + GVA_MAIN_Industry_transformation + GVA_MAIN_Industrial + IDHM + COMP_TOT, data=Joint_sp, approach= "AIC", kernel="gaussian", adaptive=FALSE, longlat=TRUE)
# Could not resolve the issue
Taking the bandwidth established earlier
gwr.fixed <- gwr.basic(formula = GDP_CAPITA ~ POP_WORKING_RATIO + POP_ELDERLY_RATIO + GVA_AGROPEC_RATIO + GVA_INDUSTRY_RATIO + GVA_SERVICES_RATIO + CAT_INTERMEDIATE_REMOTE + CAT_RURAL_REMOTE + GVA_MAIN_Public_Sector + GVA_MAIN_Commercial + GVA_MAIN_Other_services+ GVA_MAIN_Public_Utilities + GVA_MAIN_Industry_transformation + GVA_MAIN_Industrial + IDHM + COMP_TOT, data=Joint_sp, bw=364, kernel = 'gaussian', longlat = TRUE)
gwr.fixed
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2020-06-01 00:49:36
## Call:
## gwr.basic(formula = GDP_CAPITA ~ POP_WORKING_RATIO + POP_ELDERLY_RATIO +
## GVA_AGROPEC_RATIO + GVA_INDUSTRY_RATIO + GVA_SERVICES_RATIO +
## CAT_INTERMEDIATE_REMOTE + CAT_RURAL_REMOTE + GVA_MAIN_Public_Sector +
## GVA_MAIN_Commercial + GVA_MAIN_Other_services + GVA_MAIN_Public_Utilities +
## GVA_MAIN_Industry_transformation + GVA_MAIN_Industrial +
## IDHM + COMP_TOT, data = Joint_sp, bw = 364, kernel = "gaussian",
## longlat = TRUE)
##
## Dependent (y) variable: GDP_CAPITA
## Independent variables: POP_WORKING_RATIO POP_ELDERLY_RATIO GVA_AGROPEC_RATIO GVA_INDUSTRY_RATIO GVA_SERVICES_RATIO CAT_INTERMEDIATE_REMOTE CAT_RURAL_REMOTE GVA_MAIN_Public_Sector GVA_MAIN_Commercial GVA_MAIN_Other_services GVA_MAIN_Public_Utilities GVA_MAIN_Industry_transformation GVA_MAIN_Industrial IDHM COMP_TOT
## Number of data points: 5564
## ***********************************************************************
## * Results of Global Regression *
## ***********************************************************************
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42091 -5367 -884 3055 252671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13922.8 6105.4 -2.280 0.022622 *
## POP_WORKING_RATIO 29840.1 10241.3 2.914 0.003586 **
## POP_ELDERLY_RATIO -37947.9 6994.2 -5.426 6.02e-08 ***
## GVA_AGROPEC_RATIO 9008.2 1287.0 6.999 2.88e-12 ***
## GVA_INDUSTRY_RATIO 22507.7 1633.3 13.780 < 2e-16 ***
## GVA_SERVICES_RATIO 4989.7 1148.6 4.344 1.42e-05 ***
## CAT_INTERMEDIATE_REMOTE 4304.2 1967.3 2.188 0.028725 *
## CAT_RURAL_REMOTE 3815.4 908.6 4.199 2.72e-05 ***
## GVA_MAIN_Public_Sector -11314.6 706.3 -16.019 < 2e-16 ***
## GVA_MAIN_Commercial 21167.5 2293.4 9.230 < 2e-16 ***
## GVA_MAIN_Other_services -9509.9 751.3 -12.657 < 2e-16 ***
## GVA_MAIN_Public_Utilities 19421.8 1734.1 11.200 < 2e-16 ***
## GVA_MAIN_Industry_transformation 11100.7 1220.7 9.094 < 2e-16 ***
## GVA_MAIN_Industrial 15348.7 2654.9 5.781 7.82e-09 ***
## IDHM 38163.3 2351.1 16.232 < 2e-16 ***
## COMP_TOT 46378.4 12894.6 3.597 0.000325 ***
##
## ---Significance stars
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 14850 on 5548 degrees of freedom
## Multiple R-squared: 0.4671
## Adjusted R-squared: 0.4657
## F-statistic: 324.2 on 15 and 5548 DF, p-value: < 2.2e-16
## ***Extra Diagnostic information
## Residual sum of squares: 1.223844e+12
## Sigma(hat): 14833.64
## AIC: 122702.5
## AICc: 122702.6
## ***********************************************************************
## * Results of Geographically Weighted Regression *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: gaussian
## Fixed bandwidth: 364
## Regression points: the same locations as observations are used.
## Distance metric: Great Circle distance metric is used.
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median 3rd Qu.
## Intercept -92329.16 -40715.76 -13121.18 6452.30
## POP_WORKING_RATIO -50961.49 9286.58 18723.07 58946.25
## POP_ELDERLY_RATIO -300167.48 -53694.90 -40170.12 -24763.62
## GVA_AGROPEC_RATIO -1419.48 2350.16 8997.05 20989.29
## GVA_INDUSTRY_RATIO -4759.82 10065.73 29567.73 33829.25
## GVA_SERVICES_RATIO -10466.24 1004.91 9619.26 14927.33
## CAT_INTERMEDIATE_REMOTE -2499.98 1220.93 5936.90 19238.03
## CAT_RURAL_REMOTE -2115.10 255.87 2455.88 5966.25
## GVA_MAIN_Public_Sector -18916.73 -10431.07 -8711.67 -8063.23
## GVA_MAIN_Commercial -14577.52 11193.58 16074.03 33082.35
## GVA_MAIN_Other_services -32806.76 -7849.04 -7358.40 -5494.94
## GVA_MAIN_Public_Utilities -21453.80 11024.04 18877.86 28535.35
## GVA_MAIN_Industry_transformation -62782.35 7649.03 13818.96 18882.11
## GVA_MAIN_Industrial -12371.02 5276.13 19584.56 23744.54
## IDHM 1910.09 15644.84 37478.87 48643.06
## COMP_TOT -601785.36 38835.86 57729.70 102521.38
## Max.
## Intercept 26461.8
## POP_WORKING_RATIO 111318.4
## POP_ELDERLY_RATIO 53442.4
## GVA_AGROPEC_RATIO 31812.6
## GVA_INDUSTRY_RATIO 45073.8
## GVA_SERVICES_RATIO 26536.2
## CAT_INTERMEDIATE_REMOTE 40003.4
## CAT_RURAL_REMOTE 15040.3
## GVA_MAIN_Public_Sector 1064.7
## GVA_MAIN_Commercial 56781.7
## GVA_MAIN_Other_services 4793.0
## GVA_MAIN_Public_Utilities 38054.1
## GVA_MAIN_Industry_transformation 27756.8
## GVA_MAIN_Industrial 34343.3
## IDHM 132568.6
## COMP_TOT 1189196.5
## ************************Diagnostic information*************************
## Number of data points: 5564
## Effective number of parameters (2trace(S) - trace(S'S)): 215.9051
## Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5348.095
## AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 122049.3
## AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 121879.2
## Residual sum of squares: 1.032151e+12
## R-square value: 0.5506
## Adjusted R-square value: 0.5324541
##
## ***********************************************************************
## Program stops at: 2020-06-01 00:50:22
By using the maximum bandwidth established earlier, we can see that the R-square value has gone up slightly which means that using geographical weighted method has resulted in a better model overall. However, we need to check the geographic R-square distribution below.
GWR.sf <- st_as_sf(gwr.fixed$SDF) %>%
st_transform(4674)
GWR.sf.transformed <- st_transform(GWR.sf, 4674)
gwr.fixed.output <- as.data.frame(gwr.fixed$SDF)
Brazil_sig_Indic.sf.fixed <- cbind(Brazil_sig_Indic.sf, as.matrix(gwr.fixed.output))
range(Brazil_sig_Indic.sf.fixed$Local_R2)
## [1] 0.4524265 0.9703265
summary(gwr.fixed$SDF$yhat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -15511 8943 17909 21388 29558 104354
tm_shape(Brazil_sig_Indic.sf.fixed) +
tm_fill(col = "Local_R2",
style = "jenks",
palette = "Greens",
title = "R-squared Values")
As we can see, there does not seem to be any pattern in distribution. Although the model does seem to explain some area better than others, it is not clear why this is the case.