Brazil is one of the largest countries which is growing rapidly. However, the wealth is unequally distributed across Brazil. To put it in a simple perspective, half of the municipalities with GDP per capita less than R$16000 and the top 25% municipalities earn R$26155 and above. This report will determine the factors affecting unequal development across Brazil at a muncipality level through a geographically weighed regression model.
packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse','geobr',"lmtest","ggpubr")
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
There are two datasets which are imported for this report.
say some more bull shit.
#brazil_cities <- read.csv2("data/aspatial/BRAZIL_CITIES.csv")
brazil_cities <- read_delim(file = "data/aspatial/BRAZIL_CITIES.csv", delim = ";")
metadata <- read.csv2("data/aspatial/Data_Dictionary.csv")
In order to acquire the municipality map of Brazil, the read_municipality function is used from the “geobr” package. This package is built by the Brazil government, and provides quick and easy access to official spatial data sets of Brazil. As most of the aspatial data is collected in 2016, we will also be using the municipality map from the year 2016 in order to be consistent with the aspatial data.
#municipality <- read_municipality(year=2016)
#municipality <- brazil_sp
Firstly, we will check if any Coordinate Reference System is ai tssigned to the municipality data.
#st_crs(municipality)
As seen in the output above, the CRS assigned is 4674 From epsg.io, the CRS with EPSG code 29101 is one of the most accurate projection system for Brazil. Hence we will convert our municipality data into EPSG 29101 format.
#municipality <- st_transform(municipality,4674)
municipality <- st_read(dsn = "data/muni_sf", layer = "muni_sf")
## Reading layer `muni_sf' from data source `/Users/Amey/Desktop/Y2S2/IS415/assignment4/data/muni_sf' using driver `ESRI Shapefile'
## Simple feature collection with 5572 features and 4 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -73.99045 ymin: -33.75118 xmax: -28.83594 ymax: 5.271841
## CRS: 4674
We can examine the spatial data from the code below.
tm_shape(municipality29101)+
tm_borders()
In the next section, we will perform data wrangling on aspatial data.
We will first have a look at the main dataset brazil_cities through the summary function.
summary(brazil_cities)
## CITY STATE CAPITAL IBGE_RES_POP
## Length:5573 Length:5573 Min. :0.000000 Min. : 805
## Class :character Class :character 1st Qu.:0.000000 1st Qu.: 5235
## Mode :character Mode :character Median :0.000000 Median : 10934
## Mean :0.004845 Mean : 34278
## 3rd Qu.:0.000000 3rd Qu.: 23424
## Max. :1.000000 Max. :11253503
## NA's :8
## IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN
## Min. : 805 Min. : 0.0 Min. : 239 Min. : 60
## 1st Qu.: 5230 1st Qu.: 0.0 1st Qu.: 1572 1st Qu.: 874
## Median : 10926 Median : 0.0 Median : 3174 Median : 1846
## Mean : 34200 Mean : 77.5 Mean : 10303 Mean : 8859
## 3rd Qu.: 23390 3rd Qu.: 10.0 3rd Qu.: 6726 3rd Qu.: 4624
## Max. :11133776 Max. :119727.0 Max. :3576148 Max. :3548433
## NA's :8 NA's :8 NA's :10 NA's :10
## IBGE_DU_RURAL IBGE_POP IBGE_1 IBGE_1-4
## Min. : 3 Min. : 174 Min. : 0.0 Min. : 5
## 1st Qu.: 487 1st Qu.: 2801 1st Qu.: 38.0 1st Qu.: 158
## Median : 931 Median : 6170 Median : 92.0 Median : 376
## Mean : 1463 Mean : 27595 Mean : 383.3 Mean : 1544
## 3rd Qu.: 1832 3rd Qu.: 15302 3rd Qu.: 232.0 3rd Qu.: 951
## Max. :33809 Max. :10463636 Max. :129464.0 Max. :514794
## NA's :81 NA's :8 NA's :8 NA's :8
## IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## Min. : 7 Min. : 12 Min. : 94 Min. : 29
## 1st Qu.: 220 1st Qu.: 259 1st Qu.: 1734 1st Qu.: 341
## Median : 516 Median : 588 Median : 3841 Median : 722
## Mean : 2069 Mean : 2381 Mean : 18212 Mean : 3004
## 3rd Qu.: 1300 3rd Qu.: 1478 3rd Qu.: 9628 3rd Qu.: 1724
## Max. :684443 Max. :783702 Max. :7058221 Max. :1293012
## NA's :8 NA's :8 NA's :8 NA's :8
## IBGE_PLANTED_AREA IBGE_CROP_PRODUCTION_$ IDHM Ranking 2010 IDHM
## Min. : 0.0 Min. : 0 Min. : 1 Min. :0.4180
## 1st Qu.: 910.2 1st Qu.: 2326 1st Qu.:1392 1st Qu.:0.5990
## Median : 3471.5 Median : 13846 Median :2783 Median :0.6650
## Mean : 14179.9 Mean : 57384 Mean :2783 Mean :0.6592
## 3rd Qu.: 11194.2 3rd Qu.: 55619 3rd Qu.:4174 3rd Qu.:0.7180
## Max. :1205669.0 Max. :3274885 Max. :5565 Max. :0.8620
## NA's :3 NA's :3 NA's :8 NA's :8
## IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG
## Min. :0.4000 Min. :0.6720 Min. :0.2070 Min. :-72.92
## 1st Qu.:0.5720 1st Qu.:0.7690 1st Qu.:0.4900 1st Qu.:-50.87
## Median :0.6540 Median :0.8080 Median :0.5600 Median :-46.52
## Mean :0.6429 Mean :0.8016 Mean :0.5591 Mean :-46.23
## 3rd Qu.:0.7070 3rd Qu.:0.8360 3rd Qu.:0.6310 3rd Qu.:-41.40
## Max. :0.8910 Max. :0.8940 Max. :0.8250 Max. :-32.44
## NA's :8 NA's :8 NA's :8 NA's :9
## LAT ALT PAY_TV FIXED_PHONES
## Min. :-33.688 Min. : 0.0 Min. : 1 Min. : 3
## 1st Qu.:-22.838 1st Qu.: 169.8 1st Qu.: 88 1st Qu.: 119
## Median :-18.089 Median : 406.5 Median : 247 Median : 327
## Mean :-16.444 Mean : 893.8 Mean : 3094 Mean : 6567
## 3rd Qu.: -8.489 3rd Qu.: 628.9 3rd Qu.: 815 3rd Qu.: 1151
## Max. : 4.585 Max. :874579.0 Max. :2047668 Max. :5543127
## NA's :9 NA's :9 NA's :3 NA's :3
## AREA REGIAO_TUR CATEGORIA_TUR ESTIMATED_POP
## Min. : 3.57 Length:5573 Length:5573 Min. : 786
## 1st Qu.: 204.44 Class :character Class :character 1st Qu.: 5454
## Median : 416.59 Mode :character Mode :character Median : 11590
## Mean : 1517.44 Mean : 37432
## 3rd Qu.: 1026.57 3rd Qu.: 25296
## Max. :159533.33 Max. :12176866
## NA's :3 NA's :3
## RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES
## Length:5573 Min. : 0 Min. : 1 Min. : 2
## Class :character 1st Qu.: 4189 1st Qu.: 1726 1st Qu.: 10112
## Mode :character Median : 20426 Median : 7424 Median : 31211
## Mean : 47271 Mean : 175928 Mean : 489451
## 3rd Qu.: 51227 3rd Qu.: 41022 3rd Qu.: 115406
## Max. :1402282 Max. :63306755 Max. :464656988
## NA's :3 NA's :3 NA's :3
## GVA_PUBLIC GVA_TOTAL TAXES GDP
## Min. : 7 Min. : 17 Min. : -14159 Min. : 15
## 1st Qu.: 17267 1st Qu.: 42253 1st Qu.: 1305 1st Qu.: 43709
## Median : 35866 Median : 119492 Median : 5100 Median : 125153
## Mean : 123768 Mean : 832987 Mean : 118864 Mean : 954584
## 3rd Qu.: 89245 3rd Qu.: 313963 3rd Qu.: 22197 3rd Qu.: 329539
## Max. :41902893 Max. :569910503 Max. :117125387 Max. :687035890
## NA's :3 NA's :3 NA's :3 NA's :3
## POP_GDP GDP_CAPITA GVA_MAIN MUN_EXPENDIT
## Min. : 815 Min. : 3191 Length:5573 Min. :1.421e+06
## 1st Qu.: 5483 1st Qu.: 9058 Class :character 1st Qu.:1.573e+07
## Median : 11578 Median : 15870 Mode :character Median :2.746e+07
## Mean : 36998 Mean : 21126 Mean :1.043e+08
## 3rd Qu.: 25085 3rd Qu.: 26155 3rd Qu.:5.666e+07
## Max. :12038175 Max. :314638 Max. :4.577e+10
## NA's :3 NA's :3 NA's :1492
## COMP_TOT COMP_A COMP_B COMP_C
## Min. : 6.0 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 68.0 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 3.00
## Median : 162.0 Median : 2.00 Median : 0.000 Median : 11.00
## Mean : 906.8 Mean : 18.25 Mean : 1.852 Mean : 73.44
## 3rd Qu.: 448.0 3rd Qu.: 8.00 3rd Qu.: 2.000 3rd Qu.: 39.00
## Max. :530446.0 Max. :1948.00 Max. :274.000 Max. :31566.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_D COMP_E COMP_F COMP_G
## Min. : 0.0000 Min. : 0.000 Min. : 0.00 Min. : 1.0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 32.0
## Median : 0.0000 Median : 0.000 Median : 4.00 Median : 74.5
## Mean : 0.4262 Mean : 2.029 Mean : 43.26 Mean : 348.0
## 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 15.00 3rd Qu.: 199.0
## Max. :332.0000 Max. :657.000 Max. :25222.00 Max. :150633.0
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_H COMP_I COMP_J COMP_K
## Min. : 0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 7 Median : 7.00 Median : 1.00 Median : 0.00
## Mean : 41 Mean : 55.88 Mean : 24.74 Mean : 15.55
## 3rd Qu.: 25 3rd Qu.: 24.00 3rd Qu.: 5.00 3rd Qu.: 2.00
## Max. :19515 Max. :29290.00 Max. :38720.00 Max. :23738.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_L COMP_M COMP_N COMP_O
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 0.00 Median : 4.00 Median : 4.0 Median : 2.000
## Mean : 15.14 Mean : 51.29 Mean : 83.7 Mean : 3.269
## 3rd Qu.: 3.00 3rd Qu.: 13.00 3rd Qu.: 14.0 3rd Qu.: 3.000
## Max. :14003.00 Max. :49181.00 Max. :76757.0 Max. :204.000
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_P COMP_Q COMP_R COMP_S
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 1.00 1st Qu.: 0.00 1st Qu.: 5.00
## Median : 6.00 Median : 3.00 Median : 2.00 Median : 12.00
## Mean : 30.96 Mean : 34.15 Mean : 12.18 Mean : 51.61
## 3rd Qu.: 17.00 3rd Qu.: 12.00 3rd Qu.: 6.00 3rd Qu.: 31.00
## Max. :16030.00 Max. :22248.00 Max. :6687.00 Max. :24832.00
## NA's :3 NA's :3 NA's :3 NA's :3
## COMP_T COMP_U HOTELS BEDS
## Min. :0 Min. : 0.00000 Min. : 1.000 Min. : 2.0
## 1st Qu.:0 1st Qu.: 0.00000 1st Qu.: 1.000 1st Qu.: 40.0
## Median :0 Median : 0.00000 Median : 1.000 Median : 82.0
## Mean :0 Mean : 0.05027 Mean : 3.131 Mean : 257.5
## 3rd Qu.:0 3rd Qu.: 0.00000 3rd Qu.: 3.000 3rd Qu.: 200.0
## Max. :0 Max. :123.00000 Max. :97.000 Max. :13247.0
## NA's :3 NA's :3 NA's :4686 NA's :4686
## Pr_Agencies Pu_Agencies Pr_Bank Pu_Bank
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.00
## 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:1.00
## Median : 1.000 Median : 2.000 Median : 1.000 Median :2.00
## Mean : 3.383 Mean : 2.829 Mean : 1.312 Mean :1.58
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.:2.00
## Max. :1693.000 Max. :626.000 Max. :83.000 Max. :8.00
## NA's :2231 NA's :2231 NA's :2231 NA's :2231
## Pr_Assets Pu_Assets Cars Motorcycles
## Min. :0.000e+00 Min. :0.000e+00 Min. : 2 Min. : 4
## 1st Qu.:0.000e+00 1st Qu.:4.047e+07 1st Qu.: 602 1st Qu.: 591
## Median :3.231e+07 Median :1.339e+08 Median : 1438 Median : 1285
## Mean :9.180e+09 Mean :6.005e+09 Mean : 9859 Mean : 4879
## 3rd Qu.:1.148e+08 3rd Qu.:4.970e+08 3rd Qu.: 4086 3rd Qu.: 3294
## Max. :1.947e+13 Max. :8.016e+12 Max. :5740995 Max. :1134570
## NA's :2231 NA's :2231 NA's :11 NA's :11
## Wheeled_tractor UBER MAC WAL-MART
## Min. : 0.000 Min. :1 Min. : 1.000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:1 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 0.000 Median :1 Median : 2.000 Median : 1.000
## Mean : 5.754 Mean :1 Mean : 4.277 Mean : 2.059
## 3rd Qu.: 1.000 3rd Qu.:1 3rd Qu.: 3.000 3rd Qu.: 1.750
## Max. :3236.000 Max. :1 Max. :130.000 Max. :26.000
## NA's :11 NA's :5448 NA's :5407 NA's :5471
## POST_OFFICES
## Min. : 1.000
## 1st Qu.: 1.000
## Median : 1.000
## Mean : 2.081
## 3rd Qu.: 2.000
## Max. :225.000
## NA's :120
The first column “CITY” refers to the municipality. As seen in the above summary, there are 79 variables in relation to each municipality. There are several pressing issues regarding this dataset which needs to be fixed before analysis is performed. Firstly, it is very evident that all the variable names are not self explainatory from their variable names, and hence we will be renaming them so that they can be identified easily while performing analysis. The description of the data can be found in the second file imported(Data_dictionary.csv) which briefly describes what each variable refers to along with the source and year of data. Secondly, there are several variables which consists of NA values, which need to be dealt with. As we will be creating an explanatory model to explain factors affecting the GDP per capita at the municipality level, we will be removing variables which are not related to GDP per capita. After that, we will further perform data cleaning in order to deal with NA values.
Before we perform spatial analysis, we will convert the dataframe object into a sf object from the
brazil_cities <- brazil_cities %>% drop_na(LONG)
brazil.sf <- st_as_sf(brazil_cities,
coords = c("LONG", "LAT"),
crs=4326)%>%
st_transform(4674)
st_crs(brazil.sf)
## Coordinate Reference System:
## User input: EPSG:4674
## wkt:
## GEOGCS["SIRGAS 2000",
## DATUM["Sistema_de_Referencia_Geocentrico_para_las_AmericaS_2000",
## SPHEROID["GRS 1980",6378137,298.257222101,
## AUTHORITY["EPSG","7019"]],
## TOWGS84[0,0,0,0,0,0,0],
## AUTHORITY["EPSG","6674"]],
## PRIMEM["Greenwich",0,
## AUTHORITY["EPSG","8901"]],
## UNIT["degree",0.0174532925199433,
## AUTHORITY["EPSG","9122"]],
## AUTHORITY["EPSG","4674"]]
head(brazil.sf)
## Simple feature collection with 6 features and 79 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -49.44055 ymin: -19.15585 xmax: -39.04755 ymax: -1.72347
## CRS: EPSG:4674
## # A tibble: 6 x 80
## CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR… IBGE_RES_POP_ES… IBGE_DU
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abad… GO 0 6876 6876 0 2137
## 2 Abad… MG 0 6704 6704 0 2328
## 3 Abad… GO 0 15757 15609 148 4655
## 4 Abae… MG 0 22690 22690 0 7694
## 5 Abae… PA 0 141100 141040 60 31061
## 6 Abai… CE 0 10496 10496 0 2791
## # … with 73 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## # IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## # `IBGE_10-14` <dbl>, `IBGE_15-59` <dbl>, `IBGE_60+` <dbl>,
## # IBGE_PLANTED_AREA <dbl>, `IBGE_CROP_PRODUCTION_$` <dbl>, `IDHM Ranking
## # 2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## # IDHM_Educacao <dbl>, ALT <dbl>, PAY_TV <dbl>, FIXED_PHONES <dbl>,
## # AREA <dbl>, REGIAO_TUR <chr>, CATEGORIA_TUR <chr>, ESTIMATED_POP <dbl>,
## # RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## # GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## # GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## # MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## # COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## # COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## # COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## # COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, HOTELS <dbl>,
## # BEDS <dbl>, Pr_Agencies <dbl>, Pu_Agencies <dbl>, Pr_Bank <dbl>,
## # Pu_Bank <dbl>, Pr_Assets <dbl>, Pu_Assets <dbl>, Cars <dbl>,
## # Motorcycles <dbl>, Wheeled_tractor <dbl>, UBER <dbl>, MAC <dbl>,
## # `WAL-MART` <dbl>, POST_OFFICES <dbl>, geometry <POINT [°]>
plot(brazil.sf$geometry)
data <- st_join(municipality,brazil.sf)
tmap_mode("plot")
tm_shape(data)+
tm_polygons("GDP_CAPITA",border.alpha = 0.7,legend.hist=TRUE)+
tm_layout(legend.outside = TRUE)
Firstly, we will remove the columns with a lot of NA values as these data will not be a good fit for our model. We will re-examine the data from the summary function through which we understand the range of the data as well as the number of NA values in each column.
summary(data)
## code_mn name_mn cod_stt abbrv_s
## Min. :1100015 Bom Jesus : 5 31 : 853 MG : 853
## 1st Qu.:2512175 São Domingos: 5 35 : 645 SP : 645
## Median :3146354 Bonito : 4 43 : 499 RS : 499
## Mean :3253966 Planalto : 4 29 : 417 BA : 417
## 3rd Qu.:4119264 Santa Helena: 4 41 : 399 PR : 399
## Max. :5300108 Santa Inês : 4 42 : 295 SC : 295
## (Other) :5546 (Other):2464 (Other):2464
## CITY STATE CAPITAL IBGE_RES_POP
## Length:5572 Length:5572 Min. :0.000000 Min. : 805
## Class :character Class :character 1st Qu.:0.000000 1st Qu.: 5236
## Mode :character Mode :character Median :0.000000 Median : 10936
## Mean :0.004853 Mean : 34288
## 3rd Qu.:0.000000 3rd Qu.: 23468
## Max. :1.000000 Max. :11253503
## NA's :8 NA's :9
## IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN
## Min. : 805 Min. : 0.00 Min. : 239 Min. : 60
## 1st Qu.: 5230 1st Qu.: 0.00 1st Qu.: 1573 1st Qu.: 874
## Median : 10934 Median : 0.00 Median : 3176 Median : 1850
## Mean : 34210 Mean : 77.53 Mean : 10306 Mean : 8862
## 3rd Qu.: 23394 3rd Qu.: 10.00 3rd Qu.: 6727 3rd Qu.: 4627
## Max. :11133776 Max. :119727.00 Max. :3576148 Max. :3548433
## NA's :9 NA's :9 NA's :11 NA's :11
## IBGE_DU_RURAL IBGE_POP IBGE_1 IBGE_1-4
## Min. : 3 Min. : 174 Min. : 0.0 Min. : 5.0
## 1st Qu.: 487 1st Qu.: 2802 1st Qu.: 38.0 1st Qu.: 158.0
## Median : 931 Median : 6177 Median : 92.0 Median : 377.0
## Mean : 1463 Mean : 27604 Mean : 383.4 Mean : 1545.1
## 3rd Qu.: 1832 3rd Qu.: 15304 3rd Qu.: 232.0 3rd Qu.: 951.5
## Max. :33809 Max. :10463636 Max. :129464.0 Max. :514794.0
## NA's :82 NA's :9 NA's :9 NA's :9
## IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## Min. : 7.0 Min. : 12 Min. : 94 Min. : 29
## 1st Qu.: 220.5 1st Qu.: 260 1st Qu.: 1735 1st Qu.: 341
## Median : 516.0 Median : 589 Median : 3842 Median : 723
## Mean : 2070.0 Mean : 2382 Mean : 18218 Mean : 3005
## 3rd Qu.: 1300.5 3rd Qu.: 1478 3rd Qu.: 9630 3rd Qu.: 1724
## Max. :684443.0 Max. :783702 Max. :7058221 Max. :1293012
## NA's :9 NA's :9 NA's :9 NA's :9
## IBGE_PLANTED_AREA IBGE_CROP_PRODUCTION_$ IDHM Ranking 2010 IDHM
## Min. : 0.0 Min. : 0 Min. : 1 Min. :0.4180
## 1st Qu.: 911.5 1st Qu.: 2330 1st Qu.:1392 1st Qu.:0.5990
## Median : 3473.0 Median : 13845 Median :2782 Median :0.6650
## Mean : 14172.6 Mean : 57362 Mean :2783 Mean :0.6592
## 3rd Qu.: 11172.5 3rd Qu.: 55579 3rd Qu.:4174 3rd Qu.:0.7180
## Max. :1205669.0 Max. :3274885 Max. :5565 Max. :0.8620
## NA's :9 NA's :9 NA's :8 NA's :8
## IDHM_Renda IDHM_Longevidade IDHM_Educacao ALT
## Min. :0.4000 Min. :0.6720 Min. :0.2070 Min. : 0.0
## 1st Qu.:0.5720 1st Qu.:0.7690 1st Qu.:0.4900 1st Qu.: 169.8
## Median :0.6540 Median :0.8080 Median :0.5600 Median : 406.5
## Mean :0.6429 Mean :0.8016 Mean :0.5591 Mean : 893.8
## 3rd Qu.:0.7070 3rd Qu.:0.8360 3rd Qu.:0.6310 3rd Qu.: 628.9
## Max. :0.8910 Max. :0.8940 Max. :0.8250 Max. :874579.0
## NA's :8 NA's :8 NA's :8 NA's :8
## PAY_TV FIXED_PHONES AREA REGIAO_TUR
## Min. : 1 Min. : 3 Min. : 3.57 Length:5572
## 1st Qu.: 88 1st Qu.: 119 1st Qu.: 204.49 Class :character
## Median : 247 Median : 328 Median : 415.87 Mode :character
## Mean : 3097 Mean : 6574 Mean : 1515.73
## 3rd Qu.: 816 3rd Qu.: 1151 3rd Qu.: 1026.16
## Max. :2047668 Max. :5543127 Max. :159533.33
## NA's :8 NA's :8 NA's :10
## CATEGORIA_TUR ESTIMATED_POP RURAL_URBAN GVA_AGROPEC
## Length:5572 Min. : 786 Length:5572 Min. : 0
## Class :character 1st Qu.: 5454 Class :character 1st Qu.: 4190
## Mode :character Median : 11591 Mode :character Median : 20430
## Mean : 37463 Mean : 47268
## 3rd Qu.: 25306 3rd Qu.: 51216
## Max. :12176866 Max. :1402282
## NA's :8 NA's :9
## GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC GVA_TOTAL
## Min. : 1 Min. : 2 Min. : 7 Min. : 17
## 1st Qu.: 1725 1st Qu.: 10117 1st Qu.: 17256 1st Qu.: 42345
## Median : 7425 Median : 31216 Median : 35865 Median : 119504
## Mean : 176063 Mean : 490028 Mean : 123879 Mean : 833879
## 3rd Qu.: 40995 3rd Qu.: 115583 3rd Qu.: 89339 3rd Qu.: 314089
## Max. :63306755 Max. :464656988 Max. :41902893 Max. :569910503
## NA's :9 NA's :9 NA's :9 NA's :9
## TAXES GDP POP_GDP GDP_CAPITA
## Min. : -14159 Min. : 15 Min. : 815 Min. : 3191
## 1st Qu.: 1302 1st Qu.: 43676 1st Qu.: 5488 1st Qu.: 9062
## Median : 5107 Median : 125111 Median : 11584 Median : 15866
## Mean : 119000 Mean : 955528 Mean : 37035 Mean : 21093
## 3rd Qu.: 22208 3rd Qu.: 329361 3rd Qu.: 25108 3rd Qu.: 26155
## Max. :117125387 Max. :687035890 Max. :12038175 Max. :314638
## NA's :9 NA's :9 NA's :9 NA's :9
## GVA_MAIN MUN_EXPENDIT COMP_TOT COMP_A
## Length:5572 Min. :1.421e+06 Min. : 6.0 Min. : 0.00
## Class :character 1st Qu.:1.574e+07 1st Qu.: 68.0 1st Qu.: 1.00
## Mode :character Median :2.746e+07 Median : 162.0 Median : 2.00
## Mean :1.044e+08 Mean : 907.8 Mean : 18.27
## 3rd Qu.:5.679e+07 3rd Qu.: 449.5 3rd Qu.: 8.00
## Max. :4.577e+10 Max. :530446.0 Max. :1948.00
## NA's :1497 NA's :9 NA's :9
## COMP_B COMP_C COMP_D COMP_E
## Min. : 0.000 Min. : 0.00 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 0.000 Median : 11.00 Median : 0.0000 Median : 0.000
## Mean : 1.853 Mean : 73.53 Mean : 0.4264 Mean : 2.031
## 3rd Qu.: 2.000 3rd Qu.: 39.00 3rd Qu.: 0.0000 3rd Qu.: 1.000
## Max. :274.000 Max. :31566.00 Max. :332.0000 Max. :657.000
## NA's :9 NA's :9 NA's :9 NA's :9
## COMP_F COMP_G COMP_H COMP_I
## Min. : 0.0 Min. : 1.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.0 1st Qu.: 32.0 1st Qu.: 1.00 1st Qu.: 2.00
## Median : 4.0 Median : 75.0 Median : 7.00 Median : 7.00
## Mean : 43.3 Mean : 348.4 Mean : 41.04 Mean : 55.94
## 3rd Qu.: 15.0 3rd Qu.: 200.0 3rd Qu.: 25.00 3rd Qu.: 24.00
## Max. :25222.0 Max. :150633.0 Max. :19515.00 Max. :29290.00
## NA's :9 NA's :9 NA's :9 NA's :9
## COMP_J COMP_K COMP_L COMP_M
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 1.00
## Median : 1.00 Median : 0.00 Median : 0.00 Median : 4.00
## Mean : 24.77 Mean : 15.57 Mean : 15.16 Mean : 51.35
## 3rd Qu.: 5.00 3rd Qu.: 2.00 3rd Qu.: 3.00 3rd Qu.: 13.00
## Max. :38720.00 Max. :23738.00 Max. :14003.00 Max. :49181.00
## NA's :9 NA's :9 NA's :9 NA's :9
## COMP_N COMP_O COMP_P COMP_Q
## Min. : 0.0 Min. : 1.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.0 1st Qu.: 2.000 1st Qu.: 2.00 1st Qu.: 1.00
## Median : 4.0 Median : 2.000 Median : 6.00 Median : 3.00
## Mean : 83.8 Mean : 3.271 Mean : 30.99 Mean : 34.19
## 3rd Qu.: 14.0 3rd Qu.: 3.000 3rd Qu.: 17.00 3rd Qu.: 12.00
## Max. :76757.0 Max. :204.000 Max. :16030.00 Max. :22248.00
## NA's :9 NA's :9 NA's :9 NA's :9
## COMP_R COMP_S COMP_T COMP_U
## Min. : 0.00 Min. : 0.00 Min. :0 Min. : 0.00000
## 1st Qu.: 0.00 1st Qu.: 5.00 1st Qu.:0 1st Qu.: 0.00000
## Median : 2.00 Median : 12.00 Median :0 Median : 0.00000
## Mean : 12.19 Mean : 51.66 Mean :0 Mean : 0.05033
## 3rd Qu.: 6.00 3rd Qu.: 31.00 3rd Qu.:0 3rd Qu.: 0.00000
## Max. :6687.00 Max. :24832.00 Max. :0 Max. :123.00000
## NA's :9 NA's :9 NA's :9 NA's :9
## HOTELS BEDS Pr_Agencies Pu_Agencies
## Min. : 1.000 Min. : 2.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 40.0 1st Qu.: 0.000 1st Qu.: 1.00
## Median : 1.000 Median : 82.0 Median : 1.000 Median : 2.00
## Mean : 3.131 Mean : 257.5 Mean : 3.384 Mean : 2.83
## 3rd Qu.: 3.000 3rd Qu.: 200.0 3rd Qu.: 2.000 3rd Qu.: 2.00
## Max. :97.000 Max. :13247.0 Max. :1693.000 Max. :626.00
## NA's :4685 NA's :4685 NA's :2231 NA's :2231
## Pr_Bank Pu_Bank Pr_Assets Pu_Assets
## Min. : 0.000 Min. :0.00 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.: 0.000 1st Qu.:1.00 1st Qu.:0.000e+00 1st Qu.:4.048e+07
## Median : 1.000 Median :2.00 Median :3.234e+07 Median :1.339e+08
## Mean : 1.312 Mean :1.58 Mean :9.183e+09 Mean :6.007e+09
## 3rd Qu.: 2.000 3rd Qu.:2.00 3rd Qu.:1.149e+08 3rd Qu.:4.976e+08
## Max. :83.000 Max. :8.00 Max. :1.947e+13 Max. :8.016e+12
## NA's :2231 NA's :2231 NA's :2231 NA's :2231
## Cars Motorcycles Wheeled_tractor UBER
## Min. : 2 Min. : 4 Min. : 0.000 Min. :1
## 1st Qu.: 602 1st Qu.: 591 1st Qu.: 0.000 1st Qu.:1
## Median : 1440 Median : 1286 Median : 0.000 Median :1
## Mean : 9869 Mean : 4883 Mean : 5.759 Mean :1
## 3rd Qu.: 4091 3rd Qu.: 3299 3rd Qu.: 1.000 3rd Qu.:1
## Max. :5740995 Max. :1134570 Max. :3236.000 Max. :1
## NA's :16 NA's :16 NA's :16 NA's :5447
## MAC WAL-MART POST_OFFICES geometry
## Min. : 1.000 Min. : 1.000 Min. : 1.000 MULTIPOLYGON :5572
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 epsg:4674 : 0
## Median : 2.000 Median : 1.000 Median : 1.000 +proj=long...: 0
## Mean : 4.277 Mean : 2.059 Mean : 2.081
## 3rd Qu.: 3.000 3rd Qu.: 1.750 3rd Qu.: 2.000
## Max. :130.000 Max. :26.000 Max. :225.000
## NA's :5406 NA's :5470 NA's :126
As seen above, there are various variables which have majority of its values as NA. We will be removing these variables from our dataset from the code below.
data$HOTELS<-NULL
data$`WAL-MART`<-NULL
data$MAC<-NULL
data$UBER<-NULL
data$Pu_Assets<-NULL
data$Pr_Assets<-NULL
data$Pu_Bank<-NULL
data$Pr_Bank<-NULL
data$Pu_Agencies<-NULL
data$BEDS <- NULL
data$Pr_Agencies <- NULL
Our dependent variable for the analysis will be GDP per capita. Hence, we should keep variables which are potential factors of the GDP per capita in order to treat them as independent variables. Hence, we will now be eliminating variables which are not related to GDP per capita. Furthermore, there are various columns which are derived from other columns. For example, GDP per capita is found by diving GDP by the population. This makes GDP and population data directly correlated with our dependent variable. If one column is directly related to other columns, there will be a problem of multicollinearity. To avoid this problem, we will be eliminating such data.
data <- data %>%
mutate(FOREIGNERS_PERCENTAGE=(IBGE_RES_POP_ESTR/IBGE_RES_POP)*100)
data$IBGE_RES_POP_ESTR <- NULL
data$IBGE_RES_POP <- NULL
data$IBGE_RES_POP_BRAS <- NULL
data <- data %>%
mutate(DU_URBAN_PERCENTAGE = (IBGE_DU_URBAN/IBGE_DU)*100)
data$IBGE_DU_RURAL <- NULL
data$IBGE_DU_URBAN <- NULL
data$IBGE_DU <- NULL
data$DU_RURAL_PERCENTAGE<-NULL
IBGE_60+. The resident population has been divided into multiple age groups. It is very obvious that all the age groups do not contribute to the GDP/Capita. Only the working population contributes to the GDP per capita. Hence, we will be only considering keeping the IBGE_15-59 which consists of number of people aged between 15 and 59. One limitation is that the active population is usually above the age of 22 in Brazil, however given the dataset, we will use IBGE_15-59 due to the lack of option. As the GDP per capita is sensitive to the number of people, we will use percentage of active population instead of using the direct count. Do note that this population is only consists of the urban population.data <- data %>%
mutate(ACTIVE_PERCENTAGE=(`IBGE_15-59`/IBGE_POP)*100)
data$IBGE_1 <- NULL
data$`IBGE_1-4`<-NULL
data$`IBGE_5-9` <- NULL
data$`IBGE_10-14` <- NULL
data$`IBGE_60+` <- NULL
data$IBGE_PLANTED_AREA <- data$IBGE_PLANTED_AREA * 10000
data$AREA <- data$AREA * 1000000
data <- data %>%
mutate(PLANTED_AREA_PERCENTAGE=(IBGE_PLANTED_AREA/AREA)*100)
data$IBGE_PLANTED_AREA <- NULL
data$`IDHM Ranking 2010`<-NULL
data$IDHM_Longevidade <- NULL
data$IDHM_Educacao <- NULL
data$IDHM_Renda <- NULL
data$PAY_TV <- NULL
data$FIXED_PHONES <- NULL
data$REGIAO_TUR <- NULL
data$CATEGORIA_TUR <- NULL
unique(data$RURAL_URBAN)
## [1] "Intermediário Adjacente" "Urbano"
## [3] "Rural Adjacente" "Rural Remoto"
## [5] "Intermediário Remoto" NA
data$Intermediário_Adjacente<-ifelse(data$RURAL_URBAN == "Intermediário Adjacente",
c(1), c(0))
data$Intermediário_Remoto<-ifelse(data$RURAL_URBAN == "Intermediário Remoto",
c(1), c(0))
data$Rural_Adjacente<-ifelse(data$RURAL_URBAN == "Rural Adjacente",
c(1), c(0))
data$Rural_Remoto<-ifelse(data$RURAL_URBAN == "Rural Remoto",
c(1), c(0))
data$RURAL_URBAN <- NULL
data <- data%>%
mutate(GVA_AGROPEC_Percentage=(GVA_AGROPEC/` GVA_TOTAL `)*100) %>%
mutate(GVA_INDUSTRY_Percentage=(GVA_INDUSTRY/` GVA_TOTAL `)*100) %>%
mutate(GVA_SERVICES_Percentage=(GVA_SERVICES/` GVA_TOTAL `)*100) %>%
mutate(GVA_PUBLIC_Percentage=(GVA_PUBLIC/` GVA_TOTAL `)*100)
data$GVA_AGROPEC <- NULL
data$GVA_INDUSTRY <- NULL
data$GVA_SERVICES <- NULL
data$GVA_PUBLIC <- NULL
data$` GVA_TOTAL ` <- NULL
data <- data %>%
mutate(TAX_CAPITA=TAXES/POP_GDP)
data$TAXES <- NULL
data$MUN_EXPENDIT <- NULL
data <- data %>%
mutate(COMP_AGRICULTURE=(COMP_A/COMP_TOT)*100)%>%
mutate(COMP_INDUSTRY=((COMP_A+COMP_B+COMP_F+COMP_G)/COMP_TOT)*100)%>%
mutate(COMP_SERVICES=((COMP_I+COMP_J+COMP_K+COMP_L+COMP_M+COMP_N+COMP_T+COMP_U)/COMP_TOT)*100)%>%
mutate(COMP_PUBLIC=((COMP_D+COMP_E+COMP_H+COMP_O+COMP_P+COMP_Q+COMP_R+COMP_S)/COMP_TOT)*100)
data$COMP_A <- NULL
data$COMP_B <- NULL
data$COMP_C <- NULL
data$COMP_D <- NULL
data$COMP_E <- NULL
data$COMP_F <- NULL
data$COMP_G <- NULL
data$COMP_H <- NULL
data$COMP_I <- NULL
data$COMP_J <- NULL
data$COMP_K <- NULL
data$COMP_L <- NULL
data$COMP_M <- NULL
data$COMP_N <- NULL
data$COMP_O <- NULL
data$COMP_P <- NULL
data$COMP_Q <- NULL
data$COMP_R <- NULL
data$COMP_S <- NULL
data$COMP_T <- NULL
data$COMP_U <- NULL
data$COMP_TOT <- NULL
data <- data %>%
mutate(CAR_RATIO=Cars/POP_GDP) %>%
mutate(MOTORCYCLE_RATIO=Motorcycles/POP_GDP)
data$POP_GDP <- NULL
data$Cars <- NULL
data$Motorcycles <- NULL
data$Wheeled_tractor <- NULL
data$POST_OFFICES<- NULL
data$AREA<- NULL
data$GVA_MAIN<- NULL
data$`IBGE_15-59`<-NULL
data$`IBGE_CROP_PRODUCTION_$`<-NULL
There are two columns for both municipality name and state abbreviation. We will remove the duplicated columns.
The variable ESTIMATED_POP is dated as of 2016. As we do not require population as our independent variable, we will be removing it as well.
Lastly, GDP is directly related to GDP_CAPITA, and hence we cannot keep it as our independent variable.
data$CITY <- NULL
data$STATE<- NULL
data$code_mn<- NULL
data$cod_stt<- NULL
data$abbrv_s<- NULL
data$ESTIMATED_POP <- NULL
data$IBGE_POP <- NULL
data$GDP <- NULL
glimpse(data)
## Rows: 5,572
## Columns: 25
## $ name_mn <fct> Alta Floresta D'oeste, Ariquemes, Cabixi, Cac…
## $ CAPITAL <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ IDHM <dbl> 0.640, 0.700, 0.650, 0.718, 0.690, 0.685, 0.6…
## $ ALT <dbl> 337.74, 138.69, 236.06, 177.45, 262.81, 419.0…
## $ GDP_CAPITA <dbl> 18732.17, 20618.18, 21202.96, 22130.78, 22721…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-62.19465 -..., …
## $ FOREIGNERS_PERCENTAGE <dbl> 0.00000000, 0.12506502, 0.04752099, 0.0483620…
## $ DU_URBAN_PERCENTAGE <dbl> 59.18087, 84.34058, 44.82062, 80.72780, 85.31…
## $ ACTIVE_PERCENTAGE <dbl> 65.36511, 66.60103, 64.78245, 68.11618, 65.25…
## $ PLANTED_AREA_PERCENTAGE <dbl> 2.58779148, 1.60733932, 31.25955796, 4.793178…
## $ Intermediário_Adjacente <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ Intermediário_Remoto <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Rural_Adjacente <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Rural_Remoto <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ GVA_AGROPEC_Percentage <dbl> 3.654743e+01, 7.374052e+00, 4.621590e+01, 1.0…
## $ GVA_INDUSTRY_Percentage <dbl> 6.878626e-03, 1.795178e+01, 3.616531e+00, 1.3…
## $ GVA_SERVICES_Percentage <dbl> 2.517734e+01, 4.473012e-02, 1.884524e+01, 4.8…
## $ GVA_PUBLIC_Percentage <dbl> 3.139653e+01, 2.994422e+01, 3.132234e+01, 2.7…
## $ TAX_CAPITA <dbl> 0.9090476751, 2.0406429893, 0.0008761329, 2.2…
## $ COMP_AGRICULTURE <dbl> 0.5905512, 1.0805943, 0.0000000, 0.7042254, 1…
## $ COMP_INDUSTRY <dbl> 59.05512, 53.35434, 55.00000, 50.97508, 52.24…
## $ COMP_SERVICES <dbl> 10.629921, 18.640252, 15.000000, 19.609967, 1…
## $ COMP_PUBLIC <dbl> 22.83465, 17.96488, 18.33333, 19.66414, 24.01…
## $ CAR_RATIO <dbl> 0.09660472, 0.16537924, 0.14708221, 0.1998361…
## $ MOTORCYCLE_RATIO <dbl> 0.3633655, 0.4028859, 0.2922563, 0.4332533, 0…
We will first check if our data has any NA values.
sum(is.na(data))
## [1] 223
As there are NA values present, we will drop the rows with NA values.
data <- data %>% drop_na()
We have created our final dataset. Now, we will extract the dependent variable and plot a correlation plot of all the indepedent variables in order to investigate for any multicollinearity.
GDP_CAPITA <- data$GDP_CAPITA
data$GDP_CAPITA <- NULL
data$GDP_CAPITA <- GDP_CAPITA
dependent_variables <- data
st_geometry(dependent_variables)<-NULL
corrplot(cor(dependent_variables[,2:23]), diag = FALSE, order = "AOE",
tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper",
number.cex = 0.7)
There are multiple arguments passed through the corrplot() function in order to plot the correlation matrix. We have chosen “AOE” as the ordering method for our correlation matrix which orders the variables in their respective order of the eigenvectors. We have passed “number” as our argument for method, which allows us to visualise the numerical vlaue of the correlation coefficients.
We will classify two variables to be significantly correlated if their correlation constant is above 0.80. As seen in the diagram above, there is only one pair of variables which is significantly correlated – Car ratio and IDHM with R value of 0.86. Between the two variables, we will eliminate car ratio as IHDM consists information about life expectancy, income, and education which is more crucial for our model. We will now perform exploratory data analysis on both our depedent and indepedent variables.
In this section, each variable will be analysed by plotting histograms and box plots. Through these diagrams, we can understand the distribution of the data as well as understand the distribution of outliers.
The dependent variable for the regression model is going to be GDP per capita. The unit of GDP per capita is measured in 1000 real$. Real$ is the official currency of Brazil.
ggplot(data=dependent_variables,
aes_string(x= "GDP_CAPITA")) +
geom_histogram(bins=20,
color="black",
fill="light blue")
The distribution above reveals a right skewed distribution. This means that most of the municipalities in Brazil have a low GDP per capita, with a few exceptions which have extremely high GDP per capita as compared to others. Statistically, the skewed dsitribution can be normalised by using log transformation. The code chunk below is used to derive a new variable called GDP_CAPITA.log by using a log transformation on the variable GDP_CAPITA.
dependent_variables$GDP_CAPITA.log <- log(dependent_variables$GDP_CAPITA)
data$GDP_CAPITA.log <- log(dependent_variables$GDP_CAPITA)
We will now plot a histogram with the transformed data.
ggplot(data=dependent_variables,
aes_string(x= "GDP_CAPITA.log")) +
geom_histogram(bins=20,
color="black",
fill="light blue")
With the transformation, the distribution is less skewed and hence will help us in the regression analysis. Moving on, we will perform exploratory data analysis on the independent variables. Note that both GDP_CAPITA and GDP_CAPITA.log will be fitted in the multiple regression model and will be analysed in order to find out which model performs better.
Firsly, we will create two functions which will help us plot histograms and bar plots of the independent variables. Note that we will need to visualise dummy variables through bar plots.
plot_data <- function(maindata,attribute){
return(ggplot(data=maindata,
aes_string(x= attribute)) +
geom_histogram(bins=20,
color="black",
fill="light blue"))
}
bar_plot <- function(maindata,attribute){
return(ggplot(data=maindata, aes_string(x=attribute))+
geom_bar(fill="light blue"))
}
All the plots are stored in variables through the code below. This is done by using the functions made in the code above.
CAPITAL <- bar_plot(dependent_variables,"CAPITAL")
IDHM <- plot_data(dependent_variables,"IDHM")
ALT <- plot_data(dependent_variables,"ALT")
FOREIGNERS_PERCENTAGE <- plot_data(dependent_variables,"FOREIGNERS_PERCENTAGE")
DU_URBAN_PERCENTAGE <- plot_data(dependent_variables,"DU_URBAN_PERCENTAGE")
ACTIVE_PERCENTAGE <- plot_data(dependent_variables,"ACTIVE_PERCENTAGE")
PLANTED_AREA_PERCENTAGE <- plot_data(dependent_variables,"PLANTED_AREA_PERCENTAGE")
Intermediário_Adjacente <- bar_plot(dependent_variables,"Intermediário_Adjacente")
Intermediário_Remoto <- bar_plot(dependent_variables,"Intermediário_Remoto")
Rural_Adjacente <- bar_plot(dependent_variables,"Rural_Adjacente")
Rural_Remoto <- bar_plot(dependent_variables,"Rural_Remoto")
GVA_AGROPEC_Percentage <- plot_data(dependent_variables,"GVA_AGROPEC_Percentage")
GVA_INDUSTRY_Percentage <- plot_data(dependent_variables,"GVA_INDUSTRY_Percentage")
GVA_SERVICES_Percentage <- plot_data(dependent_variables,"GVA_SERVICES_Percentage")
GVA_PUBLIC_Percentage <- plot_data(dependent_variables,"GVA_PUBLIC_Percentage")
TAX_CAPITA <- plot_data(dependent_variables,"TAX_CAPITA")
COMP_AGRICULTURE <- plot_data(dependent_variables,"COMP_AGRICULTURE")
COMP_INDUSTRY <- plot_data(dependent_variables,"COMP_INDUSTRY")
COMP_SERVICES <- plot_data(dependent_variables,"COMP_SERVICES")
COMP_PUBLIC <- plot_data(dependent_variables,"COMP_PUBLIC")
MOTORCYCLE_RATIO <- plot_data(dependent_variables,"MOTORCYCLE_RATIO")
Now, we will be plotting all the histograms in a tabular form so that it is more presentable.
ggarrange(CAPITAL, IDHM, ALT, FOREIGNERS_PERCENTAGE,
DU_URBAN_PERCENTAGE, ACTIVE_PERCENTAGE, PLANTED_AREA_PERCENTAGE, Intermediário_Remoto, Rural_Adjacente, Rural_Remoto, GVA_AGROPEC_Percentage, GVA_INDUSTRY_Percentage,GVA_SERVICES_Percentage,GVA_PUBLIC_Percentage,TAX_CAPITA,COMP_AGRICULTURE,COMP_INDUSTRY,COMP_SERVICES,COMP_PUBLIC,
MOTORCYCLE_RATIO,
ncol = 3,
nrow = 3)
## $`1`
##
## $`2`
##
## $`3`
##
## attr(,"class")
## [1] "list" "ggarrange"
We can divide the analysis into two parts:
(1) Dummy variables: Bar graphs are plotted which show the proportion of data which have 0 (FALSE) and 1 (TRUE). We do not need to transform this dataset as it is prepared to be fit in a regression model.
(2) Normal variables: Many of these variables are right skewed while others are normally distributed. As we have seen in the indepedent variable exploratory analysis and in the general scene of Brazil, there is prominent inequality. Hence, we will not standardise any of the data so that we do not lose any information.
In order to build a regression model, we will be using the lm function which is used to fit linear models. Through this function, we will also find out the dependent variables which are statistically significant. We will fit the model according to both GDP_CAPITA and GDP_CAPITA.log in order to analyse the differences between the two models.
gdp_capita.lmr <- lm(formula = GDP_CAPITA ~ CAPITAL + IDHM + ALT+ FOREIGNERS_PERCENTAGE +
DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente+ Intermediário_Remoto+ Rural_Adjacente + Rural_Remoto + GVA_AGROPEC_Percentage+ GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC + MOTORCYCLE_RATIO, data = dependent_variables)
summary(gdp_capita.lmr)
##
## Call:
## lm(formula = GDP_CAPITA ~ CAPITAL + IDHM + ALT + FOREIGNERS_PERCENTAGE +
## DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE +
## Intermediário_Adjacente + Intermediário_Remoto + Rural_Adjacente +
## Rural_Remoto + GVA_AGROPEC_Percentage + GVA_INDUSTRY_Percentage +
## GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA +
## COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC +
## MOTORCYCLE_RATIO, data = dependent_variables)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66845 -5008 -1709 1936 290381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.954e+04 6.346e+03 -10.958 < 2e-16 ***
## CAPITAL -3.177e+03 2.545e+03 -1.248 0.212025
## IDHM 6.528e+04 4.757e+03 13.724 < 2e-16 ***
## ALT -3.649e-03 1.006e-02 -0.363 0.716880
## FOREIGNERS_PERCENTAGE -3.672e+01 3.226e+02 -0.114 0.909397
## DU_URBAN_PERCENTAGE -5.992e+01 1.250e+01 -4.793 1.69e-06 ***
## ACTIVE_PERCENTAGE 4.590e+02 8.733e+01 5.256 1.53e-07 ***
## PLANTED_AREA_PERCENTAGE 1.213e+02 7.074e+00 17.144 < 2e-16 ***
## Intermediário_Adjacente 1.623e+03 6.316e+02 2.570 0.010209 *
## Intermediário_Remoto 8.797e+03 1.742e+03 5.051 4.54e-07 ***
## Rural_Adjacente 3.197e+03 5.646e+02 5.661 1.58e-08 ***
## Rural_Remoto 7.116e+03 9.084e+02 7.834 5.63e-15 ***
## GVA_AGROPEC_Percentage 1.089e-01 2.240e-02 4.860 1.21e-06 ***
## GVA_INDUSTRY_Percentage 4.130e-01 3.081e-02 13.404 < 2e-16 ***
## GVA_SERVICES_Percentage -1.218e-01 2.006e-02 -6.071 1.36e-09 ***
## GVA_PUBLIC_Percentage -7.486e-02 1.751e-02 -4.275 1.94e-05 ***
## TAX_CAPITA 4.065e+03 6.925e+01 58.709 < 2e-16 ***
## COMP_AGRICULTURE 1.186e+02 2.149e+01 5.522 3.51e-08 ***
## COMP_INDUSTRY 1.152e+02 3.233e+01 3.564 0.000369 ***
## COMP_SERVICES 2.602e+02 3.977e+01 6.543 6.58e-11 ***
## COMP_PUBLIC 1.200e+02 3.255e+01 3.687 0.000229 ***
## MOTORCYCLE_RATIO -7.288e+03 2.820e+03 -2.584 0.009793 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12880 on 5530 degrees of freedom
## Multiple R-squared: 0.596, Adjusted R-squared: 0.5944
## F-statistic: 388.4 on 21 and 5530 DF, p-value: < 2.2e-16
The above model suggests we need to eliminate three variables when working with confidence interval of 95%. This model has an R-Square value of 0.5944 which signifies that 59.44% of the variation in the dependent variable is due to the variation in independent variables.
Similar to the ethod used above, we will now use logarithmic GDP_CAPITA values to build our regression model.
gdp_capita.lmr <- lm(formula = GDP_CAPITA.log ~ CAPITAL + IDHM + ALT+ FOREIGNERS_PERCENTAGE +
DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente+ Intermediário_Remoto+ Rural_Adjacente + Rural_Remoto + GVA_AGROPEC_Percentage+ GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC + MOTORCYCLE_RATIO, data = dependent_variables)
summary(gdp_capita.lmr)
##
## Call:
## lm(formula = GDP_CAPITA.log ~ CAPITAL + IDHM + ALT + FOREIGNERS_PERCENTAGE +
## DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE +
## Intermediário_Adjacente + Intermediário_Remoto + Rural_Adjacente +
## Rural_Remoto + GVA_AGROPEC_Percentage + GVA_INDUSTRY_Percentage +
## GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA +
## COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC +
## MOTORCYCLE_RATIO, data = dependent_variables)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.87518 -0.21594 -0.04313 0.16704 2.98407
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.635e+00 1.761e-01 32.005 < 2e-16 ***
## CAPITAL -1.164e-01 7.061e-02 -1.648 0.09943 .
## IDHM 4.486e+00 1.320e-01 33.994 < 2e-16 ***
## ALT -8.491e-08 2.792e-07 -0.304 0.76105
## FOREIGNERS_PERCENTAGE 8.219e-03 8.952e-03 0.918 0.35857
## DU_URBAN_PERCENTAGE -2.398e-03 3.469e-04 -6.913 5.26e-12 ***
## ACTIVE_PERCENTAGE 1.847e-02 2.423e-03 7.625 2.86e-14 ***
## PLANTED_AREA_PERCENTAGE 5.311e-03 1.963e-04 27.059 < 2e-16 ***
## Intermediário_Adjacente 5.147e-02 1.752e-02 2.937 0.00332 **
## Intermediário_Remoto 3.605e-01 4.832e-02 7.460 9.97e-14 ***
## Rural_Adjacente 6.986e-02 1.567e-02 4.460 8.37e-06 ***
## Rural_Remoto 2.603e-01 2.520e-02 10.327 < 2e-16 ***
## GVA_AGROPEC_Percentage 5.990e-06 6.215e-07 9.638 < 2e-16 ***
## GVA_INDUSTRY_Percentage 1.024e-05 8.548e-07 11.981 < 2e-16 ***
## GVA_SERVICES_Percentage -3.256e-06 5.567e-07 -5.850 5.21e-09 ***
## GVA_PUBLIC_Percentage -4.120e-06 4.858e-07 -8.481 < 2e-16 ***
## TAX_CAPITA 7.148e-02 1.921e-03 37.205 < 2e-16 ***
## COMP_AGRICULTURE 3.451e-03 5.961e-04 5.790 7.45e-09 ***
## COMP_INDUSTRY -4.033e-03 8.970e-04 -4.496 7.07e-06 ***
## COMP_SERVICES 8.580e-03 1.103e-03 7.775 8.91e-15 ***
## COMP_PUBLIC -2.845e-03 9.032e-04 -3.149 0.00164 **
## MOTORCYCLE_RATIO -1.471e-02 7.825e-02 -0.188 0.85092
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3573 on 5530 degrees of freedom
## Multiple R-squared: 0.7271, Adjusted R-squared: 0.7261
## F-statistic: 701.7 on 21 and 5530 DF, p-value: < 2.2e-16
The above model suggests we need to eliminate three variables when working with confidence interval of 95%. This model has an R-Square value of 0.7261 which signifies that 72.61% of the variation in the dependent variable is due to the variation in independent variables. As the R-square value of the lorithmic model is significantly better than that of the model used with raw scores, we will use the logarithmic model. Note that this model is also chosen as the linearity assumptions will be more accurately met as compared to other model.
We will be using confidence interval of 95%. Therefore, with reference to the summary above, it is clear that not all the independent variables are statistically significant. The insignificant variables are:
(1) CAPITAL
(2) ALT
(3) FOREIGNERS_PERCENTAGE
(5) MOTORCYCLE_RATIO
We will revise the model by removing those variables which are not statistically significant.
gdp_capita.lmr <- lm(formula = GDP_CAPITA.log ~ IDHM +
DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente+ Intermediário_Remoto+ Rural_Adjacente + Rural_Remoto + GVA_AGROPEC_Percentage+ GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC , data = dependent_variables)
summary(gdp_capita.lmr)
##
## Call:
## lm(formula = GDP_CAPITA.log ~ IDHM + DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE +
## PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente + Intermediário_Remoto +
## Rural_Adjacente + Rural_Remoto + GVA_AGROPEC_Percentage +
## GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage +
## TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES +
## COMP_PUBLIC, data = dependent_variables)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.87458 -0.21720 -0.04254 0.16651 2.98463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.655e+00 1.749e-01 32.334 < 2e-16 ***
## IDHM 4.481e+00 1.297e-01 34.557 < 2e-16 ***
## DU_URBAN_PERCENTAGE -2.397e-03 3.457e-04 -6.933 4.59e-12 ***
## ACTIVE_PERCENTAGE 1.828e-02 2.402e-03 7.610 3.21e-14 ***
## PLANTED_AREA_PERCENTAGE 5.338e-03 1.951e-04 27.356 < 2e-16 ***
## Intermediário_Adjacente 5.203e-02 1.750e-02 2.973 0.00296 **
## Intermediário_Remoto 3.670e-01 4.787e-02 7.666 2.08e-14 ***
## Rural_Adjacente 7.049e-02 1.563e-02 4.510 6.63e-06 ***
## Rural_Remoto 2.612e-01 2.517e-02 10.377 < 2e-16 ***
## GVA_AGROPEC_Percentage 5.998e-06 6.212e-07 9.656 < 2e-16 ***
## GVA_INDUSTRY_Percentage 1.029e-05 8.539e-07 12.047 < 2e-16 ***
## GVA_SERVICES_Percentage -3.279e-06 5.559e-07 -5.898 3.90e-09 ***
## GVA_PUBLIC_Percentage -4.118e-06 4.857e-07 -8.478 < 2e-16 ***
## TAX_CAPITA 7.152e-02 1.919e-03 37.275 < 2e-16 ***
## COMP_AGRICULTURE 3.478e-03 5.843e-04 5.953 2.80e-09 ***
## COMP_INDUSTRY -4.097e-03 8.951e-04 -4.577 4.81e-06 ***
## COMP_SERVICES 8.438e-03 1.080e-03 7.812 6.66e-15 ***
## COMP_PUBLIC -2.925e-03 8.982e-04 -3.256 0.00114 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3573 on 5534 degrees of freedom
## Multiple R-squared: 0.7269, Adjusted R-squared: 0.7261
## F-statistic: 866.6 on 17 and 5534 DF, p-value: < 2.2e-16
The adjusted R-square values remain the same at 0.7261 after corecting the model.
There are multiple assumptions made while fitting a linear model. These assumptions are:
(1) Linearity assumption
(2) Normality assumption
(3) Homoscedasticity
(4) No correlation between residuals
We will be performing tests on each of these assumptions to ensure that our model is accuarte.
We will first be checking the linearity assumption. The assumption is that the relationship between the depedent variable and indepedent variables is approximately linear. We can check this assumption through two plots.
The first graph we are plotting is the Actual vs Fitted plot.
ols_plot_obs_fit(gdp_capita.lmr, print_plot = TRUE)
It is evident that almost all the points are close to the regressed diagonal line. The exception lies with points representing higher GDP_CAPITA as most of them are further away from the regressed line. However, as our sample size is more than 5000, the proportion of data which is away from the regressed line, indicating that an approximate linear relationship exists between GDP_CAPITA and the indepedent variables.
To be sure of our assumption, we can plot another graph, which is the Residual vs Fitted plot.
ols_plot_resid_fit(gdp_capita.lmr, print_plot = TRUE)
From this graph we can derive few things:
(1) The relationship between dependent variable and indepedent variables represent a linear relationship as most of the residual values are spread around the horizontal line.
(2) The residuals form an approximate band around the 0 line indicating homogenity of error variance.
(3) There are few data points which are away from the horizontal line, indicating that there are few outliers present in the data.
Multicollinearity occurs when independent variables in a regression model are correlated. If multicolinearity exists between our variables, we are failing to the assumption that all our independent variables are truly independent and not related to one another. Multicollinearity can reduce the precision of the estimated coefficients, which weakens the statistical power of our regression model. Therefore, we will check if multicolinearity exists from VIF (Variance Inflation Factor) scores. We will keep the threshold for the VIF scores as 10, i.e. if any VIF score is more than 10, we will eliminate that variable.
ols_vif_tol(gdp_capita.lmr)
## Variables Tolerance VIF
## 1 IDHM 0.2644779 3.781035
## 2 DU_URBAN_PERCENTAGE 0.4190134 2.386558
## 3 ACTIVE_PERCENTAGE 0.3881498 2.576325
## 4 PLANTED_AREA_PERCENTAGE 0.8090001 1.236094
## 5 Intermediário_Adjacente 0.6929264 1.443155
## 6 Intermediário_Remoto 0.9384841 1.065548
## 7 Rural_Adjacente 0.3796542 2.633976
## 8 Rural_Remoto 0.6663301 1.500758
## 9 GVA_AGROPEC_Percentage 0.6817505 1.466812
## 10 GVA_INDUSTRY_Percentage 0.7070687 1.414290
## 11 GVA_SERVICES_Percentage 0.4817096 2.075940
## 12 GVA_PUBLIC_Percentage 0.5765119 1.734570
## 13 TAX_CAPITA 0.8012259 1.248087
## 14 COMP_AGRICULTURE 0.7250452 1.379224
## 15 COMP_INDUSTRY 0.1971123 5.073251
## 16 COMP_SERVICES 0.3543089 2.822396
## 17 COMP_PUBLIC 0.2439240 4.099638
Since the VIF of the independent variables are less than 10, we can safely conclude that there are no sign of multicollinearity among the independent variables.
ols_plot_resid_hist(gdp_capita.lmr)
As seen above, the residual data resembles a normal distribution. Most of the statistical tests which check for normality in one-sample data do not allow a sample size more than 5000. The only test which allows a sample size more than 5000 is the Kolmogorov-Smirnov test, however, this test is not designed for such a big dataset. To confirm that our data does not violdate the normality assumption, we will be plotting a density plot and QQ plot to determine normality in the residual data.
# Density plot
ggdensity(gdp_capita.lmr$residuals, fill = "lightgray")
From the diagram above, we are able to notice a grpah which resembles a normal distribution for our residual data. To double confirm our findings, we will also be plotting the QQ plot from the code below.
# QQ plot
ggqqplot(gdp_capita.lmr$residuals)
The slope of the line is roughly 45 degrees and almost all the points are located in close proximity of the slope. The exception is for larger data. Given the large dataset which contains more than 5000 points, having little variation in the outliers is absolutely understandable. Hence, from the graph we can conclude that our residual data does not violate the normality assumption.
In this section, we will be checking if our residual data from the linear regression is spatially correlated. Firstly, we will add the residual data in our sf data and then convert the sf data into spatial polygon data frame format because spdep functions can only process spatial data objects.
data$residuals <- gdp_capita.lmr$residuals
data.sp <- as_Spatial(data)
data.sp
## class : SpatialPolygonsDataFrame
## features : 5552
## extent : -73.99045, -28.83594, -33.75118, 4.884623 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
## variables : 26
## names : name_mn, CAPITAL, IDHM, ALT, FOREIGNERS_PERCENTAGE, DU_URBAN_PERCENTAGE, ACTIVE_PERCENTAGE, PLANTED_AREA_PERCENTAGE, Intermediário_Adjacente, Intermediário_Remoto, Rural_Adjacente, Rural_Remoto, GVA_AGROPEC_Percentage, GVA_INDUSTRY_Percentage, GVA_SERVICES_Percentage, ...
## min values : Abadia De Goiás, 0, 0.418, 0, 0, 4.55261775520925, 47.1631205673759, 0, 0, 0, 0, 0, 0, 0.00142261767428109, 0.00670618854896865, ...
## max values : Zortéa, 1, 0.862, 874579, 37.7218184890992, 100, 74.4760130414532, 185.942941120092, 1, 1, 1, 1, 76088.5597874933, 87480.2651309903, 83553.5525749495, ...
To visualise the residuals, we will use tmap package to display the distribution of the residuals.
tm_shape(data)+
tm_polygons("residuals", border.col = "black",border.alpha = 0.2)
From the figure above, we can infer that spatial autocorrelation of residuals exist. In order to find out if the autocorrelation is statistically significant, we will perform global moran’s I test. To perform this test, we need a weight matrix.
We will be creating a distance based weight matrix. Firstly, we will calculate the distance for which all the municipalities have at least one neighbour.
coords <- coordinates(data.sp)
k1 <- knn2nb(knearneigh(coords))
k1dists <- unlist(nbdists(k1,coords,longlat = TRUE))
summary(k1dists)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.386 11.778 15.881 21.366 23.963 369.663
As seen above, 370 km is the cut off distance such that all the municipalities have at least one neighbour. We will compute the distance-based weight matrix by using dnearneigh() function of spdep.
nb <- dnearneigh(coordinates(data.sp), 0, 370, longlat = TRUE)
summary(nb)
## Neighbour list object:
## Number of regions: 5552
## Number of nonzero links: 3021758
## Percentage nonzero weights: 9.80304
## Average number of links: 544.2648
## Link number distribution:
##
## 1 3 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 1 1 1 5 1 3 4 10 9 4 7 5 8 8 6 3 2 7 6 8
## 26 27 28 29 30 31 32 33 34 36 37 38 39 40 41 42 43 44 45 46
## 5 5 7 6 3 4 4 1 2 1 1 2 3 4 5 3 2 2 5 3
## 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
## 8 4 4 6 7 2 8 8 9 8 5 13 6 4 2 1 2 2 4 2
## 67 70 71 72 73 74 75 76 77 78 80 81 82 83 84 85 86 87 88 89
## 4 3 5 4 4 2 4 5 5 1 5 3 3 1 2 4 1 3 2 3
## 90 91 92 94 95 97 98 100 101 102 103 104 105 106 107 108 109 110 111 112
## 5 1 1 2 3 3 1 3 3 2 1 2 1 1 6 2 1 1 5 3
## 113 114 115 116 117 118 119 120 121 123 124 125 126 128 129 130 131 132 133 134
## 2 5 5 4 1 1 5 2 2 1 3 5 1 1 2 3 6 1 4 4
## 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
## 1 4 3 4 2 2 6 3 8 7 10 5 4 4 6 7 6 8 4 7
## 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174
## 4 7 3 7 3 2 6 14 3 5 5 5 8 5 4 4 5 3 7 2
## 175 177 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 198
## 5 2 2 4 2 3 1 1 1 4 1 2 3 5 8 2 5 3 2 3
## 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 218 219 220
## 3 3 4 1 2 1 2 2 5 3 2 4 3 8 4 4 6 4 8 1
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 3 7 3 4 4 5 3 7 4 4 2 5 1 2 6 7 5 2 7 5
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 2 4 3 9 6 10 6 3 6 6 1 9 3 7 8 6 3 2 4 7
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 2 5 8 9 2 5 8 4 5 7 5 10 6 9 7 6 6 5 9 9
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 5 7 4 7 5 4 5 5 9 5 5 6 5 1 7 3 2 3 7 4
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 7 4 6 4 4 7 5 4 6 2 7 7 5 2 4 8 8 2 7 5
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 9 8 5 6 6 6 5 2 4 10 4 3 9 3 8 6 5 6 6 4
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 5 5 3 5 7 5 2 7 6 7 6 6 2 2 3 4 5 7 6 9
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 4 3 8 4 8 7 4 7 8 4 4 7 6 6 6 10 8 4 7 9
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 9 10 7 9 9 14 14 7 4 14 3 10 3 7 6 3 9 5 10 7
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 4 9 6 7 3 4 6 3 1 8 3 6 3 6 7 1 5 6 5 5
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 7 6 11 3 2 1 3 7 3 6 4 3 3 3 4 4 4 3 2 3
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 3 6 7 3 1 3 3 6 3 1 2 3 6 6 4 4 5 5 5 5
## 461 462 463 464 465 466 467 468 469 470 471 473 474 475 476 477 478 479 481 482
## 5 1 3 3 5 3 6 1 2 4 3 2 4 4 2 5 4 6 3 4
## 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 503
## 4 2 6 1 7 7 6 3 5 3 5 4 2 3 6 3 8 3 4 2
## 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523
## 5 4 2 6 5 3 3 10 2 8 5 2 7 6 6 3 4 4 6 3
## 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543
## 6 4 2 5 5 3 1 6 6 8 7 2 8 5 3 6 1 5 5 5
## 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563
## 3 9 4 5 4 5 7 3 8 6 8 7 15 4 6 5 4 13 3 6
## 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583
## 10 7 8 5 9 5 5 7 5 11 7 9 11 10 5 9 11 6 4 4
## 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603
## 8 8 8 9 7 7 7 5 6 7 6 7 5 7 9 8 8 9 10 9
## 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623
## 11 8 7 3 5 9 4 10 4 3 6 6 5 6 7 4 10 10 6 8
## 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643
## 9 12 9 7 9 11 8 4 6 8 10 9 5 9 13 10 9 8 7 11
## 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663
## 8 8 10 5 3 12 11 12 3 11 11 9 9 9 11 11 10 6 6 13
## 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683
## 14 7 6 8 9 8 7 6 7 7 6 9 13 14 10 10 14 9 12 13
## 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703
## 7 6 7 7 16 6 15 8 9 7 11 8 8 11 12 5 13 9 13 9
## 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723
## 16 8 7 9 4 11 11 7 8 12 10 6 16 15 7 15 6 9 8 8
## 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743
## 10 8 7 10 10 8 8 7 14 13 13 12 6 12 9 5 4 12 8 6
## 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763
## 7 16 7 7 11 15 11 8 11 6 8 13 11 6 7 9 5 7 6 10
## 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783
## 16 8 9 7 10 12 5 14 4 5 13 10 11 6 4 7 6 14 5 5
## 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803
## 12 4 12 6 8 12 3 10 9 7 9 8 2 8 9 4 6 8 8 6
## 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823
## 13 9 7 9 3 8 12 10 10 4 10 6 4 9 10 7 8 4 20 6
## 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843
## 8 12 11 7 7 12 7 9 6 6 7 7 7 8 5 5 6 4 8 10
## 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863
## 11 10 6 4 7 6 9 13 13 4 11 12 7 6 11 7 3 9 7 9
## 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883
## 8 5 8 7 13 12 6 9 2 7 6 5 10 6 12 11 9 4 5 4
## 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903
## 8 8 12 8 4 4 5 3 8 6 7 6 6 6 5 6 4 7 6 9
## 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923
## 7 4 12 9 5 6 5 5 9 5 7 7 6 5 10 5 3 4 4 2
## 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 944
## 10 6 5 8 4 9 3 5 5 7 5 5 3 5 5 3 2 1 2 2
## 945 946 947 948 951 954 955 959 963 968 970 971
## 1 1 1 1 4 2 1 1 1 1 1 1
## 1 least connected region:
## 1517 with 1 link
## 1 most connected region:
## 4148 with 971 links
Next, nb2listw() of spdep packge will be used to convert the output neighbours lists (i.e. nb) into a spatial weights.
nb_lw <- nb2listw(nb, style = 'W')
summary(nb_lw)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 5552
## Number of nonzero links: 3021758
## Percentage nonzero weights: 9.80304
## Average number of links: 544.2648
## Link number distribution:
##
## 1 3 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 1 1 1 5 1 3 4 10 9 4 7 5 8 8 6 3 2 7 6 8
## 26 27 28 29 30 31 32 33 34 36 37 38 39 40 41 42 43 44 45 46
## 5 5 7 6 3 4 4 1 2 1 1 2 3 4 5 3 2 2 5 3
## 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
## 8 4 4 6 7 2 8 8 9 8 5 13 6 4 2 1 2 2 4 2
## 67 70 71 72 73 74 75 76 77 78 80 81 82 83 84 85 86 87 88 89
## 4 3 5 4 4 2 4 5 5 1 5 3 3 1 2 4 1 3 2 3
## 90 91 92 94 95 97 98 100 101 102 103 104 105 106 107 108 109 110 111 112
## 5 1 1 2 3 3 1 3 3 2 1 2 1 1 6 2 1 1 5 3
## 113 114 115 116 117 118 119 120 121 123 124 125 126 128 129 130 131 132 133 134
## 2 5 5 4 1 1 5 2 2 1 3 5 1 1 2 3 6 1 4 4
## 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
## 1 4 3 4 2 2 6 3 8 7 10 5 4 4 6 7 6 8 4 7
## 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174
## 4 7 3 7 3 2 6 14 3 5 5 5 8 5 4 4 5 3 7 2
## 175 177 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 198
## 5 2 2 4 2 3 1 1 1 4 1 2 3 5 8 2 5 3 2 3
## 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 218 219 220
## 3 3 4 1 2 1 2 2 5 3 2 4 3 8 4 4 6 4 8 1
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 3 7 3 4 4 5 3 7 4 4 2 5 1 2 6 7 5 2 7 5
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 2 4 3 9 6 10 6 3 6 6 1 9 3 7 8 6 3 2 4 7
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 2 5 8 9 2 5 8 4 5 7 5 10 6 9 7 6 6 5 9 9
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 5 7 4 7 5 4 5 5 9 5 5 6 5 1 7 3 2 3 7 4
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 7 4 6 4 4 7 5 4 6 2 7 7 5 2 4 8 8 2 7 5
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 9 8 5 6 6 6 5 2 4 10 4 3 9 3 8 6 5 6 6 4
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 5 5 3 5 7 5 2 7 6 7 6 6 2 2 3 4 5 7 6 9
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 4 3 8 4 8 7 4 7 8 4 4 7 6 6 6 10 8 4 7 9
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 9 10 7 9 9 14 14 7 4 14 3 10 3 7 6 3 9 5 10 7
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 4 9 6 7 3 4 6 3 1 8 3 6 3 6 7 1 5 6 5 5
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 7 6 11 3 2 1 3 7 3 6 4 3 3 3 4 4 4 3 2 3
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 3 6 7 3 1 3 3 6 3 1 2 3 6 6 4 4 5 5 5 5
## 461 462 463 464 465 466 467 468 469 470 471 473 474 475 476 477 478 479 481 482
## 5 1 3 3 5 3 6 1 2 4 3 2 4 4 2 5 4 6 3 4
## 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 503
## 4 2 6 1 7 7 6 3 5 3 5 4 2 3 6 3 8 3 4 2
## 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523
## 5 4 2 6 5 3 3 10 2 8 5 2 7 6 6 3 4 4 6 3
## 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543
## 6 4 2 5 5 3 1 6 6 8 7 2 8 5 3 6 1 5 5 5
## 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563
## 3 9 4 5 4 5 7 3 8 6 8 7 15 4 6 5 4 13 3 6
## 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583
## 10 7 8 5 9 5 5 7 5 11 7 9 11 10 5 9 11 6 4 4
## 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603
## 8 8 8 9 7 7 7 5 6 7 6 7 5 7 9 8 8 9 10 9
## 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623
## 11 8 7 3 5 9 4 10 4 3 6 6 5 6 7 4 10 10 6 8
## 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643
## 9 12 9 7 9 11 8 4 6 8 10 9 5 9 13 10 9 8 7 11
## 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663
## 8 8 10 5 3 12 11 12 3 11 11 9 9 9 11 11 10 6 6 13
## 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683
## 14 7 6 8 9 8 7 6 7 7 6 9 13 14 10 10 14 9 12 13
## 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703
## 7 6 7 7 16 6 15 8 9 7 11 8 8 11 12 5 13 9 13 9
## 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723
## 16 8 7 9 4 11 11 7 8 12 10 6 16 15 7 15 6 9 8 8
## 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743
## 10 8 7 10 10 8 8 7 14 13 13 12 6 12 9 5 4 12 8 6
## 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763
## 7 16 7 7 11 15 11 8 11 6 8 13 11 6 7 9 5 7 6 10
## 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783
## 16 8 9 7 10 12 5 14 4 5 13 10 11 6 4 7 6 14 5 5
## 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803
## 12 4 12 6 8 12 3 10 9 7 9 8 2 8 9 4 6 8 8 6
## 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823
## 13 9 7 9 3 8 12 10 10 4 10 6 4 9 10 7 8 4 20 6
## 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843
## 8 12 11 7 7 12 7 9 6 6 7 7 7 8 5 5 6 4 8 10
## 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863
## 11 10 6 4 7 6 9 13 13 4 11 12 7 6 11 7 3 9 7 9
## 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883
## 8 5 8 7 13 12 6 9 2 7 6 5 10 6 12 11 9 4 5 4
## 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903
## 8 8 12 8 4 4 5 3 8 6 7 6 6 6 5 6 4 7 6 9
## 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923
## 7 4 12 9 5 6 5 5 9 5 7 7 6 5 10 5 3 4 4 2
## 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 944
## 10 6 5 8 4 9 3 5 5 7 5 5 3 5 5 3 2 1 2 2
## 945 946 947 948 951 954 955 959 963 968 970 971
## 1 1 1 1 4 2 1 1 1 1 1 1
## 1 least connected region:
## 1517 with 1 link
## 1 most connected region:
## 4148 with 971 links
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 5552 30824704 5552 45.51765 22335.81
Next, lm.morantest() of spdep package will be used to perform Moran’s I test for residual spatial autocorrelation. The null hypothesis for this test is that the residuals are randomly distributed among the municipalities of Brazil. This test will be conducted at a 95% confidence interval.
lm.morantest(gdp_capita.lmr, nb_lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = GDP_CAPITA.log ~ IDHM + DU_URBAN_PERCENTAGE +
## ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente +
## Intermediário_Remoto + Rural_Adjacente + Rural_Remoto +
## GVA_AGROPEC_Percentage + GVA_INDUSTRY_Percentage +
## GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA +
## COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC, data =
## dependent_variables)
## weights: nb_lw
##
## Moran I statistic standard deviate = 98.893, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 1.095106e-01 -5.332931e-04 1.238219e-06
The p-value is 2.2e-16, which is less than 0.05. Hence, we reject the null hypothesis. We can conclude that the residual data are not randomly distributed . The moran’s I value is 0.02, which is very close to 0 indicating that the magnitude of clustering is low, even though we reject the null hypothesis. The findings from this test imply that there are confounding variables which affect the GDP per ceapita in the municipalities in Brazil. Even though the test indicates that the relationships in the model is stationary, this is acceptable as there are numerous more reasons which are culturally, politically, and socially linked factors which we do not have in this dataset.
One of the important assumptions of linear regression is that, there should be no heteroscedasticity of residuals. In simpler terms, this means that the variance of residuals should not increase with fitted values of independent variable. It is customary to check for heteroscedasticity of residuals once you build the linear regression model. The reason is, we want to check if the model thus built is unable to explain some pattern in the response variable (Y), that eventually shows up in the residuals.
We will perform Breusch Pagan Test for Heteroskedasticity.
Null hypothesis: The variance is constant, i.e. the data is there is no heteroscedasticity.
Alternate hypothesis: The variance is not constant, i.e. the data is there is heteroscedasticity.
Significance level: 0.05
ols_test_breusch_pagan(gdp_capita.lmr)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ------------------------------------------
## Response : GDP_CAPITA.log
## Variables: fitted values of GDP_CAPITA.log
##
## Test Summary
## ------------------------------
## DF = 1
## Chi2 = 243.6201
## Prob > Chi2 = 6.38839e-55
As the p-value for this test is less than 0.05, we reject the null hypothesis. We can conclude that there is heteroscedasticity in our data with 95% confidence. This test suggests that there is a lot of noise in our data which leads to inconsistent variation of data. This is not good for our regression model, however, given the number of variables and observations, building a perfect model is very hard. Therefore, this is one of the drawbacks of the regression model.
Most of the assumptions of linear regression models have been met as seen above. From the tests above, we were not able to prove if the data is non-stationary as autocorrelation was found when Moran’s test was conducted. Furthermore, through Breusch Pagan Test, we were also able to find out that heterescedacity is present in our regression model which is a drawback for our model.
The drawbacks of this regression model will be overcomed by making a Geographically Weighted Regression Model. This model will account for the spatial location for each municipality when factoring the role each independent variable play.
To analyse our multiple regression model, we will be analysing how each independent variable affects GDP per capita. This relationship can be found by analysing the coefficients of each of the independent variables. These coefficients can be obtained from the code below.
coefficients <- as.data.frame(gdp_capita.lmr$coefficients)
colnames(coefficients) <- c("Value")
print(coefficients)
## Value
## (Intercept) 5.655193e+00
## IDHM 4.481006e+00
## DU_URBAN_PERCENTAGE -2.396971e-03
## ACTIVE_PERCENTAGE 1.827565e-02
## PLANTED_AREA_PERCENTAGE 5.337640e-03
## Intermediário_Adjacente 5.203362e-02
## Intermediário_Remoto 3.669998e-01
## Rural_Adjacente 7.048606e-02
## Rural_Remoto 2.611680e-01
## GVA_AGROPEC_Percentage 5.998346e-06
## GVA_INDUSTRY_Percentage 1.028729e-05
## GVA_SERVICES_Percentage -3.278772e-06
## GVA_PUBLIC_Percentage -4.117840e-06
## TAX_CAPITA 7.152464e-02
## COMP_AGRICULTURE 3.477837e-03
## COMP_INDUSTRY -4.097323e-03
## COMP_SERVICES 8.437904e-03
## COMP_PUBLIC -2.924581e-03
There are a total of 17 variables in our model. Out of the seventeen variable, 5 variables are negative. Negative coefficients tell us that there is an negative relationship between the independent variable and the dependent variables. Note that the dependent variable, GDP_CAPITA has the unit of 1000 real (~S$265) which is why all our coefficients are in decimal place. For easier interpretation, we will multiply our dependent variable by 1000 to make the unit of GDP_CAPITA as 1 real.
coefficients$Value <- coefficients$Value *1000
coefficients
## Value
## (Intercept) 5.655193e+03
## IDHM 4.481006e+03
## DU_URBAN_PERCENTAGE -2.396971e+00
## ACTIVE_PERCENTAGE 1.827565e+01
## PLANTED_AREA_PERCENTAGE 5.337640e+00
## Intermediário_Adjacente 5.203362e+01
## Intermediário_Remoto 3.669998e+02
## Rural_Adjacente 7.048606e+01
## Rural_Remoto 2.611680e+02
## GVA_AGROPEC_Percentage 5.998346e-03
## GVA_INDUSTRY_Percentage 1.028729e-02
## GVA_SERVICES_Percentage -3.278772e-03
## GVA_PUBLIC_Percentage -4.117840e-03
## TAX_CAPITA 7.152464e+01
## COMP_AGRICULTURE 3.477837e+00
## COMP_INDUSTRY -4.097323e+00
## COMP_SERVICES 8.437904e+00
## COMP_PUBLIC -2.924581e+00
The five variables which are negative are:
(1) COMP_INDUSTRY -4.097323
(2) COMP_PUBLIC -2.924581
(3) DU_URBAN_PERCENTAGE -2.396971
(4) GVA_PUBLIC_Percentage -0.004117840
(5) GVA_SERVICES_Percentage -0.003278772
COMP_INDUSTRY (Percentage of Company in Industrial Services), COMP_PUBLIC (Percentage of Company in Public Services) and DU_URBAN_PERCENTAGE (Percentage of domestic units which are urban) have a strong negative relationship with these GDP_CAPITA. With increase in proportion for each of these variables, there is a decrease in GDP per capita. As this is a comparitive study done for all the municipalities, these coefficents suggest that municipalities which are heavily focused in industrial services and public services have lower GDP per capita than other municipalities which have higher proprotion of other type of companies. GVA_PUBLIC_Percentage (Percentage of Gross Value Add By Public Services) and GVA_SERVICES_Percentage (Percentage of Gross Value Add By Commercial Services) are negative but very close to zero suggesting that they do not have a significant relationship on GDP_CAPITA.
The twelve variables which are positive are:
(1) IDHM 4481.006
(2)
(3)
(4) TAX_CAPITA 71.52464
(5)
(6)
(7) ACTIVE_PERCENTAGE 18.27565
(8) COMP_SERVICES 8.437904
(9) PLANTED_AREA_PERCENTAGE 5.337640
(10) COMP_AGRICULTURE 3.477837
(11) GVA_INDUSTRY_Percentage 0.01028729
(12) GVA_AGROPEC_Percentage 0.005998346
The variables displayed above are the independent variables which have a positive relationship with GDP_CAPITA. Other than for the dummy variables marked above, this means that the single unit increase in the variable leads to an increase in GDP_CAPITA with an amount of the value of the coefficient. For example, IDHM (Human Development Index) has a coefficient of 4481.006. This means that for every one unit of increase in IDHM, the GDP_CAPITA increases by roughly 4481 real. As IDHM ranges from 0-1, a better understanding of this coefficient will be that for every 0.1 increase in IDHM, the GDP_CAPITA increases by roughly 448.1 real. IDHM is a composite index consisting of education, life expectancy, and income. this indicates that public infrastructure such as schools and hospitals significantly contribute to the increase in GDP per capita, and are essential to every municipality.
The dummy variable is analysed in a different way. For example, Intermediário_Remoto has a coefficient of 366.9998. This means that if a municipality is Intermediário_Remoto then it is likely to have 367 real more in their GDP per capita than those municipalities which dont belong to this category. Similar conclusions can be drawn from the other dummy variables. To be very honest, it is very surprising to notice that Rural and remote municipalities tend to have more GDP_CAPITA. One of the major reasons for this is due to lower population as compared to urban areas.
A geographically weighted regression is a local form of linear regression used to model spatially varying relationships. We have already created an exploratory multiple linear regression model for GDP_CAPITA. GWR will increase the accuracy of our model as it will construct a separate equation for every municipality in the dataset incorporating the dependent and explanatory variables of features falling within the bandwidth of each target feauture. This means that each equation is calibrated using a different weighting of the observations contained in the data set. There are many variations of a geographically weighted regression. The parameters chosen for our regression model are described below.
The assumption in a GWR model is that observations nearby one another have a greater influence on one another’s parameter estimates than observations farther apart. The weight assigned to each observation is based on a distance decay function from the centroid of each municipality.The weights are determined using a kernel, which is a distance decay function that determines how quickly weights decrease as distances increase. We have chosen our kernel as gausian as it assigns a weight of one to the regression feature, and weights for the surrounding features smoothly and gradually decrease as the distance from the regression feature increases. As we have more than 5000 municipalities in our regression analysis, the gausian kernel is the most appropriate weighting scheme as municipalities which are further away from the regression feauture are also considered but are given a very low weightage. This is because the gaussian kernel never reaches zero, hence each municipality have weights for all the other municipalities based on distance.
There are two types of bandwidths we can use: Fixed or Adapative. Fixed bandwidth will consider all the neighbours in a given distance whereas adaptive bandwidth will consider optimal number of neighbours. Adaptive bandwidths require the calculation of a distance matrix, which will require more than 5000 * 5000 calculations to build the matrix. As we will be publishing this report on RPubs, we will avoid using adaptive bandwidth as the matrix data file itself takes up 250 MB.
For adaptive bandwidth, GWR will compute the optimal distance (for a fixed kernel) while minimising the Cross validation score.
The approach used in calculating neighbour is Cross Validation. In geographically weighted regression, one must determine a window size which will be used to subset the data locally. The cross-validation procedure is used to determine a globally optimal window size. The CV score is better than the alternative approach (AIC score) for this case, as the degrees of freedom change in the AIC approach while adjusting the bandwidth, which is not required in this model.
The function bw.gwr() will be used to compute the fixed bandwidth. As described above, we will be using a fixed bandwidth on a gausian kernel which aims to minimise cross validation scores.
bw.fixed <- bw.gwr(formula = GDP_CAPITA.log ~ IDHM +
DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente+ Rural_Adjacente + GVA_AGROPEC_Percentage+ GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + Rural_Remoto+ GVA_PUBLIC_Percentage + TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC , data = data.sp, approach = "CV", kernel="gaussian", adaptive=FALSE, longlat=TRUE)
As we have found the bandwidth, we will create the geogrpahically weighted model and determine if our model has improved by considering the spatial relationships.
gwr.fixed <- gwr.basic(formula = GDP_CAPITA.log ~ IDHM + DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente+ Rural_Adjacente + GVA_AGROPEC_Percentage+ GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + Rural_Remoto+ GVA_PUBLIC_Percentage + TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC , data = data.sp, bw = 520.5, kernel = 'gaussian', longlat = TRUE)
gwr.fixed
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2020-06-01 03:09:22
## Call:
## gwr.basic(formula = GDP_CAPITA.log ~ IDHM + DU_URBAN_PERCENTAGE +
## ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente +
## Rural_Adjacente + GVA_AGROPEC_Percentage + GVA_INDUSTRY_Percentage +
## GVA_SERVICES_Percentage + Rural_Remoto + GVA_PUBLIC_Percentage +
## TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES +
## COMP_PUBLIC, data = data.sp, bw = 520.5, kernel = "gaussian",
## longlat = TRUE)
##
## Dependent (y) variable: GDP_CAPITA.log
## Independent variables: IDHM DU_URBAN_PERCENTAGE ACTIVE_PERCENTAGE PLANTED_AREA_PERCENTAGE Intermediário_Adjacente Rural_Adjacente GVA_AGROPEC_Percentage GVA_INDUSTRY_Percentage GVA_SERVICES_Percentage Rural_Remoto GVA_PUBLIC_Percentage TAX_CAPITA COMP_AGRICULTURE COMP_INDUSTRY COMP_SERVICES COMP_PUBLIC
## Number of data points: 5552
## ***********************************************************************
## * Results of Global Regression *
## ***********************************************************************
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88482 -0.21926 -0.04419 0.16588 2.99060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.768e+00 1.752e-01 32.925 < 2e-16 ***
## IDHM 4.466e+00 1.303e-01 34.264 < 2e-16 ***
## DU_URBAN_PERCENTAGE -2.494e-03 3.473e-04 -7.181 7.82e-13 ***
## ACTIVE_PERCENTAGE 1.707e-02 2.409e-03 7.086 1.56e-12 ***
## PLANTED_AREA_PERCENTAGE 5.319e-03 1.961e-04 27.123 < 2e-16 ***
## Intermediário_Adjacente 3.133e-02 1.738e-02 1.802 0.07161 .
## Rural_Adjacente 4.590e-02 1.538e-02 2.985 0.00285 **
## GVA_AGROPEC_Percentage 6.218e-06 6.237e-07 9.969 < 2e-16 ***
## GVA_INDUSTRY_Percentage 1.030e-05 8.583e-07 11.998 < 2e-16 ***
## GVA_SERVICES_Percentage -3.347e-06 5.587e-07 -5.990 2.23e-09 ***
## Rural_Remoto 2.318e-01 2.500e-02 9.272 < 2e-16 ***
## GVA_PUBLIC_Percentage -4.187e-06 4.882e-07 -8.578 < 2e-16 ***
## TAX_CAPITA 7.163e-02 1.929e-03 37.138 < 2e-16 ***
## COMP_AGRICULTURE 3.388e-03 5.872e-04 5.770 8.38e-09 ***
## COMP_INDUSTRY -3.976e-03 8.997e-04 -4.420 1.01e-05 ***
## COMP_SERVICES 8.147e-03 1.085e-03 7.509 6.90e-14 ***
## COMP_PUBLIC -2.965e-03 9.029e-04 -3.284 0.00103 **
##
## ---Significance stars
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.3591 on 5535 degrees of freedom
## Multiple R-squared: 0.724
## Adjusted R-squared: 0.7232
## F-statistic: 907.6 on 16 and 5535 DF, p-value: < 2.2e-16
## ***Extra Diagnostic information
## Residual sum of squares: 713.8964
## Sigma(hat): 0.3586502
## AIC: 4403.766
## AICc: 4403.89
## ***********************************************************************
## * Results of Geographically Weighted Regression *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: gaussian
## Fixed bandwidth: 520.5
## Regression points: the same locations as observations are used.
## Distance metric: Great Circle distance metric is used.
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median 3rd Qu.
## Intercept 3.5918e+00 4.5969e+00 6.4049e+00 7.1170e+00
## IDHM 7.0023e-01 2.3241e+00 3.0105e+00 3.7890e+00
## DU_URBAN_PERCENTAGE -1.2126e-02 -2.9675e-03 -1.5261e-03 1.2966e-03
## ACTIVE_PERCENTAGE -4.0936e-02 1.6111e-02 2.0225e-02 3.6364e-02
## PLANTED_AREA_PERCENTAGE 5.1165e-04 3.7827e-03 4.2643e-03 4.8486e-03
## Intermediário_Adjacente -8.7305e-02 1.0146e-02 3.1268e-02 5.6163e-02
## Rural_Adjacente -9.7066e-02 -1.7413e-02 6.0540e-02 8.0944e-02
## GVA_AGROPEC_Percentage 4.6438e-07 5.2591e-06 5.5373e-06 6.0705e-06
## GVA_INDUSTRY_Percentage -8.6256e-06 9.4129e-06 1.0547e-05 1.1588e-05
## GVA_SERVICES_Percentage -1.0004e-05 -3.0248e-06 -2.6983e-06 -1.6041e-06
## Rural_Remoto -1.5925e-01 -3.7663e-04 1.1742e-01 2.7789e-01
## GVA_PUBLIC_Percentage -1.3185e-05 -1.0805e-05 -5.8453e-06 -1.8408e-06
## TAX_CAPITA 5.8351e-02 6.1810e-02 6.8618e-02 1.1677e-01
## COMP_AGRICULTURE 3.4866e-04 9.5571e-04 2.4816e-03 1.7337e-02
## COMP_INDUSTRY -1.2177e-02 -4.1437e-03 7.1787e-05 1.1738e-03
## COMP_SERVICES 3.0606e-03 9.7136e-03 1.0167e-02 1.0535e-02
## COMP_PUBLIC -1.2587e-02 -2.3025e-03 1.5551e-03 3.3808e-03
## Max.
## Intercept 10.2177
## IDHM 4.2075
## DU_URBAN_PERCENTAGE 0.0020
## ACTIVE_PERCENTAGE 0.0551
## PLANTED_AREA_PERCENTAGE 0.0204
## Intermediário_Adjacente 0.1019
## Rural_Adjacente 0.1641
## GVA_AGROPEC_Percentage 0.0000
## GVA_INDUSTRY_Percentage 0.0000
## GVA_SERVICES_Percentage 0.0000
## Rural_Remoto 0.4139
## GVA_PUBLIC_Percentage 0.0000
## TAX_CAPITA 0.1970
## COMP_AGRICULTURE 0.0298
## COMP_INDUSTRY 0.0176
## COMP_SERVICES 0.0235
## COMP_PUBLIC 0.0121
## ************************Diagnostic information*************************
## Number of data points: 5552
## Effective number of parameters (2trace(S) - trace(S'S)): 137.9105
## Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5414.09
## AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 2985.55
## AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 2879.408
## Residual sum of squares: 536.2443
## R-square value: 0.7927114
## Adjusted R-square value: 0.7874303
##
## ***********************************************************************
## Program stops at: 2020-06-01 03:10:14
By incorporating the GWR model in our multiple regression model, our Adjusted R-Square value has increased from 0.7232 to 0.787427. This means that the spatial weights have justified for 6.42% of the variation in GDP_CAPITA. Due to the spatial weights, our model has significantly improved as nearly 80% of the variation in GDP_CAPITA is justified.
We will first obtain the data from the GWR model through the code below.
gdp_capita.fixed <- st_as_sf(gwr.fixed$SDF) %>%
st_transform(crs=4674)
one <- tm_shape(gdp_capita.fixed)+
tm_polygons("Local_R2",border.col = "black", border.alpha = 0.2, style="quantile")
two <- tm_shape(data)+
tm_polygons("GDP_CAPITA",border.col = "black", border.alpha = 0.2, style="quantile")
tmap_arrange(one,two,ncol = 2,nrow = 1)
The above diagram shows the Local R-square value and the GDP_CAPITA for municipalities in Brazil. The point of this comparison is not to analyse each municipality independently but rather to understand the patterns discovered. Local R-square value indicate the amount of variation in the GDP_CAPITA which is due to the dependent variables for each municipality. We can draw strong observations from the map above. The eastern region of Brazil has low GDP_CAPITA, and has the highest R-square values. This means that our GWR model works better for municipalities which have lesser GDP_CAPITA as it is explained from the independent variables chosen. It should also be noted that the similar inverse relationship is observed in the central region of Brazil(which has the highest GDP_CAPITA, but lowest R-square value). Another thing to note is that we have used quartile breaks in GDP_CAPITA, and it is very evident that the last break consists majority of the data. This was done to understand the relationship of between Local R-square and GDP_CAPITA, which would not be possible otherwise due to the extremely skewed nature of the data. The exception to this analysis is the southern region of Brazil which has higher GDP_CAPITA and a relatively high local r-square value. We can infer that there are many other variables which need to be included in order to understand the richer part of Brazil as this model severely lacks that understanding.
We will first manupulate the data in order to sort out regions which are statistically significant when comparing the parameter estimates for each of the dependent variables.
gdp_capita.fixed <- gdp_capita.fixed %>%
mutate(IDHM_TV=abs(IDHM_TV)) %>%
mutate(DU_URBAN_PERCENTAGE_TV=abs(DU_URBAN_PERCENTAGE_TV)) %>%
mutate(ACTIVE_PERCENTAGE_TV=abs(ACTIVE_PERCENTAGE_TV)) %>%
mutate(PLANTED_AREA_PERCENTAGE_TV=abs(PLANTED_AREA_PERCENTAGE_TV)) %>%
mutate(Intermediário_Adjacente=abs(Intermediário_Adjacente_TV)) %>%
mutate(Rural_Adjacente=abs(Rural_Adjacente_TV)) %>%
mutate(GVA_AGROPEC_Percentage=abs(GVA_AGROPEC_Percentage_TV)) %>%
mutate(GVA_SERVICES_Percentage=abs(GVA_SERVICES_Percentage_TV)) %>%
mutate(Rural_Remoto_TV=abs(Rural_Remoto_TV)) %>%
mutate(GVA_PUBLIC_Percentage_TV=abs(GVA_PUBLIC_Percentage_TV)) %>%
mutate(TAX_CAPITA_TV=abs(TAX_CAPITA_TV)) %>%
mutate(COMP_AGRICULTURE_TV=abs(COMP_AGRICULTURE_TV)) %>%
mutate(COMP_INDUSTRY_TV=abs(COMP_INDUSTRY_TV)) %>%
mutate(COMP_SERVICES_TV=abs(COMP_SERVICES_TV))
The objective of this section is to map out parameter estimates which are significant. It will be plotted along side the GDP_CAPITA raw values in order to draw understanding from the GWR model. Note that the confidence interval used in this report is 95%, hence the municipalities will be considered to have a significant value for a particular independent if the absolute value of the t-statistic is more than or equal to 1.96. In order to perfrom this analysis, the code below creates functions which will help us arrange the choropleth maps in neat manner.
makeMap <- function(variable,tscore){
one <- tm_shape(gdp_capita.fixed)+
tm_polygons(variable,border.col = "black", border.alpha = 0.2)
two <- tm_shape(gdp_capita.fixed)+
tm_polygons(tscore,border.col = "black", border.alpha = 0.2, breaks=c(0,1.96,Inf))
three <- tm_shape(data)+
tm_polygons("GDP_CAPITA",border.col = "black", border.alpha = 0.2, style="quantile")
return(tmap_arrange(one,two,three,ncol=3,nrow = 1))
}
IDHM <- makeMap("IDHM","IDHM_TV")
DU_URBAN_PERCENTAGE <- makeMap("DU_URBAN_PERCENTAGE","DU_URBAN_PERCENTAGE_TV")
ACTIVE_PERCENTAGE <- makeMap("ACTIVE_PERCENTAGE","ACTIVE_PERCENTAGE_TV")
PLANTED_AREA_PERCENTAGE <- makeMap("PLANTED_AREA_PERCENTAGE","PLANTED_AREA_PERCENTAGE_TV")
Intermediário_Adjacente <- makeMap("Intermediário_Adjacente","Intermediário_Adjacente_TV")
Rural_Adjacente <- makeMap("Rural_Adjacente","Rural_Adjacente_TV")
GVA_AGROPEC_Percentage <- makeMap("GVA_AGROPEC_Percentage","GVA_AGROPEC_Percentage_TV")
GVA_INDUSTRY_Percentage <- makeMap("GVA_AGROPEC_Percentage","GVA_INDUSTRY_Percentage_TV")
GVA_SERVICES_Percentage <- makeMap("GVA_SERVICES_Percentage","GVA_SERVICES_Percentage_TV")
Rural_Remoto <- makeMap("Rural_Remoto","Rural_Remoto_TV")
GVA_PUBLIC_Percentage <- makeMap("GVA_PUBLIC_Percentage","GVA_PUBLIC_Percentage_TV")
TAX_CAPITA <- makeMap("TAX_CAPITA","TAX_CAPITA_TV")
COMP_AGRICULTURE <- makeMap("COMP_AGRICULTURE","COMP_AGRICULTURE_TV")
COMP_INDUSTRY <- makeMap("COMP_INDUSTRY","COMP_INDUSTRY_TV")
COMP_SERVICES <- makeMap("COMP_SERVICES","COMP_SERVICES_TV")
IDHM
It is very prominent that IDHM is significant in almost all the municipalities in Brazil. This is because this index captures education, life expectancy and income, which is key to GDP_CAPITA. This variable was also found to have the highest correlation coefficient with GDP_CAPITA when compared in the multiple linear regression model as well. When we account for spatial variation, IDHM tops the list as the most significant independent variable, indicating that it is one of the most important methods through which GDP_CAPITA can be estimated.
DU_URBAN_PERCENTAGE
The percentage of urban units is found to be significant in the western, eastern and sothern regions of Brazil. It can be inferred that this independent variable is significant only when its is either too low or too high, indicating that it does not play a big factor in determining GDP_CAPITA in municipalities which have average urban housing rates.
ACTIVE_PERCENTAGE
Active percentage reflects the percentage of population which is performing economically active work. Almost all the municipalities consider this variable as a significant variable regardless of their active percentage indicating that it one of the most important factor when determining the GDP_CAPITA.
PLANTED_AREA_PERCENTAGE
Planted_area_percentage reflects the amount of land dedicated to farming. As seen that it is a significant factor in almost all the municipalities in Brazil indicating Brazil’s dependence on the agricultural economy. We can also observe that this proportion is lesser in economically richer areas.
Intermediário_Adjacente
This index determines semi-rural areas. It is evident that these regions have a higher GDP_CAPITA as compared to the other types of regions.
Rural_Adjacente
GVA_AGROPEC_Percentage
This index tells us the percentage of value addition by the agricultural industry. It is significant in almost all the municipalities in Brazil except the northern regions. Agriculture is one of the biggest factors contributing to the GDP_CAPITA, as we have noticed from the other variables analysed above.
GVA_SERVICES_Percentage
This variable is insignificant in almost all the municipalities in Brazil.
The model was significantly improved when the GWR model was adopted. Through this method, we could explain how each independent variable played a role in different regions which was key to our understanding of the different factors which affect the GDP_CAPITA. Local statistics help us justify the findings from global statistics and provide us insights on a deeper level through which we can uncover findings which were not visible before. We understand how spatial relationships play an important role and dig deeper to find out variables which do not affect GDP_CAPITA at all!