1 Getting Started

1.1 Overview

1.2 Importing and Loading Packages

packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse','geobr','heatmaply')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

1.3 Importing Geospatial Data

We will be importing the municipalities of Brazil from the geobr package. Studying the vignettes, the function to obtain our required municipality data from 2016 is read_municipality.

However, re-running the code will involve pulling the municipality data over and over, therefore instead we will save the municipality data into a local file instead for faster loading in future uses.

The code chunk below will run once on my client side and will be commented out for use of the local file instead moving forward.

#Municipal2016 <- read_municipality(year=2016)
#saveRDS(Municipal2016, file = "Municipal2016.rds")

Municipal2016 <- readRDS(file = "Municipal2016.rds")

1.3.1 Studying the Imported Geospatial Data

glimpse(Municipal2016)

## Rows: 5,572
## Columns: 5
## $ code_muni    <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100064, ...
## $ name_muni    <fct> Alta Floresta D'oeste, Ariquemes, Cabixi, Cacoal, Cere...
## $ code_state   <fct> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11...
## $ abbrev_state <fct> RO, RO, RO, RO, RO, RO, RO, RO, RO, RO, RO, RO, RO, RO...
## $ geom         <POLYGON [°]> POLYGON ((-62.19465 -11.827..., POLYGON ((-62....

5572 Municipalities are observed in our Simple Feature Dataframe.

1.3.2 Geospatial Data Wrangling

st_crs(Municipal2016)

## Coordinate Reference System:
##   User input: SIRGAS 2000 
##   wkt:
## GEOGCRS["SIRGAS 2000",
##     DATUM["Sistema de Referencia Geocentrico para las AmericaS 2000",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["Latin America - SIRGAS 2000 by country"],
##         BBOX[-59.87,-122.19,32.72,-25.28]],
##     ID["EPSG",4674]]

We observe that our current projection is in EPSG:4674

Next we will check for invalid polygons in Municipal2016

unique(st_is_valid(Municipal2016))

## [1]  TRUE FALSE

It is observed that there are invalid polygons that we will use the st_make_valid() function to fix. st_make_Valid attempts to repair invalidities without only minimal alterations to the input geometries. No vertices are dropped or moved, the structure of the object is simply re-arranged.

Municipal2016 <- st_make_valid(Municipal2016)

Now we’ll recheck to ensure no invalid polygons remain

unique(st_is_valid(Municipal2016))

## [1] TRUE

No invalid polygons remain!

Now we’ll check for any NA data in Municipal2016

any(is.na(Municipal2016))

## [1] FALSE

No NA data is observed in Municipal2016.

1.3.3 Visualising Brazil Municipalities 2016

# Remove plot axis
no_axis <- theme(axis.title=element_blank(),
                 axis.text=element_blank(),
                 axis.ticks=element_blank())

# Plot all Brazilian municipalities
ggplot() +
    geom_sf(data=Municipal2016, fill="#FFDF00", color="#009C3B", size=.15, show.legend = FALSE) +
    labs(subtitle="Brazil Municipalities, 2016", size=8) +
    theme_minimal() +
    no_axis

We will proceed with importing our provided Aspatial Data.

1.4 Importing Aspatial Data

We will be importing both BRAZIL_CITIES.csv and Data_Dictionary.csv

Further investigations into the csv files reveal that columns are delimited by “;”. Therefore we will be using read_delim instead of read_csv for importing the Aspatial Data. Note that read_csv2 is also an alternative here.

Brazil_Cities<- read_delim ("data/aspatial/BRAZIL_CITIES.csv", ";")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   CITY = col_character(),
##   STATE = col_character(),
##   AREA = col_number(),
##   REGIAO_TUR = col_character(),
##   CATEGORIA_TUR = col_character(),
##   RURAL_URBAN = col_character(),
##   GVA_MAIN = col_character()
## )

## See spec(...) for full column specifications.

Data_Dict <- read_delim("data/aspatial/Data_Dictionary.csv", ";")

## Warning: Missing column names filled in: 'X6' [6]

## Parsed with column specification:
## cols(
##   FIELD = col_character(),
##   DESCRIPTION = col_character(),
##   REFERENCE = col_character(),
##   UNIT = col_character(),
##   SOURCE = col_character(),
##   X6 = col_character()
## )

1.4.1 Studying the Imported Aspatial Data (Brazil_Cities)

Next we will take a look at our imported Brazil_Cities aspatial data with summary.

summary(Brazil_Cities)

##      CITY              STATE              CAPITAL          IBGE_RES_POP     
##  Length:5573        Length:5573        Min.   :0.000000   Min.   :     805  
##  Class :character   Class :character   1st Qu.:0.000000   1st Qu.:    5235  
##  Mode  :character   Mode  :character   Median :0.000000   Median :   10934  
##                                        Mean   :0.004845   Mean   :   34278  
##                                        3rd Qu.:0.000000   3rd Qu.:   23424  
##                                        Max.   :1.000000   Max.   :11253503  
##                                                           NA's   :8         
##  IBGE_RES_POP_BRAS  IBGE_RES_POP_ESTR     IBGE_DU        IBGE_DU_URBAN    
##  Min.   :     805   Min.   :     0.0   Min.   :    239   Min.   :     60  
##  1st Qu.:    5230   1st Qu.:     0.0   1st Qu.:   1572   1st Qu.:    874  
##  Median :   10926   Median :     0.0   Median :   3174   Median :   1846  
##  Mean   :   34200   Mean   :    77.5   Mean   :  10303   Mean   :   8859  
##  3rd Qu.:   23390   3rd Qu.:    10.0   3rd Qu.:   6726   3rd Qu.:   4624  
##  Max.   :11133776   Max.   :119727.0   Max.   :3576148   Max.   :3548433  
##  NA's   :8          NA's   :8          NA's   :10        NA's   :10       
##  IBGE_DU_RURAL      IBGE_POP            IBGE_1            IBGE_1-4     
##  Min.   :    3   Min.   :     174   Min.   :     0.0   Min.   :     5  
##  1st Qu.:  487   1st Qu.:    2801   1st Qu.:    38.0   1st Qu.:   158  
##  Median :  931   Median :    6170   Median :    92.0   Median :   376  
##  Mean   : 1463   Mean   :   27595   Mean   :   383.3   Mean   :  1544  
##  3rd Qu.: 1832   3rd Qu.:   15302   3rd Qu.:   232.0   3rd Qu.:   951  
##  Max.   :33809   Max.   :10463636   Max.   :129464.0   Max.   :514794  
##  NA's   :81      NA's   :8          NA's   :8          NA's   :8       
##     IBGE_5-9        IBGE_10-14       IBGE_15-59         IBGE_60+      
##  Min.   :     7   Min.   :    12   Min.   :     94   Min.   :     29  
##  1st Qu.:   220   1st Qu.:   259   1st Qu.:   1734   1st Qu.:    341  
##  Median :   516   Median :   588   Median :   3841   Median :    722  
##  Mean   :  2069   Mean   :  2381   Mean   :  18212   Mean   :   3004  
##  3rd Qu.:  1300   3rd Qu.:  1478   3rd Qu.:   9628   3rd Qu.:   1724  
##  Max.   :684443   Max.   :783702   Max.   :7058221   Max.   :1293012  
##  NA's   :8        NA's   :8        NA's   :8         NA's   :8        
##  IBGE_PLANTED_AREA   IBGE_CROP_PRODUCTION_$ IDHM Ranking 2010      IDHM       
##  Min.   :      0.0   Min.   :      0        Min.   :   1      Min.   :0.4180  
##  1st Qu.:    910.2   1st Qu.:   2326        1st Qu.:1392      1st Qu.:0.5990  
##  Median :   3471.5   Median :  13846        Median :2783      Median :0.6650  
##  Mean   :  14179.9   Mean   :  57384        Mean   :2783      Mean   :0.6592  
##  3rd Qu.:  11194.2   3rd Qu.:  55619        3rd Qu.:4174      3rd Qu.:0.7180  
##  Max.   :1205669.0   Max.   :3274885        Max.   :5565      Max.   :0.8620  
##  NA's   :3           NA's   :3              NA's   :8         NA's   :8       
##    IDHM_Renda     IDHM_Longevidade IDHM_Educacao         LONG       
##  Min.   :0.4000   Min.   :0.6720   Min.   :0.2070   Min.   :-72.92  
##  1st Qu.:0.5720   1st Qu.:0.7690   1st Qu.:0.4900   1st Qu.:-50.87  
##  Median :0.6540   Median :0.8080   Median :0.5600   Median :-46.52  
##  Mean   :0.6429   Mean   :0.8016   Mean   :0.5591   Mean   :-46.23  
##  3rd Qu.:0.7070   3rd Qu.:0.8360   3rd Qu.:0.6310   3rd Qu.:-41.40  
##  Max.   :0.8910   Max.   :0.8940   Max.   :0.8250   Max.   :-32.44  
##  NA's   :8        NA's   :8        NA's   :8        NA's   :9       
##       LAT               ALT               PAY_TV         FIXED_PHONES    
##  Min.   :-33.688   Min.   :     0.0   Min.   :      1   Min.   :      3  
##  1st Qu.:-22.838   1st Qu.:   169.8   1st Qu.:     88   1st Qu.:    119  
##  Median :-18.089   Median :   406.5   Median :    247   Median :    327  
##  Mean   :-16.444   Mean   :   893.8   Mean   :   3094   Mean   :   6567  
##  3rd Qu.: -8.489   3rd Qu.:   628.9   3rd Qu.:    815   3rd Qu.:   1151  
##  Max.   :  4.585   Max.   :874579.0   Max.   :2047668   Max.   :5543127  
##  NA's   :9         NA's   :9          NA's   :3         NA's   :3        
##       AREA            REGIAO_TUR        CATEGORIA_TUR      ESTIMATED_POP     
##  Min.   :     3.57   Length:5573        Length:5573        Min.   :     786  
##  1st Qu.:   204.44   Class :character   Class :character   1st Qu.:    5454  
##  Median :   416.59   Mode  :character   Mode  :character   Median :   11590  
##  Mean   :  1517.44                                         Mean   :   37432  
##  3rd Qu.:  1026.57                                         3rd Qu.:   25296  
##  Max.   :159533.33                                         Max.   :12176866  
##  NA's   :3                                                 NA's   :3         
##  RURAL_URBAN         GVA_AGROPEC       GVA_INDUSTRY       GVA_SERVICES      
##  Length:5573        Min.   :      0   Min.   :       1   Min.   :        2  
##  Class :character   1st Qu.:   4189   1st Qu.:    1726   1st Qu.:    10112  
##  Mode  :character   Median :  20426   Median :    7424   Median :    31211  
##                     Mean   :  47271   Mean   :  175928   Mean   :   489451  
##                     3rd Qu.:  51227   3rd Qu.:   41022   3rd Qu.:   115406  
##                     Max.   :1402282   Max.   :63306755   Max.   :464656988  
##                     NA's   :3         NA's   :3          NA's   :3          
##    GVA_PUBLIC         GVA_TOTAL             TAXES                GDP           
##  Min.   :       7   Min.   :       17   Min.   :   -14159   Min.   :       15  
##  1st Qu.:   17267   1st Qu.:    42253   1st Qu.:     1305   1st Qu.:    43709  
##  Median :   35866   Median :   119492   Median :     5100   Median :   125153  
##  Mean   :  123768   Mean   :   832987   Mean   :   118864   Mean   :   954584  
##  3rd Qu.:   89245   3rd Qu.:   313963   3rd Qu.:    22197   3rd Qu.:   329539  
##  Max.   :41902893   Max.   :569910503   Max.   :117125387   Max.   :687035890  
##  NA's   :3          NA's   :3           NA's   :3           NA's   :3          
##     POP_GDP           GDP_CAPITA       GVA_MAIN          MUN_EXPENDIT      
##  Min.   :     815   Min.   :  3191   Length:5573        Min.   :1.421e+06  
##  1st Qu.:    5483   1st Qu.:  9058   Class :character   1st Qu.:1.573e+07  
##  Median :   11578   Median : 15870   Mode  :character   Median :2.746e+07  
##  Mean   :   36998   Mean   : 21126                      Mean   :1.043e+08  
##  3rd Qu.:   25085   3rd Qu.: 26155                      3rd Qu.:5.666e+07  
##  Max.   :12038175   Max.   :314638                      Max.   :4.577e+10  
##  NA's   :3          NA's   :3                           NA's   :1492       
##     COMP_TOT            COMP_A            COMP_B            COMP_C        
##  Min.   :     6.0   Min.   :   0.00   Min.   :  0.000   Min.   :    0.00  
##  1st Qu.:    68.0   1st Qu.:   1.00   1st Qu.:  0.000   1st Qu.:    3.00  
##  Median :   162.0   Median :   2.00   Median :  0.000   Median :   11.00  
##  Mean   :   906.8   Mean   :  18.25   Mean   :  1.852   Mean   :   73.44  
##  3rd Qu.:   448.0   3rd Qu.:   8.00   3rd Qu.:  2.000   3rd Qu.:   39.00  
##  Max.   :530446.0   Max.   :1948.00   Max.   :274.000   Max.   :31566.00  
##  NA's   :3          NA's   :3         NA's   :3         NA's   :3         
##      COMP_D             COMP_E            COMP_F             COMP_G        
##  Min.   :  0.0000   Min.   :  0.000   Min.   :    0.00   Min.   :     1.0  
##  1st Qu.:  0.0000   1st Qu.:  0.000   1st Qu.:    1.00   1st Qu.:    32.0  
##  Median :  0.0000   Median :  0.000   Median :    4.00   Median :    74.5  
##  Mean   :  0.4262   Mean   :  2.029   Mean   :   43.26   Mean   :   348.0  
##  3rd Qu.:  0.0000   3rd Qu.:  1.000   3rd Qu.:   15.00   3rd Qu.:   199.0  
##  Max.   :332.0000   Max.   :657.000   Max.   :25222.00   Max.   :150633.0  
##  NA's   :3          NA's   :3         NA's   :3          NA's   :3         
##      COMP_H          COMP_I             COMP_J             COMP_K        
##  Min.   :    0   Min.   :    0.00   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    1   1st Qu.:    2.00   1st Qu.:    0.00   1st Qu.:    0.00  
##  Median :    7   Median :    7.00   Median :    1.00   Median :    0.00  
##  Mean   :   41   Mean   :   55.88   Mean   :   24.74   Mean   :   15.55  
##  3rd Qu.:   25   3rd Qu.:   24.00   3rd Qu.:    5.00   3rd Qu.:    2.00  
##  Max.   :19515   Max.   :29290.00   Max.   :38720.00   Max.   :23738.00  
##  NA's   :3       NA's   :3          NA's   :3          NA's   :3         
##      COMP_L             COMP_M             COMP_N            COMP_O       
##  Min.   :    0.00   Min.   :    0.00   Min.   :    0.0   Min.   :  0.000  
##  1st Qu.:    0.00   1st Qu.:    1.00   1st Qu.:    1.0   1st Qu.:  2.000  
##  Median :    0.00   Median :    4.00   Median :    4.0   Median :  2.000  
##  Mean   :   15.14   Mean   :   51.29   Mean   :   83.7   Mean   :  3.269  
##  3rd Qu.:    3.00   3rd Qu.:   13.00   3rd Qu.:   14.0   3rd Qu.:  3.000  
##  Max.   :14003.00   Max.   :49181.00   Max.   :76757.0   Max.   :204.000  
##  NA's   :3          NA's   :3          NA's   :3         NA's   :3        
##      COMP_P             COMP_Q             COMP_R            COMP_S        
##  Min.   :    0.00   Min.   :    0.00   Min.   :   0.00   Min.   :    0.00  
##  1st Qu.:    2.00   1st Qu.:    1.00   1st Qu.:   0.00   1st Qu.:    5.00  
##  Median :    6.00   Median :    3.00   Median :   2.00   Median :   12.00  
##  Mean   :   30.96   Mean   :   34.15   Mean   :  12.18   Mean   :   51.61  
##  3rd Qu.:   17.00   3rd Qu.:   12.00   3rd Qu.:   6.00   3rd Qu.:   31.00  
##  Max.   :16030.00   Max.   :22248.00   Max.   :6687.00   Max.   :24832.00  
##  NA's   :3          NA's   :3          NA's   :3         NA's   :3         
##      COMP_T      COMP_U              HOTELS            BEDS        
##  Min.   :0   Min.   :  0.00000   Min.   : 1.000   Min.   :    2.0  
##  1st Qu.:0   1st Qu.:  0.00000   1st Qu.: 1.000   1st Qu.:   40.0  
##  Median :0   Median :  0.00000   Median : 1.000   Median :   82.0  
##  Mean   :0   Mean   :  0.05027   Mean   : 3.131   Mean   :  257.5  
##  3rd Qu.:0   3rd Qu.:  0.00000   3rd Qu.: 3.000   3rd Qu.:  200.0  
##  Max.   :0   Max.   :123.00000   Max.   :97.000   Max.   :13247.0  
##  NA's   :3   NA's   :3           NA's   :4686     NA's   :4686     
##   Pr_Agencies        Pu_Agencies         Pr_Bank          Pu_Bank    
##  Min.   :   0.000   Min.   :  0.000   Min.   : 0.000   Min.   :0.00  
##  1st Qu.:   0.000   1st Qu.:  1.000   1st Qu.: 0.000   1st Qu.:1.00  
##  Median :   1.000   Median :  2.000   Median : 1.000   Median :2.00  
##  Mean   :   3.383   Mean   :  2.829   Mean   : 1.312   Mean   :1.58  
##  3rd Qu.:   2.000   3rd Qu.:  2.000   3rd Qu.: 2.000   3rd Qu.:2.00  
##  Max.   :1693.000   Max.   :626.000   Max.   :83.000   Max.   :8.00  
##  NA's   :2231       NA's   :2231      NA's   :2231     NA's   :2231  
##    Pr_Assets           Pu_Assets              Cars          Motorcycles     
##  Min.   :0.000e+00   Min.   :0.000e+00   Min.   :      2   Min.   :      4  
##  1st Qu.:0.000e+00   1st Qu.:4.047e+07   1st Qu.:    602   1st Qu.:    591  
##  Median :3.231e+07   Median :1.339e+08   Median :   1438   Median :   1285  
##  Mean   :9.180e+09   Mean   :6.005e+09   Mean   :   9859   Mean   :   4879  
##  3rd Qu.:1.148e+08   3rd Qu.:4.970e+08   3rd Qu.:   4086   3rd Qu.:   3294  
##  Max.   :1.947e+13   Max.   :8.016e+12   Max.   :5740995   Max.   :1134570  
##  NA's   :2231        NA's   :2231        NA's   :11        NA's   :11       
##  Wheeled_tractor         UBER           MAC             WAL-MART     
##  Min.   :   0.000   Min.   :1      Min.   :  1.000   Min.   : 1.000  
##  1st Qu.:   0.000   1st Qu.:1      1st Qu.:  1.000   1st Qu.: 1.000  
##  Median :   0.000   Median :1      Median :  2.000   Median : 1.000  
##  Mean   :   5.754   Mean   :1      Mean   :  4.277   Mean   : 2.059  
##  3rd Qu.:   1.000   3rd Qu.:1      3rd Qu.:  3.000   3rd Qu.: 1.750  
##  Max.   :3236.000   Max.   :1      Max.   :130.000   Max.   :26.000  
##  NA's   :11         NA's   :5448   NA's   :5407      NA's   :5471    
##   POST_OFFICES    
##  Min.   :  1.000  
##  1st Qu.:  1.000  
##  Median :  1.000  
##  Mean   :  2.081  
##  3rd Qu.:  2.000  
##  Max.   :225.000  
##  NA's   :120

Many NA values are observed in our data, most concerningly in our LONG & LAT Columns which are required for transformation to geospatial data. We will investigate it further

Brazil_Cities[is.na(Brazil_Cities$LAT),]

## # A tibble: 9 x 81
##   CITY  STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR~ IBGE_RES_POP_ES~ IBGE_DU
##   <chr> <chr>   <dbl>        <dbl>            <dbl>            <dbl>   <dbl>
## 1 Baln~ SC          0           NA               NA               NA      NA
## 2 Lago~ RS          0           NA               NA               NA      NA
## 3 Moju~ PA          0           NA               NA               NA      NA
## 4 Para~ MS          0           NA               NA               NA      NA
## 5 Pesc~ SC          0           NA               NA               NA      NA
## 6 Pinh~ RS          0         2130             2130                0     745
## 7 Pint~ RS          0           NA               NA               NA      NA
## 8 Sant~ BA          0         9648             9648                0    2891
## 9 São ~ PE          0           NA               NA               NA      NA
## # ... with 74 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## #   IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## #   `IBGE_10-14` <dbl>, `IBGE_15-59` <dbl>, `IBGE_60+` <dbl>,
## #   IBGE_PLANTED_AREA <dbl>, `IBGE_CROP_PRODUCTION_$` <dbl>, `IDHM Ranking
## #   2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## #   IDHM_Educacao <dbl>, LONG <dbl>, LAT <dbl>, ALT <dbl>, PAY_TV <dbl>,
## #   FIXED_PHONES <dbl>, AREA <dbl>, REGIAO_TUR <chr>, CATEGORIA_TUR <chr>,
## #   ESTIMATED_POP <dbl>, RURAL_URBAN <chr>, GVA_AGROPEC <dbl>,
## #   GVA_INDUSTRY <dbl>, GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL
## #   ` <dbl>, TAXES <dbl>, GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>,
## #   GVA_MAIN <chr>, MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>,
## #   COMP_B <dbl>, COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>,
## #   COMP_G <dbl>, COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>,
## #   COMP_L <dbl>, COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>,
## #   COMP_Q <dbl>, COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>,
## #   HOTELS <dbl>, BEDS <dbl>, Pr_Agencies <dbl>, Pu_Agencies <dbl>,
## #   Pr_Bank <dbl>, Pu_Bank <dbl>, Pr_Assets <dbl>, Pu_Assets <dbl>, Cars <dbl>,
## #   Motorcycles <dbl>, Wheeled_tractor <dbl>, UBER <dbl>, MAC <dbl>,
## #   `WAL-MART` <dbl>, POST_OFFICES <dbl>

Brazil_Cities[is.na(Brazil_Cities$LONG),]

## # A tibble: 9 x 81
##   CITY  STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR~ IBGE_RES_POP_ES~ IBGE_DU
##   <chr> <chr>   <dbl>        <dbl>            <dbl>            <dbl>   <dbl>
## 1 Baln~ SC          0           NA               NA               NA      NA
## 2 Lago~ RS          0           NA               NA               NA      NA
## 3 Moju~ PA          0           NA               NA               NA      NA
## 4 Para~ MS          0           NA               NA               NA      NA
## 5 Pesc~ SC          0           NA               NA               NA      NA
## 6 Pinh~ RS          0         2130             2130                0     745
## 7 Pint~ RS          0           NA               NA               NA      NA
## 8 Sant~ BA          0         9648             9648                0    2891
## 9 São ~ PE          0           NA               NA               NA      NA
## # ... with 74 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## #   IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## #   `IBGE_10-14` <dbl>, `IBGE_15-59` <dbl>, `IBGE_60+` <dbl>,
## #   IBGE_PLANTED_AREA <dbl>, `IBGE_CROP_PRODUCTION_$` <dbl>, `IDHM Ranking
## #   2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## #   IDHM_Educacao <dbl>, LONG <dbl>, LAT <dbl>, ALT <dbl>, PAY_TV <dbl>,
## #   FIXED_PHONES <dbl>, AREA <dbl>, REGIAO_TUR <chr>, CATEGORIA_TUR <chr>,
## #   ESTIMATED_POP <dbl>, RURAL_URBAN <chr>, GVA_AGROPEC <dbl>,
## #   GVA_INDUSTRY <dbl>, GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL
## #   ` <dbl>, TAXES <dbl>, GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>,
## #   GVA_MAIN <chr>, MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>,
## #   COMP_B <dbl>, COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>,
## #   COMP_G <dbl>, COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>,
## #   COMP_L <dbl>, COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>,
## #   COMP_Q <dbl>, COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>,
## #   HOTELS <dbl>, BEDS <dbl>, Pr_Agencies <dbl>, Pu_Agencies <dbl>,
## #   Pr_Bank <dbl>, Pu_Bank <dbl>, Pr_Assets <dbl>, Pu_Assets <dbl>, Cars <dbl>,
## #   Motorcycles <dbl>, Wheeled_tractor <dbl>, UBER <dbl>, MAC <dbl>,
## #   `WAL-MART` <dbl>, POST_OFFICES <dbl>

It is observed that NA rows in both LAT and LONG are the same. For these missing LAT LONG data, we will reference https://pt.db-city.com/ and https://latitude.to/ to find our missing data.

Our following reference in order of the cities listed above are: https://pt.db-city.com/Brasil--Santa-Catarina--Balne%C3%A1rio-Rinc%C3%A3o https://latitude.to/articles-by-country/br/brazil/69447/lagoa-dos-patos https://en.db-city.com/Brazil--Par%C3%A1--Moju%C3%AD-dos-Campos https://pt.db-city.com/Brasil--Mato-Grosso-do-Sul--Para%C3%ADso-das-%C3%81guas https://pt.db-city.com/Brasil--Santa-Catarina--Pescaria-Brava https://pt.db-city.com/Brasil--Rio-Grande-do-Sul--Pinhal-da-Serra https://pt.db-city.com/Brasil--Rio-Grande-do-Sul--Pinto-Bandeira https://pt.db-city.com/Brasil--Bahia--Santa-Teresinha https://pt.db-city.com/Brasil--S%C3%A3o-Paulo--S%C3%A3o-Caetano-do-Sul

We will append these data where our NA rows are. Note that Lagoa Dos Patos and Santa Terezinha have duplicate names in different states hence we will need to specify state.

Brazil_Cities$LAT[Brazil_Cities$CITY == "Balneário Rincão"] <- -28.8344
Brazil_Cities$LONG[Brazil_Cities$CITY == "Balneário Rincão"] <- -49.2361

Brazil_Cities$LAT[Brazil_Cities$CITY == "Lagoa Dos Patos" & Brazil_Cities$STATE == "RS"] <- -31.0697
Brazil_Cities$LONG[Brazil_Cities$CITY == "Lagoa Dos Patos" & Brazil_Cities$STATE == "RS"] <- --51.4725

Brazil_Cities$LAT[Brazil_Cities$CITY == "Mojuí Dos Campos"] <- -2.68472
Brazil_Cities$LONG[Brazil_Cities$CITY == "Mojuí Dos Campos"] <- -54.6431

Brazil_Cities$LAT[Brazil_Cities$CITY == "Paraíso Das Águas"] <- -19.0257
Brazil_Cities$LONG[Brazil_Cities$CITY == "Paraíso Das Águas"] <- -53.0102

Brazil_Cities$LAT[Brazil_Cities$CITY == "Pescaria Brava"] <- -28.4247
Brazil_Cities$LONG[Brazil_Cities$CITY == "Pescaria Brava"] <- -48.8956

Brazil_Cities$LAT[Brazil_Cities$CITY == "Pinhal Da Serra"] <- -27.8747
Brazil_Cities$LONG[Brazil_Cities$CITY == "Pinhal Da Serra"] <- -51.1733

Brazil_Cities$LAT[Brazil_Cities$CITY == "Pinto Bandeira"] <- -29.0978
Brazil_Cities$LONG[Brazil_Cities$CITY == "Pinto Bandeira"] <- -51.4503

Brazil_Cities$LAT[Brazil_Cities$CITY == "Santa Terezinha" & Brazil_Cities$STATE == "BA"] <- -12.7498
Brazil_Cities$LONG[Brazil_Cities$CITY == "Santa Terezinha" & Brazil_Cities$STATE == "BA"] <- -39.5184

Brazil_Cities$LAT[Brazil_Cities$CITY == "São Caetano"] <- -8.33
Brazil_Cities$LONG[Brazil_Cities$CITY == "São Caetano"] <- -36.1459

Further investigation reveals that Lagoa Dos Patos in the State RS is a lagoon and not a municipal, it will be removed.

Brazil_Cities <- Brazil_Cities[!(Brazil_Cities$CITY == "Lagoa Dos Patos" & Brazil_Cities$STATE == "RS" ),]

Brazil_Cities[is.na(Brazil_Cities$LAT),]

## # A tibble: 0 x 81
## # ... with 81 variables: CITY <chr>, STATE <chr>, CAPITAL <dbl>,
## #   IBGE_RES_POP <dbl>, IBGE_RES_POP_BRAS <dbl>, IBGE_RES_POP_ESTR <dbl>,
## #   IBGE_DU <dbl>, IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>, IBGE_POP <dbl>,
## #   IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>, `IBGE_10-14` <dbl>,
## #   `IBGE_15-59` <dbl>, `IBGE_60+` <dbl>, IBGE_PLANTED_AREA <dbl>,
## #   `IBGE_CROP_PRODUCTION_$` <dbl>, `IDHM Ranking 2010` <dbl>, IDHM <dbl>,
## #   IDHM_Renda <dbl>, IDHM_Longevidade <dbl>, IDHM_Educacao <dbl>, LONG <dbl>,
## #   LAT <dbl>, ALT <dbl>, PAY_TV <dbl>, FIXED_PHONES <dbl>, AREA <dbl>,
## #   REGIAO_TUR <chr>, CATEGORIA_TUR <chr>, ESTIMATED_POP <dbl>,
## #   RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## #   GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## #   GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## #   MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## #   COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## #   COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## #   COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## #   COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, HOTELS <dbl>,
## #   BEDS <dbl>, Pr_Agencies <dbl>, Pu_Agencies <dbl>, Pr_Bank <dbl>,
## #   Pu_Bank <dbl>, Pr_Assets <dbl>, Pu_Assets <dbl>, Cars <dbl>,
## #   Motorcycles <dbl>, Wheeled_tractor <dbl>, UBER <dbl>, MAC <dbl>,
## #   `WAL-MART` <dbl>, POST_OFFICES <dbl>

Brazil_Cities[is.na(Brazil_Cities$LONG),]

## # A tibble: 0 x 81
## # ... with 81 variables: CITY <chr>, STATE <chr>, CAPITAL <dbl>,
## #   IBGE_RES_POP <dbl>, IBGE_RES_POP_BRAS <dbl>, IBGE_RES_POP_ESTR <dbl>,
## #   IBGE_DU <dbl>, IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>, IBGE_POP <dbl>,
## #   IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>, `IBGE_10-14` <dbl>,
## #   `IBGE_15-59` <dbl>, `IBGE_60+` <dbl>, IBGE_PLANTED_AREA <dbl>,
## #   `IBGE_CROP_PRODUCTION_$` <dbl>, `IDHM Ranking 2010` <dbl>, IDHM <dbl>,
## #   IDHM_Renda <dbl>, IDHM_Longevidade <dbl>, IDHM_Educacao <dbl>, LONG <dbl>,
## #   LAT <dbl>, ALT <dbl>, PAY_TV <dbl>, FIXED_PHONES <dbl>, AREA <dbl>,
## #   REGIAO_TUR <chr>, CATEGORIA_TUR <chr>, ESTIMATED_POP <dbl>,
## #   RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## #   GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## #   GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## #   MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## #   COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## #   COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## #   COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## #   COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, HOTELS <dbl>,
## #   BEDS <dbl>, Pr_Agencies <dbl>, Pu_Agencies <dbl>, Pr_Bank <dbl>,
## #   Pu_Bank <dbl>, Pr_Assets <dbl>, Pu_Assets <dbl>, Cars <dbl>,
## #   Motorcycles <dbl>, Wheeled_tractor <dbl>, UBER <dbl>, MAC <dbl>,
## #   `WAL-MART` <dbl>, POST_OFFICES <dbl>

We observe no more NA data within our LAT LONG columns.

We will replace the remaining NA values with 0 to represent the absent data

1.4.2 Studying the Imported Aspatial Data (Data_Dict)

We will then look at our imported Brazil_Cities aspatial data with summary.

glimpse(Data_Dict)

## Rows: 96
## Columns: 6
## $ FIELD       <chr> "CITY", "STATE", "CAPITAL", "IBGE_RES_POP", "IBGE_RES_P...
## $ DESCRIPTION <chr> "Name of the City", "Name of the State", "1 if Capital ...
## $ REFERENCE   <chr> NA, NA, NA, "2010", "2010", "2010", "2010", "2010", "20...
## $ UNIT        <chr> NA, NA, NA, "-", "-", "-", "-", "-", "-", "-", "-", "-"...
## $ SOURCE      <chr> "-", "-", "-", "https://sidra.ibge.gov.br/tabela/1497",...
## $ X6          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

It appears that Data_Dict has a description column where each rows explain the columns in Brazil_Cities that are vague such as IBGE_DU etc. However we also observe 96 rows where our Brazil_Cities only has 81. We will perform further investigation to find out what the extra rows may be describing.

tail(Data_Dict)

## # A tibble: 6 x 6
##   FIELD DESCRIPTION REFERENCE UNIT  SOURCE X6   
##   <chr> <chr>       <chr>     <chr> <chr>  <chr>
## 1 <NA>  <NA>        <NA>      <NA>  <NA>   <NA> 
## 2 <NA>  <NA>        <NA>      <NA>  <NA>   <NA> 
## 3 <NA>  <NA>        <NA>      <NA>  <NA>   <NA> 
## 4 <NA>  <NA>        <NA>      <NA>  <NA>   <NA> 
## 5 <NA>  <NA>        <NA>      <NA>  <NA>   <NA> 
## 6 <NA>  <NA>        <NA>      <NA>  <NA>   <NA>

Taking a look at the extra rows, we find that they are rows purely of NA. We will remove these rows.

1.4.3 Removing NA Rows in Data_Dict

Data_Dict <- Data_Dict[!(is.na(Data_Dict$FIELD)),]

nrow(Data_Dict)

## [1] 81

We now see that the number of remaining rows in Data_Dict correspond to the number of columns we have in Brazil_Cities.

Data_Dict also provides us the reference year of the data, allowing us to exlude data that is past 2016 from our study; such as HOTELS and Cars etc. However our population data is also old at 2010, we will accept this however as we do not expect the municipality populations to have drastic changes over the 6 years as there were no major disease outbreaks, save for a couple minor outbreaks, one in 2015 - 2016 involving newborns and microcephaly or in 2016 with Guillain-Barré syndrome, these were likely residual effects from a zika outbreak earlier in October 2015. No wars were recorded in Brazil during the period either.

https://www.who.int/csr/don/21-october-2015-zika/en/ https://www.who.int/csr/don/15-december-2015-microcephaly-brazil/en/ https://www.who.int/csr/don/8-february-2016-gbs-brazil/en/

1.4.4 Selecting Independent Variables

As mentioned, columns with reference years > 2016 will not be considered. Thus we will filter them out .

Brazil_Cities_Filt <- subset(Brazil_Cities, select = c(1:16, 19:26,29,33:66))

From the list of columns in the codechunk below, we will select our independent variables.

colnames(Brazil_Cities_Filt)

##  [1] "CITY"              "STATE"             "CAPITAL"          
##  [4] "IBGE_RES_POP"      "IBGE_RES_POP_BRAS" "IBGE_RES_POP_ESTR"
##  [7] "IBGE_DU"           "IBGE_DU_URBAN"     "IBGE_DU_RURAL"    
## [10] "IBGE_POP"          "IBGE_1"            "IBGE_1-4"         
## [13] "IBGE_5-9"          "IBGE_10-14"        "IBGE_15-59"       
## [16] "IBGE_60+"          "IDHM Ranking 2010" "IDHM"             
## [19] "IDHM_Renda"        "IDHM_Longevidade"  "IDHM_Educacao"    
## [22] "LONG"              "LAT"               "ALT"              
## [25] "AREA"              "RURAL_URBAN"       "GVA_AGROPEC"      
## [28] "GVA_INDUSTRY"      "GVA_SERVICES"      "GVA_PUBLIC"       
## [31] " GVA_TOTAL "       "TAXES"             "GDP"              
## [34] "POP_GDP"           "GDP_CAPITA"        "GVA_MAIN"         
## [37] "MUN_EXPENDIT"      "COMP_TOT"          "COMP_A"           
## [40] "COMP_B"            "COMP_C"            "COMP_D"           
## [43] "COMP_E"            "COMP_F"            "COMP_G"           
## [46] "COMP_H"            "COMP_I"            "COMP_J"           
## [49] "COMP_K"            "COMP_L"            "COMP_M"           
## [52] "COMP_N"            "COMP_O"            "COMP_P"           
## [55] "COMP_Q"            "COMP_R"            "COMP_S"           
## [58] "COMP_T"            "COMP_U"

As we know, our dependent variable is GDP_CAPITA.

The independent variable choices for our study will be:

IBGE_15-59

For selection 1, this is the working populace, who consume products and services, earn an income and pay taxes, hence contributing most to the economy. However, higher populated municipals would likely have a naturally higher economically active populace. Thus, we will need to scale it.

It is concluded that GDP and HDI have a strong positive relationship (https://www.researchgate.net/publication/324803815_RELATIONSHIP_BETWEEN_GROSS_DOMESTIC_PRODUCT_AND_HUMAN_DEVELOPMENT_INDEX), therefore we will be selecting the various IDHM indexes in our study.

IDHM_Renda
IDHM_Longevidade
IDHM_Educacao

For selections 2,3 and 4, it is chosen over IDHM itself because IDHM is a summation of all 3 categories, and selecting the individual categories will allow for a finer view of how each component may contribute more to GDP per capita rather than the entire IDHM index itself.

It is also known that GVA or Gross Value Added is used in the estimation of GDP (https://webarchive.nationalarchives.gov.uk/20160107002005/http://www.ons.gov.uk/ons/guide-method/method-quality/specific/economy/national-accounts/gva/index.html). Hence, we will add it to our study.

Once again instead of selecting the summation, we will select the individual attributes instead.

GVA_AGROPEC
GVA_INDUSTRY
GVA_SERVICES
GVA_PUBLIC

Taxes also form part of the GDP in the Income Approach of calculating GDP, therefore we will add it to our study as well. The numbers may be inflated for more populated cities and will need to be scaled.

On the topic of more populated cities, the higher the population count would likely have higher GDP as the economically active counts will likely be higher relative to less populated cities, meaning more businesses, more consumption and hence, overall higher GDP. However this is my theory, we will have to find out how this truly affects the overall GDP.

Population Density
Taxes

However for taxes, rather than looking at the raw number, it is perhaps better to view it as a % contribution of the overall GDP as it would show the measure of a city’s tax revenue relative to the size of its economy. This is also known as Tax-to-GDP Ratio. (https://www.investopedia.com/terms/t/tax-to-gdp-ratio.asp)

Capital

It would be interesting to see if the municipal being in a capital state would have a bearing on the overall municipal’s GDP.

MUN_EXPENDIT would have been considered as well, as government spending does contribute to the GDP, however the vast number of missing values to remove or impute would likely affect our study negatively and thus will not be considered.

1.4.5 Scaling Independent Variables

It appears IBGE_15-59 gives some error when trying to use the column. We will have to rename it. The column will now be E_Active

Brazil_Cities_Filt <- Brazil_Cities_Filt %>% rename("E_Active" = `IBGE_15-59`)

Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`E_ACTIVE_SC` = `E_Active`/`IBGE_POP`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`GVA_AGROPEC_SC` = `GVA_AGROPEC`/`POP_GDP`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`GVA_INDUSTRY_SC` = `GVA_INDUSTRY`/`POP_GDP`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`GVA_SERVICES_SC` = `GVA_SERVICES`/`POP_GDP`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`GVA_PUBLIC_SC` = `GVA_SERVICES`/`POP_GDP`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`POP_DENS` = `POP_GDP`/`AREA`)
Brazil_Cities_Filt <- Brazil_Cities_Filt %>% mutate(`TTG_RATIO` = `TAXES`/`GDP`)

summary(Brazil_Cities_Filt)

##      CITY              STATE              CAPITAL          IBGE_RES_POP     
##  Length:5572        Length:5572        Min.   :0.000000   Min.   :     805  
##  Class :character   Class :character   1st Qu.:0.000000   1st Qu.:    5235  
##  Mode  :character   Mode  :character   Median :0.000000   Median :   10934  
##                                        Mean   :0.004846   Mean   :   34278  
##                                        3rd Qu.:0.000000   3rd Qu.:   23424  
##                                        Max.   :1.000000   Max.   :11253503  
##                                                           NA's   :7         
##  IBGE_RES_POP_BRAS  IBGE_RES_POP_ESTR     IBGE_DU        IBGE_DU_URBAN    
##  Min.   :     805   Min.   :     0.0   Min.   :    239   Min.   :     60  
##  1st Qu.:    5230   1st Qu.:     0.0   1st Qu.:   1572   1st Qu.:    874  
##  Median :   10926   Median :     0.0   Median :   3174   Median :   1846  
##  Mean   :   34200   Mean   :    77.5   Mean   :  10303   Mean   :   8859  
##  3rd Qu.:   23390   3rd Qu.:    10.0   3rd Qu.:   6726   3rd Qu.:   4624  
##  Max.   :11133776   Max.   :119727.0   Max.   :3576148   Max.   :3548433  
##  NA's   :7          NA's   :7          NA's   :9         NA's   :9        
##  IBGE_DU_RURAL      IBGE_POP            IBGE_1            IBGE_1-4     
##  Min.   :    3   Min.   :     174   Min.   :     0.0   Min.   :     5  
##  1st Qu.:  487   1st Qu.:    2801   1st Qu.:    38.0   1st Qu.:   158  
##  Median :  931   Median :    6170   Median :    92.0   Median :   376  
##  Mean   : 1463   Mean   :   27595   Mean   :   383.3   Mean   :  1544  
##  3rd Qu.: 1832   3rd Qu.:   15302   3rd Qu.:   232.0   3rd Qu.:   951  
##  Max.   :33809   Max.   :10463636   Max.   :129464.0   Max.   :514794  
##  NA's   :80      NA's   :7          NA's   :7          NA's   :7       
##     IBGE_5-9        IBGE_10-14        E_Active          IBGE_60+      
##  Min.   :     7   Min.   :    12   Min.   :     94   Min.   :     29  
##  1st Qu.:   220   1st Qu.:   259   1st Qu.:   1734   1st Qu.:    341  
##  Median :   516   Median :   588   Median :   3841   Median :    722  
##  Mean   :  2069   Mean   :  2381   Mean   :  18212   Mean   :   3004  
##  3rd Qu.:  1300   3rd Qu.:  1478   3rd Qu.:   9628   3rd Qu.:   1724  
##  Max.   :684443   Max.   :783702   Max.   :7058221   Max.   :1293012  
##  NA's   :7        NA's   :7        NA's   :7         NA's   :7        
##  IDHM Ranking 2010      IDHM          IDHM_Renda     IDHM_Longevidade
##  Min.   :   1      Min.   :0.4180   Min.   :0.4000   Min.   :0.6720  
##  1st Qu.:1392      1st Qu.:0.5990   1st Qu.:0.5720   1st Qu.:0.7690  
##  Median :2783      Median :0.6650   Median :0.6540   Median :0.8080  
##  Mean   :2783      Mean   :0.6592   Mean   :0.6429   Mean   :0.8016  
##  3rd Qu.:4174      3rd Qu.:0.7180   3rd Qu.:0.7070   3rd Qu.:0.8360  
##  Max.   :5565      Max.   :0.8620   Max.   :0.8910   Max.   :0.8940  
##  NA's   :7         NA's   :7        NA's   :7        NA's   :7       
##  IDHM_Educacao         LONG             LAT               ALT          
##  Min.   :0.2070   Min.   :-72.92   Min.   :-33.688   Min.   :     0.0  
##  1st Qu.:0.4900   1st Qu.:-50.87   1st Qu.:-22.841   1st Qu.:   169.8  
##  Median :0.5600   Median :-46.52   Median :-18.091   Median :   406.5  
##  Mean   :0.5591   Mean   :-46.23   Mean   :-16.449   Mean   :   893.8  
##  3rd Qu.:0.6310   3rd Qu.:-41.40   3rd Qu.: -8.489   3rd Qu.:   628.9  
##  Max.   :0.8250   Max.   :-32.44   Max.   :  4.585   Max.   :874579.0  
##  NA's   :7                                           NA's   :8         
##       AREA           RURAL_URBAN         GVA_AGROPEC       GVA_INDUSTRY     
##  Min.   :     3.57   Length:5572        Min.   :      0   Min.   :       1  
##  1st Qu.:   204.43   Class :character   1st Qu.:   4189   1st Qu.:    1726  
##  Median :   415.92   Mode  :character   Median :  20426   Median :    7424  
##  Mean   :  1515.89                      Mean   :  47271   Mean   :  175928  
##  3rd Qu.:  1026.38                      3rd Qu.:  51227   3rd Qu.:   41022  
##  Max.   :159533.33                      Max.   :1402282   Max.   :63306755  
##  NA's   :3                              NA's   :2         NA's   :2         
##   GVA_SERVICES         GVA_PUBLIC         GVA_TOTAL             TAXES          
##  Min.   :        2   Min.   :       7   Min.   :       17   Min.   :   -14159  
##  1st Qu.:    10112   1st Qu.:   17267   1st Qu.:    42253   1st Qu.:     1305  
##  Median :    31211   Median :   35866   Median :   119492   Median :     5100  
##  Mean   :   489451   Mean   :  123768   Mean   :   832987   Mean   :   118864  
##  3rd Qu.:   115406   3rd Qu.:   89245   3rd Qu.:   313963   3rd Qu.:    22197  
##  Max.   :464656988   Max.   :41902893   Max.   :569910503   Max.   :117125387  
##  NA's   :2           NA's   :2          NA's   :2           NA's   :2          
##       GDP               POP_GDP           GDP_CAPITA       GVA_MAIN        
##  Min.   :       15   Min.   :     815   Min.   :  3191   Length:5572       
##  1st Qu.:    43709   1st Qu.:    5483   1st Qu.:  9058   Class :character  
##  Median :   125153   Median :   11578   Median : 15870   Mode  :character  
##  Mean   :   954584   Mean   :   36998   Mean   : 21126                     
##  3rd Qu.:   329539   3rd Qu.:   25085   3rd Qu.: 26155                     
##  Max.   :687035890   Max.   :12038175   Max.   :314638                     
##  NA's   :2           NA's   :2          NA's   :2                          
##   MUN_EXPENDIT          COMP_TOT            COMP_A            COMP_B       
##  Min.   :1.421e+06   Min.   :     6.0   Min.   :   0.00   Min.   :  0.000  
##  1st Qu.:1.573e+07   1st Qu.:    68.0   1st Qu.:   1.00   1st Qu.:  0.000  
##  Median :2.746e+07   Median :   162.0   Median :   2.00   Median :  0.000  
##  Mean   :1.043e+08   Mean   :   906.8   Mean   :  18.25   Mean   :  1.852  
##  3rd Qu.:5.666e+07   3rd Qu.:   448.0   3rd Qu.:   8.00   3rd Qu.:  2.000  
##  Max.   :4.577e+10   Max.   :530446.0   Max.   :1948.00   Max.   :274.000  
##  NA's   :1491        NA's   :2          NA's   :2         NA's   :2        
##      COMP_C             COMP_D             COMP_E            COMP_F        
##  Min.   :    0.00   Min.   :  0.0000   Min.   :  0.000   Min.   :    0.00  
##  1st Qu.:    3.00   1st Qu.:  0.0000   1st Qu.:  0.000   1st Qu.:    1.00  
##  Median :   11.00   Median :  0.0000   Median :  0.000   Median :    4.00  
##  Mean   :   73.44   Mean   :  0.4262   Mean   :  2.029   Mean   :   43.26  
##  3rd Qu.:   39.00   3rd Qu.:  0.0000   3rd Qu.:  1.000   3rd Qu.:   15.00  
##  Max.   :31566.00   Max.   :332.0000   Max.   :657.000   Max.   :25222.00  
##  NA's   :2          NA's   :2          NA's   :2         NA's   :2         
##      COMP_G             COMP_H          COMP_I             COMP_J        
##  Min.   :     1.0   Min.   :    0   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    32.0   1st Qu.:    1   1st Qu.:    2.00   1st Qu.:    0.00  
##  Median :    74.5   Median :    7   Median :    7.00   Median :    1.00  
##  Mean   :   348.0   Mean   :   41   Mean   :   55.88   Mean   :   24.74  
##  3rd Qu.:   199.0   3rd Qu.:   25   3rd Qu.:   24.00   3rd Qu.:    5.00  
##  Max.   :150633.0   Max.   :19515   Max.   :29290.00   Max.   :38720.00  
##  NA's   :2          NA's   :2       NA's   :2          NA's   :2         
##      COMP_K             COMP_L             COMP_M             COMP_N       
##  Min.   :    0.00   Min.   :    0.00   Min.   :    0.00   Min.   :    0.0  
##  1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:    1.00   1st Qu.:    1.0  
##  Median :    0.00   Median :    0.00   Median :    4.00   Median :    4.0  
##  Mean   :   15.55   Mean   :   15.14   Mean   :   51.29   Mean   :   83.7  
##  3rd Qu.:    2.00   3rd Qu.:    3.00   3rd Qu.:   13.00   3rd Qu.:   14.0  
##  Max.   :23738.00   Max.   :14003.00   Max.   :49181.00   Max.   :76757.0  
##  NA's   :2          NA's   :2          NA's   :2          NA's   :2        
##      COMP_O            COMP_P             COMP_Q             COMP_R       
##  Min.   :  0.000   Min.   :    0.00   Min.   :    0.00   Min.   :   0.00  
##  1st Qu.:  2.000   1st Qu.:    2.00   1st Qu.:    1.00   1st Qu.:   0.00  
##  Median :  2.000   Median :    6.00   Median :    3.00   Median :   2.00  
##  Mean   :  3.269   Mean   :   30.96   Mean   :   34.15   Mean   :  12.18  
##  3rd Qu.:  3.000   3rd Qu.:   17.00   3rd Qu.:   12.00   3rd Qu.:   6.00  
##  Max.   :204.000   Max.   :16030.00   Max.   :22248.00   Max.   :6687.00  
##  NA's   :2         NA's   :2          NA's   :2          NA's   :2        
##      COMP_S             COMP_T      COMP_U           E_ACTIVE_SC    
##  Min.   :    0.00   Min.   :0   Min.   :  0.00000   Min.   :0.4716  
##  1st Qu.:    5.00   1st Qu.:0   1st Qu.:  0.00000   1st Qu.:0.6087  
##  Median :   12.00   Median :0   Median :  0.00000   Median :0.6325  
##  Mean   :   51.61   Mean   :0   Mean   :  0.05027   Mean   :0.6308  
##  3rd Qu.:   31.00   3rd Qu.:0   3rd Qu.:  0.00000   3rd Qu.:0.6543  
##  Max.   :24832.00   Max.   :0   Max.   :123.00000   Max.   :0.7448  
##  NA's   :2          NA's   :2   NA's   :2           NA's   :7       
##  GVA_AGROPEC_SC     GVA_INDUSTRY_SC    GVA_SERVICES_SC     GVA_PUBLIC_SC      
##  Min.   :  0.0000   Min.   :  0.0001   Min.   :  0.00049   Min.   :  0.00049  
##  1st Qu.:  0.3538   1st Qu.:  0.2608   1st Qu.:  1.61880   1st Qu.:  1.61880  
##  Median :  1.4255   Median :  0.7902   Median :  3.52067   Median :  3.52067  
##  Mean   :  3.7525   Mean   :  3.3071   Mean   :  5.60526   Mean   :  5.60526  
##  3rd Qu.:  4.7315   3rd Qu.:  2.5900   3rd Qu.:  7.73028   3rd Qu.:  7.73028  
##  Max.   :121.0575   Max.   :254.9198   Max.   :108.80127   Max.   :108.80127  
##  NA's   :2          NA's   :2          NA's   :2           NA's   :2          
##     POP_DENS           TTG_RATIO       
##  Min.   :    0.166   Min.   : -2.0920  
##  1st Qu.:   11.945   1st Qu.:  0.0298  
##  Median :   25.306   Median :  0.0527  
##  Mean   :  117.225   Mean   :  8.5545  
##  3rd Qu.:   55.446   3rd Qu.:  0.0972  
##  Max.   :13533.497   Max.   :324.6684  
##  NA's   :3           NA's   :2

NA data is observed. Next we will have to handle the NA values in our independent variables. We will remove the NA rows.

For our independent data, we observe unique rows with NA data in our POP_DENS, IDHM and E_ACTIVE, removing the NA using these 3 rows are sufficient in removing all the NA data in the other independent variables.

Brazil_Cities_Filt[is.na(Brazil_Cities_Filt$E_Active),]

## # A tibble: 7 x 66
##   CITY  STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR~ IBGE_RES_POP_ES~ IBGE_DU
##   <chr> <chr>   <dbl>        <dbl>            <dbl>            <dbl>   <dbl>
## 1 Baln~ SC          0           NA               NA               NA      NA
## 2 Moju~ PA          0           NA               NA               NA      NA
## 3 Para~ MS          0           NA               NA               NA      NA
## 4 Pesc~ SC          0           NA               NA               NA      NA
## 5 Pint~ RS          0           NA               NA               NA      NA
## 6 Sant~ BA          0           NA               NA               NA      NA
## 7 São ~ PE          0           NA               NA               NA      NA
## # ... with 59 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## #   IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## #   `IBGE_10-14` <dbl>, E_Active <dbl>, `IBGE_60+` <dbl>, `IDHM Ranking
## #   2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## #   IDHM_Educacao <dbl>, LONG <dbl>, LAT <dbl>, ALT <dbl>, AREA <dbl>,
## #   RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## #   GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## #   GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## #   MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## #   COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## #   COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## #   COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## #   COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, E_ACTIVE_SC <dbl>,
## #   GVA_AGROPEC_SC <dbl>, GVA_INDUSTRY_SC <dbl>, GVA_SERVICES_SC <dbl>,
## #   GVA_PUBLIC_SC <dbl>, POP_DENS <dbl>, TTG_RATIO <dbl>

Brazil_Cities_Filt[is.na(Brazil_Cities_Filt$IDHM_Renda),]

## # A tibble: 7 x 66
##   CITY  STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR~ IBGE_RES_POP_ES~ IBGE_DU
##   <chr> <chr>   <dbl>        <dbl>            <dbl>            <dbl>   <dbl>
## 1 Baln~ SC          0           NA               NA               NA      NA
## 2 Moju~ PA          0           NA               NA               NA      NA
## 3 Para~ MS          0           NA               NA               NA      NA
## 4 Pesc~ SC          0           NA               NA               NA      NA
## 5 Pint~ RS          0           NA               NA               NA      NA
## 6 Sant~ BA          0         9648             9648                0    2891
## 7 São ~ PE          0           NA               NA               NA      NA
## # ... with 59 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## #   IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## #   `IBGE_10-14` <dbl>, E_Active <dbl>, `IBGE_60+` <dbl>, `IDHM Ranking
## #   2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## #   IDHM_Educacao <dbl>, LONG <dbl>, LAT <dbl>, ALT <dbl>, AREA <dbl>,
## #   RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## #   GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## #   GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## #   MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## #   COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## #   COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## #   COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## #   COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, E_ACTIVE_SC <dbl>,
## #   GVA_AGROPEC_SC <dbl>, GVA_INDUSTRY_SC <dbl>, GVA_SERVICES_SC <dbl>,
## #   GVA_PUBLIC_SC <dbl>, POP_DENS <dbl>, TTG_RATIO <dbl>

Brazil_Cities_Filt[is.na(Brazil_Cities_Filt$POP_DENS),]

## # A tibble: 3 x 66
##   CITY  STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR~ IBGE_RES_POP_ES~ IBGE_DU
##   <chr> <chr>   <dbl>        <dbl>            <dbl>            <dbl>   <dbl>
## 1 Japu~ AM          0         7326             7318                8    1043
## 2 Sant~ BA          0           NA               NA               NA      NA
## 3 São ~ PE          0           NA               NA               NA      NA
## # ... with 59 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## #   IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## #   `IBGE_10-14` <dbl>, E_Active <dbl>, `IBGE_60+` <dbl>, `IDHM Ranking
## #   2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## #   IDHM_Educacao <dbl>, LONG <dbl>, LAT <dbl>, ALT <dbl>, AREA <dbl>,
## #   RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## #   GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## #   GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## #   MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## #   COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## #   COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## #   COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## #   COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, E_ACTIVE_SC <dbl>,
## #   GVA_AGROPEC_SC <dbl>, GVA_INDUSTRY_SC <dbl>, GVA_SERVICES_SC <dbl>,
## #   GVA_PUBLIC_SC <dbl>, POP_DENS <dbl>, TTG_RATIO <dbl>

The code chunk below removes the NA values from our independent variables

Brazil_Cities_Filt <- Brazil_Cities_Filt %>%
  filter(!is.na(`E_Active`)) %>% 
  filter(!is.na(`POP_DENS`)) %>%
  filter(!is.na(`IDHM_Renda`))

Ensuring that our independent variables do not have any NA values

colnames(Brazil_Cities_Filt)[colSums(is.na(Brazil_Cities_Filt)) > 0]

## [1] "IBGE_DU"       "IBGE_DU_URBAN" "IBGE_DU_RURAL" "ALT"          
## [5] "MUN_EXPENDIT"

As we observe, our independent variables do not have any NA values remaining.

We will next select the only columns that we will need for the analysis.

BC_IndepVar <- subset(Brazil_Cities_Filt, select = c(1:3,19:23,35,60:66))

BC_IndepVar_sf <- st_as_sf(BC_IndepVar, coords = c("LONG","LAT"), crs=4326) %>%
  st_transform(crs=4674)

1.5 Joining Aspatial to Geospatial Data

We will join our newly created BC_IndepVar_sf simple feature data frame with our Municipal2016 simple feature data frame

Municipal_BC <- st_join(Municipal2016, BC_IndepVar_sf)

## although coordinates are longitude/latitude, st_intersects assumes that they are planar

Now we’ll check for any NA values present

Municipal_BC[rowSums(is.na(Municipal_BC)) > 0,]

## Simple feature collection with 9 features and 18 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: -70.05811 ymin: -33.61653 xmax: -39.43499 ymax: 0.06830957
## geographic CRS: SIRGAS 2000
##      code_muni         name_muni code_state abbrev_state CITY STATE CAPITAL
## 106    1302108            Japurá         13           AM <NA>  <NA>      NA
## 225    1504752  Mojuí Dos Campos         15           PA <NA>  <NA>      NA
## 2175   2928505   Santa Terezinha         29           BA <NA>  <NA>      NA
## 4504   4212650    Pescaria Brava         42           SC <NA>  <NA>      NA
## 4606   4220000  Balneário Rincão         42           SC <NA>  <NA>      NA
## 4607   4300001       Lagoa Mirim         43           RS <NA>  <NA>      NA
## 4608   4300002   Lagoa Dos Patos         43           RS <NA>  <NA>      NA
## 4926   4314548    Pinto Bandeira         43           RS <NA>  <NA>      NA
## 5163   5006275 Paraíso Das Águas         50           MS <NA>  <NA>      NA
##      IDHM_Renda IDHM_Longevidade IDHM_Educacao GDP_CAPITA E_ACTIVE_SC
## 106          NA               NA            NA         NA          NA
## 225          NA               NA            NA         NA          NA
## 2175         NA               NA            NA         NA          NA
## 4504         NA               NA            NA         NA          NA
## 4606         NA               NA            NA         NA          NA
## 4607         NA               NA            NA         NA          NA
## 4608         NA               NA            NA         NA          NA
## 4926         NA               NA            NA         NA          NA
## 5163         NA               NA            NA         NA          NA
##      GVA_AGROPEC_SC GVA_INDUSTRY_SC GVA_SERVICES_SC GVA_PUBLIC_SC POP_DENS
## 106              NA              NA              NA            NA       NA
## 225              NA              NA              NA            NA       NA
## 2175             NA              NA              NA            NA       NA
## 4504             NA              NA              NA            NA       NA
## 4606             NA              NA              NA            NA       NA
## 4607             NA              NA              NA            NA       NA
## 4608             NA              NA              NA            NA       NA
## 4926             NA              NA              NA            NA       NA
## 5163             NA              NA              NA            NA       NA
##      TTG_RATIO                           geom
## 106         NA POLYGON ((-69.89499 -0.1310...
## 225         NA MULTIPOLYGON (((-54.58777 -...
## 2175        NA MULTIPOLYGON (((-39.56748 -...
## 4504        NA MULTIPOLYGON (((-48.92608 -...
## 4606        NA MULTIPOLYGON (((-49.20933 -...
## 4607        NA POLYGON ((-52.62241 -32.146...
## 4608        NA POLYGON ((-51.2719 -30.0389...
## 4926        NA POLYGON ((-51.45725 -29.029...
## 5163        NA POLYGON ((-52.90027 -18.782...

We observe that there are 9 rows with NA values. We will proceed to remove them.

Municipal_BC <- Municipal_BC %>%
  filter(!is.na(`CITY`))

Ensuring no NA rows remain.

Municipal_BC[rowSums(is.na(Municipal_BC)) > 0,]

## Simple feature collection with 0 features and 18 fields
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## geographic CRS: SIRGAS 2000
##  [1] code_muni        name_muni        code_state       abbrev_state    
##  [5] CITY             STATE            CAPITAL          IDHM_Renda      
##  [9] IDHM_Longevidade IDHM_Educacao    GDP_CAPITA       E_ACTIVE_SC     
## [13] GVA_AGROPEC_SC   GVA_INDUSTRY_SC  GVA_SERVICES_SC  GVA_PUBLIC_SC   
## [17] POP_DENS         TTG_RATIO        geom            
## <0 rows> (or 0-length row.names)

From the output we can see no NA rows remain.

Next we will check if there are any invalid polygons in our joined simple feature data frame.

unique(st_is_valid(Municipal_BC))

## [1] TRUE

As we can see there are no invalid polygons.

1.6 Visualising Distribution of GDP per capita 2016

We will plot a choropleth map showing the distribution of GDP per capita, 2016 at Municipality Level. We will increase the number of classes to have a finer distribution of the GDP_CAPITA.

We will also print two maps with inverse palette so that the macro details are not lost to contrast as well.

GDP2016_CMAP <- tm_shape(Municipal_BC) +
  tm_polygons(col = "GDP_CAPITA",n = 15, border.col = 0.15, palette = "Reds") +
  tm_layout(title = "Brazil Municipal 2016 GDP Per Capita",
    title.size = 1,
    title.position = c("center", "top"),
    inner.margins = c(0.06, 0.10, 0.10, 0.08),
    legend.width = 0.3,
    legend.height = 0.3)

GDP2016_INV_CMAP <- tm_shape(Municipal_BC) +
  tm_polygons(col = "GDP_CAPITA",n = 15, border.col = 0.15, palette = "-Reds") +
  tm_layout(title = "Brazil Municipal 2016 GDP Per Capita",
    title.size = 1,
    title.position = c("center", "top"),
    inner.margins = c(0.06, 0.10, 0.10, 0.08),
    legend.width = 0.3,
    legend.height = 0.3)

tmap_arrange(GDP2016_CMAP,GDP2016_INV_CMAP)

We can observe a cluster of municipalities with higher GDP_CAPITA in the central region. The area around the central cluster also generally shows a higher GDP_CAPITA relative to the pheripheral (Western and Eastern) regions. The highest GDP_CAPITA is recorded south of that cluster with high GDP_CAPITA small municipals scattered around. Lowest GDP_CAPITAs are observeede around the Western and Eastern municipality regions of Brazil.

2 Exploratory Data Analysis

2.1 Visualising GDP_CAPITA Distribution

We will take a look at the distribution of GDP per capita in a histogram. To make the scales more readable, we will divide GDP_CAPITA by 1000.

ggplot(data=Municipal_BC, aes(x=GDP_CAPITA/1000)) +
  geom_histogram(bins=20, color="white", fill="red") +
  ggtitle("Brazil Municipality GDP per capita Distribution, 2016") +
  labs(x="Brazil Municipality GDP per capita, 2016 ($ 1,000 reais) ", y="count")

We observe a clear right-skewed distribution of GDP per capita or a higher number of municipals with lower GDP_CAPITA. This is to be expected as income is usually skewed.

2.2 Visualising the Distribution of Independent Variables

First we will create the histograms for each independent variable

E_ACTIVE_HIST <- ggplot(data=Municipal_BC, aes(x=E_ACTIVE_SC)) +
  geom_histogram(bins=20, color="black", fill="light blue")

RENDA_HIST <- ggplot(data=Municipal_BC, aes(x=IDHM_Renda)) +
  geom_histogram(bins=20, color="black", fill="light blue")

LONG_HIST <- ggplot(data=Municipal_BC, aes(x=IDHM_Longevidade)) +
  geom_histogram(bins=20, color="black", fill="light blue")

EDU_HIST <- ggplot(data=Municipal_BC, aes(x=IDHM_Educacao)) +
  geom_histogram(bins=20, color="black", fill="light blue")

GVA_AGRO_HIST <- ggplot(data=Municipal_BC, aes(x=GVA_AGROPEC_SC)) +
  geom_histogram(bins=20, color="black", fill="light blue")

GVA_IND_HIST <- ggplot(data=Municipal_BC, aes(x=GVA_INDUSTRY_SC)) +
  geom_histogram(bins=20, color="black", fill="light blue")

GVA_SVC_HIST <- ggplot(data=Municipal_BC, aes(x=GVA_SERVICES_SC)) +
  geom_histogram(bins=20, color="black", fill="light blue")

GVA_PUB_HIST <- ggplot(data=Municipal_BC, aes(x=GVA_PUBLIC_SC)) +
  geom_histogram(bins=20, color="black", fill="light blue")

POP_DENS_HIST <- ggplot(data=Municipal_BC, aes(x=POP_DENS)) +
  geom_histogram(bins=20, color="black", fill="light blue")

TTG_RATIO_HIST <- ggplot(data=Municipal_BC, aes(x=TTG_RATIO)) +
  geom_histogram(bins=20, color="black", fill="light blue")

The code chunk below prints the independent variables that we will use

ggarrange(E_ACTIVE_HIST, RENDA_HIST, LONG_HIST, EDU_HIST, GVA_AGRO_HIST, GVA_IND_HIST, GVA_SVC_HIST, GVA_PUB_HIST, POP_DENS_HIST, TTG_RATIO_HIST, ncol = 3, nrow = 4)

We observe that the our E_ACTIVE_SC and IDHM variables are normally distributed and within the 0-1 value range. However our GVA, POP_DENS and TTG_RATIO are all right-skewed with large variable ranges especially with POP_DENS and thus will require normalisation. Note that our dependent variable GDP_CAPITA is not within the range of 0-1 and will require normalisation as well.

Since the data for the variables we are normalising are not normally distributed, we will use the Min-Max method.

Municipal_BC$GVA_AGROPEC_SC <- normalize(Municipal_BC$GVA_AGROPEC_SC)
Municipal_BC$GVA_INDUSTRY_SC <- normalize(Municipal_BC$GVA_INDUSTRY_SC)
Municipal_BC$GVA_SERVICES_SC <- normalize(Municipal_BC$GVA_SERVICES_SC)
Municipal_BC$GVA_PUBLIC_SC <- normalize(Municipal_BC$GVA_PUBLIC_SC)
Municipal_BC$POP_DENS <- normalize(Municipal_BC$POP_DENS)
Municipal_BC$TTG_RATIO <- normalize(Municipal_BC$TTG_RATIO)
Municipal_BC$GDP_CAPITA <- normalize(Municipal_BC$GDP_CAPITA)

GVA_AGRO_HIST_NM <- ggplot(data=Municipal_BC, aes(x=GVA_AGROPEC_SC)) +
  geom_histogram(bins=20, color="black", fill="light blue")

GVA_IND_HIST_NM <- ggplot(data=Municipal_BC, aes(x=GVA_INDUSTRY_SC)) +
  geom_histogram(bins=20, color="black", fill="light blue")

GVA_SVC_HIST_NM <- ggplot(data=Municipal_BC, aes(x=GVA_SERVICES_SC)) +
  geom_histogram(bins=20, color="black", fill="light blue")

GVA_PUB_HIST_NM <- ggplot(data=Municipal_BC, aes(x=GVA_PUBLIC_SC)) +
  geom_histogram(bins=20, color="black", fill="light blue")

POP_DENS_HIST_NM <- ggplot(data=Municipal_BC, aes(x=POP_DENS)) +
  geom_histogram(bins=20, color="black", fill="light blue")

TTG_RATIO_HIST_NM <- ggplot(data=Municipal_BC, aes(x=TTG_RATIO)) +
  geom_histogram(bins=20, color="black", fill="light blue")

GDP_CAPITA_HIST_NM <- ggplot(data=Municipal_BC, aes(x=GDP_CAPITA)) +
  geom_histogram(bins=20, color="black", fill="light blue")

ggarrange(GVA_AGRO_HIST_NM, GVA_IND_HIST_NM, GVA_SVC_HIST_NM, GVA_PUB_HIST_NM, POP_DENS_HIST_NM, TTG_RATIO_HIST_NM, GDP_CAPITA_HIST_NM, ncol = 3, nrow = 3)

We see that the data ranges are now betwee 0 and 1

3 Multiple Linear Regression

3.1 Correlation Matrix

We will first plot the correlation matrix to find high correlation coefficient magnitudes between our independent variables.

The code chunk below is to create a dataframe with our Municipal_BC Simple Feature Data Frame

Municipal_BC_df <- as.data.frame(Municipal_BC)

Next we will use the columns of the newly created data frame to create our correlation matrix

corrplot(cor(Municipal_BC_df[, 7:18]), diag = FALSE, order = "AOE",
         tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")

We observe that GVA_SERVICES_SC and GVA_PUBLIC_SC have a strong positive correlation coefficient at 1 and thus we will omit GVA_PUBLIC_SC from our future analysis.

Next we see that IDHM_Renda has a high correlation coefficient magnitude (>0.8) with IDHM_Longevidade at 0.83 and IDHM_Educaco at 0.82. Hence we will omit IDHM_Renda from our future analysis.

3.2 Calibrating our Multiple Linear Regression Model

With our highly correlated independent variables removed, we can proceed with calibrating our multiple linear regression model.

For our analysis we will be assuming 95% confidence interval or 0.05 alpha value

Municipal_BC.mlr <- lm(GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + POP_DENS + TTG_RATIO + CAPITAL, data=Municipal_BC)
summary(Municipal_BC.mlr)

## 
## Call:
## lm(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + 
##     GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + POP_DENS + 
##     TTG_RATIO + CAPITAL, data = Municipal_BC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10127 -0.01036 -0.00407  0.00328  0.83669 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.110319   0.009951 -11.087  < 2e-16 ***
## E_ACTIVE_SC       0.092282   0.016697   5.527 3.41e-08 ***
## IDHM_Longevidade  0.060388   0.012449   4.851 1.26e-06 ***
## IDHM_Educacao     0.033109   0.006210   5.332 1.01e-07 ***
## GVA_AGROPEC_SC    0.376736   0.007214  52.222  < 2e-16 ***
## GVA_INDUSTRY_SC   0.925389   0.010191  90.802  < 2e-16 ***
## GVA_SERVICES_SC   0.360013   0.007123  50.541  < 2e-16 ***
## POP_DENS          0.015377   0.008679   1.772  0.07649 .  
## TTG_RATIO         0.013037   0.004636   2.812  0.00493 ** 
## CAPITAL          -0.001007   0.005513  -0.183  0.85505    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02713 on 5553 degrees of freedom
## Multiple R-squared:  0.8273, Adjusted R-squared:  0.8271 
## F-statistic:  2957 on 9 and 5553 DF,  p-value: < 2.2e-16

At our confidence interval of 95%, we observe that only cAPITAL and POP_DENS are not statistically significant and hence will be dropped from the multiple linear regression model.

3.3 Calibrating our Revised MLR Model

This code chunk will calibrate a revised MLR model with the prior statistically insignificant variables dropped.

Municipal_BC.mlr1 <- lm(GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + TTG_RATIO, data=Municipal_BC)
ols_regress(Municipal_BC.mlr1)

##                         Model Summary                          
## --------------------------------------------------------------
## R                       0.910       RMSE                0.027 
## R-Squared               0.827       Coef. Var          47.129 
## Adj. R-Squared          0.827       MSE                 0.001 
## Pred R-Squared          0.826       MAE                 0.012 
## --------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                  ANOVA                                  
## -----------------------------------------------------------------------
##                Sum of                                                  
##               Squares          DF    Mean Square       F          Sig. 
## -----------------------------------------------------------------------
## Regression     19.587           7          2.798     3799.94    0.0000 
## Residual        4.090        5555          0.001                       
## Total          23.677        5562                                      
## -----------------------------------------------------------------------
## 
##                                      Parameter Estimates                                       
## ----------------------------------------------------------------------------------------------
##            model      Beta    Std. Error    Std. Beta       t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------------
##      (Intercept)    -0.111         0.010                 -11.212    0.000    -0.131    -0.092 
##      E_ACTIVE_SC     0.094         0.017        0.046      5.667    0.000     0.062     0.127 
## IDHM_Longevidade     0.060         0.012        0.041      4.800    0.000     0.035     0.084 
##    IDHM_Educacao     0.034         0.006        0.048      5.472    0.000     0.022     0.046 
##   GVA_AGROPEC_SC     0.375         0.007        0.304     52.403    0.000     0.361     0.389 
##  GVA_INDUSTRY_SC     0.924         0.010        0.549     90.909    0.000     0.904     0.944 
##  GVA_SERVICES_SC     0.362         0.007        0.352     51.289    0.000     0.348     0.375 
##        TTG_RATIO     0.013         0.005        0.016      2.819    0.005     0.004     0.022 
## ----------------------------------------------------------------------------------------------

It is observed that our adjusted R^2 value is at 0.827 which tells us that 82.7% of the variation in our dependent variable is explained by the independent variables.

3.4 Checking for Multicolinearity

We will use ols_vif_tol() to test for multicolinearity between our independent variables.

ols_vif_tol(Municipal_BC.mlr1)

##          Variables Tolerance      VIF
## 1      E_ACTIVE_SC 0.4641728 2.154370
## 2 IDHM_Longevidade 0.4290432 2.330768
## 3    IDHM_Educacao 0.3971184 2.518141
## 4   GVA_AGROPEC_SC 0.9265720 1.079247
## 5  GVA_INDUSTRY_SC 0.8513631 1.174587
## 6  GVA_SERVICES_SC 0.6589498 1.517566
## 7        TTG_RATIO 0.9810219 1.019345

We observe that none of our variables have a VIF of >=10, hence we can conclude that there are no signs of multicolinearity between our independent variables.

3.5 Test for Non-Linearity

The below code chunk will test for the assumption of linearity between the independent and dependent variables.

ols_plot_resid_fit(Municipal_BC.mlr1)

It can be observed from the output plot that most of the data points are scattered around the 0 line. Thus we can conclude that the relationship between the independent variables and the dependent variables are linear.

3.6 Test for Normality Assumption

The code chunk below will test for normality using the ols_plot_resid_hist() function.

ols_plot_resid_hist(Municipal_BC.mlr1)

The output figure reveals that the residual of the multiple linear regression resembles normal distribution

3.7 Test for Spatial Autocorrelation

For our spatial autocorrelation testing,

Our confidence interval will be 95% with an alpha value of 0.05.

Null Hypothesis H0: The residuals of the MLR model are randomly distributed.

H1: The residuals of the MLR model are NOT randomly distributed.

We will firstly need to extract our residuals from our MLR model and save it as a dataframe

mlr.output <- as.data.frame(Municipal_BC.mlr1$residuals)

Next we’ll join our newly created residuals data frame with Municipal_BC simple feature object

Municipal_BC.sf <- cbind(Municipal_BC, Municipal_BC.mlr1$residuals) %>%
  rename(`MLR_RES` = `Municipal_BC.mlr1.residuals`)

After that we will convert our joined simple feature object into a SpatialPointsDataFrame as spdep only can process sp conformed spatial data objects.

Municipal_BC.sp <- as_Spatial(Municipal_BC.sf)
Municipal_BC.sp

## class       : SpatialPolygonsDataFrame 
## features    : 5563 
## extent      : -73.99045, -28.83594, -33.75118, 5.271841  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs 
## variables   : 19
## names       : code_muni,       name_muni, code_state, abbrev_state,            CITY, STATE, CAPITAL, IDHM_Renda, IDHM_Longevidade, IDHM_Educacao, GDP_CAPITA,       E_ACTIVE_SC, GVA_AGROPEC_SC, GVA_INDUSTRY_SC, GVA_SERVICES_SC, ... 
## min values  :   1100015, Abadia De Goiás,         11,           AC, Abadia De Goiás,    AC,       0,        0.4,            0.672,         0.207,          0, 0.471631205673759,              0,               0,               0, ... 
## max values  :   5300108,          Zortéa,         53,           TO,          Zortéa,    TO,       1,      0.891,            0.894,         0.825,          1, 0.744760130414532,              1,               1,               1, ...

3.7.1 Visualising Residuals

Next we will visualise our residuals by plotting a choropleth map together with an interactive point symbol map to enable us to have a more precise investigation.

Resid_Cmap <- tm_shape(Municipal_BC) +
  tm_polygons(border.col = 0.5) +
  tm_shape(Municipal_BC.sp) +
  tm_fill(col="MLR_RES",
          alpha = 0.6,
          style="quantile",
          midpoint = 0)+
   tm_layout(title = "GDP Per Capita Residual Distribution",
    title.size = 1,
    title.position = c("center", "top"),
    inner.margins = c(0.06, 0.10, 0.10, 0.08),
    legend.width = 0.64,
    legend.height = 0.64)

Resid_Pmap <- tm_shape(Municipal_BC)+
  tm_polygons(border.col = 0.5)+
tm_shape(Municipal_BC.sf)+
  tm_dots(col="MLR_RES", alpha=0.6, style="quantile", midpoint = 0)+
  tm_layout(title = "GDP Per Capita Residual Point Map",
    title.size = 1,
    title.position = c("center", "top"),
    inner.margins = c(0.06, 0.10, 0.10, 0.08),
    legend.width = 0.64,
    legend.height = 0.64)

tmap_arrange(Resid_Cmap, Resid_Pmap, ncol = 2)

We will perform more statistical tests to determine if there is spatial autocorrelation, however from the point map we can observe some spatial autocorrelation such as in the south eastern regions of brazil by the shorelines.

3.7.2 Moran’s I Test

Our next steps will involve performing the Moran’s I Test.

But before we do so, we will find the distance summary for computing our distance-based matrix.

coords <- coordinates(Municipal_BC.sp)
k <- knn2nb(knearneigh(coords))
kdists <- unlist(nbdists(k, coords, longlat=FALSE))
summary(kdists)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01289 0.10908 0.14688 0.19638 0.22004 3.33432

The output tells us 3.33432 is the max, therefore we will adopt 3.4 as our upper threshold distance to ensure all units will have at least one neighbour.

The code chunk below will compute the distance-based weight matrix and print out the summary data of our matrix.

nb <- dnearneigh(coords, 0, 3.4, longlat=FALSE)
nb_lw <- nb2listw(nb, style='B')

summary(nb_lw)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 5563 
## Number of nonzero links: 2983304 
## Percentage nonzero weights: 9.640052 
## Average number of links: 536.2761 
## Link number distribution:
## 
##   3   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26 
##   2   1   2   4   1   2   7  13   4   7   6   4  10   9   2   3   7   4   6   5 
##  27  28  29  30  31  32  33  34  36  37  38  39  40  41  42  43  44  45  46  47 
##   8   6   6   4   5   3   2   2   1   2   1   2   5   3   5   1   3   2   5   6 
##  48  49  50  51  52  53  54  55  56  57  58  59  60  61  63  64  65  66  67  69 
##   5   5   8   6   5   3   5  12  11   8   8   2  11   3   1   1   1   6   2   5 
##  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  90 
##   3   3   7   1   3   2   5   3   2   4   2   3   1   3   4   3   4   1   5   4 
##  91  92  96  97  98 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 
##   3   1   1   1   3   3   1   2   3   3   3   1   1   1   5   1   6   1   1   2 
## 115 116 117 118 119 120 121 122 123 124 125 127 128 130 131 132 133 134 135 136 
##   1   3   1  10   3   1   1   3   2   1   1   3   3   4   3   2   5   2   4   2 
## 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 
##   2   2   4   2   2   3   7   2   4   8   7   7   5   3   5   9   6   5   9   7 
## 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 
##   5   7   4   2   6   5   3   7   7   1   7   8   8   4   3   4   3   2   9   7 
## 177 178 179 180 181 182 183 184 185 186 187 188 190 191 192 193 194 195 196 197 
##   4   2   3   1   3   2   1   3   1   2   2   3   5   1   3   4   5   5   4   2 
## 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 
##   4   1   2   4   2   1   2   2   2   1   1   4   3   3   2   4   1   4   6   2 
## 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 
##   6   4   4   8   1   2   9   6   3   7   2   4   2   2   3   5   4   5   4   5 
## 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 
##   5   6   4   3   6   3   2   7   4   2   6   8   4   6   5   7   8   6   7   7 
## 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 
##   2   5   3   7   3   7   6   3   9   5   5   5   5   5   4  11   7   3   7   4 
## 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 297 298 
##   6   9   6   9   7   6   5   8   6   2   4  11   6   8   3   5   5   8   1   4 
## 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 
##   4   7   3   3   4   1   6   7   3   7   7   3   4   7   7   5   7   4   3   7 
## 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 
##   7  12   2   4   7   5   6   6   2   7   7   7   3   3   5   5   2   1   5   5 
## 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 
##   3   8   3   6   5   7   6   7   4   6   6   5   8   6   3   3   4   4  10   2 
## 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 
##   4   6  10   5   3   6   4   4   5   5   9   7   1   6   5   7   8   6  10   6 
## 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 
##   7   9  15   8   8   7  12  11   5   6   6  14   9   7   8   4  11   6   8   9 
## 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 
##   3   6   4   5   5   6   7   6   9  16   8   3   6   4   6   6   4   6   8   5 
## 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 
##   7   6   2   7   7   2   2   3   4   7   6   2   9   7   2   7   7   4   3  10 
## 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 
##   7   5   4   7   1   4   4   2   3   4   4   3   4   7   5   3   3   3   7   5 
## 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 
##   4   7   7   2   4   3   4   3   8   2   2   4   4   4   6   7   2   2   4   2 
## 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 
##   6   3   7   1   2   4   6   2   5   3   3   5   2   3   3   7   6   8   3   2 
## 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 
##   4   7  10   6   5   5   6   9   1   3   4   7  11   4   5   8   4   3   6   5 
## 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 
##   2   2   3   5   2   5   6   4   2   9   5   2   4   3   7   7   7   8   8   3 
## 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 
##   4   4   6   4   5   2   6   2   7   5  10   6   4   7   5  11   5   4   2  11 
## 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 
##   2   5   5  11   4  11   7   8   5   6   8   5   4   4   5   6  13   7   5   4 
## 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 
##   4  13  11   8   4   5   5   9   4   8   4   7   6   5  15   7  13   7  15   5 
## 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 
##   5   6   6  10   7   6   8   7   6  12   5  10  11   6   5  10   8   7   6   8 
## 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 
##   9  11  12  10   8  10   4   9   8   9   8  10   7   7   9   8  11   6  15   9 
## 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 
##  12   7  11   9   6  11   3  12   7   8  14  11  10  14   6   7  12  10  10  10 
## 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 
##  10  11   9   8  12  13  16   3  10   8  12   8  10  13   8   7   5   7  14   9 
## 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 
##   8   6  13  11  15  13  12   3  10   9   4  20  13   7  12  12   8   8  10  10 
## 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 
##   5   4  12  10  12  16  11  15   7  10  10  12   5   9   9  14  12  11   5  11 
## 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 
##   8   9   7   8   8  13  12  12  12   9  15   5   7  12   9  14   8  15   8   5 
## 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 
##   5  16   4  11   9  11   8  14  14   9  14   9   4  11   7   9  15   5   8  12 
## 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 
##   7   7  13  12   5   9  10   9   9   7  14   9   6   2  12  11   6  12  14   9 
## 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 
##   7   6   7   8  11  11   7   6   5  12   6  10   7   8  14   8  12   9   7  16 
## 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 
##   6   8   9   8  10   8  12  10  12   5   4  13  11   5  10   9   4   8  10   6 
## 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 
##  11  10  12   6   8  11  17   9  10   7   8   4   6   8  10   7   5  11   6  11 
## 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 
##  11  10  16   8   4  12   8   8   8  11  12   7  10  10   5  12   7   9   5   4 
## 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 
##  10   8   5  15   6   9  10  11   6   8   8   7   8   2   8   8   6   5   7   7 
## 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 
##   3   7   8   5   9   3   3   8   5   8   6   7   7   2   7   6   7   8   6   6 
## 899 900 901 902 903 904 905 906 907 908 910 911 912 913 914 917 918 920 921 922 
##   5   2   7   5   2   4   5   1   3   1   3   1   1   2   1   1   3   2   2   1 
## 923 925 928 931 932 934 935 944 948 949 954 955 956 
##   2   1   2   1   2   1   1   2   2   1   1   1   1 
## 2 least connected regions:
## 125 1524 with 3 links
## 1 most connected region:
## 4032 with 956 links
## 
## Weights style: B 
## Weights constants summary:
##      n       nn      S0      S1         S2
## B 5563 30946969 2983304 5966608 7764659904

Now we will compute our Moran’s I with the code chunk below.

lm.morantest(Municipal_BC.mlr1,nb_lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade +
## IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC +
## TTG_RATIO, data = Municipal_BC)
## weights: nb_lw
## 
## Moran I statistic standard deviate = 8.6329, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     5.550393e-03    -3.663159e-04     4.697237e-07

It is observed here that the p value is less than our alpha value of 0.05, and therefore we reject the null hypothesis that the residual distribution is random.

Also, our Observed Moran’s I is > 0 at 0.005550393. Thus we can conclude that residuals resemble cluster distribution.

4 Geographically Weighted Regression

4.1 Fixed Bandwidth GWR Model

4.1.1 Building the Fixed Bandwidth GWR Model

The code chunk below will help us find our optimal bandwidth recommendation for our fixed bandwidth gwr model.

bw.fixed <- bw.gwr(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + TTG_RATIO, data=Municipal_BC.sp, approach="CV", kernel="gaussian", adaptive=FALSE, longlat=FALSE)

## Take a cup of tea and have a break, it will take a few minutes.
##           -----A kind suggestion from GWmodel development group
## Fixed bandwidth: 34.70094 CV score: 4.121122 
## Fixed bandwidth: 21.45065 CV score: 4.11316 
## Fixed bandwidth: 13.26152 CV score: 4.096541 
## Fixed bandwidth: 8.200357 CV score: 4.071068 
## Fixed bandwidth: 5.072388 CV score: 4.044393 
## Fixed bandwidth: 3.139197 CV score: 4.036085 
## Fixed bandwidth: 1.944419 CV score: 4.063393 
## Fixed bandwidth: 3.87761 CV score: 4.036465 
## Fixed bandwidth: 2.682833 CV score: 4.040249 
## Fixed bandwidth: 3.421246 CV score: 4.035577 
## Fixed bandwidth: 3.595562 CV score: 4.035705 
## Fixed bandwidth: 3.313513 CV score: 4.035652

We see the output bandwidth recommendation of 3.313513 metres.

4.1.2 Constructing the fixed bandwidth gwr model

gwr.fixed <- gwr.basic(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + TTG_RATIO, data=Municipal_BC.sp, bw=bw.fixed, kernel = 'gaussian', longlat = FALSE)

gwr.fixed

##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2020-05-31 23:10:32 
##    Call:
##    gwr.basic(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + 
##     IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + 
##     TTG_RATIO, data = Municipal_BC.sp, bw = bw.fixed, kernel = "gaussian", 
##     longlat = FALSE)
## 
##    Dependent (y) variable:  GDP_CAPITA
##    Independent variables:  E_ACTIVE_SC IDHM_Longevidade IDHM_Educacao GVA_AGROPEC_SC GVA_INDUSTRY_SC GVA_SERVICES_SC TTG_RATIO
##    Number of data points: 5563
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10000 -0.01043 -0.00407  0.00335  0.83660 
## 
##    Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
##    (Intercept)      -0.111384   0.009934 -11.212  < 2e-16 ***
##    E_ACTIVE_SC       0.094369   0.016652   5.667 1.52e-08 ***
##    IDHM_Longevidade  0.059676   0.012432   4.800 1.63e-06 ***
##    IDHM_Educacao     0.033865   0.006189   5.472 4.65e-08 ***
##    GVA_AGROPEC_SC    0.375172   0.007159  52.403  < 2e-16 ***
##    GVA_INDUSTRY_SC   0.924359   0.010168  90.909  < 2e-16 ***
##    GVA_SERVICES_SC   0.361674   0.007052  51.289  < 2e-16 ***
##    TTG_RATIO         0.013063   0.004634   2.819  0.00483 ** 
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 0.02714 on 5555 degrees of freedom
##    Multiple R-squared: 0.8272
##    Adjusted R-squared: 0.827 
##    F-statistic:  3800 on 7 and 5555 DF,  p-value: < 2.2e-16 
##    ***Extra Diagnostic information
##    Residual sum of squares: 4.090433
##    Sigma(hat): 0.02712116
##    AIC:  -24333.28
##    AICc:  -24333.25
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: gaussian 
##    Fixed bandwidth: 3.421246 
##    Regression points: the same locations as observations are used.
##    Distance metric: Euclidean distance metric is used.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                           Min.    1st Qu.     Median    3rd Qu.   Max.
##    Intercept        -0.2445780 -0.1547615 -0.1166513 -0.0439943 0.0776
##    E_ACTIVE_SC      -0.1762191  0.0501316  0.1093239  0.1828372 0.2774
##    IDHM_Longevidade -0.0413518  0.0118118  0.0227798  0.0590263 0.2692
##    IDHM_Educacao    -0.0639098  0.0128900  0.0345071  0.0473316 0.0933
##    GVA_AGROPEC_SC    0.2532614  0.3469561  0.3626922  0.3711437 0.4089
##    GVA_INDUSTRY_SC   0.8059552  0.8955244  0.9467388  1.0010579 1.7183
##    GVA_SERVICES_SC   0.1636195  0.3265849  0.3492343  0.3638618 0.6414
##    TTG_RATIO        -0.0238476  0.0087139  0.0115743  0.0154589 0.0537
##    ************************Diagnostic information*************************
##    Number of data points: 5563 
##    Effective number of parameters (2trace(S) - trace(S'S)): 113.7207 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5449.279 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): -24569.17 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): -24655.64 
##    Residual sum of squares: 3.816032 
##    R-square value:  0.8388301 
##    Adjusted R-square value:  0.8354661 
## 
##    ***********************************************************************
##    Program stops at: 2020-05-31 23:10:39

We observe that the adjusted r square value improved from 0.827 to 0.835, thus explaining the variation better than our multiple linear regression model. We also observe a lower GWR AIC value at -24655.64 than the GR AIC at -24333.28 which indicates that the GWR model has a better fit.

We will compare it with adaptive bandwidth to see which is better.

4.2 Adaptive Bandwidth GWR Model

4.2.1 Building the Adaptive Bandwidth GWR Model

The code chunk below will help us find our optimal number of points for our adaptive bandwidth gwr model.

dMat <- gw.dist(dp.locat = coordinates(Municipal_BC.sp))

bw.adaptive <- bw.gwr(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + TTG_RATIO, data=Municipal_BC.sp, approach="CV", kernel="gaussian",
adaptive=TRUE, longlat=FALSE, dMat=dMat)

## Take a cup of tea and have a break, it will take a few minutes.
##           -----A kind suggestion from GWmodel development group
## Adaptive bandwidth: 3445 CV score: 4.101626 
## Adaptive bandwidth: 2137 CV score: 4.078878 
## Adaptive bandwidth: 1327 CV score: 4.047649 
## Adaptive bandwidth: 828 CV score: 4.020536 
## Adaptive bandwidth: 518 CV score: 4.006486 
## Adaptive bandwidth: 328 CV score: 4.010149 
## Adaptive bandwidth: 637 CV score: 4.011214 
## Adaptive bandwidth: 446 CV score: 4.006151 
## Adaptive bandwidth: 400 CV score: 4.006689 
## Adaptive bandwidth: 473 CV score: 4.005898 
## Adaptive bandwidth: 491 CV score: 4.006018 
## Adaptive bandwidth: 463 CV score: 4.005963

From the output we can see that we are recommended 463 points for our adaptive bandwidth.

4.2.2 Constructing the adaptive bandwidth gwr model

gwr.adaptive <- gwr.basic(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + TTG_RATIO, data=Municipal_BC.sp, bw=bw.adaptive, kernel = 'gaussian', adaptive=TRUE, longlat = FALSE)

Displaying the output

gwr.adaptive

##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2020-05-31 23:11:38 
##    Call:
##    gwr.basic(formula = GDP_CAPITA ~ E_ACTIVE_SC + IDHM_Longevidade + 
##     IDHM_Educacao + GVA_AGROPEC_SC + GVA_INDUSTRY_SC + GVA_SERVICES_SC + 
##     TTG_RATIO, data = Municipal_BC.sp, bw = bw.adaptive, kernel = "gaussian", 
##     adaptive = TRUE, longlat = FALSE)
## 
##    Dependent (y) variable:  GDP_CAPITA
##    Independent variables:  E_ACTIVE_SC IDHM_Longevidade IDHM_Educacao GVA_AGROPEC_SC GVA_INDUSTRY_SC GVA_SERVICES_SC TTG_RATIO
##    Number of data points: 5563
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10000 -0.01043 -0.00407  0.00335  0.83660 
## 
##    Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
##    (Intercept)      -0.111384   0.009934 -11.212  < 2e-16 ***
##    E_ACTIVE_SC       0.094369   0.016652   5.667 1.52e-08 ***
##    IDHM_Longevidade  0.059676   0.012432   4.800 1.63e-06 ***
##    IDHM_Educacao     0.033865   0.006189   5.472 4.65e-08 ***
##    GVA_AGROPEC_SC    0.375172   0.007159  52.403  < 2e-16 ***
##    GVA_INDUSTRY_SC   0.924359   0.010168  90.909  < 2e-16 ***
##    GVA_SERVICES_SC   0.361674   0.007052  51.289  < 2e-16 ***
##    TTG_RATIO         0.013063   0.004634   2.819  0.00483 ** 
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 0.02714 on 5555 degrees of freedom
##    Multiple R-squared: 0.8272
##    Adjusted R-squared: 0.827 
##    F-statistic:  3800 on 7 and 5555 DF,  p-value: < 2.2e-16 
##    ***Extra Diagnostic information
##    Residual sum of squares: 4.090433
##    Sigma(hat): 0.02712116
##    AIC:  -24333.28
##    AICc:  -24333.25
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: gaussian 
##    Adaptive bandwidth: 473 (number of nearest neighbours)
##    Regression points: the same locations as observations are used.
##    Distance metric: Euclidean distance metric is used.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                           Min.    1st Qu.     Median    3rd Qu.    Max.
##    Intercept        -0.2190601 -0.1533961 -0.1187402 -0.0506453 -0.0256
##    E_ACTIVE_SC       0.0219149  0.0543199  0.1058008  0.1819362  0.3436
##    IDHM_Longevidade -0.0081273  0.0121306  0.0260083  0.0531487  0.0759
##    IDHM_Educacao     0.0038680  0.0149151  0.0324215  0.0483241  0.0983
##    GVA_AGROPEC_SC    0.2919661  0.3413579  0.3670434  0.3739130  0.4019
##    GVA_INDUSTRY_SC   0.8457256  0.8943751  0.9434476  1.0019872  1.0782
##    GVA_SERVICES_SC   0.2304597  0.3165152  0.3468584  0.3673190  0.4012
##    TTG_RATIO        -0.0011856  0.0084792  0.0115244  0.0157431  0.0223
##    ************************Diagnostic information*************************
##    Number of data points: 5563 
##    Effective number of parameters (2trace(S) - trace(S'S)): 71.5749 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5491.425 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): -24590.57 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): -24645.5 
##    Residual sum of squares: 3.843693 
##    R-square value:  0.8376618 
##    Adjusted R-square value:  0.8355456 
## 
##    ***********************************************************************
##    Program stops at: 2020-05-31 23:11:50

For the output for adaptive matrix, we observe that the adjusted r^2 value is at 0.3855456, or marginally better than fixed bandwidth. AIC is the same. We also observe a lower GWR AIC value at -24645.5 than the GR AIC at -24333.28 which indicates that the GWR model has a better fit. However we it is observed that fixed GWR has a lower AIC value at -24655.64.

Since the adjusted r square values between fixed and adaptive bandwidth gwrs or essentially on par, we will use Fixed bandwidth moving foward as it has the better AIC value.

4.3 Visualising GWR Output

4.3.1 Converting SDF into sf data.frame

The code chunk below will convert our SDF data into a sf data.frame

Municipal_BC_Fixed <- st_as_sf(gwr.fixed$SDF) %>%
  st_transform(crs=4674)

gwr.fixed.output <- as.data.frame(gwr.fixed$SDF)

Municipal_BC_Fixed <- cbind(Municipal_BC_Fixed, as.matrix(gwr.fixed.output))
Municipal_BC_Fixed

## Simple feature collection with 5563 features and 60 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -73.99045 ymin: -33.75118 xmax: -28.83594 ymax: 5.271841
## geographic CRS: SIRGAS 2000
## First 10 features:
##      Intercept   E_ACTIVE_SC IDHM_Longevidade IDHM_Educacao GVA_AGROPEC_SC
## 1   0.04238358 -1.095105e-01       0.07861155   -0.05096840      0.3241375
## 2  -0.02399894  8.716528e-05       0.06225939   -0.03157196      0.3259902
## 3   0.07584267 -1.733096e-01       0.09611157   -0.06096790      0.3257255
## 4   0.01377714 -7.553315e-02       0.08720081   -0.04963263      0.3262039
## 5   0.06912431 -1.583624e-01       0.09065771   -0.05945466      0.3248705
## 6   0.06865024 -1.641142e-01       0.09672861   -0.05930521      0.3259579
## 7   0.06285221 -1.505709e-01       0.09157375   -0.05812038      0.3252057
## 8   0.01051681 -5.851647e-02       0.06640669   -0.03355619      0.3241697
## 9   0.01451844 -8.528723e-02       0.09567998   -0.05127515      0.3272012
## 10 -0.01139459 -2.501714e-02       0.06142085   -0.02359192      0.3252393
##    GVA_INDUSTRY_SC GVA_SERVICES_SC    TTG_RATIO          y       yhat
## 1         1.543784       0.5990802 -0.019444635 0.04990125 0.04599972
## 2         1.573399       0.5482027 -0.008369576 0.05595688 0.03155272
## 3         1.321921       0.6259474 -0.020850347 0.05783450 0.05323487
## 4         1.539014       0.5780136 -0.022612264 0.06081357 0.07573962
## 5         1.412513       0.6205307 -0.022503581 0.06270891 0.07270296
## 6         1.346432       0.6183183 -0.021200888 0.04244695 0.04144850
## 7         1.422878       0.6140240 -0.022660226 0.07657810 0.07341621
## 8         1.634248       0.5671677 -0.002533554 0.02861561 0.03342865
## 9         1.505531       0.5776164 -0.023742653 0.04321665 0.05454719
## 10        1.647379       0.5398315  0.004548765 0.04026828 0.01825436
##         residual CV_Score Stud_residual Intercept_SE E_ACTIVE_SC_SE
## 1   0.0039015286        0     0.1529909   0.08744519     0.10628095
## 2   0.0244041622        0     0.9921369   0.07965353     0.09213324
## 3   0.0045996394        0     0.1837896   0.08810145     0.10666639
## 4  -0.0149260521        0    -0.5844040   0.08027562     0.09563057
## 5  -0.0099940523        0    -0.3889078   0.08864302     0.10801825
## 6   0.0009984495        0     0.0383953   0.08654187     0.10468834
## 7   0.0031618971        0     0.1253911   0.08695399     0.10580955
## 8  -0.0048130424        0    -0.1988655   0.09028950     0.10712896
## 9  -0.0113305412        0    -0.4369966   0.07931918     0.09386379
## 10  0.0220139183        0     0.8884765   0.08833568     0.10452726
##    IDHM_Longevidade_SE IDHM_Educacao_SE GVA_AGROPEC_SC_SE GVA_INDUSTRY_SC_SE
## 1            0.1110368       0.04588061        0.02652382          0.1579919
## 2            0.1119694       0.04740315        0.03257650          0.2150466
## 3            0.1020531       0.03981730        0.02201388          0.1005255
## 4            0.1058184       0.04277850        0.02465380          0.1544803
## 5            0.1055777       0.04211908        0.02334130          0.1209299
## 6            0.1016634       0.03955925        0.02197706          0.1045009
## 7            0.1046616       0.04156298        0.02311131          0.1219210
## 8            0.1202339       0.05172279        0.03458566          0.2100146
## 9            0.1032246       0.04088969        0.02331786          0.1417038
## 10           0.1209990       0.05267237        0.03912466          0.2400011
##    GVA_SERVICES_SC_SE TTG_RATIO_SE Intercept_TV E_ACTIVE_SC_TV
## 1          0.04484955   0.02876817    0.4846874  -1.0303865742
## 2          0.05455090   0.03203154   -0.3012916   0.0009460785
## 3          0.03805070   0.02730878    0.8608561  -1.6247814079
## 4          0.04243829   0.02822399    0.1716229  -0.7898431015
## 5          0.04010442   0.02771053    0.7798055  -1.4660709286
## 6          0.03797883   0.02708284    0.7932604  -1.5676451309
## 7          0.03978546   0.02749665    0.7228215  -1.4230371567
## 8          0.05564012   0.03190093    0.1164787  -0.5462245150
## 9          0.04021954   0.02750499    0.1830382  -0.9086276383
## 10         0.06256852   0.03351537   -0.1289919  -0.2393360792
##    IDHM_Longevidade_TV IDHM_Educacao_TV GVA_AGROPEC_SC_TV GVA_INDUSTRY_SC_TV
## 1            0.7079772       -1.1108919         12.220621           9.771289
## 2            0.5560394       -0.6660309         10.006914           7.316547
## 3            0.9417800       -1.5311911         14.796375          13.150111
## 4            0.8240608       -1.1602237         13.231385           9.962527
## 5            0.8586825       -1.4115848         13.918270          11.680424
## 6            0.9514596       -1.4991493         14.831733          12.884406
## 7            0.8749508       -1.3983691         14.071279          11.670487
## 8            0.5523124       -0.6487699          9.372950           7.781592
## 9            0.9269104       -1.2539872         14.032212          10.624491
## 10           0.5076143       -0.4478994          8.312898           6.864047
##    GVA_SERVICES_SC_TV TTG_RATIO_TV  Local_R2 Intercept.1 E_ACTIVE_SC.1
## 1           13.357552  -0.67590792 0.9231928  0.04238358 -1.095105e-01
## 2           10.049381  -0.26129173 0.9106030 -0.02399894  8.716528e-05
## 3           16.450355  -0.76350348 0.9299215  0.07584267 -1.733096e-01
## 4           13.620094  -0.80117180 0.9240774  0.01377714 -7.553315e-02
## 5           15.472876  -0.81209492 0.9282309  0.06912431 -1.583624e-01
## 6           16.280601  -0.78281617 0.9299548  0.06865024 -1.641142e-01
## 7           15.433375  -0.82410846 0.9283817  0.06285221 -1.505709e-01
## 8           10.193502  -0.07941943 0.9120981  0.01051681 -5.851647e-02
## 9           14.361586  -0.86321250 0.9262483  0.01451844 -8.528723e-02
## 10           8.627846   0.13572177 0.9052905 -0.01139459 -2.501714e-02
##    IDHM_Longevidade.1 IDHM_Educacao.1 GVA_AGROPEC_SC.1 GVA_INDUSTRY_SC.1
## 1          0.07861155     -0.05096840        0.3241375          1.543784
## 2          0.06225939     -0.03157196        0.3259902          1.573399
## 3          0.09611157     -0.06096790        0.3257255          1.321921
## 4          0.08720081     -0.04963263        0.3262039          1.539014
## 5          0.09065771     -0.05945466        0.3248705          1.412513
## 6          0.09672861     -0.05930521        0.3259579          1.346432
## 7          0.09157375     -0.05812038        0.3252057          1.422878
## 8          0.06640669     -0.03355619        0.3241697          1.634248
## 9          0.09567998     -0.05127515        0.3272012          1.505531
## 10         0.06142085     -0.02359192        0.3252393          1.647379
##    GVA_SERVICES_SC.1  TTG_RATIO.1        y.1     yhat.1    residual.1
## 1          0.5990802 -0.019444635 0.04990125 0.04599972  0.0039015286
## 2          0.5482027 -0.008369576 0.05595688 0.03155272  0.0244041622
## 3          0.6259474 -0.020850347 0.05783450 0.05323487  0.0045996394
## 4          0.5780136 -0.022612264 0.06081357 0.07573962 -0.0149260521
## 5          0.6205307 -0.022503581 0.06270891 0.07270296 -0.0099940523
## 6          0.6183183 -0.021200888 0.04244695 0.04144850  0.0009984495
## 7          0.6140240 -0.022660226 0.07657810 0.07341621  0.0031618971
## 8          0.5671677 -0.002533554 0.02861561 0.03342865 -0.0048130424
## 9          0.5776164 -0.023742653 0.04321665 0.05454719 -0.0113305412
## 10         0.5398315  0.004548765 0.04026828 0.01825436  0.0220139183
##    CV_Score.1 Stud_residual.1 Intercept_SE.1 E_ACTIVE_SC_SE.1
## 1           0       0.1529909     0.08744519       0.10628095
## 2           0       0.9921369     0.07965353       0.09213324
## 3           0       0.1837896     0.08810145       0.10666639
## 4           0      -0.5844040     0.08027562       0.09563057
## 5           0      -0.3889078     0.08864302       0.10801825
## 6           0       0.0383953     0.08654187       0.10468834
## 7           0       0.1253911     0.08695399       0.10580955
## 8           0      -0.1988655     0.09028950       0.10712896
## 9           0      -0.4369966     0.07931918       0.09386379
## 10          0       0.8884765     0.08833568       0.10452726
##    IDHM_Longevidade_SE.1 IDHM_Educacao_SE.1 GVA_AGROPEC_SC_SE.1
## 1              0.1110368         0.04588061          0.02652382
## 2              0.1119694         0.04740315          0.03257650
## 3              0.1020531         0.03981730          0.02201388
## 4              0.1058184         0.04277850          0.02465380
## 5              0.1055777         0.04211908          0.02334130
## 6              0.1016634         0.03955925          0.02197706
## 7              0.1046616         0.04156298          0.02311131
## 8              0.1202339         0.05172279          0.03458566
## 9              0.1032246         0.04088969          0.02331786
## 10             0.1209990         0.05267237          0.03912466
##    GVA_INDUSTRY_SC_SE.1 GVA_SERVICES_SC_SE.1 TTG_RATIO_SE.1 Intercept_TV.1
## 1             0.1579919           0.04484955     0.02876817      0.4846874
## 2             0.2150466           0.05455090     0.03203154     -0.3012916
## 3             0.1005255           0.03805070     0.02730878      0.8608561
## 4             0.1544803           0.04243829     0.02822399      0.1716229
## 5             0.1209299           0.04010442     0.02771053      0.7798055
## 6             0.1045009           0.03797883     0.02708284      0.7932604
## 7             0.1219210           0.03978546     0.02749665      0.7228215
## 8             0.2100146           0.05564012     0.03190093      0.1164787
## 9             0.1417038           0.04021954     0.02750499      0.1830382
## 10            0.2400011           0.06256852     0.03351537     -0.1289919
##    E_ACTIVE_SC_TV.1 IDHM_Longevidade_TV.1 IDHM_Educacao_TV.1
## 1     -1.0303865742             0.7079772         -1.1108919
## 2      0.0009460785             0.5560394         -0.6660309
## 3     -1.6247814079             0.9417800         -1.5311911
## 4     -0.7898431015             0.8240608         -1.1602237
## 5     -1.4660709286             0.8586825         -1.4115848
## 6     -1.5676451309             0.9514596         -1.4991493
## 7     -1.4230371567             0.8749508         -1.3983691
## 8     -0.5462245150             0.5523124         -0.6487699
## 9     -0.9086276383             0.9269104         -1.2539872
## 10    -0.2393360792             0.5076143         -0.4478994
##    GVA_AGROPEC_SC_TV.1 GVA_INDUSTRY_SC_TV.1 GVA_SERVICES_SC_TV.1 TTG_RATIO_TV.1
## 1            12.220621             9.771289            13.357552    -0.67590792
## 2            10.006914             7.316547            10.049381    -0.26129173
## 3            14.796375            13.150111            16.450355    -0.76350348
## 4            13.231385             9.962527            13.620094    -0.80117180
## 5            13.918270            11.680424            15.472876    -0.81209492
## 6            14.831733            12.884406            16.280601    -0.78281617
## 7            14.071279            11.670487            15.433375    -0.82410846
## 8             9.372950             7.781592            10.193502    -0.07941943
## 9            14.032212            10.624491            14.361586    -0.86321250
## 10            8.312898             6.864047             8.627846     0.13572177
##    Local_R2.1                       geometry
## 1   0.9231928 MULTIPOLYGON (((-62.19465 -...
## 2   0.9106030 MULTIPOLYGON (((-62.53648 -...
## 3   0.9299215 MULTIPOLYGON (((-60.37075 -...
## 4   0.9240774 MULTIPOLYGON (((-61.0008 -1...
## 5   0.9282309 MULTIPOLYGON (((-61.49976 -...
## 6   0.9299548 MULTIPOLYGON (((-60.50475 -...
## 7   0.9283817 MULTIPOLYGON (((-61.34273 -...
## 8   0.9120981 MULTIPOLYGON (((-63.71199 -...
## 9   0.9262483 MULTIPOLYGON (((-60.94827 -...
## 10  0.9052905 MULTIPOLYGON (((-65.37724 -...

4.3.2 Visualising local R2

LocalR2_Cmap <- tm_shape(Municipal_BC) +
  tm_polygons(border.col = 0.5) +
  tm_shape(Municipal_BC_Fixed) +
  tm_fill(col="Local_R2",
          alpha = 0.6,
          border.col="gray60",
          border.lwd=1,
          style="quantile")+
   tm_layout(title = "Local R Squared Distribution",
    title.size = 1,
    title.position = c("center", "top"),
    inner.margins = c(0.06, 0.10, 0.10, 0.08),
    legend.width = 0.64,
    legend.height = 0.64)


LocalR2_Pmap <- tm_shape(Municipal_BC)+
  tm_polygons() +
tm_shape(Municipal_BC_Fixed) +  
  tm_dots(col = "Local_R2",
           alpha = 0.6,
           border.col = "gray60",
           border.lwd = 1)+
  tm_layout(title = "Local R Squared Point Map",
    title.size = 1,
    title.position = c("center", "top"),
    inner.margins = c(0.06, 0.10, 0.10, 0.08),
    legend.width = 0.64,
    legend.height = 0.64)

tmap_arrange(LocalR2_Cmap, LocalR2_Pmap, ncol = 2)

With a local R2 range between 0.7 to 1 (0.749 - 0.986 to be exact), this indicates to us that at least 74.9% of the variation in our dependent variable is explained by the independent variables, up to 98.6%. Hence, it indicates to us that the local regression model fits the observed y -values well.

The areas with the highest Local R^2 values are in the north west and eastern regions of brazil. This tells us that our independent variables fit the local regression model better relative to the other regions and the relationship between our dependent and indepdent variables (GPD per capita and factors affecting it) are well captured in these regions of brazil.

The central southern region shows the lowest local R^2 values which tells us that the relationship between our dependent and independent variables are not as well captured relative to the other regions although still pretty well at 0.749.

4.3.3 Visualising yhat

tm_shape(Municipal_BC) +
  tm_polygons(border.col = 0.5) +
  tm_shape(Municipal_BC_Fixed) +
  tm_fill(col="yhat",
          alpha = 0.6,)+
   tm_layout(title = "yhat Distribution",
    title.size = 1,
    title.position = c("center", "top"),
    inner.margins = c(0.06, 0.10, 0.10, 0.08),
    legend.width = 0.64,
    legend.height = 0.64)

The values of yhat represent the predicted equation for the best fit line in regression; it is used to differentiate between the predicted data and observed data y (GDP_CAPITA).

It is observed that majority of the areas in brazil have a yhat range of 0.0 to 0.2, with a small cluster in the central and central south region displaying areas with slightly higher yhat.

4.3.4 Visualising Residual

tm_shape(Municipal_BC) +
  tm_polygons(border.col = 0.5) +
  tm_shape(Municipal_BC_Fixed) +
  tm_fill(col="residual",
          alpha = 0.6,
          midpoint = 0)+
   tm_layout(title = "Residual Distribution",
    title.size = 1,
    title.position = c("center", "top"),
    inner.margins = c(0.06, 0.10, 0.10, 0.08),
    legend.width = 0.64,
    legend.height = 0.64)

The residual values are obtained from the observed GDP_CAPITA subtracted from the predicted GDP_CAPITA. It is observed from this output that majority of the areas in brazil have a residual value of between -0.2 to 0.0 or 0.0 to 0.2. This indicates to us that the model explains the variation in our dependent variable by the independent variables well.

4.3.5 Visualising Intercept_SE

tm_shape(Municipal_BC) +
  tm_polygons() +
  tm_shape(Municipal_BC_Fixed) +
  tm_fill(col="Intercept_SE",
          alpha = 0.6,
          border.col="gray60",
          border.lwd=1)+
   tm_layout(title = "Intercept Std Error Distribution",
    title.size = 1,
    title.position = c("center", "top"),
    inner.margins = c(0.06, 0.10, 0.10, 0.08),
    legend.width = 0.64,
    legend.height = 0.64)

Intercept SE is a measures the reliability of each coefficient estimate. Low standard errors are preferred as high standard errors may signal that the model has problems where local regression coefficients are potentially collinear and thus has lower confidence in the overall estimates.

From the plot output we see that the north western regions of brazil has higher standard errors relative to the other regions (especially the central eastern and southern regions wherethe standard error is lower.) As mentioned above, a higher standard error may indicate that the local model has problems where local regression coefficients are potentially collinear.

Take-home_Ex04

Tan Yan Kiat, Jireh

5/28/2020