Take-home

1 Introduction

1.1 Brazil, an unequal country

Brazil is the fifth-largest country by area and the sixth-largest country by population. Although it has the ninth largest GDP by nominal measures, it is a highly unequal country.

A main cause of Brazil’s unequal economic growth is high transportation costs due to its large hinterland. The country Brazil may seem little known to us. We may have heard of the Amazon Rainforest which is the world’s largest tropical rainforest. It stretches from the northeastern region of Brazil to some of the southern countries in Latin America. Despite the lush rainforest, the poor soil quality prevented locals from settling there. As the transportation system was first built in the more accessible south, economic development has been more prosperous there compared to the north.

In this analysis, we will be studying the factors affecting GDP per capita using a multiple linear regression model and a geographically weighted regression for a more local analysis.

1.2 Data used

Dependent variable

The dependent variable in this analysis is Brazil GDP per capita in 2016(GDP_CAPITA). This data is retrieved from the Kaggle dataset Brazilian Cities.

Independent variable

There are 13 independent variables used in this analysis.

IDHM_Renda is the Human Development Index (HDI) GNI Index 2010.
IDHM_Longevidade is the HDI Life Expectancy Index 2010.
IDHM_Educacao is the HDI Education Index 2010.
AREA is the municipality area in square kilometer.
RURAL_URBAN is the rural or urban typology.
GVA_TOTAL is the total gross value-added (GVA) 2016.
GVA_MAIN is the activity with the higher GVA contribution 2016.
RES_POP_PROP_15-59 is the proportion of resident population who are economically active 2010.
RES_POP_PROP_ESTR is the proportion of resident population who are foreigners 2010.
DU_PROP_RURAL is the proportion of domestic units who are rural 2010.
CROPPX_PER_HEC is the crop production price per hectare 2017.
GVA_PROP_AGROPEC is the proportion of GVA contribution from Agropecuary 2016.
GVA_PROP_INDUSTRY is the proportion of GVA contribution from Industry 2016.

A more detailed explanation of the variables is done below.

1.3 Methodology

We will be building two models to understand Brazil’s GDP per capita:

Multiple linear regression (MLR)
Geographically-weighted regression (GWR)

The MLR enables us to simply understand the significant factors that contribute to our target response variable, GDP per capita. However, due to its generalization, it is unable to fit every observation as good as possible. The GWR model covers this shortfall.

The GWR is an extension of the ordinary least square approach. It embeds the geographic location of observations into the model to produce a regression model for each observation. In particular, it takes into account the geographical proximity of observations. According to Tobler’s first law of Geography, nearer things are more alike than distant things. Therefore, a larger weight is placed on the neighbouring observations in calculating the GWR model.

2 Data Import and Preparation

Data required:

Brazilian Cities (Source: Kaggle)
Brazilian Spatial Boundary (Source: geobr())

2.1 Importing required libraries

tidyverse for data cleaning
sf to import spatial data
tmap to visualise spatial data
ggplot to visualise data
corrplot to identify correlation
GWModel to build GWR model
olsrr to check linearity assumptions
spdep to calculate spatial autocorrelation statistics

Two new packages are used:

geobr to get Brazil spatial boundary
Hmisc to deal with missing values

packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse', 'geobr', 'Hmisc')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2.2 Importing aspatial data as Dataframe

We use read_delim instead of read_csv since the separator used is “;” and not comma.

Brazilian municipality

The term “city” and “municipality” refers to the same entity.

brazil_cities <- read_delim('data/aspatial/BRAZIL_CITIES.csv', delim=';')

Dimensions of dataframe:

dim(brazil_cities)

## [1] 5573   81

Data structure of dataframe columns:

glimpse(brazil_cities)

## Rows: 5,573
## Columns: 81
## $ CITY                     <chr> "Abadia De Goiás", "Abadia Dos Dourados", ...
## $ STATE                    <chr> "GO", "MG", "GO", "MG", "PA", "CE", "BA", ...
## $ CAPITAL                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ IBGE_RES_POP             <dbl> 6876, 6704, 15757, 22690, 141100, 10496, 8...
## $ IBGE_RES_POP_BRAS        <dbl> 6876, 6704, 15609, 22690, 141040, 10496, 8...
## $ IBGE_RES_POP_ESTR        <dbl> 0, 0, 148, 0, 60, 0, 0, 0, 0, 0, 0, 16, 17...
## $ IBGE_DU                  <dbl> 2137, 2328, 4655, 7694, 31061, 2791, 2572,...
## $ IBGE_DU_URBAN            <dbl> 1546, 1481, 3233, 6667, 19057, 1251, 1193,...
## $ IBGE_DU_RURAL            <dbl> 591, 847, 1422, 1027, 12004, 1540, 1379, 1...
## $ IBGE_POP                 <dbl> 5300, 4154, 10656, 18464, 82956, 4538, 372...
## $ IBGE_1                   <dbl> 69, 38, 139, 176, 1354, 98, 37, 167, 69, 1...
## $ `IBGE_1-4`               <dbl> 318, 207, 650, 856, 5567, 323, 156, 733, 3...
## $ `IBGE_5-9`               <dbl> 438, 260, 894, 1233, 7618, 421, 263, 978, ...
## $ `IBGE_10-14`             <dbl> 517, 351, 1087, 1539, 8905, 483, 277, 927,...
## $ `IBGE_15-59`             <dbl> 3542, 2709, 6896, 11979, 53516, 2631, 2319...
## $ `IBGE_60+`               <dbl> 416, 589, 990, 2681, 5996, 582, 673, 803, ...
## $ IBGE_PLANTED_AREA        <dbl> 319, 4479, 10307, 1862, 25200, 2598, 895, ...
## $ `IBGE_CROP_PRODUCTION_$` <dbl> 1843, 18017, 33085, 7502, 700872, 5234, 39...
## $ `IDHM Ranking 2010`      <dbl> 1689, 2207, 2202, 1994, 3530, 3522, 4086, ...
## $ IDHM                     <dbl> 0.708, 0.690, 0.690, 0.698, 0.628, 0.628, ...
## $ IDHM_Renda               <dbl> 0.687, 0.693, 0.671, 0.720, 0.579, 0.540, ...
## $ IDHM_Longevidade         <dbl> 0.830, 0.839, 0.841, 0.848, 0.798, 0.748, ...
## $ IDHM_Educacao            <dbl> 0.622, 0.563, 0.579, 0.556, 0.537, 0.612, ...
## $ LONG                     <dbl> -49.44055, -47.39683, -48.71881, -45.44619...
## $ LAT                      <dbl> -16.758812, -18.487565, -16.182672, -19.15...
## $ ALT                      <dbl> 893.60, 753.12, 1017.55, 644.74, 10.12, 40...
## $ PAY_TV                   <dbl> 360, 77, 227, 1230, 3389, 29, 952, 51, 55,...
## $ FIXED_PHONES             <dbl> 842, 296, 720, 1716, 1218, 34, 335, 222, 3...
## $ AREA                     <dbl> 147.26, 881.06, 1045.13, 1817.07, 1610.65,...
## $ REGIAO_TUR               <chr> NA, "Caminhos Do Cerrado", "Região Turísti...
## $ CATEGORIA_TUR            <chr> NA, "D", "C", "D", "D", NA, "D", NA, NA, "...
## $ ESTIMATED_POP            <dbl> 8583, 6972, 19614, 23223, 156292, 11663, 8...
## $ RURAL_URBAN              <chr> "Urbano", "Rural Adjacente", "Rural Adjace...
## $ GVA_AGROPEC              <dbl> 6.20, 50524.57, 42.84, 113824.60, 140463.7...
## $ GVA_INDUSTRY             <dbl> 27991.25, 25917.70, 16728.30, 31002.62, 58...
## $ GVA_SERVICES             <dbl> 74750.32, 62689.23, 138198.58, 172.33, 468...
## $ GVA_PUBLIC               <dbl> 36915.04, 28083.79, 63396.20, 86081.41, 48...
## $ ` GVA_TOTAL `            <dbl> 145857.60, 167215.28, 261161.91, 403241.27...
## $ TAXES                    <dbl> 20554.20, 12873.50, 26822.58, 26994.09, 95...
## $ GDP                      <dbl> 166.41, 180.09, 287984.49, 430235.36, 1249...
## $ POP_GDP                  <dbl> 8053, 7037, 18427, 23574, 151934, 11483, 9...
## $ GDP_CAPITA               <dbl> 20664.57, 25591.70, 15628.40, 18250.42, 82...
## $ GVA_MAIN                 <chr> "Demais serviços", "Demais serviços", "Dem...
## $ MUN_EXPENDIT             <dbl> 28227691, 17909274, 37513019, NA, NA, NA, ...
## $ COMP_TOT                 <dbl> 284, 476, 288, 621, 931, 86, 191, 87, 285,...
## $ COMP_A                   <dbl> 5, 6, 5, 18, 4, 1, 6, 2, 5, 2, 0, 8, 3, 1,...
## $ COMP_B                   <dbl> 1, 6, 9, 1, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, ...
## $ COMP_C                   <dbl> 56, 30, 26, 40, 43, 4, 8, 3, 20, 4, 9, 40,...
## $ COMP_D                   <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
## $ COMP_E                   <dbl> 2, 2, 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 2, ...
## $ COMP_F                   <dbl> 29, 34, 7, 20, 27, 6, 4, 0, 10, 2, 0, 25, ...
## $ COMP_G                   <dbl> 110, 190, 117, 303, 500, 48, 97, 71, 133, ...
## $ COMP_H                   <dbl> 26, 70, 12, 62, 16, 2, 5, 0, 18, 8, 1, 67,...
## $ COMP_I                   <dbl> 4, 28, 57, 30, 31, 10, 5, 1, 14, 3, 0, 25,...
## $ COMP_J                   <dbl> 5, 11, 2, 9, 6, 2, 3, 1, 8, 1, 1, 9, 5, 14...
## $ COMP_K                   <dbl> 0, 0, 1, 6, 1, 0, 1, 0, 0, 1, 0, 4, 3, 3, ...
## $ COMP_L                   <dbl> 2, 4, 0, 4, 1, 0, 0, 0, 4, 0, 0, 7, 4, 4, ...
## $ COMP_M                   <dbl> 10, 15, 7, 28, 22, 2, 5, 0, 11, 4, 2, 26, ...
## $ COMP_N                   <dbl> 12, 29, 15, 27, 16, 3, 5, 1, 26, 0, 1, 16,...
## $ COMP_O                   <dbl> 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 6, 2, 4, 2, ...
## $ COMP_P                   <dbl> 6, 9, 11, 15, 155, 0, 8, 0, 8, 1, 6, 14, 1...
## $ COMP_Q                   <dbl> 6, 14, 5, 19, 33, 2, 1, 2, 9, 3, 0, 13, 22...
## $ COMP_R                   <dbl> 1, 6, 1, 9, 15, 0, 2, 0, 4, 0, 0, 4, 6, 6,...
## $ COMP_S                   <dbl> 5, 19, 8, 27, 56, 4, 38, 4, 12, 3, 4, 23, ...
## $ COMP_T                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ COMP_U                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ HOTELS                   <dbl> NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, ...
## $ BEDS                     <dbl> NA, NA, 34, NA, NA, NA, 24, NA, NA, NA, NA...
## $ Pr_Agencies              <dbl> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0,...
## $ Pu_Agencies              <dbl> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1,...
## $ Pr_Bank                  <dbl> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0,...
## $ Pu_Bank                  <dbl> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1,...
## $ Pr_Assets                <dbl> NA, NA, 33724584, 44974716, 76181384, NA, ...
## $ Pu_Assets                <dbl> NA, NA, 67091904, 371922572, 800078483, NA...
## $ Cars                     <dbl> 2158, 2227, 2838, 6928, 5277, 553, 896, 61...
## $ Motorcycles              <dbl> 1246, 1142, 1426, 2953, 25661, 1674, 696, ...
## $ Wheeled_tractor          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, ...
## $ UBER                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ MAC                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `WAL-MART`               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ POST_OFFICES             <dbl> 1, 1, 3, 4, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, ...

We can tell that there are many missing values, especially in the last few columns. Therefore in our data wrangling process, we will need to deal with these variables.

Brazil spatial boundary

The geobr() package provides spatial datasets for Brazil at different geographical level. Here, we are interested in the municipality level data so we will use read_municipality().

muni <- read_municipality(code_muni = 'all', year = 2016)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%

Class type:

class(muni)

## [1] "sf"         "data.frame"

Dimensions of dataframe:

dim(muni)

## [1] 5572    5

2.3 Data wrangling

Steps:

Convert name of cities to uppercase
Assign CRS=5614 to municipalities sf dataframe
Join two dataframes together
Select variables
Transform variables
Deal with NA values
Convert dataframe to sf object

2.3.1 Convert name of cities to uppercase

brazil_cities <- brazil_cities %>%
  mutate_at(.vars = vars(CITY), .funs = toupper) 
muni_clean <- muni %>%
  mutate_at(.vars = vars(name_muni), .funs = funs(toupper))

2.3.2 Assign CRS=5614 to municipalities sf dataframe

muni_clean <- st_transform(muni_clean, 5641)

2.3.3 Join two dataframes together

Since some municipalities have the same name (e.g. CACHOEIRINHA), we join by two columns - city name and state.

brazil_all <- inner_join(brazil_cities, muni_clean, by = c('CITY' = 'name_muni', 'STATE' = 'abbrev_state'))

class(brazil_all)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

2.3.4 Select variables

For the purpose of this study, we need not take into consideration the year of the data source (i.e. we do not need to remove variables that were after 2016 although we are predicting GDP per capita at 2016). However, some variables do not contribute to explaining factors affecting GDP per capita.

Variables to remove and corresponding reason:

Descriptive data about the city: CITY, STATE, CAPITAL, code_state, code_muni
Geographical data: LONG, LAT, ALT
Variables that are a result of economic activity rather than a cause of it: PAY_TV, FIXED_PHONES, HOTELS, BEDS, Pr_Agencies, Pu_Agencies, Pr_Bank, Pu_Bank, Pr_Assets, Pu_Assets, Cars, Motorcycles, Wheeled_tractor, UBER, MAC, WAL-MART, POST_OFFICES
Use gross value-added and not the number of companies: COMP_A, COMP_B, COMP_C, COMP_D, COMP_E, COMP_F, COMP_G,COMP_H, COMP_I, COMP_J,COMP_K, COMP_L, COMP_M, COMP_N, COMP_O, COMP_P, COMP_Q, COMP_R, COMP_S, COMP_T, COMP_U, COMP_TOT
Using actual population instead of an estimate: ESTIMATED_POP
Economically inactive population: IBGE_1, IBGE_1-4, IBGE_5-9, IBGE_10-14, IBGE_60+
Components of GDP: TAXES, GDP, POP_GDP

brazil <- brazil_all %>% 
  subset(select=-c(`CITY`, `STATE`, `CAPITAL`, `code_state`, `code_muni`, 
                   `LONG`, `LAT`, `ALT`, `PAY_TV`, `FIXED_PHONES`,
                   `HOTELS`, `BEDS`, `Pr_Agencies`, `Pu_Agencies`, `Pr_Bank`, `Pu_Bank`,
                   `Pr_Assets`, `Pu_Assets`, `Cars`, `Motorcycles`, `Wheeled_tractor`, `UBER`,
                   `MAC`, `WAL-MART`, `POST_OFFICES`, `COMP_A`, `COMP_B`, `COMP_C`, `COMP_D`,
                   `COMP_E`, `COMP_F`, `COMP_G`,`COMP_H`, `COMP_I`, `COMP_J`,`COMP_K`, `COMP_L`,
                   `COMP_M`, `COMP_N`, `COMP_O`, `COMP_P`, `COMP_Q`, `COMP_R`, `COMP_S`, `COMP_T`,
                   `COMP_U`, `COMP_TOT`, `ESTIMATED_POP`, `IBGE_1`, `IBGE_1-4`, `IBGE_5-9`,
                   `IBGE_10-14`, `IBGE_60+`, `TAXES`, `GDP`, `POP_GDP`))

2.3.5 Transform variables

Some variables cannot be used as it is since we want a comparison between the states. We can transform them to make it useful.

New variables to create:

Demographic

RES_POP_PROP_15.59 (proportion of resident population who are 15-59 years old): this is the closest age group to the economically active population. Since only the economically active population contribute to the economic activities of a population, we can find the proportion of economically active population out of the whole population (i.e. IBGE_15-59 / IBGE_POP)
RES_POP_PROP_ESTR (proportion of resident population who are Brazilians): in the early stage of development, foreign immigrants were an important source of productivity growth. Since different municipalities have different immigration policies, we can find the percentage of resident population who are foreigners (i.e. IBGE_RES_POP_ESTR / IBGE_RES_POP).
Remove IBGE_RES_POP_BRAS since its proportion is indirectly calculated by 1-RES_POP_PROP_ESTR.
DU_PROP_RURAL (proportion of domestic units contribution by rural): domestic units are the number of households. Since rural and urban families can differ in size, we can find the percentage of domestic units who are rural (i.e. IBGE_DU_RURAL / IBGE_DU)
Remove IBGE_DU_URBAN since its proportion is indirectly calculated by 1-IBGE_DU_RURAL.

Economic activity

CROPPX_PER_HEC (crop price per hectare of planted area): we can find the productivity of agriculture by IBGE_CROP_PRODUCTION_$ / IBGE_PLANTED_AREA.
GVA_PROP_AGROPEC (proportion of Gross Added Value contribution by Agropecuary): in the early stages of early development, export of primary commodities was a key contribution to growth. Instead of using the absolute size of agropecuary which varies by size of municipality, we can find the percentage contribution of agropecuary to GAV (i.e. GVA_AGROPEC / GVA_TOTAL).
GVA_PROP_INDUSTRY (proportion of GAV contribution by Industry): in the second stage of growth after agriculture, industrialisation was a key contribution to growth. The same rationale holds for finding the percentage contribution (i.e. GVA_INDUSTRY / GVA_TOTAL).
Remove GVA_SERVICES and GVA_PUBLIC since their proportion are indirectly calculated by 1-GVA_PROP_AGROPEC-GVA_PROP_INDUSTRY.

brazil <- brazil %>% 
  mutate(`RES_POP_PROP_15.59` = `IBGE_15-59` / `IBGE_POP`) %>%
  mutate(`RES_POP_PROP_ESTR` = `IBGE_RES_POP_ESTR` / `IBGE_RES_POP`) %>% 
  mutate(`DU_PROP_RURAL` = `IBGE_DU_RURAL` / `IBGE_DU`) %>%
  mutate(`CROPPX_PER_HEC` = `IBGE_CROP_PRODUCTION_$` / `IBGE_PLANTED_AREA`) %>%
  rename(`GVA_TOTAL` = ` GVA_TOTAL `) %>%
  mutate(`GVA_PROP_AGROPEC` = `GVA_AGROPEC` / `GVA_TOTAL`) %>%
  mutate(`GVA_PROP_INDUSTRY` = `GVA_INDUSTRY` / `GVA_TOTAL`) %>%
      subset(select=-c(`IBGE_15-59`, `IBGE_POP`, `IBGE_RES_POP_BRAS`, `IBGE_RES_POP_ESTR`,
                       `IBGE_RES_POP`, `IBGE_DU_RURAL`, `IBGE_DU_URBAN`, `IBGE_DU`,
                       `GVA_AGROPEC`, `GVA_INDUSTRY`, `GVA_SERVICES`, `GVA_PUBLIC`, 
                       `IBGE_CROP_PRODUCTION_$`, `IBGE_PLANTED_AREA`))

Rename column values for easier understanding:

GVA_MAIN: 10 classes in long Portuguese phrases. Convert to shorter English phrase.
RURAL_URBAN: 6 classes in Portuguese. Convert to English phrase.

Column values for GVA_MAIN:

table(brazil$GVA_MAIN)

## 
##                 Administração, defesa, educação e saúde públicas e seguridade social 
##                                                                                 2725 
##                          Agricultura, inclusive apoio à agricultura e a pós colheita 
##                                                                                  735 
##                          Comércio e reparação de veículos automotores e motocicletas 
##                                                                                   46 
##                                                                           Construção 
##                                                                                    7 
##                                                                      Demais serviços 
##                                                                                 1477 
## Eletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação 
##                                                                                   98 
##                                                          Indústrias de transformação 
##                                                                                  261 
##                                                                Indústrias extrativas 
##                                                                                   35 
##                                                 Pecuária, inclusive apoio à pecuária 
##                                                                                  161 
##                                              Produção florestal, pesca e aquicultura 
##                                                                                   25

Column values for RURAL_URBAN:

table(brazil$RURAL_URBAN)

## 
## Intermediário Adjacente    Intermediário Remoto         Rural Adjacente 
##                     686                      60                    3040 
##            Rural Remoto       Sem classificação                  Urbano 
##                     323                       5                    1456

brazil <- brazil %>%
  mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Administração, defesa, educação e saúde públicas e seguridade social", "Public services")) %>%
  mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Agricultura, inclusive apoio à agricultura e a pós colheita", "Agriculture")) %>%
  mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Comércio e reparação de veículos automotores e motocicletas", "Automobile services")) %>%
  mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Construção", "Construction")) %>%
  mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Demais serviços", "Other services")) %>%
  mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Eletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação", "Energy services")) %>%
  mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Indústrias de transformação", "Manufacturing industries")) %>%
  mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Indústrias extrativas", "Extractive industries")) %>%
  mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Pecuária, inclusive apoio à pecuária", "Livestock")) %>%
  mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Produção florestal, pesca e aquicultura", "Fisheries")) %>%
  mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Intermediário Adjacente", "Adjacent intermediate")) %>%
  mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Intermediário Remoto", "Remote intermediate")) %>%
  mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Rural Adjacente", "Adjacent rural")) %>%
  mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Rural Remoto", "Remote rural")) %>%
  mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Sem classificação", "Without classification")) %>%
  mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Urbano", "Urban"))

Check dimensions of dataframe:

dim(brazil)

## [1] 5571   20

2.3.6 Deal with NA values

Dealing with missing values for GDP per capita

Since GDP per capita is our dependent variable of interest, we cannot have missing values.

sum(is.na(brazil$GDP_CAPITA))

## [1] 1

We will be dropping one row of observation.

brazil <- brazil %>%
  filter(!is.na(`GDP_CAPITA`))

Confirm the number of missing values:

sum(is.na(brazil$GDP_CAPITA))

## [1] 0

Dealing with missing values for other columns

The lm() function has an na.action which deals with missing values. Since the default action is na.omit, we may end up missing many observations. Therefore, we will approach it in a more systematic way.

Steps:

For each column, count the number of rows with missing values
If the number of observations with missing values is large, we can delete these variables.
If the number of observations with missing values is small, we can use data imputation techniques (e.g. median) to replace these missing values.

Counting the number of rows with missing values:

sapply(brazil, function(x) sum(is.na(x)))

##  IDHM Ranking 2010               IDHM         IDHM_Renda   IDHM_Longevidade 
##                  6                  6                  6                  6 
##      IDHM_Educacao               AREA         REGIAO_TUR      CATEGORIA_TUR 
##                  6                  1               2286               2286 
##        RURAL_URBAN          GVA_TOTAL         GDP_CAPITA           GVA_MAIN 
##                  0                  0                  0                  0 
##       MUN_EXPENDIT               geom RES_POP_PROP_15.59  RES_POP_PROP_ESTR 
##               1491                  0                  5                  5 
##      DU_PROP_RURAL     CROPPX_PER_HEC   GVA_PROP_AGROPEC  GVA_PROP_INDUSTRY 
##                 78                 67                  0                  0

Some 27%-41% of the observations in MUN_EXPENDIT, REGIAO_TUR and CATEGORIA_TUR columns are missing. Therefore, we will remove these variables.

brazil <- brazil %>%
  subset(select=-c(`MUN_EXPENDIT`, `REGIAO_TUR`, `CATEGORIA_TUR`))

sapply(brazil, function(x) sum(is.na(x)))

##  IDHM Ranking 2010               IDHM         IDHM_Renda   IDHM_Longevidade 
##                  6                  6                  6                  6 
##      IDHM_Educacao               AREA        RURAL_URBAN          GVA_TOTAL 
##                  6                  1                  0                  0 
##         GDP_CAPITA           GVA_MAIN               geom RES_POP_PROP_15.59 
##                  0                  0                  0                  5 
##  RES_POP_PROP_ESTR      DU_PROP_RURAL     CROPPX_PER_HEC   GVA_PROP_AGROPEC 
##                  5                 78                 67                  0 
##  GVA_PROP_INDUSTRY 
##                  0

Use impute() function from Hmisc package to impute the mean/median/mode value of non-missing observations into the observations with missing values. The default impute is median. Since some of our variables are highly skewed, it is good to use the median over the mean.

Boxplots showing distribution of data with missing values:

p1 <- ggplot(brazil, aes(y=`IDHM Ranking 2010`)) + 
  geom_boxplot() + coord_flip() 
p2 <- ggplot(brazil, aes(y=`IDHM`)) + 
  geom_boxplot() + coord_flip() 
p3 <- ggplot(brazil, aes(y=`IDHM_Renda`)) + 
  geom_boxplot() + coord_flip() 
p4 <- ggplot(brazil, aes(y=`IDHM_Longevidade`)) + 
  geom_boxplot() + coord_flip() 
p5 <- ggplot(brazil, aes(y=`IDHM_Educacao`)) + 
  geom_boxplot() + coord_flip() 
p6 <- ggplot(brazil, aes(y=`AREA`)) + 
  geom_boxplot() + coord_flip() 
p7 <- ggplot(brazil, aes(y=`RES_POP_PROP_15.59`)) + 
  geom_boxplot() + coord_flip() 
p8 <- ggplot(brazil, aes(y=`RES_POP_PROP_ESTR`)) + 
  geom_boxplot() + coord_flip() 
p9 <- ggplot(brazil, aes(y=`DU_PROP_RURAL`)) + 
  geom_boxplot() + coord_flip() 
p10 <- ggplot(brazil, aes(y=`CROPPX_PER_HEC`)) + 
  geom_boxplot() + coord_flip() 
ggarrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10,
          ncol = 5, nrow = 2)

Impute doubles type columns:

brazil[, !names(brazil) %in% c("RURAL_URBAN","GVA_TOTAL","GDP_CAPITA","GVA_MAIN","geom",
                               "GVA_PROP_AGROPEC", "GVA_PROP_INDUSTRY")] <- impute(brazil[, !names(brazil) %in% c("RURAL_URBAN","GVA_TOTAL","GDP_CAPITA","GVA_MAIN","geom",
                               "GVA_PROP_AGROPEC", "GVA_PROP_INDUSTRY")])

Check for missing values:

sapply(brazil, function(x) sum(is.na(x)))

##  IDHM Ranking 2010               IDHM         IDHM_Renda   IDHM_Longevidade 
##                  0                  0                  0                  0 
##      IDHM_Educacao               AREA        RURAL_URBAN          GVA_TOTAL 
##                  0                  0                  0                  0 
##         GDP_CAPITA           GVA_MAIN               geom RES_POP_PROP_15.59 
##                  0                  0                  0                  0 
##  RES_POP_PROP_ESTR      DU_PROP_RURAL     CROPPX_PER_HEC   GVA_PROP_AGROPEC 
##                  0                  0                  0                  0 
##  GVA_PROP_INDUSTRY 
##                  0

No more missing values!

2.3.7 Convert dataframe to sf object

Check class type:

class(brazil)

## [1] "tbl_df"     "tbl"        "data.frame"

Convert to sf object:

brazil <- st_as_sf(brazil)

Check class type again:

class(brazil)

## [1] "sf"         "tbl_df"     "tbl"        "data.frame"

3 Exploratory data analysis

3.1 EDA using statistical graphics

Since GDP per capita is our dependent variable of interest, we will plota histogram and boxplot to understand its distribution.

p1 <- ggplot(data=brazil, aes(x= GDP_CAPITA)) +
  geom_histogram(bins=20, color="black", fill="light blue")
p2 <- ggplot(brazil, aes(y=GDP_CAPITA)) +
  geom_boxplot() +
  coord_flip()
ggarrange(p1, p2,
          ncol = 2, nrow = 1)

describe(brazil$GDP_CAPITA)

## brazil$GDP_CAPITA 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     5570        0     5567        1    21126    16863     6330     6967 
##      .25      .50      .75      .90      .95 
##     9058    15870    26155    39681    50515 
## 
## lowest :   3190.57   4282.65   4530.24   4533.95   4585.72
## highest: 274572.12 289932.05 296459.35 306138.63 314637.69

GDP_CAPITA is highly skewed, with the median at R$15,870 and the municipal with highest GDP per capita at R$314,638.

3.2 EDA using choropleth map

The choropleth map helps us to understand the geographical distribution of our variable of interest, GDP per capita.

3.2.1 Mapping GDP per capita

We use the quantile breaks, and n=4 to split the data into groups of bottom 25%, 25%-50%, 50%-75% and top 75%.

tm_shape(brazil)+
  tm_polygons() +
tm_shape(brazil) +  
  tm_fill(col = "GDP_CAPITA",
          alpha = 0.6,
          n = 4,
          style="quantile",
          title = "GDP per capita")

From the choropleth map, we can see that the the richer municipalities are located in center and southern Brazil. The bottom 25% municipalities are mostly located in northeastern and northern Brazil.

This is similar to how the transporation networks are more densely connected in the south, enabling trade and thus economic growth. Understandably, the northern area where the Amazon Rainforest is located, economic growth and hence GDP per capita is low in that area. On the other hand, the northeastern region’s economy is mainly contributed by agriculture. The constant droughts has brought losses to producers, which has caused over 40% of its population to be trapped in poverty.

3.3 Correlation Analysis

Correlation analysis helps us to identify highly correlated variables. These variables cause the problem of multicollinearity and will affect our modelling.

We will use corrplot.mixed() to identify variables with >=0.85 correlation coefficient.We will retain the variables that are least correlated with other variables. In addition, we will remove the non-numeric columns since correlation can only be done for numeric columns.

to_corr = brazil[, !names(brazil) %in% c("RURAL_URBAN","GVA_MAIN","geom")] %>%
  st_set_geometry(NULL)
corrplot.mixed(cor(to_corr),
         lower = "ellipse", upper = "number",
         tl.pos = "lt", diag = "l", 
         tl.col = "black", number.cex=0.75)

IDHM Ranking 2010 is highly correlated with 3 variables:

IDHM (-0.99)
IDHM_Renda (-0.94)
IDHM_Educacao (-0.94)

IDHM is highly correlated with 4 variables:

IDHM Ranking 2010 (-0.99)
IDHM_Renda (0.95)
IDHM_Longevidade (0.85)
IDHM_Educacao (0.95)

Therefore, between IDHM Ranking 2010 and IDHM, we will retain IDHM Ranking 2010.

Between IDHM Ranking 2010 and IDHM_Renda, we retain IDHM_Renda which is not highly correlated with any other variables.
The remaining variables are not highly correlated with each other.

Checking correlation plot again:

to_corr = brazil[, !names(brazil) %in% c("RURAL_URBAN","GVA_MAIN","geom", "IDHM" ,
                                         "IDHM Ranking 2010")] %>%
  st_set_geometry(NULL)
corrplot.mixed(cor(to_corr),
         lower = "ellipse", upper = "number",
         tl.pos = "lt", diag = "l", 
         tl.col = "black", number.cex=0.75)

Remove two highly correlated variables from the brazil dataframe:

brazil <- brazil %>%
  subset(select=-c(`IDHM`, `IDHM Ranking 2010`))

Dataframe dimensions:

dim(brazil)

## [1] 5570   15

15 variables left!

4 Multiple linear regression model of GDP per capita

We want to know if GDP per capita can be explained by a muliple linear regression (MLR) model. The hypothesis is:

H0: There is no relationship between GDP per capita and the independent variables in our model.
H1: There is a relationship between GDP per capita and the independent variables in our model.

Confidence interval is 95%.

4.1 Building a multiple linear regression model

4.1.1 Modelling with all variables

mlr <- lm(`GDP_CAPITA` ~ `IDHM_Renda` + `IDHM_Longevidade` + `IDHM_Educacao` + `AREA` +
            `RURAL_URBAN` + `GVA_TOTAL` + `GVA_MAIN` + `RES_POP_PROP_15.59` + `RES_POP_PROP_ESTR` +
            `DU_PROP_RURAL` + `CROPPX_PER_HEC` + `GVA_PROP_AGROPEC` + `GVA_PROP_INDUSTRY`, 
          data=brazil)
summary(mlr)

## 
## Call:
## lm(formula = GDP_CAPITA ~ IDHM_Renda + IDHM_Longevidade + IDHM_Educacao + 
##     AREA + RURAL_URBAN + GVA_TOTAL + GVA_MAIN + RES_POP_PROP_15.59 + 
##     RES_POP_PROP_ESTR + DU_PROP_RURAL + CROPPX_PER_HEC + GVA_PROP_AGROPEC + 
##     GVA_PROP_INDUSTRY, data = brazil)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52931  -5069   -836   2663 252743 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -4.392e+04  6.633e+03  -6.621 3.90e-11 ***
## IDHM_Renda                         8.430e+04  6.306e+03  13.368  < 2e-16 ***
## IDHM_Longevidade                  -1.762e+04  8.226e+03  -2.141  0.03228 *  
## IDHM_Educacao                      9.406e+03  4.038e+03   2.330  0.01986 *  
## AREA                               6.210e-02  3.943e-02   1.575  0.11537    
## RURAL_URBANAdjacent rural          1.609e+03  6.855e+02   2.347  0.01898 *  
## RURAL_URBANRemote intermediate     5.566e+03  2.130e+03   2.613  0.00899 ** 
## RURAL_URBANRemote rural            6.457e+03  1.069e+03   6.042 1.63e-09 ***
## RURAL_URBANUrban                   2.062e+03  7.237e+02   2.849  0.00440 ** 
## RURAL_URBANWithout classification -2.784e+04  2.611e+04  -1.066  0.28631    
## GVA_TOTAL                          1.274e-04  2.177e-05   5.853 5.11e-09 ***
## GVA_MAINAutomobile services        1.989e+04  2.279e+03   8.728  < 2e-16 ***
## GVA_MAINConstruction              -2.700e+03  5.653e+03  -0.478  0.63290    
## GVA_MAINEnergy services            2.356e+04  1.627e+03  14.482  < 2e-16 ***
## GVA_MAINExtractive industries      1.973e+04  2.592e+03   7.612 3.14e-14 ***
## GVA_MAINFisheries                 -1.029e+03  3.038e+03  -0.339  0.73483    
## GVA_MAINLivestock                 -9.312e+03  1.300e+03  -7.160 9.09e-13 ***
## GVA_MAINManufacturing industries   1.443e+04  1.134e+03  12.722  < 2e-16 ***
## GVA_MAINOther services            -1.008e+04  7.592e+02 -13.275  < 2e-16 ***
## GVA_MAINPublic services           -1.340e+04  7.088e+02 -18.908  < 2e-16 ***
## RES_POP_PROP_15.59                 3.932e+04  9.975e+03   3.942 8.17e-05 ***
## RES_POP_PROP_ESTR                  4.127e+04  3.716e+04   1.110  0.26687    
## DU_PROP_RURAL                      2.454e+03  1.295e+03   1.895  0.05817 .  
## CROPPX_PER_HEC                    -7.233e+01  3.307e+01  -2.187  0.02878 *  
## GVA_PROP_AGROPEC                  -1.675e+00  2.269e+00  -0.738  0.46033    
## GVA_PROP_INDUSTRY                  2.806e+01  3.174e+00   8.840  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14850 on 5544 degrees of freedom
## Multiple R-squared:  0.4687, Adjusted R-squared:  0.4663 
## F-statistic: 195.6 on 25 and 5544 DF,  p-value: < 2.2e-16

4.1.2 Modelling with statistically significant variables

We look at the variables that are statistically significant at the 95% confidence interval (***, **, or *). Remove variables that are not statistically significant at the 95% confidence interval and build MLR model.

mlr2 <- lm(`GDP_CAPITA` ~ `IDHM_Renda` + `IDHM_Longevidade` + `RURAL_URBAN` + `GVA_TOTAL` + 
             `GVA_MAIN` + `RES_POP_PROP_15.59` + `CROPPX_PER_HEC` + `GVA_PROP_INDUSTRY`, 
          data=brazil)
summary(mlr2)

## 
## Call:
## lm(formula = GDP_CAPITA ~ IDHM_Renda + IDHM_Longevidade + RURAL_URBAN + 
##     GVA_TOTAL + GVA_MAIN + RES_POP_PROP_15.59 + CROPPX_PER_HEC + 
##     GVA_PROP_INDUSTRY, data = brazil)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54161  -5188   -874   2693 252508 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -4.387e+04  6.552e+03  -6.696 2.35e-11 ***
## IDHM_Renda                         8.879e+04  5.597e+03  15.864  < 2e-16 ***
## IDHM_Longevidade                  -1.615e+04  8.199e+03  -1.970  0.04886 *  
## RURAL_URBANAdjacent rural          1.964e+03  6.458e+02   3.042  0.00236 ** 
## RURAL_URBANRemote intermediate     6.620e+03  2.012e+03   3.289  0.00101 ** 
## RURAL_URBANRemote rural            6.905e+03  1.038e+03   6.651 3.18e-11 ***
## RURAL_URBANUrban                   2.191e+03  7.196e+02   3.044  0.00234 ** 
## RURAL_URBANWithout classification  2.168e+03  6.785e+03   0.319  0.74940    
## GVA_TOTAL                          1.303e-04  2.175e-05   5.992 2.20e-09 ***
## GVA_MAINAutomobile services        1.988e+04  2.277e+03   8.730  < 2e-16 ***
## GVA_MAINConstruction              -2.393e+03  5.651e+03  -0.423  0.67197    
## GVA_MAINEnergy services            2.358e+04  1.619e+03  14.565  < 2e-16 ***
## GVA_MAINExtractive industries      1.996e+04  2.586e+03   7.717 1.40e-14 ***
## GVA_MAINFisheries                 -1.178e+03  3.034e+03  -0.388  0.69777    
## GVA_MAINLivestock                 -9.242e+03  1.298e+03  -7.121 1.21e-12 ***
## GVA_MAINManufacturing industries   1.446e+04  1.121e+03  12.897  < 2e-16 ***
## GVA_MAINOther services            -1.003e+04  7.448e+02 -13.465  < 2e-16 ***
## GVA_MAINPublic services           -1.344e+04  7.033e+02 -19.111  < 2e-16 ***
## RES_POP_PROP_15.59                 4.227e+04  9.794e+03   4.315 1.62e-05 ***
## CROPPX_PER_HEC                    -7.329e+01  3.299e+01  -2.222  0.02633 *  
## GVA_PROP_INDUSTRY                  2.728e+01  3.034e+00   8.990  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14860 on 5549 degrees of freedom
## Multiple R-squared:  0.4676, Adjusted R-squared:  0.4657 
## F-statistic: 243.7 on 20 and 5549 DF,  p-value: < 2.2e-16

Multiple linear regression model

GDP_CAPITA = - 43870 + 88790(IDHM_Renda) - 16150(IDHM_Longevidade) + 1964(RURAL_URBANAdjacent rural) + 6620(RURAL_URBANRemote intermediate) + 6905(RURAL_URBANRemote rural) + 2191(RURAL_URBANUrban) + 2168(RURAL_URBANWithout classification) + 13030(GVA_TOTAL) + 19880(GVA_MAINAutomobile services) - 2393(GVA_MAINConstruction) + 23580(GVA_MAINEnergy services) + 19960(GVA_MAINExtractive industries) - 1178(GVA_MAINFisheries) - 9242(GVA_MAINLivestock) + 14460(GVA_MAINManufacturing industries) - 10040(GVA_MAINOther services) - 13440(GVA_MAINPublic services) + 42270(RES_POP_PROP_15.59) - 73.29(CROPPX_PER_HEC) + 27.28(GVA_PROP_INDUSTRY)

Model interpretation

The MLR is interpreted as such:

For one unit increase in IDHM_Renda, GDP_CAPITA increases by 88790 units (while other variables remain constant).
For one unit increase in RURAL_URBANRemote rural, GDP_CAPITA increases by 6905 units.
For one unit increase in GVA_TOTAL, GDP_CAPITA decreases by 13030 units.
For one unit increase in GVA_MAINAutomobile services, GDP_CAPITA increases by 19880 units.

In addition, the most significant variables at the 95% confidence interval affecting GDP per capita are IDHM_Renda, IDHM_Longevidade, RURAL_URBANAdjacent rural, RURAL_URBANRemote intermediate, RURAL_URBANRemote rural, RURAL_URBANUrban, GVA_TOTAL, GVA_MAINAutomobile services, GVA_MAINEnergy services, GVA_MAINExtractive industries, GVA_MAINLivestock, GVA_MAINManufacturing industries, GVA_MAINOther services, GVA_MAINPublic services, RES_POP_PROP_15.59, CROPPX_PER_HEC, GVA_PROP_INDUSTRY.

Model evaluation

Since the Adjusted R-squared = 0.4657, this MLR model can explain about 46.57% of variation in GDP per capita.

Since F-statistic = 243.7 >> 1, we can reject the null hypothesis that there is no relationship between our response variable GDP per capita and the independent variables in our model.

4.2 Regression assumption checking

After modelling, we can check if our linear regression assumptions holds. If they do not, a transformed model may be used to explain the relationship between GDP per capita and the independent variables better.

The linear regression assumptions:

Linear relationship
Multivariate normality
No or little multicollinearity
No autocorrelation - for spatial analysis, we can test for spatial autocorrelation

It is important to remember that we are using the residuals of the model to check against our assumptions and not the actual values. Luckily, the oslrr package takes care of this for us.

4.2.1 Linearity assumption

ols_plot_resid_fit(mlr2)

Since most of the data poitns are scattered around the 0 line, we can safely conclude that the relationships between the dependent variable and independent variables are linear.

4.2.2 Normality assumption

ols_plot_resid_hist(mlr2)

From the above plot, we can tell that the residuals of the multiple linear regression model resemble normal distribution. We can safely conclude that the residuals have a normal distribution.

4.2.3 No multicollinearity assumption

ols_vif_tol(mlr2)

##                            Variables Tolerance      VIF
## 1                         IDHM_Renda 0.1947953 5.133595
## 2                   IDHM_Longevidade 0.2933923 3.408405
## 3          RURAL_URBANAdjacent rural 0.3835697 2.607088
## 4     RURAL_URBANRemote intermediate 0.9188681 1.088295
## 5            RURAL_URBANRemote rural 0.6735732 1.484620
## 6                   RURAL_URBANUrban 0.3966336 2.521218
## 7  RURAL_URBANWithout classification 0.9603577 1.041279
## 8                          GVA_TOTAL 0.9706129 1.030277
## 9        GVA_MAINAutomobile services 0.9335483 1.071182
## 10              GVA_MAINConstruction 0.9892562 1.010860
## 11           GVA_MAINEnergy services 0.8755305 1.142165
## 12     GVA_MAINExtractive industries 0.9495395 1.053142
## 13                 GVA_MAINFisheries 0.9642609 1.037064
## 14                 GVA_MAINLivestock 0.8386443 1.192401
## 15  GVA_MAINManufacturing industries 0.7063135 1.415802
## 16            GVA_MAINOther services 0.3668958 2.725569
## 17           GVA_MAINPublic services 0.3207995 3.117211
## 18                RES_POP_PROP_15.59 0.4010996 2.493146
## 19                    CROPPX_PER_HEC 0.9347895 1.069760
## 20                 GVA_PROP_INDUSTRY 0.9484295 1.054375

Since the VIF of the independent variables are less than 10. We can safely conclude that there are no sign of multicollinearity among the independent variables.

4.2.4 Spatial autocorrelation assumption

Combine residuals with sf dataframe:

brazil.res.sf <- cbind(brazil, mlr2$residuals) %>% 
  rename(`MLR_RES` = `mlr2.residuals`)

Convert to SpatialPointsDataframe:

brazil.sp <- as_Spatial(brazil.res.sf)
brazil.sp

## class       : SpatialPolygonsDataFrame 
## features    : 5570 
## extent      : 1552246, 6575781, 6030702, 10583412  (xmin, xmax, ymin, ymax)
## crs         : +proj=merc +lon_0=-43 +lat_ts=-2 +x_0=5000000 +y_0=10000000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## variables   : 15
## names       : IDHM_Renda, IDHM_Longevidade, IDHM_Educacao,      AREA,            RURAL_URBAN,    GVA_TOTAL, GDP_CAPITA,        GVA_MAIN, RES_POP_PROP_15.59, RES_POP_PROP_ESTR,        DU_PROP_RURAL,   CROPPX_PER_HEC, GVA_PROP_AGROPEC,    GVA_PROP_INDUSTRY,           MLR_RES 
## min values  :        0.4,            0.672,         0.207,     0.679,  Adjacent intermediate,        17.48,    3190.57,     Agriculture,  0.471631205673759,                 0, 0.000129388931365053,                0,                0, 1.42261767428109e-05, -54161.1593263509 
## max values  :      0.891,            0.894,         0.825, 159533.33, Without classification, 569910502.87,  314637.69, Public services,  0.744760130414532,             0.679,    0.954473822447908, 106.960227272727, 760.885597874933,     874.802651309903,  252507.656201713

tm_shape(brazil.sp)+
  tm_polygons() +
tm_shape(brazil.sp) +  
  tm_fill(col = "MLR_RES",
          alpha = 0.6,
          n = 4,
          style="quantile")

The figure above reveal that the residuals display signs of spatial autocorrelation. Positive spatial autocorrelation, or clustering, means that neighbours are more similar to each other rather than randomly distributed.

To proof that our observation is indeed true, the Moran’s I test will be performed.

Moran’s I test

Calculate the distance between polygons and its neighbours:

longlat=FALSE since coordinates is taken from a SpatialPoints object and the value is taken from it (source: rdocumentation)

coords <- coordinates(brazil.sp)
k1 <- knn2nb(knearneigh(coords))
k1dists <- unlist(nbdists(k1, coords, longlat = FALSE))
summary(k1dists)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1467   12473   16794   22290   24987  370668

Calculate distance-based weight matrix:

Use max distance (i.e. longest distance) and add 1 for buffer

nb <- dnearneigh(coords, 0, 370669, longlat = FALSE)

Convert neighbours list to spatial weights:

nb_lw <- nb2listw(nb, style = 'W')

Perform the Moran’s I test for residual spatial autocorrelation:

lm.morantest(mlr2, nb_lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = GDP_CAPITA ~ IDHM_Renda + IDHM_Longevidade +
## RURAL_URBAN + GVA_TOTAL + GVA_MAIN + RES_POP_PROP_15.59 +
## CROPPX_PER_HEC + GVA_PROP_INDUSTRY, data = brazil)
## weights: nb_lw
## 
## Moran I statistic standard deviate = 13.581, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     1.511154e-02    -4.834299e-04     1.318550e-06

The Global Moran’s I test for residual spatial autocorrelation shows that it’s p-value < 2.2e-16 which is less than the alpha value of 0.05. Hence, we will reject the null hypothesis that the residuals are randomly distributed.

Since the Observed Global Moran I = 1.511154e-02 which is greater than 0, we can infer than the residuals resemble cluster distribution.

5 Geographically weighted regression model of GDP per capita using adaptive bandwidth

Once again, the Geographically Weighted Regression allows us to analyse spatial non-stationarity which is the measurement of relationship among variables that differs between locations. It is done by computing a regression model for each observation.

Since the size of Brazil municipalities vary, we should use an adaptive spatial kernel instead of a fixed one. The adaptive bandwidth kernel allows the bandwidth to become a function of the number of nearest neighbours such that each observation is based on the same number of cities. Suppose a fixed bandwidth is used, the bandwidth distance is longer for larger cities and shorter for smaller cities. This makes it hard for us to determine an optimal bandwidth to use.

There are two approaches for calculating the bandwidth:

Cross-validation
Akaike information criterion corrected (AICc)

We will use both and determine the best model using three evaluation metrices:

AICc
Adjusted R-squared
Residual sum of squares

5.1 Cross-validation approach

5.1.1 Computing the adaptive bandwidth

Calculate distance matrix using gw.dist(). Data points used are Brazil municipalities from the SpatialPointsDataframe.

distmat <- gw.dist(dp.locat = coordinates(brazil.sp))

The bw.gwr() function computes the automatic bandwidth selection for a GWR model.

adaptive=TRUE for an adaptive bandwidth
approach="CV" for Cross-Validation approach
kernel="gaussian" for a gaussian function
longlat=FALSE for great circle distance to be used since the unit of projection is in m

The variables used are those significant variables identified earlier in MLR model.

bw.adaptive <- bw.gwr(`GDP_CAPITA`~`IDHM_Renda` + `IDHM_Longevidade` +
            `RURAL_URBAN` + `GVA_TOTAL` + `GVA_MAIN` + `RES_POP_PROP_15.59` + `CROPPX_PER_HEC` +
             `GVA_PROP_INDUSTRY`,
            data=brazil.sp, approach="CV", kernel="gaussian",
            adaptive=TRUE, longlat=FALSE, dMat = distmat)

## Take a cup of tea and have a break, it will take a few minutes.
##           -----A kind suggestion from GWmodel development group
## Adaptive bandwidth: 3450 CV score: 1.26322e+12 
## Adaptive bandwidth: 2140 CV score: 1.247214e+12 
## Adaptive bandwidth: 1330 CV score: 1.228808e+12 
## Adaptive bandwidth: 829 CV score: 1.214902e+12 
## Adaptive bandwidth: 520 CV score: 1.207128e+12 
## Adaptive bandwidth: 328 CV score: 1.226663e+12 
## Adaptive bandwidth: 637 CV score: 1.209545e+12 
## Adaptive bandwidth: 445 CV score: 1.20858e+12 
## Adaptive bandwidth: 563 CV score: 1.207588e+12 
## Adaptive bandwidth: 489 CV score: 1.207671e+12 
## Adaptive bandwidth: 534 CV score: 1.207536e+12 
## Adaptive bandwidth: 506 CV score: 1.207316e+12 
## Adaptive bandwidth: 523 CV score: 1.207281e+12 
## Adaptive bandwidth: 512 CV score: 1.207233e+12 
## Adaptive bandwidth: 519 CV score: 1.207168e+12 
## Adaptive bandwidth: 514 CV score: 1.207089e+12 
## Adaptive bandwidth: 517 CV score: 1.207113e+12 
## Adaptive bandwidth: 519 CV score: 1.207168e+12 
## Adaptive bandwidth: 518 CV score: 1.207172e+12 
## Adaptive bandwidth: 519 CV score: 1.207168e+12 
## Adaptive bandwidth: 518 CV score: 1.207172e+12 
## Adaptive bandwidth: 518 CV score: 1.207172e+12 
## Adaptive bandwidth: 517 CV score: 1.207113e+12 
## Adaptive bandwidth: 517 CV score: 1.207113e+12 
## Adaptive bandwidth: 516 CV score: 1.207115e+12 
## Adaptive bandwidth: 516 CV score: 1.207115e+12 
## Adaptive bandwidth: 515 CV score: 1.207099e+12 
## Adaptive bandwidth: 515 CV score: 1.207099e+12 
## Adaptive bandwidth: 514 CV score: 1.207089e+12

With the CV approach, 514 data points are recommended.

5.1.2 Constructing the adaptive bandwidth gwr model

The gwr.basic() function implements the GWR model.

adaptive=TRUE for an adaptive bandwidth
bw=bw.adaptive to specify the bandwidth calculated earlier.

gwr.adaptive <- gwr.basic(`GDP_CAPITA`~`IDHM_Renda` + `IDHM_Longevidade` +
            `RURAL_URBAN` + `GVA_TOTAL` + `GVA_MAIN` + `RES_POP_PROP_15.59` + `CROPPX_PER_HEC` +
             `GVA_PROP_INDUSTRY`,
            data=brazil.sp, bw=bw.adaptive, kernel = 'gaussian', adaptive=TRUE, longlat = FALSE)
gwr.adaptive

##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2020-05-31 22:17:59 
##    Call:
##    gwr.basic(formula = GDP_CAPITA ~ IDHM_Renda + IDHM_Longevidade + 
##     RURAL_URBAN + GVA_TOTAL + GVA_MAIN + RES_POP_PROP_15.59 + 
##     CROPPX_PER_HEC + GVA_PROP_INDUSTRY, data = brazil.sp, bw = bw.adaptive, 
##     kernel = "gaussian", adaptive = TRUE, longlat = FALSE)
## 
##    Dependent (y) variable:  GDP_CAPITA
##    Independent variables:  IDHM_Renda IDHM_Longevidade RURAL_URBAN GVA_TOTAL GVA_MAIN RES_POP_PROP_15.59 CROPPX_PER_HEC GVA_PROP_INDUSTRY
##    Number of data points: 5570
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##    Min     1Q Median     3Q    Max 
## -54161  -5188   -874   2693 252508 
## 
##    Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
##    (Intercept)                       -4.387e+04  6.552e+03  -6.696 2.35e-11 ***
##    IDHM_Renda                         8.879e+04  5.597e+03  15.864  < 2e-16 ***
##    IDHM_Longevidade                  -1.615e+04  8.199e+03  -1.970  0.04886 *  
##    RURAL_URBANAdjacent rural          1.964e+03  6.458e+02   3.042  0.00236 ** 
##    RURAL_URBANRemote intermediate     6.620e+03  2.012e+03   3.289  0.00101 ** 
##    RURAL_URBANRemote rural            6.905e+03  1.038e+03   6.651 3.18e-11 ***
##    RURAL_URBANUrban                   2.191e+03  7.196e+02   3.044  0.00234 ** 
##    RURAL_URBANWithout classification  2.168e+03  6.785e+03   0.319  0.74940    
##    GVA_TOTAL                          1.303e-04  2.175e-05   5.992 2.20e-09 ***
##    GVA_MAINAutomobile services        1.988e+04  2.277e+03   8.730  < 2e-16 ***
##    GVA_MAINConstruction              -2.393e+03  5.651e+03  -0.423  0.67197    
##    GVA_MAINEnergy services            2.358e+04  1.619e+03  14.565  < 2e-16 ***
##    GVA_MAINExtractive industries      1.996e+04  2.586e+03   7.717 1.40e-14 ***
##    GVA_MAINFisheries                 -1.178e+03  3.034e+03  -0.388  0.69777    
##    GVA_MAINLivestock                 -9.242e+03  1.298e+03  -7.121 1.21e-12 ***
##    GVA_MAINManufacturing industries   1.446e+04  1.121e+03  12.897  < 2e-16 ***
##    GVA_MAINOther services            -1.003e+04  7.448e+02 -13.465  < 2e-16 ***
##    GVA_MAINPublic services           -1.344e+04  7.033e+02 -19.111  < 2e-16 ***
##    RES_POP_PROP_15.59                 4.227e+04  9.794e+03   4.315 1.62e-05 ***
##    CROPPX_PER_HEC                    -7.329e+01  3.299e+01  -2.222  0.02633 *  
##    GVA_PROP_INDUSTRY                  2.728e+01  3.034e+00   8.990  < 2e-16 ***
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 14860 on 5549 degrees of freedom
##    Multiple R-squared: 0.4676
##    Adjusted R-squared: 0.4657 
##    F-statistic: 243.7 on 20 and 5549 DF,  p-value: < 2.2e-16 
##    ***Extra Diagnostic information
##    Residual sum of squares: 1.225655e+12
##    Sigma(hat): 14836.6
##    AIC:  122847
##    AICc:  122847.2
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: gaussian 
##    Adaptive bandwidth: 514 (number of nearest neighbours)
##    Regression points: the same locations as observations are used.
##    Distance metric: Euclidean distance metric is used.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                                             Min.     1st Qu.      Median
##    Intercept                         -1.5706e+05 -7.1522e+04 -4.7143e+04
##    IDHM_Renda                         8.5105e+03  4.3618e+04  8.7402e+04
##    IDHM_Longevidade                  -9.9974e+04 -6.1629e+04 -5.1612e+03
##    RURAL_URBANAdjacent rural         -4.5147e+02  7.5420e+02  1.8534e+03
##    RURAL_URBANRemote intermediate     6.3517e+02  2.3047e+03  6.1966e+03
##    RURAL_URBANRemote rural           -1.4197e+03  1.8151e+03  5.0593e+03
##    RURAL_URBANUrban                  -4.6890e+02  1.5340e+03  2.2445e+03
##    RURAL_URBANWithout classification -1.9804e+04 -2.2142e+03  5.0892e+03
##    GVA_TOTAL                          6.7580e-05  1.2539e-04  1.6448e-04
##    GVA_MAINAutomobile services        3.3516e+03  1.0040e+04  1.6044e+04
##    GVA_MAINConstruction              -9.5004e+03 -5.9814e+03 -1.6002e+03
##    GVA_MAINEnergy services            8.4164e+03  1.4175e+04  2.0356e+04
##    GVA_MAINExtractive industries     -1.7468e+03  1.3497e+04  1.9613e+04
##    GVA_MAINFisheries                 -1.1084e+04 -3.6825e+03 -6.8811e+01
##    GVA_MAINLivestock                 -1.3486e+04 -9.5680e+03 -5.6149e+03
##    GVA_MAINManufacturing industries   6.1271e+03  1.2953e+04  1.8595e+04
##    GVA_MAINOther services            -1.4539e+04 -8.7634e+03 -7.4939e+03
##    GVA_MAINPublic services           -1.6544e+04 -1.3866e+04 -1.0873e+04
##    RES_POP_PROP_15.59                -3.5925e+03  1.9419e+04  3.8345e+04
##    CROPPX_PER_HEC                    -6.9442e+02 -1.0301e+02  3.2483e+01
##    GVA_PROP_INDUSTRY                  7.0969e+00  1.8020e+01  2.4047e+01
##                                          3rd Qu.        Max.
##    Intercept                         -2.2031e+04  -3648.5309
##    IDHM_Renda                         1.1091e+05 178842.8382
##    IDHM_Longevidade                   1.0048e+04  51710.7515
##    RURAL_URBANAdjacent rural          2.7296e+03   6263.1210
##    RURAL_URBANRemote intermediate     1.7698e+04  65592.5477
##    RURAL_URBANRemote rural            8.9983e+03  14671.2002
##    RURAL_URBANUrban                   2.9990e+03   4412.2162
##    RURAL_URBANWithout classification  3.9848e+04  63036.7410
##    GVA_TOTAL                          2.8841e-04      0.0007
##    GVA_MAINAutomobile services        3.2978e+04  63867.6426
##    GVA_MAINConstruction               2.9826e+03  10034.7902
##    GVA_MAINEnergy services            3.2181e+04  46050.6621
##    GVA_MAINExtractive industries      2.8758e+04  41146.5630
##    GVA_MAINFisheries                  9.4667e+03  17266.1776
##    GVA_MAINLivestock                 -3.8938e+03   -945.8665
##    GVA_MAINManufacturing industries   2.2352e+04  32989.5989
##    GVA_MAINOther services            -6.2534e+03  -3024.1746
##    GVA_MAINPublic services           -8.9900e+03  -7042.4360
##    RES_POP_PROP_15.59                 9.8460e+04 180990.3580
##    CROPPX_PER_HEC                     8.6959e+01    158.4598
##    GVA_PROP_INDUSTRY                  2.9918e+01     56.7262
##    ************************Diagnostic information*************************
##    Number of data points: 5570 
##    Effective number of parameters (2trace(S) - trace(S'S)): 154.8007 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5415.199 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 122121.3 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 121999.5 
##    Residual sum of squares: 1.039356e+12 
##    R-square value:  0.548561 
##    Adjusted R-square value:  0.5356537 
## 
##    ***********************************************************************
##    Program stops at: 2020-05-31 22:19:43

5.2 AIC corrected approach

5.2.1 Computing the adaptive bandwidth

For the AICc approach, the same parameters are used except approach="AICc" to specify the AICc approach.

bw.adaptive2 <- bw.gwr(`GDP_CAPITA`~`IDHM_Renda` + `IDHM_Longevidade` +
            `RURAL_URBAN` + `GVA_TOTAL` + `GVA_MAIN` + `RES_POP_PROP_15.59` + `CROPPX_PER_HEC` +
             `GVA_PROP_INDUSTRY`,
            data=brazil.sp, approach="AICc", kernel="gaussian",
            adaptive=TRUE, longlat=FALSE, dMat = distmat)

## Take a cup of tea and have a break, it will take a few minutes.
##           -----A kind suggestion from GWmodel development group
## Adaptive bandwidth (number of nearest neighbours): 3450 AICc value: 122731.8 
## Adaptive bandwidth (number of nearest neighbours): 2140 AICc value: 122613.8 
## Adaptive bandwidth (number of nearest neighbours): 1330 AICc value: 122465.1 
## Adaptive bandwidth (number of nearest neighbours): 829 AICc value: 122302.2 
## Adaptive bandwidth (number of nearest neighbours): 520 AICc value: 122125.3 
## Adaptive bandwidth (number of nearest neighbours): 328 AICc value: 121957.2 
## Adaptive bandwidth (number of nearest neighbours): 210 AICc value: 121816.4 
## Adaptive bandwidth (number of nearest neighbours): 136 AICc value: 121653.7 
## Adaptive bandwidth (number of nearest neighbours): 91 AICc value: 121516.2 
## Error in gw_reg(X, Y, W.i, TRUE, i) : inv(): matrix seems singular
## Adaptive bandwidth (number of nearest neighbours): 62 AICc value: Inf 
## Adaptive bandwidth (number of nearest neighbours): 107 AICc value: 121567.6 
## Adaptive bandwidth (number of nearest neighbours): 79 AICc value: 121465.6 
## Adaptive bandwidth (number of nearest neighbours): 73 AICc value: 121449.6 
## Adaptive bandwidth (number of nearest neighbours): 68 AICc value: Inf 
## Adaptive bandwidth (number of nearest neighbours): 75 AICc value: 121456.3 
## Adaptive bandwidth (number of nearest neighbours): 70 AICc value: Inf 
## Adaptive bandwidth (number of nearest neighbours): 73 AICc value: 121449.6

With the AICc approach, a much smaller number of data points - 73 - is recommended.

5.2.2 Constructing the adaptive bandwidth gwr model

gwr.adaptive2 <- gwr.basic(`GDP_CAPITA`~`IDHM_Renda` + `IDHM_Longevidade` +
            `RURAL_URBAN` + `GVA_TOTAL` + `GVA_MAIN` + `RES_POP_PROP_15.59` + `CROPPX_PER_HEC` +
             `GVA_PROP_INDUSTRY`,
            data=brazil.sp, bw=bw.adaptive2, kernel = 'gaussian', adaptive=TRUE, longlat = FALSE)
gwr.adaptive2

##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2020-05-31 22:34:57 
##    Call:
##    gwr.basic(formula = GDP_CAPITA ~ IDHM_Renda + IDHM_Longevidade + 
##     RURAL_URBAN + GVA_TOTAL + GVA_MAIN + RES_POP_PROP_15.59 + 
##     CROPPX_PER_HEC + GVA_PROP_INDUSTRY, data = brazil.sp, bw = bw.adaptive2, 
##     kernel = "gaussian", adaptive = TRUE, longlat = FALSE)
## 
##    Dependent (y) variable:  GDP_CAPITA
##    Independent variables:  IDHM_Renda IDHM_Longevidade RURAL_URBAN GVA_TOTAL GVA_MAIN RES_POP_PROP_15.59 CROPPX_PER_HEC GVA_PROP_INDUSTRY
##    Number of data points: 5570
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##    Min     1Q Median     3Q    Max 
## -54161  -5188   -874   2693 252508 
## 
##    Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
##    (Intercept)                       -4.387e+04  6.552e+03  -6.696 2.35e-11 ***
##    IDHM_Renda                         8.879e+04  5.597e+03  15.864  < 2e-16 ***
##    IDHM_Longevidade                  -1.615e+04  8.199e+03  -1.970  0.04886 *  
##    RURAL_URBANAdjacent rural          1.964e+03  6.458e+02   3.042  0.00236 ** 
##    RURAL_URBANRemote intermediate     6.620e+03  2.012e+03   3.289  0.00101 ** 
##    RURAL_URBANRemote rural            6.905e+03  1.038e+03   6.651 3.18e-11 ***
##    RURAL_URBANUrban                   2.191e+03  7.196e+02   3.044  0.00234 ** 
##    RURAL_URBANWithout classification  2.168e+03  6.785e+03   0.319  0.74940    
##    GVA_TOTAL                          1.303e-04  2.175e-05   5.992 2.20e-09 ***
##    GVA_MAINAutomobile services        1.988e+04  2.277e+03   8.730  < 2e-16 ***
##    GVA_MAINConstruction              -2.393e+03  5.651e+03  -0.423  0.67197    
##    GVA_MAINEnergy services            2.358e+04  1.619e+03  14.565  < 2e-16 ***
##    GVA_MAINExtractive industries      1.996e+04  2.586e+03   7.717 1.40e-14 ***
##    GVA_MAINFisheries                 -1.178e+03  3.034e+03  -0.388  0.69777    
##    GVA_MAINLivestock                 -9.242e+03  1.298e+03  -7.121 1.21e-12 ***
##    GVA_MAINManufacturing industries   1.446e+04  1.121e+03  12.897  < 2e-16 ***
##    GVA_MAINOther services            -1.003e+04  7.448e+02 -13.465  < 2e-16 ***
##    GVA_MAINPublic services           -1.344e+04  7.033e+02 -19.111  < 2e-16 ***
##    RES_POP_PROP_15.59                 4.227e+04  9.794e+03   4.315 1.62e-05 ***
##    CROPPX_PER_HEC                    -7.329e+01  3.299e+01  -2.222  0.02633 *  
##    GVA_PROP_INDUSTRY                  2.728e+01  3.034e+00   8.990  < 2e-16 ***
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 14860 on 5549 degrees of freedom
##    Multiple R-squared: 0.4676
##    Adjusted R-squared: 0.4657 
##    F-statistic: 243.7 on 20 and 5549 DF,  p-value: < 2.2e-16 
##    ***Extra Diagnostic information
##    Residual sum of squares: 1.225655e+12
##    Sigma(hat): 14836.6
##    AIC:  122847
##    AICc:  122847.2
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: gaussian 
##    Adaptive bandwidth: 73 (number of nearest neighbours)
##    Regression points: the same locations as observations are used.
##    Distance metric: Euclidean distance metric is used.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                                             Min.     1st Qu.      Median
##    Intercept                         -5.8415e+05 -8.2176e+04 -2.4675e+04
##    IDHM_Renda                        -4.7526e+04  2.0540e+04  5.3026e+04
##    IDHM_Longevidade                  -1.7100e+05 -2.4293e+04 -1.7001e+02
##    RURAL_URBANAdjacent rural         -9.8291e+03 -2.7287e+02  1.2075e+03
##    RURAL_URBANRemote intermediate    -5.7906e+03  4.2024e+02  2.6362e+03
##    RURAL_URBANRemote rural           -2.3799e+04 -2.1229e+03  5.5992e+02
##    RURAL_URBANUrban                  -1.5131e+04 -1.2073e+02  1.0832e+03
##    RURAL_URBANWithout classification -4.0165e+04 -2.8647e+03  9.6314e+03
##    GVA_TOTAL                         -1.1763e-03  1.9422e-04  4.5728e-04
##    GVA_MAINAutomobile services       -2.5264e+04  3.8383e+03  1.2744e+04
##    GVA_MAINConstruction              -2.5941e+04 -7.6904e+03 -1.2297e+03
##    GVA_MAINEnergy services           -2.4483e+04  6.7263e+03  1.9336e+04
##    GVA_MAINExtractive industries     -1.4134e+04  1.0860e+04  1.9199e+04
##    GVA_MAINFisheries                 -2.1858e+04 -5.4305e+03  7.0074e+02
##    GVA_MAINLivestock                 -2.9123e+04 -1.0278e+04 -3.3315e+03
##    GVA_MAINManufacturing industries  -2.3264e+04  6.4007e+03  1.2687e+04
##    GVA_MAINOther services            -3.9024e+04 -9.4419e+03 -5.5550e+03
##    GVA_MAINPublic services           -3.2173e+04 -1.3585e+04 -9.4485e+03
##    RES_POP_PROP_15.59                -7.3438e+04  1.1545e+04  3.1123e+04
##    CROPPX_PER_HEC                    -1.9938e+03 -3.1155e+02 -6.9616e+00
##    GVA_PROP_INDUSTRY                 -3.0009e+01  3.8901e+00  1.3568e+01
##                                          3rd Qu.       Max.
##    Intercept                         -5.3588e+03 4.8046e+04
##    IDHM_Renda                         9.6294e+04 3.4428e+05
##    IDHM_Longevidade                   1.4606e+04 1.6333e+05
##    RURAL_URBANAdjacent rural          3.2162e+03 1.3293e+04
##    RURAL_URBANRemote intermediate     1.5386e+04 9.9550e+04
##    RURAL_URBANRemote rural            5.7434e+03 2.9267e+04
##    RURAL_URBANUrban                   2.6505e+03 1.0235e+04
##    RURAL_URBANWithout classification  5.9914e+04 7.8985e+04
##    GVA_TOTAL                          8.9064e-04 5.4000e-03
##    GVA_MAINAutomobile services        2.6056e+04 8.0397e+04
##    GVA_MAINConstruction               7.5216e+03 3.2031e+04
##    GVA_MAINEnergy services            3.4841e+04 7.3178e+04
##    GVA_MAINExtractive industries      3.0719e+04 9.0605e+04
##    GVA_MAINFisheries                  9.7290e+03 3.0713e+04
##    GVA_MAINLivestock                  1.1565e+03 9.4133e+03
##    GVA_MAINManufacturing industries   2.2383e+04 7.6055e+04
##    GVA_MAINOther services            -2.3132e+03 5.0042e+03
##    GVA_MAINPublic services           -5.9701e+03 1.8841e+03
##    RES_POP_PROP_15.59                 9.1626e+04 5.8977e+05
##    CROPPX_PER_HEC                     1.3027e+02 6.7404e+02
##    GVA_PROP_INDUSTRY                  3.3779e+01 1.1475e+02
##    ************************Diagnostic information*************************
##    Number of data points: 5570 
##    Effective number of parameters (2trace(S) - trace(S'S)): 716.8634 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 4853.137 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 121449.6 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 120804.5 
##    Residual sum of squares: 778337760180 
##    R-square value:  0.6619329 
##    Adjusted R-square value:  0.6119863 
## 
##    ***********************************************************************
##    Program stops at: 2020-05-31 22:36:28

5.3 Comparison of GWR models

We can use the AICc, adjusted R-squared and residual sum of squares to evaluate the performance between the two GWR models.

AICc: Based on Fotheringham, for the same sample data, if the difference in the AICc value between two models is greater than 3, then the model with a lower value has a better ﬁt for the observed data.
Adjusted R-squared and residual sum of squares: A higher adjusted R2 and lower residual squares also indicate a better model.

Evaluation metrics	Model 1 (gwr.adaptive)	Model 2 (gwr.adaptive2)
AICc	122121.3	121449.6
Adjusted R-squared	0.5356537	0.6119863
Residual sum of squares	1,039,356,000,000	778,337,760,180

The two model’s AICc is different by at least 3. Thus Model 2 with the lower AICc is the better model. Since Model 2’s adjusted R-squared 0.6119863 > Model 1’s 0.5356537 and Model 2’s residual sum of squares 78,337,760,180 < Model 1’s 1,039,356,000,000, Model 2 is a better model. Thus, gwr.adaptive2 with Guassian kernel and AICc approach is a better fit model. We will use this model in our subsequent visualization of GWR output.

5.4 Visualising GWR Output

5.4.1 Converting SDF into sf data.frame

brazil.sf.adaptive <- st_as_sf(gwr.adaptive2$SDF) %>%
  st_transform(crs=5641)

5.4.2 Visualizing residuals

Residuals are found by observed y (y) - fitted y (yhat).

Standardise residuals so that mean=0, standard deviation=1.

brazil.sf.adaptive$residual_scaled <- as.numeric(round(scale(brazil.sf.adaptive$residual), digits = 4))

tm_shape(brazil.sf.adaptive)+
  tm_polygons() +
tm_shape(brazil.sf.adaptive) +  
  tm_fill(col = "residual_scaled",
          alpha = 0.6, 
          palette = "RdBu",
          breaks = c(-Inf, -0.5, 0, 0.5, Inf))

The predicted GDP per capita is very close to the actual value for cities in the northeastern region, where most of the scaled residuals are between -0.5 and 0.5. From the center to the southern region, the GWR model is unable to predict very accurately the GDP per capita. This may suggest that more variables are needed in the model.

5.4.3 Visualizing Local R2

Local R2 measures how well the local regression model fits the observed y values. A very low value indicates poor fitting which may be due to the absence of important variables from the regression model.

tm_shape(brazil.sf.adaptive)+
  tm_polygons() +
tm_shape(brazil.sf.adaptive) +  
  tm_fill(col = "Local_R2",
          alpha = 0.6,
          n = 4,
          style="quantile")

From the above plot, we can see that the top 75% Local R2 values are higher than the MLR model adjusted R-squared.

In fact, the lowest Local R2 value of 0.40 is only slightly worse off than mlr2 with adjusted R-squared 0.4657. Cities with the bottom 25% Local R2 values are mostly situated in the south, and some parts of the northeast. This may indicate that some variables are missing in the GWR model.

5.4.4 Visualizing coefficient estimates

The coefficient estimates of the significant continuous variables at 95% confidence of the MLR model are plotted.

m1 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
  tm_shape(brazil.sf.adaptive) +
  tm_fill(col = "IDHM_Renda", breaks = c(-Inf, 0, 30000, 60000, 90000, Inf), alpha = 0.6, palette = "RdBu")
m2 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
  tm_shape(brazil.sf.adaptive) +
  tm_fill(col = "GVA_TOTAL", breaks = c(-Inf, 0, 0.0005, 0.001, 0.0015, Inf), alpha = 0.6, palette = "RdBu")
m3 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
  tm_shape(brazil.sf.adaptive) +
  tm_fill(col = "RES_POP_PROP_15.59", breaks = c(-Inf, 0, 20000, 40000, 60000, Inf), alpha = 0.6, palette = "RdBu")
m4 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
  tm_shape(brazil.sf.adaptive) +
  tm_fill(col = "GVA_PROP_INDUSTRY", breaks = c(-Inf, 0, 10, 20, 30, Inf), alpha = 0.6, palette = "RdBu")
tmap_arrange(m1, m2, m3, m4,
             asp=1, ncol=2, nrow=2)

Interpretaion of plots:

Positive contribution of HDI GNI Index to GDP per capita is high in northern, northwestern and some parts of southern Brazil. Notably, the contribution of HDI GNI Index to GDP per capita is negative in the northeast.
The effect of total gross value-added to GDP per capita is marginal but positive in most parts of Brazil.
Proportion of resident population who are economically active contributed to GDP per capita positively in central Brazil. In constrast, its effect is negative in northern and northwestern Brazil.
Proportion of GVA contributed by Industry services is positive in the southern Brazil. In contrast, its contribution is negative in central Brazil.

However, it should be noted that these values do not signify the variable significance. An observation with a high parameter estimate does not equate to the strength of relationship between the independent variable and the response variable, GDP per capita.

5.4.5 Visualizing coefficient standard errors

The coefficient standard error measures the realiability of each coefficient estimate. Small values indicate higher confidence while large values may indicate problems with local collinearity.

The coefficient estimates of the significant continuous variables at 95% confidence of the MLR model are plotted.

m1 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
  tm_shape(brazil.sf.adaptive) +
  tm_fill(col = "IDHM_Renda_SE", style = "equal", alpha = 0.6, palette = "RdBu")
m2 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
  tm_shape(brazil.sf.adaptive) +
  tm_fill(col = "GVA_TOTAL_SE", style = "equal", alpha = 0.6, palette = "RdBu")
m3 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
  tm_shape(brazil.sf.adaptive) +
  tm_fill(col = "RES_POP_PROP_15.59_SE", style = "equal", alpha = 0.6, palette = "RdBu")
m4 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
  tm_shape(brazil.sf.adaptive) +
  tm_fill(col = "GVA_PROP_INDUSTRY_SE",style = "equal", alpha = 0.6, palette = "RdBu")
tmap_arrange(m1, m2, m3, m4,
             asp=1, ncol=2, nrow=2)

Interpretation of plots:

The standard error for the HDI GNI Index is high. This indicates problems with local collinearity. It is significantly higher for some parts of northeastern and southern Brazil. Comparing to the choropleth map of HDI GNI Index coefficient, these areas coincide with the areas with negative parameter estimates.
The standard error for total GVA coefficients is very small. This indicates high reliability. However, if we refer back to the choropleth map of total GVA coefficients, we realise that the change in this variable does not affect GDP per capita much.
Interestingly, the standard error for the proportion of economically active resident population displays a similar pattern to the standard error plot for HDI GNI Index.
The standard error for proportion of GVA contributed by industry is small. This indicates high reliability, except for some parts of the northest region where the errors are higher.

6 Conclusion

6.1 Comparison of MLR and GWR models

Once again, we can use the three evaluation metrics to compare the OLS and GWR models.

Evaluation metrics	OLS model	GWR model
AICc	122847.2	121449.6
Adjusted R-squared	0.4657	0.6119863
Residual sum of squares	1,225,655,000,000	778,337,760,180

The two model’s AICc is different by at least 3. Thus GWR model with the lower AICc is the better model. Since GWR model’s adjusted R-squared 0.6119863 > OLS models’s 0.4657 and GWR model’s residual sum of squares 78,337,760,180 < OLS model’s 1,225,655,000,000, GWR is a better model.

6.2 Limitations

From the Local R2 choropleth map of the GWR model, we can tell that the fit for the bottom 25% cities (0.400 - 0.616) is much lower than the top 25% cities (0.867, 0.992). This means that there are may be some missing independent variables in the GWR model to explain the GDP per capita of cities with low Local R2 values.

Comparing with the GDP per capita choropleth map at the EDA, the southern region with low Local R2 correspond to high GDP per capita cities while the northeastern region with low Local R2 correspond to low GDP per capita cities.

Lastly, the data used in this analysis taps on different years. Although we are examining the factors that affect GDP per capita 2016 values, we also used data that are beyond 2016.

Take-home_Ex4

Ong Wei Ching

5/26/2020

1 Introduction

1.1 Brazil, an unequal country

1.2 Data used

Dependent variable

Independent variable

1.3 Methodology

2 Data Import and Preparation

2.2 Importing aspatial data as Dataframe

Brazilian municipality

Brazil spatial boundary

2.3 Data wrangling

2.3.1 Convert name of cities to uppercase

2.3.2 Assign CRS=5614 to municipalities sf dataframe

2.3.3 Join two dataframes together

2.3.4 Select variables

2.3.5 Transform variables

Demographic

Economic activity

Rename column values for easier understanding:

2.3.6 Deal with NA values

Dealing with missing values for GDP per capita

Dealing with missing values for other columns

2.3.7 Convert dataframe to sf object

3 Exploratory data analysis

3.1 EDA using statistical graphics

3.2 EDA using choropleth map

3.2.1 Mapping GDP per capita

3.3 Correlation Analysis

4 Multiple linear regression model of GDP per capita

4.1 Building a multiple linear regression model

4.1.1 Modelling with all variables

4.1.2 Modelling with statistically significant variables

Multiple linear regression model

Model interpretation

Model evaluation

4.2 Regression assumption checking

4.2.1 Linearity assumption

4.2.2 Normality assumption

4.2.3 No multicollinearity assumption

4.2.4 Spatial autocorrelation assumption

Moran’s I test

5 Geographically weighted regression model of GDP per capita using adaptive bandwidth

5.1 Cross-validation approach

5.1.1 Computing the adaptive bandwidth

5.1.2 Constructing the adaptive bandwidth gwr model

5.2 AIC corrected approach

5.2.1 Computing the adaptive bandwidth

5.2.2 Constructing the adaptive bandwidth gwr model

5.3 Comparison of GWR models

5.4 Visualising GWR Output

5.4.1 Converting SDF into sf data.frame

5.4.2 Visualizing residuals

5.4.3 Visualizing Local R2

5.4.4 Visualizing coefficient estimates

5.4.5 Visualizing coefficient standard errors

6 Conclusion

6.1 Comparison of MLR and GWR models

6.2 Limitations