1 Introduction

1.1 Overview

In this exercise, we will determine factors affecting the unequal development of Brazil at the municipality level by using the data provided. The specific task of the analysis are as follows:

  • Prepare a choropleth map showing the distribution of GDP per capita, 2016 at municipality level.

  • Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using multiple linear regression method.

  • Prepare a choropleth map showing the distribution of the residual of the GDP per capita.

  • Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using geographically weighted regression method.

  • Prepare a series of choropleth maps showing the outputs of the geographically weighted regression model.

1.2 The Data

The following data sets will be used for this study:

  1. BRAZIL_CITIES.csv in csv format obtained from Kaggle

  2. 2016 Brazil Municipality Boundary File obtained using ‘geobr’ package

1.3 Installing & Loading Relevant Packages

We will install and launch the relevant R packages required for this exercise in R environment.

packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse', 'geobr')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2 Data Wrangling

2.1 Geospatial Data

The 2016 Municipality Boundary Layer will be the geospatial data used for this exercise. It will be downloaded using the ‘geobr’ package.

2.1.1 Viewing Datasets Available in ‘geobr’

We wil obtain a list of the geospatial datasets available in ‘geobr’ to select which one we need.

datasets <- list_geobr()
print(datasets)
## # A tibble: 22 x 4
##    `function`       geography             years                           source
##    <chr>            <chr>                 <chr>                           <chr> 
##  1 `read_country`   Country               1872, 1900, 1911, 1920, 1933, ~ IBGE  
##  2 `read_region`    Region                2000, 2001, 2010, 2013, 2014, ~ IBGE  
##  3 `read_state`     States                1872, 1900, 1911, 1920, 1933, ~ IBGE  
##  4 `read_meso_regi~ Meso region           2000, 2001, 2010, 2013, 2014, ~ IBGE  
##  5 `read_micro_reg~ Micro region          2000, 2001, 2010, 2013, 2014, ~ IBGE  
##  6 `read_intermedi~ Intermediate region   2017                            IBGE  
##  7 `read_immediate~ Immediate region      2017                            IBGE  
##  8 `read_municipal~ Municipality          1872, 1900, 1911, 1920, 1933, ~ IBGE  
##  9 `read_weighting~ Census weighting are~ 2010                            IBGE  
## 10 `read_census_tr~ Census tract (setor ~ 2000, 2010                      IBGE  
## # ... with 12 more rows

2.1.2 Downloading Municipality Boundaries for 2016

We will use the read_minicipality() function to download the shape files of Brazil municipalities as ‘sf’ objects. code_muni is set to “all” to obtain municipalities of all states in Brazil and year is set to ‘2016’ to get data from 2016. The data is in the Geodetic Reference System “SIRGAS2000” and CRS 4674.

mun_2016 <- read_municipality(code_muni = "all", year = 2016, simplified = TRUE)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |======================================================================| 100%

2.1.3 Checking & Rectifying Invalid/NA/Duplicated Values

2.1.3.1 Invalid Polygons Check

test <- st_is_valid(mun_2016)
length(which(test==FALSE))
## [1] 1

There is one invalid polygon as observed from the st_is_valid test. Hence, we need to make this polygon valid.

mun_2016 <- st_make_valid(mun_2016)
test <- st_is_valid(mun_2016)
length(which(test==FALSE))
## [1] 0

The invalid polygon is now valid.

2.1.3.2 NA Values Check

any(is.na(mun_2016))
## [1] FALSE

There are no NA values.

2.1.3.3 Duplicated Values Check

any(duplicated(mun_2016$code_muni))
## [1] FALSE

There are no duplicated values.

2.1.4 Extracting Relevant Columns

Only some of the columns in the dataframe are relevant to our analysis hence we will extract them.

mun_boundary <- mun_2016 %>%
  select("name_muni", "abbrev_state", "code_muni", "geom")

2.1.5 Viewing Municipalities

# Remove plot axis

  no_axis <- theme(axis.title=element_blank(),
                   axis.text=element_blank(),
                   axis.ticks=element_blank())

# Plot all Brazilian municipalities

  ggplot() +
    geom_sf(data= mun_2016, fill="#2D3E50", color="#FEBF57", size=.15, show.legend = FALSE) +
    labs(subtitle="Municipalities", size=8) +
    theme_minimal() +
    no_axis

2.2 Aspatial Data

2.2.1 Importing Aspatial Data

BRAZIL_CITIES.csv is in csv file format, separated by ‘;’. Hence, we will use read.delim() function to import it into R to ensure that decimal points are retained in the data and it is ONLY separated by ‘;’,

brazil_info <- read.delim("data/aspatial/BRAZIL_CITIES.csv", sep = ";")

Viewing Data Type and Fields:

glimpse(brazil_info)
## Rows: 5,573
## Columns: 81
## $ CITY                   <fct> Abadia De Goiás, Abadia Dos Dourados, Abadi...
## $ STATE                  <fct> GO, MG, GO, MG, PA, CE, BA, BA, PR, SC, PA, ...
## $ CAPITAL                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ IBGE_RES_POP           <int> 6876, 6704, 15757, 22690, 141100, 10496, 831...
## $ IBGE_RES_POP_BRAS      <int> 6876, 6704, 15609, 22690, 141040, 10496, 831...
## $ IBGE_RES_POP_ESTR      <int> 0, 0, 148, 0, 60, 0, 0, 0, 0, 0, 0, 16, 17, ...
## $ IBGE_DU                <int> 2137, 2328, 4655, 7694, 31061, 2791, 2572, 4...
## $ IBGE_DU_URBAN          <int> 1546, 1481, 3233, 6667, 19057, 1251, 1193, 2...
## $ IBGE_DU_RURAL          <int> 591, 847, 1422, 1027, 12004, 1540, 1379, 195...
## $ IBGE_POP               <int> 5300, 4154, 10656, 18464, 82956, 4538, 3725,...
## $ IBGE_1                 <int> 69, 38, 139, 176, 1354, 98, 37, 167, 69, 12,...
## $ IBGE_1.4               <int> 318, 207, 650, 856, 5567, 323, 156, 733, 302...
## $ IBGE_5.9               <int> 438, 260, 894, 1233, 7618, 421, 263, 978, 37...
## $ IBGE_10.14             <int> 517, 351, 1087, 1539, 8905, 483, 277, 927, 4...
## $ IBGE_15.59             <int> 3542, 2709, 6896, 11979, 53516, 2631, 2319, ...
## $ IBGE_60.               <int> 416, 589, 990, 2681, 5996, 582, 673, 803, 81...
## $ IBGE_PLANTED_AREA      <int> 319, 4479, 10307, 1862, 25200, 2598, 895, 20...
## $ IBGE_CROP_PRODUCTION_. <int> 1843, 18017, 33085, 7502, 700872, 5234, 3999...
## $ IDHM.Ranking.2010      <int> 1689, 2207, 2202, 1994, 3530, 3522, 4086, 47...
## $ IDHM                   <dbl> 0.708, 0.690, 0.690, 0.698, 0.628, 0.628, 0....
## $ IDHM_Renda             <dbl> 0.687, 0.693, 0.671, 0.720, 0.579, 0.540, 0....
## $ IDHM_Longevidade       <dbl> 0.830, 0.839, 0.841, 0.848, 0.798, 0.748, 0....
## $ IDHM_Educacao          <dbl> 0.622, 0.563, 0.579, 0.556, 0.537, 0.612, 0....
## $ LONG                   <dbl> -49.44055, -47.39683, -48.71881, -45.44619, ...
## $ LAT                    <dbl> -16.758812, -18.487565, -16.182672, -19.1558...
## $ ALT                    <dbl> 893.60, 753.12, 1017.55, 644.74, 10.12, 403....
## $ PAY_TV                 <int> 360, 77, 227, 1230, 3389, 29, 952, 51, 55, 1...
## $ FIXED_PHONES           <int> 842, 296, 720, 1716, 1218, 34, 335, 222, 392...
## $ AREA                   <fct> "147.26", "881.06", "1,045.13", "1,817.07", ...
## $ REGIAO_TUR             <fct> , Caminhos Do Cerrado, Região Turística Do...
## $ CATEGORIA_TUR          <fct> , D, C, D, D, , D, , , D, E, D, D, D, , E, ,...
## $ ESTIMATED_POP          <int> 8583, 6972, 19614, 23223, 156292, 11663, 876...
## $ RURAL_URBAN            <fct> Urbano, Rural Adjacente, Rural Adjacente, Ur...
## $ GVA_AGROPEC            <dbl> 6.20, 50524.57, 42.84, 113824.60, 140463.72,...
## $ GVA_INDUSTRY           <dbl> 27991.25, 25917.70, 16728.30, 31002.62, 5861...
## $ GVA_SERVICES           <dbl> 74750.32, 62689.23, 138198.58, 172.33, 46812...
## $ GVA_PUBLIC             <dbl> 36915.04, 28083.79, 63396.20, 86081.41, 4868...
## $ GVA_TOTAL              <dbl> 145857.60, 167215.28, 261161.91, 403241.27, ...
## $ TAXES                  <dbl> 20554.20, 12873.50, 26822.58, 26994.09, 9518...
## $ GDP                    <dbl> 166.41, 180.09, 287984.49, 430235.36, 124925...
## $ POP_GDP                <int> 8053, 7037, 18427, 23574, 151934, 11483, 921...
## $ GDP_CAPITA             <dbl> 20664.57, 25591.70, 15628.40, 18250.42, 8222...
## $ GVA_MAIN               <fct> "Demais serviços", "Demais serviços", "Dem...
## $ MUN_EXPENDIT           <dbl> 28227691, 17909274, 37513019, NA, NA, NA, NA...
## $ COMP_TOT               <int> 284, 476, 288, 621, 931, 86, 191, 87, 285, 6...
## $ COMP_A                 <int> 5, 6, 5, 18, 4, 1, 6, 2, 5, 2, 0, 8, 3, 1, 4...
## $ COMP_B                 <int> 1, 6, 9, 1, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0,...
## $ COMP_C                 <int> 56, 30, 26, 40, 43, 4, 8, 3, 20, 4, 9, 40, 3...
## $ COMP_D                 <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ COMP_E                 <int> 2, 2, 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0,...
## $ COMP_F                 <int> 29, 34, 7, 20, 27, 6, 4, 0, 10, 2, 0, 25, 12...
## $ COMP_G                 <int> 110, 190, 117, 303, 500, 48, 97, 71, 133, 35...
## $ COMP_H                 <int> 26, 70, 12, 62, 16, 2, 5, 0, 18, 8, 1, 67, 3...
## $ COMP_I                 <int> 4, 28, 57, 30, 31, 10, 5, 1, 14, 3, 0, 25, 1...
## $ COMP_J                 <int> 5, 11, 2, 9, 6, 2, 3, 1, 8, 1, 1, 9, 5, 14, ...
## $ COMP_K                 <int> 0, 0, 1, 6, 1, 0, 1, 0, 0, 1, 0, 4, 3, 3, 0,...
## $ COMP_L                 <int> 2, 4, 0, 4, 1, 0, 0, 0, 4, 0, 0, 7, 4, 4, 0,...
## $ COMP_M                 <int> 10, 15, 7, 28, 22, 2, 5, 0, 11, 4, 2, 26, 17...
## $ COMP_N                 <int> 12, 29, 15, 27, 16, 3, 5, 1, 26, 0, 1, 16, 1...
## $ COMP_O                 <int> 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 6, 2, 4, 2, 4,...
## $ COMP_P                 <int> 6, 9, 11, 15, 155, 0, 8, 0, 8, 1, 6, 14, 11,...
## $ COMP_Q                 <int> 6, 14, 5, 19, 33, 2, 1, 2, 9, 3, 0, 13, 22, ...
## $ COMP_R                 <int> 1, 6, 1, 9, 15, 0, 2, 0, 4, 0, 0, 4, 6, 6, 0...
## $ COMP_S                 <int> 5, 19, 8, 27, 56, 4, 38, 4, 12, 3, 4, 23, 38...
## $ COMP_T                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COMP_U                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ HOTELS                 <int> NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA...
## $ BEDS                   <int> NA, NA, 34, NA, NA, NA, 24, NA, NA, NA, NA, ...
## $ Pr_Agencies            <int> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0, 2...
## $ Pu_Agencies            <int> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1, 3...
## $ Pr_Bank                <int> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0, 2...
## $ Pu_Bank                <int> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1, 3...
## $ Pr_Assets              <dbl> NA, NA, 33724584, 44974716, 76181384, NA, NA...
## $ Pu_Assets              <dbl> NA, NA, 67091904, 371922572, 800078483, NA, ...
## $ Cars                   <int> 2158, 2227, 2838, 6928, 5277, 553, 896, 613,...
## $ Motorcycles            <int> 1246, 1142, 1426, 2953, 25661, 1674, 696, 15...
## $ Wheeled_tractor        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0,...
## $ UBER                   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ MAC                    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ WAL.MART               <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ POST_OFFICES           <int> 1, 1, 3, 4, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,...

2.2.2 Extracting Relevant Variables

Since we are determining the factors affecting the unequal development of Brazil based on the GDP in 2016, we will only extract variables that are relevant to our analysis. We will exclude variables that are not required and do not have a significant impact on GDP per capita. For example, we have the data for resident population regular urban planning for people aged 0-60+. We will only require the data of residents aged 15-59 who are active in the economy.

brazil_variables <- brazil_info %>%
  select(1:4, 8:9, 15, 17:18, 20, 24:25, 27:29, 38:39, 42, 44:45, 67, 71:77)
summary(brazil_info)
##              CITY          STATE         CAPITAL          IBGE_RES_POP     
##  Bom Jesus     :   5   MG     : 853   Min.   :0.000000   Min.   :     805  
##  São Domingos :   5   SP     : 645   1st Qu.:0.000000   1st Qu.:    5235  
##  Bonito        :   4   RS     : 498   Median :0.000000   Median :   10934  
##  Planalto      :   4   BA     : 418   Mean   :0.004845   Mean   :   34278  
##  São Francisco:   4   PR     : 399   3rd Qu.:0.000000   3rd Qu.:   23424  
##  Santa Helena  :   4   SC     : 295   Max.   :1.000000   Max.   :11253503  
##  (Other)       :5547   (Other):2465                      NA's   :8         
##  IBGE_RES_POP_BRAS  IBGE_RES_POP_ESTR     IBGE_DU        IBGE_DU_URBAN    
##  Min.   :     805   Min.   :     0.0   Min.   :    239   Min.   :     60  
##  1st Qu.:    5230   1st Qu.:     0.0   1st Qu.:   1572   1st Qu.:    874  
##  Median :   10926   Median :     0.0   Median :   3174   Median :   1846  
##  Mean   :   34200   Mean   :    77.5   Mean   :  10303   Mean   :   8859  
##  3rd Qu.:   23390   3rd Qu.:    10.0   3rd Qu.:   6726   3rd Qu.:   4624  
##  Max.   :11133776   Max.   :119727.0   Max.   :3576148   Max.   :3548433  
##  NA's   :8          NA's   :8          NA's   :10        NA's   :10       
##  IBGE_DU_RURAL      IBGE_POP            IBGE_1            IBGE_1.4     
##  Min.   :    3   Min.   :     174   Min.   :     0.0   Min.   :     5  
##  1st Qu.:  487   1st Qu.:    2801   1st Qu.:    38.0   1st Qu.:   158  
##  Median :  931   Median :    6170   Median :    92.0   Median :   376  
##  Mean   : 1463   Mean   :   27595   Mean   :   383.3   Mean   :  1544  
##  3rd Qu.: 1832   3rd Qu.:   15302   3rd Qu.:   232.0   3rd Qu.:   951  
##  Max.   :33809   Max.   :10463636   Max.   :129464.0   Max.   :514794  
##  NA's   :81      NA's   :8          NA's   :8          NA's   :8       
##     IBGE_5.9        IBGE_10.14       IBGE_15.59         IBGE_60.      
##  Min.   :     7   Min.   :    12   Min.   :     94   Min.   :     29  
##  1st Qu.:   220   1st Qu.:   259   1st Qu.:   1734   1st Qu.:    341  
##  Median :   516   Median :   588   Median :   3841   Median :    722  
##  Mean   :  2069   Mean   :  2381   Mean   :  18212   Mean   :   3004  
##  3rd Qu.:  1300   3rd Qu.:  1478   3rd Qu.:   9628   3rd Qu.:   1724  
##  Max.   :684443   Max.   :783702   Max.   :7058221   Max.   :1293012  
##  NA's   :8        NA's   :8        NA's   :8         NA's   :8        
##  IBGE_PLANTED_AREA   IBGE_CROP_PRODUCTION_. IDHM.Ranking.2010      IDHM       
##  Min.   :      0.0   Min.   :      0        Min.   :   1      Min.   :0.4180  
##  1st Qu.:    910.2   1st Qu.:   2326        1st Qu.:1392      1st Qu.:0.5990  
##  Median :   3471.5   Median :  13846        Median :2783      Median :0.6650  
##  Mean   :  14179.9   Mean   :  57384        Mean   :2783      Mean   :0.6592  
##  3rd Qu.:  11194.2   3rd Qu.:  55619        3rd Qu.:4174      3rd Qu.:0.7180  
##  Max.   :1205669.0   Max.   :3274885        Max.   :5565      Max.   :0.8620  
##  NA's   :3           NA's   :3              NA's   :8         NA's   :8       
##    IDHM_Renda     IDHM_Longevidade IDHM_Educacao         LONG       
##  Min.   :0.4000   Min.   :0.6720   Min.   :0.2070   Min.   :-72.92  
##  1st Qu.:0.5720   1st Qu.:0.7690   1st Qu.:0.4900   1st Qu.:-50.87  
##  Median :0.6540   Median :0.8080   Median :0.5600   Median :-46.52  
##  Mean   :0.6429   Mean   :0.8016   Mean   :0.5591   Mean   :-46.23  
##  3rd Qu.:0.7070   3rd Qu.:0.8360   3rd Qu.:0.6310   3rd Qu.:-41.40  
##  Max.   :0.8910   Max.   :0.8940   Max.   :0.8250   Max.   :-32.44  
##  NA's   :8        NA's   :8        NA's   :8        NA's   :9       
##       LAT               ALT               PAY_TV         FIXED_PHONES    
##  Min.   :-33.688   Min.   :     0.0   Min.   :      1   Min.   :      3  
##  1st Qu.:-22.838   1st Qu.:   169.8   1st Qu.:     88   1st Qu.:    119  
##  Median :-18.089   Median :   406.5   Median :    247   Median :    327  
##  Mean   :-16.444   Mean   :   893.8   Mean   :   3094   Mean   :   6567  
##  3rd Qu.: -8.489   3rd Qu.:   628.9   3rd Qu.:    815   3rd Qu.:   1151  
##  Max.   :  4.585   Max.   :874579.0   Max.   :2047668   Max.   :5543127  
##  NA's   :9         NA's   :9          NA's   :3         NA's   :3        
##        AREA                      REGIAO_TUR   CATEGORIA_TUR ESTIMATED_POP     
##          :   3                        :2288    :2288        Min.   :     786  
##  1,227.93:   2   Corredores Das Ã\201guas:  59   A:  51        1st Qu.:    5454  
##  100.07  :   2   Vale Do Contestado   :  45   B: 168        Median :   11590  
##  101.89  :   2   Amazônia Atlântica :  40   C: 521        Mean   :   37432  
##  102.00  :   2   Araguaia-Tocantins   :  39   D:1892        3rd Qu.:   25296  
##  106.85  :   2   Cariri               :  37   E: 653        Max.   :12176866  
##  (Other) :5560   (Other)              :3065                 NA's   :3         
##                    RURAL_URBAN    GVA_AGROPEC       GVA_INDUSTRY     
##                          :   3   Min.   :      0   Min.   :       1  
##  Intermediário Adjacente: 686   1st Qu.:   4189   1st Qu.:    1726  
##  Intermediário Remoto   :  60   Median :  20426   Median :    7424  
##  Rural Adjacente         :3040   Mean   :  47271   Mean   :  175928  
##  Rural Remoto            : 323   3rd Qu.:  51227   3rd Qu.:   41022  
##  Sem classificação     :   5   Max.   :1402282   Max.   :63306755  
##  Urbano                  :1456   NA's   :3         NA's   :3         
##   GVA_SERVICES         GVA_PUBLIC         GVA_TOTAL             TAXES          
##  Min.   :        2   Min.   :       7   Min.   :       17   Min.   :   -14159  
##  1st Qu.:    10112   1st Qu.:   17267   1st Qu.:    42253   1st Qu.:     1305  
##  Median :    31211   Median :   35866   Median :   119492   Median :     5100  
##  Mean   :   489451   Mean   :  123768   Mean   :   832987   Mean   :   118864  
##  3rd Qu.:   115406   3rd Qu.:   89245   3rd Qu.:   313963   3rd Qu.:    22197  
##  Max.   :464656988   Max.   :41902893   Max.   :569910503   Max.   :117125387  
##  NA's   :3           NA's   :3          NA's   :3           NA's   :3          
##       GDP               POP_GDP           GDP_CAPITA    
##  Min.   :       15   Min.   :     815   Min.   :  3191  
##  1st Qu.:    43709   1st Qu.:    5483   1st Qu.:  9058  
##  Median :   125153   Median :   11578   Median : 15870  
##  Mean   :   954584   Mean   :   36998   Mean   : 21126  
##  3rd Qu.:   329539   3rd Qu.:   25085   3rd Qu.: 26155  
##  Max.   :687035890   Max.   :12038175   Max.   :314638  
##  NA's   :3           NA's   :3          NA's   :3       
##                                                                                        GVA_MAIN   
##  Administração, defesa, educação e saúde públicas e seguridade social                :2725  
##  Demais serviços                                                                          :1477  
##  Agricultura, inclusive apoio à agricultura e a pós colheita                             : 735  
##  Indústrias de transformação                                                            : 261  
##  Pecuária, inclusive apoio à pecuária                                                   : 161  
##  Eletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação:  98  
##  (Other)                                                                                   : 116  
##   MUN_EXPENDIT          COMP_TOT            COMP_A            COMP_B       
##  Min.   :1.421e+06   Min.   :     6.0   Min.   :   0.00   Min.   :  0.000  
##  1st Qu.:1.573e+07   1st Qu.:    68.0   1st Qu.:   1.00   1st Qu.:  0.000  
##  Median :2.746e+07   Median :   162.0   Median :   2.00   Median :  0.000  
##  Mean   :1.043e+08   Mean   :   906.8   Mean   :  18.25   Mean   :  1.852  
##  3rd Qu.:5.666e+07   3rd Qu.:   448.0   3rd Qu.:   8.00   3rd Qu.:  2.000  
##  Max.   :4.577e+10   Max.   :530446.0   Max.   :1948.00   Max.   :274.000  
##  NA's   :1492        NA's   :3          NA's   :3         NA's   :3        
##      COMP_C             COMP_D             COMP_E            COMP_F        
##  Min.   :    0.00   Min.   :  0.0000   Min.   :  0.000   Min.   :    0.00  
##  1st Qu.:    3.00   1st Qu.:  0.0000   1st Qu.:  0.000   1st Qu.:    1.00  
##  Median :   11.00   Median :  0.0000   Median :  0.000   Median :    4.00  
##  Mean   :   73.44   Mean   :  0.4262   Mean   :  2.029   Mean   :   43.26  
##  3rd Qu.:   39.00   3rd Qu.:  0.0000   3rd Qu.:  1.000   3rd Qu.:   15.00  
##  Max.   :31566.00   Max.   :332.0000   Max.   :657.000   Max.   :25222.00  
##  NA's   :3          NA's   :3          NA's   :3         NA's   :3         
##      COMP_G             COMP_H          COMP_I             COMP_J        
##  Min.   :     1.0   Min.   :    0   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    32.0   1st Qu.:    1   1st Qu.:    2.00   1st Qu.:    0.00  
##  Median :    74.5   Median :    7   Median :    7.00   Median :    1.00  
##  Mean   :   348.0   Mean   :   41   Mean   :   55.88   Mean   :   24.74  
##  3rd Qu.:   199.0   3rd Qu.:   25   3rd Qu.:   24.00   3rd Qu.:    5.00  
##  Max.   :150633.0   Max.   :19515   Max.   :29290.00   Max.   :38720.00  
##  NA's   :3          NA's   :3       NA's   :3          NA's   :3         
##      COMP_K             COMP_L             COMP_M             COMP_N       
##  Min.   :    0.00   Min.   :    0.00   Min.   :    0.00   Min.   :    0.0  
##  1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:    1.00   1st Qu.:    1.0  
##  Median :    0.00   Median :    0.00   Median :    4.00   Median :    4.0  
##  Mean   :   15.55   Mean   :   15.14   Mean   :   51.29   Mean   :   83.7  
##  3rd Qu.:    2.00   3rd Qu.:    3.00   3rd Qu.:   13.00   3rd Qu.:   14.0  
##  Max.   :23738.00   Max.   :14003.00   Max.   :49181.00   Max.   :76757.0  
##  NA's   :3          NA's   :3          NA's   :3          NA's   :3        
##      COMP_O            COMP_P             COMP_Q             COMP_R       
##  Min.   :  0.000   Min.   :    0.00   Min.   :    0.00   Min.   :   0.00  
##  1st Qu.:  2.000   1st Qu.:    2.00   1st Qu.:    1.00   1st Qu.:   0.00  
##  Median :  2.000   Median :    6.00   Median :    3.00   Median :   2.00  
##  Mean   :  3.269   Mean   :   30.96   Mean   :   34.15   Mean   :  12.18  
##  3rd Qu.:  3.000   3rd Qu.:   17.00   3rd Qu.:   12.00   3rd Qu.:   6.00  
##  Max.   :204.000   Max.   :16030.00   Max.   :22248.00   Max.   :6687.00  
##  NA's   :3         NA's   :3          NA's   :3          NA's   :3        
##      COMP_S             COMP_T      COMP_U              HOTELS      
##  Min.   :    0.00   Min.   :0   Min.   :  0.00000   Min.   : 1.000  
##  1st Qu.:    5.00   1st Qu.:0   1st Qu.:  0.00000   1st Qu.: 1.000  
##  Median :   12.00   Median :0   Median :  0.00000   Median : 1.000  
##  Mean   :   51.61   Mean   :0   Mean   :  0.05027   Mean   : 3.131  
##  3rd Qu.:   31.00   3rd Qu.:0   3rd Qu.:  0.00000   3rd Qu.: 3.000  
##  Max.   :24832.00   Max.   :0   Max.   :123.00000   Max.   :97.000  
##  NA's   :3          NA's   :3   NA's   :3           NA's   :4686    
##       BEDS          Pr_Agencies        Pu_Agencies         Pr_Bank      
##  Min.   :    2.0   Min.   :   0.000   Min.   :  0.000   Min.   : 0.000  
##  1st Qu.:   40.0   1st Qu.:   0.000   1st Qu.:  1.000   1st Qu.: 0.000  
##  Median :   82.0   Median :   1.000   Median :  2.000   Median : 1.000  
##  Mean   :  257.5   Mean   :   3.383   Mean   :  2.829   Mean   : 1.312  
##  3rd Qu.:  200.0   3rd Qu.:   2.000   3rd Qu.:  2.000   3rd Qu.: 2.000  
##  Max.   :13247.0   Max.   :1693.000   Max.   :626.000   Max.   :83.000  
##  NA's   :4686      NA's   :2231       NA's   :2231      NA's   :2231    
##     Pu_Bank       Pr_Assets           Pu_Assets              Cars        
##  Min.   :0.00   Min.   :0.000e+00   Min.   :0.000e+00   Min.   :      2  
##  1st Qu.:1.00   1st Qu.:0.000e+00   1st Qu.:4.047e+07   1st Qu.:    602  
##  Median :2.00   Median :3.231e+07   Median :1.339e+08   Median :   1438  
##  Mean   :1.58   Mean   :9.180e+09   Mean   :6.005e+09   Mean   :   9859  
##  3rd Qu.:2.00   3rd Qu.:1.148e+08   3rd Qu.:4.970e+08   3rd Qu.:   4086  
##  Max.   :8.00   Max.   :1.947e+13   Max.   :8.016e+12   Max.   :5740995  
##  NA's   :2231   NA's   :2231        NA's   :2231        NA's   :11       
##   Motorcycles      Wheeled_tractor         UBER           MAC         
##  Min.   :      4   Min.   :   0.000   Min.   :1      Min.   :  1.000  
##  1st Qu.:    591   1st Qu.:   0.000   1st Qu.:1      1st Qu.:  1.000  
##  Median :   1285   Median :   0.000   Median :1      Median :  2.000  
##  Mean   :   4879   Mean   :   5.754   Mean   :1      Mean   :  4.277  
##  3rd Qu.:   3294   3rd Qu.:   1.000   3rd Qu.:1      3rd Qu.:  3.000  
##  Max.   :1134570   Max.   :3236.000   Max.   :1      Max.   :130.000  
##  NA's   :11        NA's   :11         NA's   :5448   NA's   :5407     
##     WAL.MART       POST_OFFICES    
##  Min.   : 1.000   Min.   :  1.000  
##  1st Qu.: 1.000   1st Qu.:  1.000  
##  Median : 1.000   Median :  1.000  
##  Mean   : 2.059   Mean   :  2.081  
##  3rd Qu.: 1.750   3rd Qu.:  2.000  
##  Max.   :26.000   Max.   :225.000  
##  NA's   :5471     NA's   :120

Viewing Extracted Dataset:

summary(brazil_variables)
##              CITY          STATE         CAPITAL          IBGE_RES_POP     
##  Bom Jesus     :   5   MG     : 853   Min.   :0.000000   Min.   :     805  
##  São Domingos :   5   SP     : 645   1st Qu.:0.000000   1st Qu.:    5235  
##  Bonito        :   4   RS     : 498   Median :0.000000   Median :   10934  
##  Planalto      :   4   BA     : 418   Mean   :0.004845   Mean   :   34278  
##  São Francisco:   4   PR     : 399   3rd Qu.:0.000000   3rd Qu.:   23424  
##  Santa Helena  :   4   SC     : 295   Max.   :1.000000   Max.   :11253503  
##  (Other)       :5547   (Other):2465                      NA's   :8         
##  IBGE_DU_URBAN     IBGE_DU_RURAL     IBGE_15.59      IBGE_PLANTED_AREA  
##  Min.   :     60   Min.   :    3   Min.   :     94   Min.   :      0.0  
##  1st Qu.:    874   1st Qu.:  487   1st Qu.:   1734   1st Qu.:    910.2  
##  Median :   1846   Median :  931   Median :   3841   Median :   3471.5  
##  Mean   :   8859   Mean   : 1463   Mean   :  18212   Mean   :  14179.9  
##  3rd Qu.:   4624   3rd Qu.: 1832   3rd Qu.:   9628   3rd Qu.:  11194.2  
##  Max.   :3548433   Max.   :33809   Max.   :7058221   Max.   :1205669.0  
##  NA's   :10        NA's   :81      NA's   :8         NA's   :3          
##  IBGE_CROP_PRODUCTION_.      IDHM             LONG             LAT         
##  Min.   :      0        Min.   :0.4180   Min.   :-72.92   Min.   :-33.688  
##  1st Qu.:   2326        1st Qu.:0.5990   1st Qu.:-50.87   1st Qu.:-22.838  
##  Median :  13846        Median :0.6650   Median :-46.52   Median :-18.089  
##  Mean   :  57384        Mean   :0.6592   Mean   :-46.23   Mean   :-16.444  
##  3rd Qu.:  55619        3rd Qu.:0.7180   3rd Qu.:-41.40   3rd Qu.: -8.489  
##  Max.   :3274885        Max.   :0.8620   Max.   :-32.44   Max.   :  4.585  
##  NA's   :3              NA's   :8        NA's   :9        NA's   :9        
##      PAY_TV         FIXED_PHONES           AREA        GVA_TOTAL        
##  Min.   :      1   Min.   :      3           :   3   Min.   :       17  
##  1st Qu.:     88   1st Qu.:    119   1,227.93:   2   1st Qu.:    42253  
##  Median :    247   Median :    327   100.07  :   2   Median :   119492  
##  Mean   :   3094   Mean   :   6567   101.89  :   2   Mean   :   832987  
##  3rd Qu.:    815   3rd Qu.:   1151   102.00  :   2   3rd Qu.:   313963  
##  Max.   :2047668   Max.   :5543127   106.85  :   2   Max.   :569910503  
##  NA's   :3         NA's   :3         (Other) :5560   NA's   :3          
##      TAXES             GDP_CAPITA      MUN_EXPENDIT          COMP_TOT       
##  Min.   :   -14159   Min.   :  3191   Min.   :1.421e+06   Min.   :     6.0  
##  1st Qu.:     1305   1st Qu.:  9058   1st Qu.:1.573e+07   1st Qu.:    68.0  
##  Median :     5100   Median : 15870   Median :2.746e+07   Median :   162.0  
##  Mean   :   118864   Mean   : 21126   Mean   :1.043e+08   Mean   :   906.8  
##  3rd Qu.:    22197   3rd Qu.: 26155   3rd Qu.:5.666e+07   3rd Qu.:   448.0  
##  Max.   :117125387   Max.   :314638   Max.   :4.577e+10   Max.   :530446.0  
##  NA's   :3           NA's   :3        NA's   :1492        NA's   :3         
##      HOTELS          Pr_Bank          Pu_Bank       Pr_Assets        
##  Min.   : 1.000   Min.   : 0.000   Min.   :0.00   Min.   :0.000e+00  
##  1st Qu.: 1.000   1st Qu.: 0.000   1st Qu.:1.00   1st Qu.:0.000e+00  
##  Median : 1.000   Median : 1.000   Median :2.00   Median :3.231e+07  
##  Mean   : 3.131   Mean   : 1.312   Mean   :1.58   Mean   :9.180e+09  
##  3rd Qu.: 3.000   3rd Qu.: 2.000   3rd Qu.:2.00   3rd Qu.:1.148e+08  
##  Max.   :97.000   Max.   :83.000   Max.   :8.00   Max.   :1.947e+13  
##  NA's   :4686     NA's   :2231     NA's   :2231   NA's   :2231       
##    Pu_Assets              Cars          Motorcycles      Wheeled_tractor   
##  Min.   :0.000e+00   Min.   :      2   Min.   :      4   Min.   :   0.000  
##  1st Qu.:4.047e+07   1st Qu.:    602   1st Qu.:    591   1st Qu.:   0.000  
##  Median :1.339e+08   Median :   1438   Median :   1285   Median :   0.000  
##  Mean   :6.005e+09   Mean   :   9859   Mean   :   4879   Mean   :   5.754  
##  3rd Qu.:4.970e+08   3rd Qu.:   4086   3rd Qu.:   3294   3rd Qu.:   1.000  
##  Max.   :8.016e+12   Max.   :5740995   Max.   :1134570   Max.   :3236.000  
##  NA's   :2231        NA's   :11        NA's   :11        NA's   :11

As seen from the summary, there are many variables with NA values. Some municipalities have NA values for Longitude and Latitude too. All these will need to be rectified before proceeding.

2.2.3 Checking & Rectifying NA/Duplicated Values

2.2.4 Rectifying “AREA” Column

The values in the area column are factors instead of being numeric. For example, 1100.5 is documented as 1,100.5. We will rectify this by first removing the “,” from the strings and replacing it with "" using gsub(). Next, we will use as.numeric() to convert the string to a numeric variable.

brazil_variables$AREA = as.numeric(gsub(",", "", brazil_variables$AREA))

Rechecking Area Values:

We will use head() to view the first 6 rows in the AREA column.

head(brazil_variables$AREA)
## [1]  147.26  881.06 1045.13 1817.07 1610.65  180.08

The values have been rectified.

2.2.4.1 Rectifying NA values for Longtitude/ Latitude

Identifying Position of NA Values:

which(is.na(brazil_variables$LONG))
## [1]  472 2702 3117 3581 3761 3806 3821 4490 4606
which(is.na(brazil_variables$LAT))
## [1]  472 2702 3117 3581 3761 3806 3821 4490 4606

Identifying Name of Municipalities with NA values:

brazil_variables[c(472,2702,3117,3581,3761,3806,3821,4490,4606), 1]
## [1] Balneário Rincão  Lagoa Dos Patos     Mojuí Dos Campos  
## [4] Paraíso Das Ã\201guas Pescaria Brava      Pinhal Da Serra    
## [7] Pinto Bandeira      Santa Terezinha     São Caetano       
## 5299 Levels: Ângulo Óbidos Óleo Érico Cardoso ... Zortéa

Replacing NA Values with Coordinates:

We will replace the NA values with the correct coordinates taken from https://pt.db-city.com/

brazil_variables$LONG[472] <- -49.2361
brazil_variables$LAT[472] <- -28.8344

brazil_variables$LONG[2702] <- -51.4725
brazil_variables$LAT[2702] <- -31.0697

brazil_variables$LONG[3117] <- -54.6431
brazil_variables$LAT[3117] <- -2.68472

brazil_variables$LONG[3581] <- -53.0102
brazil_variables$LAT[3581] <- -19.0257

brazil_variables$LONG[3761] <- -48.8956
brazil_variables$LAT[3761] <- -28.4247

brazil_variables$LONG[3806] <- -51.1733
brazil_variables$LAT[3806] <- -27.8747

brazil_variables$LONG[3821] <- -51.4503
brazil_variables$LAT[3821] <- -29.0978

brazil_variables$LONG[4490] <- -39.5184
brazil_variables$LAT[4490] <- -12.7498

brazil_variables$LONG[4606] <- -36.1459
brazil_variables$LAT[4606] <- -8.33

Rechecking for NA Longitude/Latitude Values:

any(is.na(brazil_variables$LONG))
## [1] FALSE
any(is.na(brazil_variables$LAT))
## [1] FALSE

The coordinates for all municipalities have been set.

2.2.4.2 Rectifying NA Values for GDP/Capita

Since there are NA values present for the GDP per capita for some municipalities, we will drop these observations as it is unfair to assume that their GDP per capita is 0.

brazil_variables <- brazil_variables %>% filter(!is.na(GDP_CAPITA))

2.2.4.3 Rectifying NA values for Other Variables

Replacing NA values with 0

The other variables with NA values can be replaced with 0 to indicate that the data is not present.

brazil_variables[is.na(brazil_variables)] <- 0

Rechecking for NA values

any(is.na(brazil_variables))
## [1] FALSE

All the NA values have been rectified.

2.2.4.4 Duplicated Values Check

any(duplicated(brazil_variables$LAT))
## [1] FALSE

There are no duplicated rows. The dataset has been cleaned and is ready to be used.

2.2.5 Converting Aspatial Data to an ‘sf’ object

Currently, the brazil_variables data frame is aspatial. We will convert it to a sf object for ease of handling later.

brazil_variables.sf <- st_as_sf(brazil_variables, coords = c("LONG", "LAT"), crs = 4674)

3 Exploratory Data Analysis (EDA)

3.1 EDA using Choropleth Mapping

3.1.1 Joining Geospatial Data and Aspatial Data

Before we can prepare the choropleth map, we need to combine both the geospatial ‘sf’ object (i.e. mun_boundary) and aspatial ‘sf’ object (i.e. brazil_variables.sf) into one. This will be performed by using the st_join function of dplyr package since both are ‘sf’ objects.

brazil_sf <- st_join(mun_boundary, brazil_variables.sf)

Checking for Rows with Na Values:

The number of observations has increased hence we suspect the presence of NA values.

any(is.na(brazil_sf))
## [1] TRUE

Due to the joining of data sets, NA values may have arised from mismatched data. We will proceed to remove them.

Removing Rows with NA values:

brazil_sf <- brazil_sf %>% filter(!is.na(GDP_CAPITA))

Rechecking for NA values:

any(is.na(brazil_sf))  
## [1] FALSE

3.1.2 Preparing Choropleth Map to Visualise GDP/Capita in 2016 at the Municipality Level

tm_shape(brazil_sf)+
  tm_fill("GDP_CAPITA") +
  tm_borders(lwd = 0.1,  alpha = 1)

Observations:

  • The west-central region of Brazil (State of Mato Grosso) has relatively higher GDP per capita as compared to other regions. This may be because it is mostly covered with Amazon rainforest, wetlands and savanna plains and the capital is a travel hub.

  • The capital of Brazil, Brasilia, has slightly higher GDP per capita as compared to many other municipalities.

  • The Selviria municipality has very high GDP per capita and the municipalities around it have higher GDP per capita compared to majority of the municipalities in Brazil.

  • There are some small municipalities with high GDP per capita towards Southern Brazil.

3.2 EDA using Statistical Graphing

Distribution of GDP per capita, 2016 in Brazil:

ggplot(data=brazil_variables.sf, aes(x=`GDP_CAPITA`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

The figure above reveals a right skewed distribution. This means that most municipalities have low GDP per capita.

4 Explanatory Model (Multiple Linear Regression Method)

4.1 Visualising Relationships between Variables

Before building a multiple regression model, we need ensure that the independent variables used are not highly correlated to each other. If highly correlated independent variables are used in building a regression model, the quality of the model will be compromised.

We will plot a Correlation matrix to visualise the relationships between the independent variables using the corrplot package.

Selecting Independent Variables:

corr_plot_variables <- brazil_variables[, -c(11,12,18) ]  

Plotting Correlation Matrix:

We will order the variables alphabetically to mine any hidden patterns.

corrplot(cor(corr_plot_variables[, 3:25]), diag = FALSE, order = "alphabet",
         tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")

From the correlogram, we can see that the following variables (IBGE_DU_URBAN, CARS, IBGE_15.59, PAY_TV, MUN_EXPENDIT, FIXED_PHONES, GVA_TOTAL, IBGE_RES_POP, TAXES & COMP_TOT) are highly correlated to one another. Therefore, we will only retain MUN_EXPENDIT out of the highly correlated variables.

4.2 Building Multiple Linear Regressions Model

We will use lm() to calibrate the multiple linear regression model.

variables.mlr <- lm(formula = GDP_CAPITA ~ CAPITAL + IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM + AREA + MUN_EXPENDIT + HOTELS + Pr_Bank + Pu_Bank + Pr_Assets + Pu_Assets + Motorcycles + Wheeled_tractor, data = brazil_variables.sf)

summary(variables.mlr)
## 
## Call:
## lm(formula = GDP_CAPITA ~ CAPITAL + IBGE_DU_RURAL + IBGE_PLANTED_AREA + 
##     IBGE_CROP_PRODUCTION_. + IDHM + AREA + MUN_EXPENDIT + HOTELS + 
##     Pr_Bank + Pu_Bank + Pr_Assets + Pu_Assets + Motorcycles + 
##     Wheeled_tractor, data = brazil_variables.sf)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -77386  -6808  -2949   1852 283561 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -5.356e+04  2.574e+03 -20.810  < 2e-16 ***
## CAPITAL                 1.888e+03  4.703e+03   0.401 0.688117    
## IBGE_DU_RURAL          -7.817e-01  1.781e-01  -4.388 1.16e-05 ***
## IBGE_PLANTED_AREA       7.588e-02  1.400e-02   5.418 6.28e-08 ***
## IBGE_CROP_PRODUCTION_.  8.155e-03  4.255e-03   1.916 0.055369 .  
## IDHM                    1.117e+05  3.936e+03  28.368  < 2e-16 ***
## AREA                    5.302e-02  4.292e-02   1.235 0.216784    
## MUN_EXPENDIT            3.624e-06  1.026e-06   3.531 0.000417 ***
## HOTELS                  7.202e+01  1.032e+02   0.698 0.485503    
## Pr_Bank                 3.993e+02  3.133e+02   1.274 0.202554    
## Pu_Bank                 8.999e+02  3.016e+02   2.983 0.002862 ** 
## Pr_Assets              -6.310e-10  1.774e-09  -0.356 0.722029    
## Pu_Assets               5.040e-09  2.346e-09   2.149 0.031708 *  
## Motorcycles            -1.998e-01  3.398e-02  -5.881 4.31e-09 ***
## Wheeled_tractor         2.379e+01  8.018e+00   2.967 0.003024 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17200 on 5555 degrees of freedom
## Multiple R-squared:  0.2861, Adjusted R-squared:  0.2843 
## F-statistic:   159 on 14 and 5555 DF,  p-value: < 2.2e-16

From the report, we can see that CAPITAL, AREA, HOTELS, Pr_Bank and Pr_Assets are not statistically significant independent variables as compared to the rest. Hence, we will revise our model by removing AREA from our multiple linear regression model.

Calibrating Revised Model:

variables.mlr1 <- lm(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM +  MUN_EXPENDIT + Pu_Bank + Pu_Assets + Motorcycles + Wheeled_tractor, data = brazil_variables.sf)

summary(variables.mlr1)
## 
## Call:
## lm(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + 
##     IBGE_CROP_PRODUCTION_. + IDHM + MUN_EXPENDIT + Pu_Bank + 
##     Pu_Assets + Motorcycles + Wheeled_tractor, data = brazil_variables.sf)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -77118  -6860  -2975   1848 283569 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -5.382e+04  2.468e+03 -21.812  < 2e-16 ***
## IBGE_DU_RURAL          -8.005e-01  1.763e-01  -4.541 5.71e-06 ***
## IBGE_PLANTED_AREA       7.495e-02  1.393e-02   5.380 7.76e-08 ***
## IBGE_CROP_PRODUCTION_.  8.814e-03  4.225e-03   2.086 0.037002 *  
## IDHM                    1.122e+05  3.753e+03  29.909  < 2e-16 ***
## MUN_EXPENDIT            3.822e-06  8.094e-07   4.722 2.40e-06 ***
## Pu_Bank                 1.093e+03  2.821e+02   3.872 0.000109 ***
## Pu_Assets               5.431e-09  2.305e-09   2.357 0.018476 *  
## Motorcycles            -1.760e-01  2.856e-02  -6.163 7.66e-10 ***
## Wheeled_tractor         2.284e+01  7.500e+00   3.045 0.002337 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17200 on 5560 degrees of freedom
## Multiple R-squared:  0.2855, Adjusted R-squared:  0.2843 
## F-statistic: 246.8 on 9 and 5560 DF,  p-value: < 2.2e-16

All the independent variables are statistically significant.

4.3 Testing Key Multiple Linear Regression Assumptions

It is important for us to test the key assumptions when building our model before proceeding to finalise it.

4.3.1 Test for Multicollinearity

Multicollinearity is a situation where collinearity exists between three or more variables even if no pair of variables have particularly high correlation. This may lead to redundancy between independent variables.In the presence of multicollinearity, the regression model becomes unstable.

We will use ols_vif_tol() of olsrr package to test if there are sign of multicollinearity.

ols_vif_tol(variables.mlr1)
##                Variables Tolerance      VIF
## 1          IBGE_DU_RURAL 0.5980051 1.672227
## 2      IBGE_PLANTED_AREA 0.1409950 7.092452
## 3 IBGE_CROP_PRODUCTION_. 0.1355743 7.376030
## 4                   IDHM 0.6687730 1.495276
## 5           MUN_EXPENDIT 0.1475242 6.778548
## 6                Pu_Bank 0.5836669 1.713306
## 7              Pu_Assets 0.7368104 1.357201
## 8            Motorcycles 0.1486427 6.727543
## 9        Wheeled_tractor 0.3083798 3.242755

Since the VIF of the independent variables are less than 10, we conclude that there are no signs of strong multicollinearity amongst the independent variables.

4.3.2 Test for Non-Linearity

It is important for us to test the linearity and additivity of the relationship between the dependent and independent variables.

We will use ols_plot_resid_fit() function of olsrr package is used to perform the linearity assumption test.

ols_plot_resid_fit(variables.mlr1)

From the figure, we can see that majority of the data points are scattered near the line passing through 0. This enables us to conclude that the relationships between the dependent and independent variables are generally linear.

4.3.3 Test for Normality

We will also test to detect any possible violation of the normality assumption on the residuals of our model.

We will use the ols_plot_resid_hist() function of olsrr package for the normality assumption test. This was chosen over ols_test_normality() as it does not a allow a sample size greater than 5000.

ols_plot_resid_hist(variables.mlr1)

The figure above explicitly shows us that the residuals of the multiple linear regression model resemble a normal distribution.

4.3.4 Test for Homoscedasticity

The Homoscedasticity assumption assumes that the variance of error terms are similar across the values of the independent variables.

We will plot our residuals against predicted values to ascertain whether the points are equally distributed across all the values of the independent variables.

plot(variables.mlr1, 1)

As seen from the plot, the line in red is generally flat, signalling that the disturbances are homoscedastic in nature. There does not seem to be a pattern.

4.3.5 Test for Spatial Autocorrelation

Since the data we have used is geospatial in nature, it can be subjected to spatial autocorrelation, leading to misspecification errors. The presence of spatial autocorrelation will challenge the independence of variables hence it is important to perform this test.

4.3.5.1 Visualising Distribution of Residuals of Regression Model

We will first take a look at how the residuals are distributed to detect potential signs of clustering.

Extracting Residuals of Model and Merging them with Data:

We will join the mun_2016 dataset with brazil_variables and convert it to an ‘sf’ object to act as the polygon layer.

We will then extract the residuals and join them to the polygon layer as well as the spatial points layer (brazil_variables.sf). The polygons are for mapping purposes while the points will be used for analysis subsequently.

# Extract Residuals
residuals <- as.data.frame(variables.mlr1$residuals)

# Joining Residuals with Data & Renaming Residuals

point_res_sf <- cbind(brazil_sf, variables.mlr1$residuals) %>%
  rename(`residuals` = `variables.mlr1.residuals`)

Choropleth Map to Visualise the Distribution of the Residuals of GDP per capita, 2016 in Brazil:

tm_shape(point_res_sf) +
  tm_fill("residuals") 
## Variable(s) "residuals" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

From the map that have been plotted, the residuals appear to be randomly distributed across Brazil. We will conduct the Global Moran’s I test to further investigate.

4.3.5.2 Performing Global Moran’s I test

Converting ‘sf’ to ‘sp’ Dataframe:

First, we need to convert our ‘sf’ dataframe to a SpatialPointDataFrame as ‘spdep’ package requires us to do so.

point_res_sp <- as_Spatial(point_res_sf)

Computing Distanced Based Neighbours:

Next, we will determine the upper limit of the distance band for our distance based weight matrix.

coords <- coordinates(point_res_sp)
k <- knn2nb(knearneigh(coords))
kdists <- unlist(nbdists(k, coords, longlat=FALSE))
summary(kdists)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01289 0.10906 0.14690 0.19642 0.22010 3.33432

The largest distance is approximately 3.4. Hence, we will use this distance so that every point has >= 1 neighbour.

Computing Distance Based Weights Matrix:

nb <- dnearneigh(coords, 0, 3.4, longlat=FALSE)
nb_lw <- nb2listw(nb, style='B')

Computing Global Moran’s I:

Null Hypothesis: The residuals for the multiple linear regression model are randomly distributed.

Alternative Hypothesis: The residuals for the multiple linear regression model are not randomly distributed.

Confidence Interval: 0.95

lm.morantest(variables.mlr1,nb_lw)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA +
## IBGE_CROP_PRODUCTION_. + IDHM + MUN_EXPENDIT + Pu_Bank + Pu_Assets +
## Motorcycles + Wheeled_tractor, data = brazil_variables.sf)
## weights: nb_lw
## 
## Moran I statistic standard deviate = -1.3091, p-value = 0.9048
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##    -1.174246e-03    -1.799330e-04     5.768867e-07

Conclusion:

– The p-value is greater than than 0.05 (alpha value at 95% confidence interval)

– We do not reject the null hypothesis at the 95% confidence interval.

–> The residuals for the multiple linear regression model are randomly distributed.

5 Explanatory Model (Geographically Weighted Regression Method)

We will be modelling the GWR using both fixed and adaptive bandwidth schemes. Amongst the two approaches that can be used (AIC & Cross-Validation Approach), we will use Cross-Validation(CV).

5.1 Fixed Bandwidth GWR Model

5.1.1 Computing Fixed Bandwidth

bw.fixed <- bw.gwr(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM +  MUN_EXPENDIT + Pu_Bank + Pu_Assets + Motorcycles + Wheeled_tractor, data = point_res_sp, approach="cv", kernel="gaussian", adaptive=FALSE, longlat=FALSE)
## Take a cup of tea and have a break, it will take a few minutes.
##           -----A kind suggestion from GWmodel development group
## Fixed bandwidth: 34.70094 CV score: 2.120404e+12 
## Fixed bandwidth: 21.45065 CV score: 2.107555e+12 
## Fixed bandwidth: 13.26152 CV score: 2.079318e+12 
## Fixed bandwidth: 8.200357 CV score: 2.02493e+12 
## Fixed bandwidth: 5.072388 CV score: 1.949693e+12 
## Fixed bandwidth: 3.139197 CV score: 1.905993e+12 
## Fixed bandwidth: 1.944419 CV score: 2.204673e+12 
## Fixed bandwidth: 3.87761 CV score: 1.907414e+12 
## Fixed bandwidth: 2.682833 CV score: 1.979863e+12 
## Fixed bandwidth: 3.421246 CV score: 1.894095e+12 
## Fixed bandwidth: 3.595562 CV score: 1.897478e+12 
## Fixed bandwidth: 3.313513 CV score: 1.894926e+12 
## Fixed bandwidth: 3.487829 CV score: 1.894912e+12 
## Fixed bandwidth: 3.380096 CV score: 1.894033e+12 
## Fixed bandwidth: 3.354663 CV score: 1.894212e+12 
## Fixed bandwidth: 3.395814 CV score: 1.894009e+12 
## Fixed bandwidth: 3.405528 CV score: 1.894024e+12 
## Fixed bandwidth: 3.38981 CV score: 1.894011e+12 
## Fixed bandwidth: 3.399524 CV score: 1.894012e+12 
## Fixed bandwidth: 3.39352 CV score: 1.894009e+12 
## Fixed bandwidth: 3.392103 CV score: 1.894009e+12 
## Fixed bandwidth: 3.394396 CV score: 1.894009e+12 
## Fixed bandwidth: 3.394938 CV score: 1.894009e+12 
## Fixed bandwidth: 3.394062 CV score: 1.894009e+12 
## Fixed bandwidth: 3.393855 CV score: 1.894009e+12 
## Fixed bandwidth: 3.394189 CV score: 1.894009e+12 
## Fixed bandwidth: 3.393983 CV score: 1.894009e+12 
## Fixed bandwidth: 3.393934 CV score: 1.894009e+12 
## Fixed bandwidth: 3.394013 CV score: 1.894009e+12

The result shows that the recommended bandwidth is 3.402438 kilometres.

5.1.2 Calibrating GWR Model

gwr.fixed <- gwr.basic(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM +  MUN_EXPENDIT + Pu_Bank + Pu_Assets + Motorcycles + Wheeled_tractor, data = point_res_sp, bw=bw.fixed, kernel = 'gaussian', longlat = FALSE)
gwr.fixed
##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2020-05-31 18:26:36 
##    Call:
##    gwr.basic(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + 
##     IBGE_CROP_PRODUCTION_. + IDHM + MUN_EXPENDIT + Pu_Bank + 
##     Pu_Assets + Motorcycles + Wheeled_tractor, data = point_res_sp, 
##     bw = bw.fixed, kernel = "gaussian", longlat = FALSE)
## 
##    Dependent (y) variable:  GDP_CAPITA
##    Independent variables:  IBGE_DU_RURAL IBGE_PLANTED_AREA IBGE_CROP_PRODUCTION_. IDHM MUN_EXPENDIT Pu_Bank Pu_Assets Motorcycles Wheeled_tractor
##    Number of data points: 5570
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##    Min     1Q Median     3Q    Max 
## -77118  -6860  -2975   1848 283569 
## 
##    Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
##    (Intercept)            -5.382e+04  2.468e+03 -21.812  < 2e-16 ***
##    IBGE_DU_RURAL          -8.005e-01  1.763e-01  -4.541 5.71e-06 ***
##    IBGE_PLANTED_AREA       7.495e-02  1.393e-02   5.380 7.76e-08 ***
##    IBGE_CROP_PRODUCTION_.  8.814e-03  4.225e-03   2.086 0.037002 *  
##    IDHM                    1.122e+05  3.753e+03  29.909  < 2e-16 ***
##    MUN_EXPENDIT            3.822e-06  8.094e-07   4.722 2.40e-06 ***
##    Pu_Bank                 1.093e+03  2.821e+02   3.872 0.000109 ***
##    Pu_Assets               5.431e-09  2.305e-09   2.357 0.018476 *  
##    Motorcycles            -1.760e-01  2.856e-02  -6.163 7.66e-10 ***
##    Wheeled_tractor         2.284e+01  7.500e+00   3.045 0.002337 ** 
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 17200 on 5560 degrees of freedom
##    Multiple R-squared: 0.2855
##    Adjusted R-squared: 0.2843 
##    F-statistic: 246.8 on 9 and 5560 DF,  p-value: < 2.2e-16 
##    ***Extra Diagnostic information
##    Residual sum of squares: 1.64508e+12
##    Sigma(hat): 17188.74
##    AIC:  124464.4
##    AICc:  124464.4
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: gaussian 
##    Fixed bandwidth: 3.394013 
##    Regression points: the same locations as observations are used.
##    Distance metric: Euclidean distance metric is used.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                                  Min.     1st Qu.      Median     3rd Qu.
##    Intercept              -9.1424e+04 -5.4536e+04 -2.6593e+04 -1.5657e+04
##    IBGE_DU_RURAL          -5.8976e+00 -1.1036e+00 -6.1339e-01 -3.6570e-01
##    IBGE_PLANTED_AREA      -5.4649e-02  3.0439e-02  7.4042e-02  1.2122e-01
##    IBGE_CROP_PRODUCTION_. -3.2018e-02 -4.0013e-03  5.7125e-03  1.4905e-02
##    IDHM                   -1.0090e+04  4.3097e+04  7.6575e+04  1.1362e+05
##    MUN_EXPENDIT           -2.1700e-05  4.1576e-06  6.6813e-06  1.7404e-05
##    Pu_Bank                -1.0231e+03  7.9581e+02  1.2374e+03  1.8688e+03
##    Pu_Assets              -4.6589e-07 -8.6199e-08 -3.1347e-09  5.2917e-09
##    Motorcycles            -9.8492e-01 -2.8808e-01 -2.3099e-01 -1.4970e-01
##    Wheeled_tractor        -7.5831e+02 -1.9104e+01  2.6848e+01  2.9990e+01
##                                 Max.
##    Intercept              3.6888e+04
##    IBGE_DU_RURAL          9.7750e-01
##    IBGE_PLANTED_AREA      6.3350e-01
##    IBGE_CROP_PRODUCTION_. 4.9000e-02
##    IDHM                   1.6482e+05
##    MUN_EXPENDIT           0.0000e+00
##    Pu_Bank                2.7312e+03
##    Pu_Assets              0.0000e+00
##    Motorcycles            2.8650e-01
##    Wheeled_tractor        1.3618e+03
##    ************************Diagnostic information*************************
##    Number of data points: 5570 
##    Effective number of parameters (2trace(S) - trace(S'S)): 116.6109 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5453.389 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 124177.5 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 124086.9 
##    Residual sum of squares: 1.519795e+12 
##    R-square value:  0.3398847 
##    Adjusted R-square value:  0.3257668 
## 
##    ***********************************************************************
##    Program stops at: 2020-05-31 18:26:57

5.2 Adaptive Bandwidth GWR Model

5.2.1 Computing Adaptive Bandwidth

(Screenshots of the computation and calibration are used for ease of knitting)

DM <- gw.dist(dp.locat = coordinates(point_res_sp))

bw.adaptive <- bw.gwr(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM +  MUN_EXPENDIT + Pu_Bank + Pu_Assets + Motorcycles + Wheeled_tractor, data = point_res_sp, approach="CV", kernel="gaussian", adaptive=TRUE, longlat=FALSE, dMat=DM)

The result shows that the recommended data points to be used is 350.

5.2.2 Calibrating GWR Model

gwr.adaptive <- gwr.basic(formula = GDP_CAPITA ~ IBGE_DU_RURAL + IBGE_PLANTED_AREA + IBGE_CROP_PRODUCTION_. + IDHM +  MUN_EXPENDIT + Pu_Bank + Pu_Assets + Motorcycles + Wheeled_tractor, data = point_res_sp, bw=bw.adaptive, kernel = 'gaussian', adaptive=TRUE, longlat = FALSE)
gwr.adaptive

Comparing Fixed vs Adaptive Bandwidth:

  • The model using adaptive bandwidth has a higher adjusted R-square value than the model using the fixed bandwidth model.

  • Therefore, we will be using the model with adaptive bandwidth for visualisation purposes.

5.3 Visualising GWR Output (Adaptive Bandwidth)

5.3.1 Converting SDF into ‘sf’ Data Frame

To visualise the fields in SDF, we need to first convert it into a ‘sf’ data frame.

brazil_adaptive_Sf <- st_as_sf(gwr.adaptive$SDF) %>%
  st_transform(crs=4674)

gwr.adaptive.output <- as.data.frame(gwr.adaptive$SDF)

brazil_adaptive_sf <- cbind(point_res_sf, as.matrix(gwr.adaptive.output))

5.3.2 Visualising Local R Square Values

tm_shape(brazil_adaptive_sf) + 
  tm_fill(col = "Local_R2") 

Analysis of Local R Square Values Map:

  • The local R square values ranging between 0-1 indicate how well the geographic weighted regression model fits the observed GDP per capita values. Very high local R square values suggest that the model is performing very well and vice versa.

  • The local R square values are very low towards the South of Brazil and very high towards North Eastern Brazil.

  • This means that our regression model accurately describes the North Eastern regions of Brazil best and is not very indicate of regions in the South.

5.3.3 Plotting Regression Coefficients of Variables

Renaming Coefficients:

brazil_adaptive_sf <- brazil_adaptive_sf %>%
  rename(coeff_IBGE_DU_RURAL = IBGE_DU_RURAL.1) %>%
  rename(coeff_IBGE_PLANTED_AREA = IBGE_PLANTED_AREA.1) %>%
  rename(coeff_IBGE_CROP_PRODUCTION = IBGE_CROP_PRODUCTION_..1) %>%
  rename(coeff_IDHM = IDHM.1) %>%
  rename(coeff_MUN_EXPENDIT = MUN_EXPENDIT.1) %>%
  rename(coeff_Pu_Bank = Pu_Bank.1) %>%
  rename(coeff_Pu_Assets = Pu_Assets.1) %>%
  rename(coeff_Motorcycles = Motorcycles.1) %>%
  rename(coeff_Wheeled_tractor = Wheeled_tractor.1)

Plotting Coefficients of Independent Variables:

Maps with negative coefficients will be plotted with the “RdBu” palette. maps with only positive coefficients will be plotted with the “Reds” palette and maps with only negative coefficients will be plotted with the “Blues” palette.

coeff_rural_map <- tm_shape(brazil_adaptive_sf)+ 
  tm_fill("coeff_IBGE_DU_RURAL" , palette = "-RdBu")

coeff_plantedarea_map <- tm_shape(brazil_adaptive_sf)+ 
  tm_fill("coeff_IBGE_PLANTED_AREA", palette = "-RdBu")

coeff_crops_map <- tm_shape(brazil_adaptive_sf)+ 
  tm_fill("coeff_IBGE_CROP_PRODUCTION", palette = "-RdBu")

coeff_idhm_map <- tm_shape(brazil_adaptive_sf)+ 
  tm_fill("coeff_IDHM", palette = "Reds")

coeff_mun_exp_map <- tm_shape(brazil_adaptive_sf)+ 
  tm_fill("coeff_MUN_EXPENDIT", palette = "Reds")

coeff_pubank_map <- tm_shape(brazil_adaptive_sf)+ 
  tm_fill("coeff_Pu_Bank", palette = "-RdBu")

coeff_puassets_map <- tm_shape(brazil_adaptive_sf)+ 
  tm_fill("coeff_Pu_Assets", palette = "-RdBu")

coeff_motorcycles_map <- tm_shape(brazil_adaptive_sf)+ 
  tm_fill("coeff_Motorcycles", palette = "Blues")

coeff_tractors_map <- tm_shape(brazil_adaptive_sf)+ 
  tm_fill("coeff_Wheeled_tractor", palette = "-RdBu")

tmap_arrange(coeff_rural_map, coeff_plantedarea_map, coeff_crops_map, coeff_idhm_map, coeff_mun_exp_map, coeff_pubank_map, coeff_puassets_map, coeff_motorcycles_map, coeff_tractors_map, ncol=3, nrow = 3)

Analysis of Coefficients:

  • For the first plot of IBGE domestic rural units coefficients, most of the areas in Brazil are blue. This means that there is a negative relationship between GDP per Capita and the number of domestic rural units in a municipality. As the number of domestic rural units increases, the GDP per Capita of the municipality decreases. This is an expected observation as more rural units would mean that the area is less developed and urbanised hence the GDP per capita will be lower.

  • The second plot of IBGE planted area coefficients shows that most of the areas in Brazil are red and the east region has some blue areas. This means that there is a positive relationship between GDP per Capita and the planted area in a municipality. As the area of plantation increases, the GDP per Capita of the municipality increases. This is also an expected observation as a greater planted area would mean that there is higher crop production, leading to an increase in revenue and GDP, causing GDP per capita to increase.

  • The third plot of IBGE crop production coefficients shows that most of the areas in Brazil are slightly red and the south region is blue. This means that there is a general positive relationship between GDP per Capita and crop production in a municipality. As crop production increases, the GDP per Capita of the municipality increases slightly. The area in the South may be more urbanised than other areas hence an increased crop production may cause a fall in GDP per capita.

  • The fourth plot of the IDHM coefficients shows that all of the areas in Brazil are red. This means that there is strong positive relationship between GDP per Capita and IDHM value in each municipality. IDHM refers to the Human Development Index(HDI). As HDI increases, the GDP per Capita of the municipality increases significantly. This is an expected observation as the HDI index reflects the development of each municipality. With a higher HDI index, the municipality is deemed to be more developed which causes the GDP per capita of that municipality to be higher too.

  • The fifth plot of the Mun Expenditure coefficients shows that all of the areas in Brazil are red. This means that there is positive relationship between GDP per Capita and expenditure of municipalities. As the expenditure increases, the GDP per Capita of the municipality. This is true because expenditure is part of the calculation to obtain GDP, hence it will have a positive relationship with GDP per capita as well.

  • The sixth plot of public bank coefficients shows that most of the areas in Brazil are red and the central-south region is blue. This means that there is a general positive relationship between GDP per Capita and the number of public banks in a municipality. As the number of public banks increases, the GDP per Capita of the municipality increases slightly. This may be because a higher number of banks may lead to more investments being made and investments are a part of GDP calculation hence it affects the GDP per capita positively too.

  • The seventh plot of public bank assets shows that most of the areas in Brazil are slightly red and regions towards the east and south of Brazil are blue. This means that there is a general positive relationship between GDP per Capita and the public bank assets in a municipality. As the amount of assets increase, the GDP per Capita of the municipality increases slightly. Bank assets are considered to be the money supply of banks. With a higher money supply, more money will be loaned to investors, causing GDP to increase and hence this may cause the positive relationship with GDP per capita.

  • The eighth plot of the motorcycle coefficients shows that all of the areas in Brazil are blue. This means that there is negative relationship between GDP per Capita and the number of motorcycles in each municipality. As the number of motorcycle increases, the GDP per Capita of the municipality decreases. This may be because motorcycles are a sign of being less developed. More developed municipalities will have more people using cars instead of motorcycles. Hence, if there is a higher number of motorcycles, it may mean that the municipality is less developed, causing the GDP per capita to be lower.

  • The ninth plot of the number of wheeled tractors shows that most of the areas in Brazil are slightly red with blue regions towards the east. This means that there is general positive relationship between GDP per Capita and number of wheeled tractors in each municipality. As the number of wheeled tractors increases, the GDP per Capita of the municipality increases slightly. Tractors are used for farming purposes hence a higher number of tractors may mean that there is a higher amount of agricultural production. This may lead to increased revenue, causing the GDP per capita to increase.

Conclusion:

In this exercise, we analysed the factors affecting the GDP per capita of Brazil at the municipality level by using the data provided. From our analysis, we can tell that there is no clustering of GDP per capita. Amongst our 9 independent analysis variables, the HDI index seems to be the most important factor affecting GDP per capita of Brazil. This was supported by the very large geographic weighted regression coefficients.