Objective of this report

Brazil is one of the largest countries which is growing rapidly. However, the wealth is unequally distributed across Brazil. To put it in a simple perspective, half of the municipalities with GDP per capita less than R$16000 and the top 25% municipalities earn R$26155 and above. This report will determine the factors affecting unequal development across Brazil at a muncipality level through a geographically weighed regression model.

1. Importing libraries

packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse','geobr',"lmtest","ggpubr")
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2. Importing the data

2.1 Aspatial data

There are two datasets which are imported for this report.

say some more bull shit.

#brazil_cities <- read.csv2("data/aspatial/BRAZIL_CITIES.csv")
brazil_cities <- read_delim(file = "data/aspatial/BRAZIL_CITIES.csv", delim = ";")
metadata <- read.csv2("data/aspatial/Data_Dictionary.csv")

2.2 Geospatial Data

In order to acquire the municipality map of Brazil, the read_municipality function is used from the “geobr” package. This package is built by the Brazil government, and provides quick and easy access to official spatial data sets of Brazil. As most of the aspatial data is collected in 2016, we will also be using the municipality map from the year 2016 in order to be consistent with the aspatial data.

#municipality <- read_municipality(year=2016)
#municipality <- brazil_sp

3. Data Wrangling

3.1 Spatial Data

Firstly, we will check if any Coordinate Reference System is ai tssigned to the municipality data.

#st_crs(municipality)

As seen in the output above, the CRS assigned is 4674 From epsg.io, the CRS with EPSG code 29101 is one of the most accurate projection system for Brazil. Hence we will convert our municipality data into EPSG 29101 format.

#municipality <- st_transform(municipality,4674)
municipality <- st_read(dsn = "data/muni_sf", layer = "muni_sf")

## Reading layer `muni_sf' from data source `/Users/Amey/Desktop/Y2S2/IS415/assignment4/data/muni_sf' using driver `ESRI Shapefile'
## Simple feature collection with 5572 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -73.99045 ymin: -33.75118 xmax: -28.83594 ymax: 5.271841
## CRS:            4674

We can examine the spatial data from the code below.

tm_shape(municipality29101)+
  tm_borders()

In the next section, we will perform data wrangling on aspatial data.

3.2 Aspatial data

We will first have a look at the main dataset brazil_cities through the summary function.

summary(brazil_cities)

##      CITY              STATE              CAPITAL          IBGE_RES_POP     
##  Length:5573        Length:5573        Min.   :0.000000   Min.   :     805  
##  Class :character   Class :character   1st Qu.:0.000000   1st Qu.:    5235  
##  Mode  :character   Mode  :character   Median :0.000000   Median :   10934  
##                                        Mean   :0.004845   Mean   :   34278  
##                                        3rd Qu.:0.000000   3rd Qu.:   23424  
##                                        Max.   :1.000000   Max.   :11253503  
##                                                           NA's   :8         
##  IBGE_RES_POP_BRAS  IBGE_RES_POP_ESTR     IBGE_DU        IBGE_DU_URBAN    
##  Min.   :     805   Min.   :     0.0   Min.   :    239   Min.   :     60  
##  1st Qu.:    5230   1st Qu.:     0.0   1st Qu.:   1572   1st Qu.:    874  
##  Median :   10926   Median :     0.0   Median :   3174   Median :   1846  
##  Mean   :   34200   Mean   :    77.5   Mean   :  10303   Mean   :   8859  
##  3rd Qu.:   23390   3rd Qu.:    10.0   3rd Qu.:   6726   3rd Qu.:   4624  
##  Max.   :11133776   Max.   :119727.0   Max.   :3576148   Max.   :3548433  
##  NA's   :8          NA's   :8          NA's   :10        NA's   :10       
##  IBGE_DU_RURAL      IBGE_POP            IBGE_1            IBGE_1-4     
##  Min.   :    3   Min.   :     174   Min.   :     0.0   Min.   :     5  
##  1st Qu.:  487   1st Qu.:    2801   1st Qu.:    38.0   1st Qu.:   158  
##  Median :  931   Median :    6170   Median :    92.0   Median :   376  
##  Mean   : 1463   Mean   :   27595   Mean   :   383.3   Mean   :  1544  
##  3rd Qu.: 1832   3rd Qu.:   15302   3rd Qu.:   232.0   3rd Qu.:   951  
##  Max.   :33809   Max.   :10463636   Max.   :129464.0   Max.   :514794  
##  NA's   :81      NA's   :8          NA's   :8          NA's   :8       
##     IBGE_5-9        IBGE_10-14       IBGE_15-59         IBGE_60+      
##  Min.   :     7   Min.   :    12   Min.   :     94   Min.   :     29  
##  1st Qu.:   220   1st Qu.:   259   1st Qu.:   1734   1st Qu.:    341  
##  Median :   516   Median :   588   Median :   3841   Median :    722  
##  Mean   :  2069   Mean   :  2381   Mean   :  18212   Mean   :   3004  
##  3rd Qu.:  1300   3rd Qu.:  1478   3rd Qu.:   9628   3rd Qu.:   1724  
##  Max.   :684443   Max.   :783702   Max.   :7058221   Max.   :1293012  
##  NA's   :8        NA's   :8        NA's   :8         NA's   :8        
##  IBGE_PLANTED_AREA   IBGE_CROP_PRODUCTION_$ IDHM Ranking 2010      IDHM       
##  Min.   :      0.0   Min.   :      0        Min.   :   1      Min.   :0.4180  
##  1st Qu.:    910.2   1st Qu.:   2326        1st Qu.:1392      1st Qu.:0.5990  
##  Median :   3471.5   Median :  13846        Median :2783      Median :0.6650  
##  Mean   :  14179.9   Mean   :  57384        Mean   :2783      Mean   :0.6592  
##  3rd Qu.:  11194.2   3rd Qu.:  55619        3rd Qu.:4174      3rd Qu.:0.7180  
##  Max.   :1205669.0   Max.   :3274885        Max.   :5565      Max.   :0.8620  
##  NA's   :3           NA's   :3              NA's   :8         NA's   :8       
##    IDHM_Renda     IDHM_Longevidade IDHM_Educacao         LONG       
##  Min.   :0.4000   Min.   :0.6720   Min.   :0.2070   Min.   :-72.92  
##  1st Qu.:0.5720   1st Qu.:0.7690   1st Qu.:0.4900   1st Qu.:-50.87  
##  Median :0.6540   Median :0.8080   Median :0.5600   Median :-46.52  
##  Mean   :0.6429   Mean   :0.8016   Mean   :0.5591   Mean   :-46.23  
##  3rd Qu.:0.7070   3rd Qu.:0.8360   3rd Qu.:0.6310   3rd Qu.:-41.40  
##  Max.   :0.8910   Max.   :0.8940   Max.   :0.8250   Max.   :-32.44  
##  NA's   :8        NA's   :8        NA's   :8        NA's   :9       
##       LAT               ALT               PAY_TV         FIXED_PHONES    
##  Min.   :-33.688   Min.   :     0.0   Min.   :      1   Min.   :      3  
##  1st Qu.:-22.838   1st Qu.:   169.8   1st Qu.:     88   1st Qu.:    119  
##  Median :-18.089   Median :   406.5   Median :    247   Median :    327  
##  Mean   :-16.444   Mean   :   893.8   Mean   :   3094   Mean   :   6567  
##  3rd Qu.: -8.489   3rd Qu.:   628.9   3rd Qu.:    815   3rd Qu.:   1151  
##  Max.   :  4.585   Max.   :874579.0   Max.   :2047668   Max.   :5543127  
##  NA's   :9         NA's   :9          NA's   :3         NA's   :3        
##       AREA            REGIAO_TUR        CATEGORIA_TUR      ESTIMATED_POP     
##  Min.   :     3.57   Length:5573        Length:5573        Min.   :     786  
##  1st Qu.:   204.44   Class :character   Class :character   1st Qu.:    5454  
##  Median :   416.59   Mode  :character   Mode  :character   Median :   11590  
##  Mean   :  1517.44                                         Mean   :   37432  
##  3rd Qu.:  1026.57                                         3rd Qu.:   25296  
##  Max.   :159533.33                                         Max.   :12176866  
##  NA's   :3                                                 NA's   :3         
##  RURAL_URBAN         GVA_AGROPEC       GVA_INDUSTRY       GVA_SERVICES      
##  Length:5573        Min.   :      0   Min.   :       1   Min.   :        2  
##  Class :character   1st Qu.:   4189   1st Qu.:    1726   1st Qu.:    10112  
##  Mode  :character   Median :  20426   Median :    7424   Median :    31211  
##                     Mean   :  47271   Mean   :  175928   Mean   :   489451  
##                     3rd Qu.:  51227   3rd Qu.:   41022   3rd Qu.:   115406  
##                     Max.   :1402282   Max.   :63306755   Max.   :464656988  
##                     NA's   :3         NA's   :3          NA's   :3          
##    GVA_PUBLIC         GVA_TOTAL             TAXES                GDP           
##  Min.   :       7   Min.   :       17   Min.   :   -14159   Min.   :       15  
##  1st Qu.:   17267   1st Qu.:    42253   1st Qu.:     1305   1st Qu.:    43709  
##  Median :   35866   Median :   119492   Median :     5100   Median :   125153  
##  Mean   :  123768   Mean   :   832987   Mean   :   118864   Mean   :   954584  
##  3rd Qu.:   89245   3rd Qu.:   313963   3rd Qu.:    22197   3rd Qu.:   329539  
##  Max.   :41902893   Max.   :569910503   Max.   :117125387   Max.   :687035890  
##  NA's   :3          NA's   :3           NA's   :3           NA's   :3          
##     POP_GDP           GDP_CAPITA       GVA_MAIN          MUN_EXPENDIT      
##  Min.   :     815   Min.   :  3191   Length:5573        Min.   :1.421e+06  
##  1st Qu.:    5483   1st Qu.:  9058   Class :character   1st Qu.:1.573e+07  
##  Median :   11578   Median : 15870   Mode  :character   Median :2.746e+07  
##  Mean   :   36998   Mean   : 21126                      Mean   :1.043e+08  
##  3rd Qu.:   25085   3rd Qu.: 26155                      3rd Qu.:5.666e+07  
##  Max.   :12038175   Max.   :314638                      Max.   :4.577e+10  
##  NA's   :3          NA's   :3                           NA's   :1492       
##     COMP_TOT            COMP_A            COMP_B            COMP_C        
##  Min.   :     6.0   Min.   :   0.00   Min.   :  0.000   Min.   :    0.00  
##  1st Qu.:    68.0   1st Qu.:   1.00   1st Qu.:  0.000   1st Qu.:    3.00  
##  Median :   162.0   Median :   2.00   Median :  0.000   Median :   11.00  
##  Mean   :   906.8   Mean   :  18.25   Mean   :  1.852   Mean   :   73.44  
##  3rd Qu.:   448.0   3rd Qu.:   8.00   3rd Qu.:  2.000   3rd Qu.:   39.00  
##  Max.   :530446.0   Max.   :1948.00   Max.   :274.000   Max.   :31566.00  
##  NA's   :3          NA's   :3         NA's   :3         NA's   :3         
##      COMP_D             COMP_E            COMP_F             COMP_G        
##  Min.   :  0.0000   Min.   :  0.000   Min.   :    0.00   Min.   :     1.0  
##  1st Qu.:  0.0000   1st Qu.:  0.000   1st Qu.:    1.00   1st Qu.:    32.0  
##  Median :  0.0000   Median :  0.000   Median :    4.00   Median :    74.5  
##  Mean   :  0.4262   Mean   :  2.029   Mean   :   43.26   Mean   :   348.0  
##  3rd Qu.:  0.0000   3rd Qu.:  1.000   3rd Qu.:   15.00   3rd Qu.:   199.0  
##  Max.   :332.0000   Max.   :657.000   Max.   :25222.00   Max.   :150633.0  
##  NA's   :3          NA's   :3         NA's   :3          NA's   :3         
##      COMP_H          COMP_I             COMP_J             COMP_K        
##  Min.   :    0   Min.   :    0.00   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    1   1st Qu.:    2.00   1st Qu.:    0.00   1st Qu.:    0.00  
##  Median :    7   Median :    7.00   Median :    1.00   Median :    0.00  
##  Mean   :   41   Mean   :   55.88   Mean   :   24.74   Mean   :   15.55  
##  3rd Qu.:   25   3rd Qu.:   24.00   3rd Qu.:    5.00   3rd Qu.:    2.00  
##  Max.   :19515   Max.   :29290.00   Max.   :38720.00   Max.   :23738.00  
##  NA's   :3       NA's   :3          NA's   :3          NA's   :3         
##      COMP_L             COMP_M             COMP_N            COMP_O       
##  Min.   :    0.00   Min.   :    0.00   Min.   :    0.0   Min.   :  0.000  
##  1st Qu.:    0.00   1st Qu.:    1.00   1st Qu.:    1.0   1st Qu.:  2.000  
##  Median :    0.00   Median :    4.00   Median :    4.0   Median :  2.000  
##  Mean   :   15.14   Mean   :   51.29   Mean   :   83.7   Mean   :  3.269  
##  3rd Qu.:    3.00   3rd Qu.:   13.00   3rd Qu.:   14.0   3rd Qu.:  3.000  
##  Max.   :14003.00   Max.   :49181.00   Max.   :76757.0   Max.   :204.000  
##  NA's   :3          NA's   :3          NA's   :3         NA's   :3        
##      COMP_P             COMP_Q             COMP_R            COMP_S        
##  Min.   :    0.00   Min.   :    0.00   Min.   :   0.00   Min.   :    0.00  
##  1st Qu.:    2.00   1st Qu.:    1.00   1st Qu.:   0.00   1st Qu.:    5.00  
##  Median :    6.00   Median :    3.00   Median :   2.00   Median :   12.00  
##  Mean   :   30.96   Mean   :   34.15   Mean   :  12.18   Mean   :   51.61  
##  3rd Qu.:   17.00   3rd Qu.:   12.00   3rd Qu.:   6.00   3rd Qu.:   31.00  
##  Max.   :16030.00   Max.   :22248.00   Max.   :6687.00   Max.   :24832.00  
##  NA's   :3          NA's   :3          NA's   :3         NA's   :3         
##      COMP_T      COMP_U              HOTELS            BEDS        
##  Min.   :0   Min.   :  0.00000   Min.   : 1.000   Min.   :    2.0  
##  1st Qu.:0   1st Qu.:  0.00000   1st Qu.: 1.000   1st Qu.:   40.0  
##  Median :0   Median :  0.00000   Median : 1.000   Median :   82.0  
##  Mean   :0   Mean   :  0.05027   Mean   : 3.131   Mean   :  257.5  
##  3rd Qu.:0   3rd Qu.:  0.00000   3rd Qu.: 3.000   3rd Qu.:  200.0  
##  Max.   :0   Max.   :123.00000   Max.   :97.000   Max.   :13247.0  
##  NA's   :3   NA's   :3           NA's   :4686     NA's   :4686     
##   Pr_Agencies        Pu_Agencies         Pr_Bank          Pu_Bank    
##  Min.   :   0.000   Min.   :  0.000   Min.   : 0.000   Min.   :0.00  
##  1st Qu.:   0.000   1st Qu.:  1.000   1st Qu.: 0.000   1st Qu.:1.00  
##  Median :   1.000   Median :  2.000   Median : 1.000   Median :2.00  
##  Mean   :   3.383   Mean   :  2.829   Mean   : 1.312   Mean   :1.58  
##  3rd Qu.:   2.000   3rd Qu.:  2.000   3rd Qu.: 2.000   3rd Qu.:2.00  
##  Max.   :1693.000   Max.   :626.000   Max.   :83.000   Max.   :8.00  
##  NA's   :2231       NA's   :2231      NA's   :2231     NA's   :2231  
##    Pr_Assets           Pu_Assets              Cars          Motorcycles     
##  Min.   :0.000e+00   Min.   :0.000e+00   Min.   :      2   Min.   :      4  
##  1st Qu.:0.000e+00   1st Qu.:4.047e+07   1st Qu.:    602   1st Qu.:    591  
##  Median :3.231e+07   Median :1.339e+08   Median :   1438   Median :   1285  
##  Mean   :9.180e+09   Mean   :6.005e+09   Mean   :   9859   Mean   :   4879  
##  3rd Qu.:1.148e+08   3rd Qu.:4.970e+08   3rd Qu.:   4086   3rd Qu.:   3294  
##  Max.   :1.947e+13   Max.   :8.016e+12   Max.   :5740995   Max.   :1134570  
##  NA's   :2231        NA's   :2231        NA's   :11        NA's   :11       
##  Wheeled_tractor         UBER           MAC             WAL-MART     
##  Min.   :   0.000   Min.   :1      Min.   :  1.000   Min.   : 1.000  
##  1st Qu.:   0.000   1st Qu.:1      1st Qu.:  1.000   1st Qu.: 1.000  
##  Median :   0.000   Median :1      Median :  2.000   Median : 1.000  
##  Mean   :   5.754   Mean   :1      Mean   :  4.277   Mean   : 2.059  
##  3rd Qu.:   1.000   3rd Qu.:1      3rd Qu.:  3.000   3rd Qu.: 1.750  
##  Max.   :3236.000   Max.   :1      Max.   :130.000   Max.   :26.000  
##  NA's   :11         NA's   :5448   NA's   :5407      NA's   :5471    
##   POST_OFFICES    
##  Min.   :  1.000  
##  1st Qu.:  1.000  
##  Median :  1.000  
##  Mean   :  2.081  
##  3rd Qu.:  2.000  
##  Max.   :225.000  
##  NA's   :120

The first column “CITY” refers to the municipality. As seen in the above summary, there are 79 variables in relation to each municipality. There are several pressing issues regarding this dataset which needs to be fixed before analysis is performed. Firstly, it is very evident that all the variable names are not self explainatory from their variable names, and hence we will be renaming them so that they can be identified easily while performing analysis. The description of the data can be found in the second file imported(Data_dictionary.csv) which briefly describes what each variable refers to along with the source and year of data. Secondly, there are several variables which consists of NA values, which need to be dealt with. As we will be creating an explanatory model to explain factors affecting the GDP per capita at the municipality level, we will be removing variables which are not related to GDP per capita. After that, we will further perform data cleaning in order to deal with NA values.

Before we perform spatial analysis, we will convert the dataframe object into a sf object from the

brazil_cities <- brazil_cities %>% drop_na(LONG)

brazil.sf <- st_as_sf(brazil_cities,
                            coords = c("LONG", "LAT"),
                      crs=4326)%>%
  st_transform(4674)
st_crs(brazil.sf)

## Coordinate Reference System:
##   User input: EPSG:4674 
##   wkt:
## GEOGCS["SIRGAS 2000",
##     DATUM["Sistema_de_Referencia_Geocentrico_para_las_AmericaS_2000",
##         SPHEROID["GRS 1980",6378137,298.257222101,
##             AUTHORITY["EPSG","7019"]],
##         TOWGS84[0,0,0,0,0,0,0],
##         AUTHORITY["EPSG","6674"]],
##     PRIMEM["Greenwich",0,
##         AUTHORITY["EPSG","8901"]],
##     UNIT["degree",0.0174532925199433,
##         AUTHORITY["EPSG","9122"]],
##     AUTHORITY["EPSG","4674"]]

head(brazil.sf)

## Simple feature collection with 6 features and 79 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -49.44055 ymin: -19.15585 xmax: -39.04755 ymax: -1.72347
## CRS:            EPSG:4674
## # A tibble: 6 x 80
##   CITY  STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BR… IBGE_RES_POP_ES… IBGE_DU
##   <chr> <chr>   <dbl>        <dbl>            <dbl>            <dbl>   <dbl>
## 1 Abad… GO          0         6876             6876                0    2137
## 2 Abad… MG          0         6704             6704                0    2328
## 3 Abad… GO          0        15757            15609              148    4655
## 4 Abae… MG          0        22690            22690                0    7694
## 5 Abae… PA          0       141100           141040               60   31061
## 6 Abai… CE          0        10496            10496                0    2791
## # … with 73 more variables: IBGE_DU_URBAN <dbl>, IBGE_DU_RURAL <dbl>,
## #   IBGE_POP <dbl>, IBGE_1 <dbl>, `IBGE_1-4` <dbl>, `IBGE_5-9` <dbl>,
## #   `IBGE_10-14` <dbl>, `IBGE_15-59` <dbl>, `IBGE_60+` <dbl>,
## #   IBGE_PLANTED_AREA <dbl>, `IBGE_CROP_PRODUCTION_$` <dbl>, `IDHM Ranking
## #   2010` <dbl>, IDHM <dbl>, IDHM_Renda <dbl>, IDHM_Longevidade <dbl>,
## #   IDHM_Educacao <dbl>, ALT <dbl>, PAY_TV <dbl>, FIXED_PHONES <dbl>,
## #   AREA <dbl>, REGIAO_TUR <chr>, CATEGORIA_TUR <chr>, ESTIMATED_POP <dbl>,
## #   RURAL_URBAN <chr>, GVA_AGROPEC <dbl>, GVA_INDUSTRY <dbl>,
## #   GVA_SERVICES <dbl>, GVA_PUBLIC <dbl>, ` GVA_TOTAL ` <dbl>, TAXES <dbl>,
## #   GDP <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>, GVA_MAIN <chr>,
## #   MUN_EXPENDIT <dbl>, COMP_TOT <dbl>, COMP_A <dbl>, COMP_B <dbl>,
## #   COMP_C <dbl>, COMP_D <dbl>, COMP_E <dbl>, COMP_F <dbl>, COMP_G <dbl>,
## #   COMP_H <dbl>, COMP_I <dbl>, COMP_J <dbl>, COMP_K <dbl>, COMP_L <dbl>,
## #   COMP_M <dbl>, COMP_N <dbl>, COMP_O <dbl>, COMP_P <dbl>, COMP_Q <dbl>,
## #   COMP_R <dbl>, COMP_S <dbl>, COMP_T <dbl>, COMP_U <dbl>, HOTELS <dbl>,
## #   BEDS <dbl>, Pr_Agencies <dbl>, Pu_Agencies <dbl>, Pr_Bank <dbl>,
## #   Pu_Bank <dbl>, Pr_Assets <dbl>, Pu_Assets <dbl>, Cars <dbl>,
## #   Motorcycles <dbl>, Wheeled_tractor <dbl>, UBER <dbl>, MAC <dbl>,
## #   `WAL-MART` <dbl>, POST_OFFICES <dbl>, geometry <POINT [°]>

plot(brazil.sf$geometry)

data <- st_join(municipality,brazil.sf)

Choropleth map of GDP Per Capita in Brazil

tmap_mode("plot")
tm_shape(data)+
  tm_polygons("GDP_CAPITA",border.alpha = 0.7,legend.hist=TRUE)+
  tm_layout(legend.outside = TRUE)

## 3.3 Data wrangling

Firstly, we will remove the columns with a lot of NA values as these data will not be a good fit for our model. We will re-examine the data from the summary function through which we understand the range of the data as well as the number of NA values in each column.

summary(data)

##     code_mn                name_mn        cod_stt        abbrv_s    
##  Min.   :1100015   Bom Jesus   :   5   31     : 853   MG     : 853  
##  1st Qu.:2512175   São Domingos:   5   35     : 645   SP     : 645  
##  Median :3146354   Bonito      :   4   43     : 499   RS     : 499  
##  Mean   :3253966   Planalto    :   4   29     : 417   BA     : 417  
##  3rd Qu.:4119264   Santa Helena:   4   41     : 399   PR     : 399  
##  Max.   :5300108   Santa Inês  :   4   42     : 295   SC     : 295  
##                    (Other)     :5546   (Other):2464   (Other):2464  
##      CITY              STATE              CAPITAL          IBGE_RES_POP     
##  Length:5572        Length:5572        Min.   :0.000000   Min.   :     805  
##  Class :character   Class :character   1st Qu.:0.000000   1st Qu.:    5236  
##  Mode  :character   Mode  :character   Median :0.000000   Median :   10936  
##                                        Mean   :0.004853   Mean   :   34288  
##                                        3rd Qu.:0.000000   3rd Qu.:   23468  
##                                        Max.   :1.000000   Max.   :11253503  
##                                        NA's   :8          NA's   :9         
##  IBGE_RES_POP_BRAS  IBGE_RES_POP_ESTR      IBGE_DU        IBGE_DU_URBAN    
##  Min.   :     805   Min.   :     0.00   Min.   :    239   Min.   :     60  
##  1st Qu.:    5230   1st Qu.:     0.00   1st Qu.:   1573   1st Qu.:    874  
##  Median :   10934   Median :     0.00   Median :   3176   Median :   1850  
##  Mean   :   34210   Mean   :    77.53   Mean   :  10306   Mean   :   8862  
##  3rd Qu.:   23394   3rd Qu.:    10.00   3rd Qu.:   6727   3rd Qu.:   4627  
##  Max.   :11133776   Max.   :119727.00   Max.   :3576148   Max.   :3548433  
##  NA's   :9          NA's   :9           NA's   :11        NA's   :11       
##  IBGE_DU_RURAL      IBGE_POP            IBGE_1            IBGE_1-4       
##  Min.   :    3   Min.   :     174   Min.   :     0.0   Min.   :     5.0  
##  1st Qu.:  487   1st Qu.:    2802   1st Qu.:    38.0   1st Qu.:   158.0  
##  Median :  931   Median :    6177   Median :    92.0   Median :   377.0  
##  Mean   : 1463   Mean   :   27604   Mean   :   383.4   Mean   :  1545.1  
##  3rd Qu.: 1832   3rd Qu.:   15304   3rd Qu.:   232.0   3rd Qu.:   951.5  
##  Max.   :33809   Max.   :10463636   Max.   :129464.0   Max.   :514794.0  
##  NA's   :82      NA's   :9          NA's   :9          NA's   :9         
##     IBGE_5-9          IBGE_10-14       IBGE_15-59         IBGE_60+      
##  Min.   :     7.0   Min.   :    12   Min.   :     94   Min.   :     29  
##  1st Qu.:   220.5   1st Qu.:   260   1st Qu.:   1735   1st Qu.:    341  
##  Median :   516.0   Median :   589   Median :   3842   Median :    723  
##  Mean   :  2070.0   Mean   :  2382   Mean   :  18218   Mean   :   3005  
##  3rd Qu.:  1300.5   3rd Qu.:  1478   3rd Qu.:   9630   3rd Qu.:   1724  
##  Max.   :684443.0   Max.   :783702   Max.   :7058221   Max.   :1293012  
##  NA's   :9          NA's   :9        NA's   :9         NA's   :9        
##  IBGE_PLANTED_AREA   IBGE_CROP_PRODUCTION_$ IDHM Ranking 2010      IDHM       
##  Min.   :      0.0   Min.   :      0        Min.   :   1      Min.   :0.4180  
##  1st Qu.:    911.5   1st Qu.:   2330        1st Qu.:1392      1st Qu.:0.5990  
##  Median :   3473.0   Median :  13845        Median :2782      Median :0.6650  
##  Mean   :  14172.6   Mean   :  57362        Mean   :2783      Mean   :0.6592  
##  3rd Qu.:  11172.5   3rd Qu.:  55579        3rd Qu.:4174      3rd Qu.:0.7180  
##  Max.   :1205669.0   Max.   :3274885        Max.   :5565      Max.   :0.8620  
##  NA's   :9           NA's   :9              NA's   :8         NA's   :8       
##    IDHM_Renda     IDHM_Longevidade IDHM_Educacao         ALT          
##  Min.   :0.4000   Min.   :0.6720   Min.   :0.2070   Min.   :     0.0  
##  1st Qu.:0.5720   1st Qu.:0.7690   1st Qu.:0.4900   1st Qu.:   169.8  
##  Median :0.6540   Median :0.8080   Median :0.5600   Median :   406.5  
##  Mean   :0.6429   Mean   :0.8016   Mean   :0.5591   Mean   :   893.8  
##  3rd Qu.:0.7070   3rd Qu.:0.8360   3rd Qu.:0.6310   3rd Qu.:   628.9  
##  Max.   :0.8910   Max.   :0.8940   Max.   :0.8250   Max.   :874579.0  
##  NA's   :8        NA's   :8        NA's   :8        NA's   :8         
##      PAY_TV         FIXED_PHONES          AREA            REGIAO_TUR       
##  Min.   :      1   Min.   :      3   Min.   :     3.57   Length:5572       
##  1st Qu.:     88   1st Qu.:    119   1st Qu.:   204.49   Class :character  
##  Median :    247   Median :    328   Median :   415.87   Mode  :character  
##  Mean   :   3097   Mean   :   6574   Mean   :  1515.73                     
##  3rd Qu.:    816   3rd Qu.:   1151   3rd Qu.:  1026.16                     
##  Max.   :2047668   Max.   :5543127   Max.   :159533.33                     
##  NA's   :8         NA's   :8         NA's   :10                            
##  CATEGORIA_TUR      ESTIMATED_POP      RURAL_URBAN         GVA_AGROPEC     
##  Length:5572        Min.   :     786   Length:5572        Min.   :      0  
##  Class :character   1st Qu.:    5454   Class :character   1st Qu.:   4190  
##  Mode  :character   Median :   11591   Mode  :character   Median :  20430  
##                     Mean   :   37463                      Mean   :  47268  
##                     3rd Qu.:   25306                      3rd Qu.:  51216  
##                     Max.   :12176866                      Max.   :1402282  
##                     NA's   :8                             NA's   :9        
##   GVA_INDUSTRY       GVA_SERVICES         GVA_PUBLIC         GVA_TOTAL        
##  Min.   :       1   Min.   :        2   Min.   :       7   Min.   :       17  
##  1st Qu.:    1725   1st Qu.:    10117   1st Qu.:   17256   1st Qu.:    42345  
##  Median :    7425   Median :    31216   Median :   35865   Median :   119504  
##  Mean   :  176063   Mean   :   490028   Mean   :  123879   Mean   :   833879  
##  3rd Qu.:   40995   3rd Qu.:   115583   3rd Qu.:   89339   3rd Qu.:   314089  
##  Max.   :63306755   Max.   :464656988   Max.   :41902893   Max.   :569910503  
##  NA's   :9          NA's   :9           NA's   :9          NA's   :9          
##      TAXES                GDP               POP_GDP           GDP_CAPITA    
##  Min.   :   -14159   Min.   :       15   Min.   :     815   Min.   :  3191  
##  1st Qu.:     1302   1st Qu.:    43676   1st Qu.:    5488   1st Qu.:  9062  
##  Median :     5107   Median :   125111   Median :   11584   Median : 15866  
##  Mean   :   119000   Mean   :   955528   Mean   :   37035   Mean   : 21093  
##  3rd Qu.:    22208   3rd Qu.:   329361   3rd Qu.:   25108   3rd Qu.: 26155  
##  Max.   :117125387   Max.   :687035890   Max.   :12038175   Max.   :314638  
##  NA's   :9           NA's   :9           NA's   :9          NA's   :9       
##    GVA_MAIN          MUN_EXPENDIT          COMP_TOT            COMP_A       
##  Length:5572        Min.   :1.421e+06   Min.   :     6.0   Min.   :   0.00  
##  Class :character   1st Qu.:1.574e+07   1st Qu.:    68.0   1st Qu.:   1.00  
##  Mode  :character   Median :2.746e+07   Median :   162.0   Median :   2.00  
##                     Mean   :1.044e+08   Mean   :   907.8   Mean   :  18.27  
##                     3rd Qu.:5.679e+07   3rd Qu.:   449.5   3rd Qu.:   8.00  
##                     Max.   :4.577e+10   Max.   :530446.0   Max.   :1948.00  
##                     NA's   :1497        NA's   :9          NA's   :9        
##      COMP_B            COMP_C             COMP_D             COMP_E       
##  Min.   :  0.000   Min.   :    0.00   Min.   :  0.0000   Min.   :  0.000  
##  1st Qu.:  0.000   1st Qu.:    3.00   1st Qu.:  0.0000   1st Qu.:  0.000  
##  Median :  0.000   Median :   11.00   Median :  0.0000   Median :  0.000  
##  Mean   :  1.853   Mean   :   73.53   Mean   :  0.4264   Mean   :  2.031  
##  3rd Qu.:  2.000   3rd Qu.:   39.00   3rd Qu.:  0.0000   3rd Qu.:  1.000  
##  Max.   :274.000   Max.   :31566.00   Max.   :332.0000   Max.   :657.000  
##  NA's   :9         NA's   :9          NA's   :9          NA's   :9        
##      COMP_F            COMP_G             COMP_H             COMP_I        
##  Min.   :    0.0   Min.   :     1.0   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    1.0   1st Qu.:    32.0   1st Qu.:    1.00   1st Qu.:    2.00  
##  Median :    4.0   Median :    75.0   Median :    7.00   Median :    7.00  
##  Mean   :   43.3   Mean   :   348.4   Mean   :   41.04   Mean   :   55.94  
##  3rd Qu.:   15.0   3rd Qu.:   200.0   3rd Qu.:   25.00   3rd Qu.:   24.00  
##  Max.   :25222.0   Max.   :150633.0   Max.   :19515.00   Max.   :29290.00  
##  NA's   :9         NA's   :9          NA's   :9          NA's   :9         
##      COMP_J             COMP_K             COMP_L             COMP_M        
##  Min.   :    0.00   Min.   :    0.00   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:    1.00  
##  Median :    1.00   Median :    0.00   Median :    0.00   Median :    4.00  
##  Mean   :   24.77   Mean   :   15.57   Mean   :   15.16   Mean   :   51.35  
##  3rd Qu.:    5.00   3rd Qu.:    2.00   3rd Qu.:    3.00   3rd Qu.:   13.00  
##  Max.   :38720.00   Max.   :23738.00   Max.   :14003.00   Max.   :49181.00  
##  NA's   :9          NA's   :9          NA's   :9          NA's   :9         
##      COMP_N            COMP_O            COMP_P             COMP_Q        
##  Min.   :    0.0   Min.   :  1.000   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    1.0   1st Qu.:  2.000   1st Qu.:    2.00   1st Qu.:    1.00  
##  Median :    4.0   Median :  2.000   Median :    6.00   Median :    3.00  
##  Mean   :   83.8   Mean   :  3.271   Mean   :   30.99   Mean   :   34.19  
##  3rd Qu.:   14.0   3rd Qu.:  3.000   3rd Qu.:   17.00   3rd Qu.:   12.00  
##  Max.   :76757.0   Max.   :204.000   Max.   :16030.00   Max.   :22248.00  
##  NA's   :9         NA's   :9         NA's   :9          NA's   :9         
##      COMP_R            COMP_S             COMP_T      COMP_U         
##  Min.   :   0.00   Min.   :    0.00   Min.   :0   Min.   :  0.00000  
##  1st Qu.:   0.00   1st Qu.:    5.00   1st Qu.:0   1st Qu.:  0.00000  
##  Median :   2.00   Median :   12.00   Median :0   Median :  0.00000  
##  Mean   :  12.19   Mean   :   51.66   Mean   :0   Mean   :  0.05033  
##  3rd Qu.:   6.00   3rd Qu.:   31.00   3rd Qu.:0   3rd Qu.:  0.00000  
##  Max.   :6687.00   Max.   :24832.00   Max.   :0   Max.   :123.00000  
##  NA's   :9         NA's   :9          NA's   :9   NA's   :9          
##      HOTELS            BEDS          Pr_Agencies        Pu_Agencies    
##  Min.   : 1.000   Min.   :    2.0   Min.   :   0.000   Min.   :  0.00  
##  1st Qu.: 1.000   1st Qu.:   40.0   1st Qu.:   0.000   1st Qu.:  1.00  
##  Median : 1.000   Median :   82.0   Median :   1.000   Median :  2.00  
##  Mean   : 3.131   Mean   :  257.5   Mean   :   3.384   Mean   :  2.83  
##  3rd Qu.: 3.000   3rd Qu.:  200.0   3rd Qu.:   2.000   3rd Qu.:  2.00  
##  Max.   :97.000   Max.   :13247.0   Max.   :1693.000   Max.   :626.00  
##  NA's   :4685     NA's   :4685      NA's   :2231       NA's   :2231    
##     Pr_Bank          Pu_Bank       Pr_Assets           Pu_Assets        
##  Min.   : 0.000   Min.   :0.00   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.: 0.000   1st Qu.:1.00   1st Qu.:0.000e+00   1st Qu.:4.048e+07  
##  Median : 1.000   Median :2.00   Median :3.234e+07   Median :1.339e+08  
##  Mean   : 1.312   Mean   :1.58   Mean   :9.183e+09   Mean   :6.007e+09  
##  3rd Qu.: 2.000   3rd Qu.:2.00   3rd Qu.:1.149e+08   3rd Qu.:4.976e+08  
##  Max.   :83.000   Max.   :8.00   Max.   :1.947e+13   Max.   :8.016e+12  
##  NA's   :2231     NA's   :2231   NA's   :2231        NA's   :2231       
##       Cars          Motorcycles      Wheeled_tractor         UBER     
##  Min.   :      2   Min.   :      4   Min.   :   0.000   Min.   :1     
##  1st Qu.:    602   1st Qu.:    591   1st Qu.:   0.000   1st Qu.:1     
##  Median :   1440   Median :   1286   Median :   0.000   Median :1     
##  Mean   :   9869   Mean   :   4883   Mean   :   5.759   Mean   :1     
##  3rd Qu.:   4091   3rd Qu.:   3299   3rd Qu.:   1.000   3rd Qu.:1     
##  Max.   :5740995   Max.   :1134570   Max.   :3236.000   Max.   :1     
##  NA's   :16        NA's   :16        NA's   :16         NA's   :5447  
##       MAC             WAL-MART       POST_OFFICES              geometry   
##  Min.   :  1.000   Min.   : 1.000   Min.   :  1.000   MULTIPOLYGON :5572  
##  1st Qu.:  1.000   1st Qu.: 1.000   1st Qu.:  1.000   epsg:4674    :   0  
##  Median :  2.000   Median : 1.000   Median :  1.000   +proj=long...:   0  
##  Mean   :  4.277   Mean   : 2.059   Mean   :  2.081                       
##  3rd Qu.:  3.000   3rd Qu.: 1.750   3rd Qu.:  2.000                       
##  Max.   :130.000   Max.   :26.000   Max.   :225.000                       
##  NA's   :5406      NA's   :5470     NA's   :126

As seen above, there are various variables which have majority of its values as NA. We will be removing these variables from our dataset from the code below.

data$HOTELS<-NULL
data$`WAL-MART`<-NULL
data$MAC<-NULL
data$UBER<-NULL
data$Pu_Assets<-NULL
data$Pr_Assets<-NULL
data$Pu_Bank<-NULL
data$Pr_Bank<-NULL
data$Pu_Agencies<-NULL
data$BEDS <- NULL
data$Pr_Agencies <- NULL

Our dependent variable for the analysis will be GDP per capita. Hence, we should keep variables which are potential factors of the GDP per capita in order to treat them as independent variables. Hence, we will now be eliminating variables which are not related to GDP per capita. Furthermore, there are various columns which are derived from other columns. For example, GDP per capita is found by diving GDP by the population. This makes GDP and population data directly correlated with our dependent variable. If one column is directly related to other columns, there will be a problem of multicollinearity. To avoid this problem, we will be eliminating such data.

IBGE_RES_POP (Total Resident Population) = IBGE_RES_POP_BRAS (Resident population who are Brazilian) + IBGE_RES_POP_ESTR (Resident population who are Foreigners). All of the municipalities have a huge majority of Brazilians. Hence, we are interested in finding out whether the proportion of foreigners affect the GDP_CAPITA in Brazil. Therefore, we will be eliminating IBGE_POP (Total Resident Population) and IBGE_RES_POP_BRAS (Resident population who are Brazilian), and we will be converting IBGE_RES_POP_ESTR (Resident population who are Foreigners) into a perecentage.

data <- data %>%
  mutate(FOREIGNERS_PERCENTAGE=(IBGE_RES_POP_ESTR/IBGE_RES_POP)*100)

data$IBGE_RES_POP_ESTR <- NULL
data$IBGE_RES_POP <- NULL
data$IBGE_RES_POP_BRAS <- NULL

IBGE_DU (Total Domestic Units) = IBGE_DU_URBAN (Urban Domestic Units) + IBGE_DU_RURAL(Rural Domestic Units). As this is a count data, which varies by municipality population, area and other factors, we will change this data into percentage data. Between urban units and rural units, we are interested in whether having more urban domestic units impact the GDP_CAPITA. Hence, we will create a new variable DU_URBAN_PERCENTAGE which will correspond to the percentage of urban household units calculated by (IBGE_DU_URBAN/IBGE_DU)*100.

data <- data %>%
  mutate(DU_URBAN_PERCENTAGE = (IBGE_DU_URBAN/IBGE_DU)*100)


data$IBGE_DU_RURAL <- NULL
data$IBGE_DU_URBAN <- NULL
data$IBGE_DU <- NULL
data$DU_RURAL_PERCENTAGE<-NULL

IBGE_POP (Resident Population Regular Urban Planning) = IBGE_1 + IBGE_1-4 + IBGE_5-9 + IBGE_10-14 + IBGE_15-59 + IBGE_60+. The resident population has been divided into multiple age groups. It is very obvious that all the age groups do not contribute to the GDP/Capita. Only the working population contributes to the GDP per capita. Hence, we will be only considering keeping the IBGE_15-59 which consists of number of people aged between 15 and 59. One limitation is that the active population is usually above the age of 22 in Brazil, however given the dataset, we will use IBGE_15-59 due to the lack of option. As the GDP per capita is sensitive to the number of people, we will use percentage of active population instead of using the direct count. Do note that this population is only consists of the urban population.

data <- data %>%
  mutate(ACTIVE_PERCENTAGE=(`IBGE_15-59`/IBGE_POP)*100)
data$IBGE_1 <- NULL
data$`IBGE_1-4`<-NULL
data$`IBGE_5-9` <- NULL
data$`IBGE_10-14` <- NULL
data$`IBGE_60+` <- NULL

The next two variables (IBGE_PLANTED_AREA, CROP_PRODUCTION) are regarding the agricultural output in the municipality. We will be eliminating CROP_PRODUCTION, which refers to the monetary agricultural output as we have variables (descrbed later) which consist of monetary output from agriculture related companies. We will be transforming the IBGE_PLANTED_AREA such that it potrays the amount of land used for agricultural purposes. This will be done by PLANTED_AREA_PERCENTAGE = (IBGE_PLANTED_AREA/AREA)*100. The variable AREA refers to the total area of the municipality.

data$IBGE_PLANTED_AREA <- data$IBGE_PLANTED_AREA * 10000
data$AREA <- data$AREA * 1000000
data <- data %>%
  mutate(PLANTED_AREA_PERCENTAGE=(IBGE_PLANTED_AREA/AREA)*100)

data$IBGE_PLANTED_AREA <- NULL

There are 5 variables linked to the human development index. They are IDHM Ranking (HDI Ranking), IDHM (HDI Index), IDHM_Renda (GNI index), IDHM_Longevidade (Life Expectancy index), and IDHM_Educacao (Education index).The variable IDHM is a mixure of IDHM_Renda, IDHM_Longevidade, and IDHM_Educacao. Therefore, we will eliminate the three subfactors of IDHM as correlation exists between the three variables. We will be eliminating IDHM Ranking as well as it will be highly correlated to IDHM which will create a problem of multicorrelation.

data$`IDHM Ranking 2010`<-NULL
data$IDHM_Longevidade <- NULL
data$IDHM_Educacao <- NULL
data$IDHM_Renda <- NULL

The variables Pay TV and Fixed phone refer to pay tv users and fixed phone users respectively. These are not relevant while factoring in GDP per capita, hence we will be removing it. REGIAO_TUR and CATEGORIA_TUR are two categorical variables which we will be eliminating as it does not contribute to our regression model.

data$PAY_TV <- NULL
data$FIXED_PHONES <- NULL

data$REGIAO_TUR <- NULL
data$CATEGORIA_TUR <- NULL

The variable RURAL_URBAN contains of various categories which correspond to the extent of urban/rural the municipalities are. The categories can be found from the code below.

unique(data$RURAL_URBAN)

## [1] "Intermediário Adjacente" "Urbano"                 
## [3] "Rural Adjacente"         "Rural Remoto"           
## [5] "Intermediário Remoto"    NA

As seen above, there are 5 categories. We will be recoding the data into four columns which determines if the municipality belongs to that category. Note that 0 refers to FALSE and 1 refers to TRUE. There are only 4 columns as the 5th category can be recognised if all the other categories contain 0. This will reduce the chance of multicollinearity.

data$Intermediário_Adjacente<-ifelse(data$RURAL_URBAN == "Intermediário Adjacente", 
c(1), c(0)) 

data$Intermediário_Remoto<-ifelse(data$RURAL_URBAN == "Intermediário Remoto", 
c(1), c(0)) 

data$Rural_Adjacente<-ifelse(data$RURAL_URBAN == "Rural Adjacente", 
c(1), c(0)) 

data$Rural_Remoto<-ifelse(data$RURAL_URBAN == "Rural Remoto", 
c(1), c(0)) 

data$RURAL_URBAN <- NULL

The next set of variables contain information regarding the GVA (Gross Value Added) which is distributed industrywise. We will transform the data into percentage form which will indicate how much percentage each industry contributes.

data <- data%>%
  mutate(GVA_AGROPEC_Percentage=(GVA_AGROPEC/` GVA_TOTAL `)*100) %>%
  mutate(GVA_INDUSTRY_Percentage=(GVA_INDUSTRY/` GVA_TOTAL `)*100) %>%
  mutate(GVA_SERVICES_Percentage=(GVA_SERVICES/` GVA_TOTAL `)*100) %>%
  mutate(GVA_PUBLIC_Percentage=(GVA_PUBLIC/` GVA_TOTAL `)*100)

data$GVA_AGROPEC <- NULL
data$GVA_INDUSTRY <- NULL
data$GVA_SERVICES <- NULL
data$GVA_PUBLIC <- NULL
data$` GVA_TOTAL ` <- NULL

The variable TAXES is dated from 2016. The tax can be either negative or positive as seen in the data. Negative taxes indicate subsidies for unemployed individuals. We will transform this variable to Tax per capita. As the tax data is from 2016, we will divide it by the 2016 population data. Furthermore, the variable MUN_EXPENDIT consists of the municipal expenditures. This variable has a lot of NA values and is also not significantly related to GDP_CAPITA, hence we will eliminate it as well.

data <- data %>%
  mutate(TAX_CAPITA=TAXES/POP_GDP)

data$TAXES <- NULL
data$MUN_EXPENDIT <- NULL

The next set of data consists of various company types. Like GVA, we will classify the companies into four sub categories – Agriculture, Industry, Services, Public Services. The percentage of companies for each category will then be calculated.

data <- data %>%
  mutate(COMP_AGRICULTURE=(COMP_A/COMP_TOT)*100)%>%
  mutate(COMP_INDUSTRY=((COMP_A+COMP_B+COMP_F+COMP_G)/COMP_TOT)*100)%>%
  mutate(COMP_SERVICES=((COMP_I+COMP_J+COMP_K+COMP_L+COMP_M+COMP_N+COMP_T+COMP_U)/COMP_TOT)*100)%>%
  mutate(COMP_PUBLIC=((COMP_D+COMP_E+COMP_H+COMP_O+COMP_P+COMP_Q+COMP_R+COMP_S)/COMP_TOT)*100)

data$COMP_A <- NULL
data$COMP_B <- NULL
data$COMP_C <- NULL
data$COMP_D <- NULL
data$COMP_E <- NULL
data$COMP_F <- NULL
data$COMP_G <- NULL
data$COMP_H <- NULL
data$COMP_I <- NULL
data$COMP_J <- NULL
data$COMP_K <- NULL
data$COMP_L <- NULL
data$COMP_M <- NULL
data$COMP_N <- NULL
data$COMP_O <- NULL
data$COMP_P <- NULL
data$COMP_Q <- NULL
data$COMP_R <- NULL
data$COMP_S <- NULL
data$COMP_T <- NULL
data$COMP_U <- NULL
data$COMP_TOT <- NULL

Both car and motorcycles represent the number of cars and two wheelers respectively. We will transform this data by dividing by the population. As the automobil data is dated from 2019, we will divide it by the 2016 population as that is the latest population data we currently have.

data <- data %>%
  mutate(CAR_RATIO=Cars/POP_GDP) %>%
  mutate(MOTORCYCLE_RATIO=Motorcycles/POP_GDP) 

data$POP_GDP <- NULL
data$Cars <- NULL
data$Motorcycles <- NULL

Lastly, the values for tractors are very low in most municipalities and also does not reflect anything related to GDP_CAPITA. Hence, we will be eliminating it. Similarly, we will be eliminating Post office as well as the count is not useful for us.

data$Wheeled_tractor <- NULL
data$POST_OFFICES<- NULL
data$AREA<- NULL
data$GVA_MAIN<- NULL
data$`IBGE_15-59`<-NULL
data$`IBGE_CROP_PRODUCTION_$`<-NULL

There are two columns for both municipality name and state abbreviation. We will remove the duplicated columns.
The variable ESTIMATED_POP is dated as of 2016. As we do not require population as our independent variable, we will be removing it as well.
Lastly, GDP is directly related to GDP_CAPITA, and hence we cannot keep it as our independent variable.

data$CITY <- NULL
data$STATE<- NULL
data$code_mn<- NULL
data$cod_stt<- NULL
data$abbrv_s<- NULL
data$ESTIMATED_POP <- NULL
data$IBGE_POP <- NULL
data$GDP <- NULL

glimpse(data)

## Rows: 5,572
## Columns: 25
## $ name_mn                 <fct> Alta Floresta D'oeste, Ariquemes, Cabixi, Cac…
## $ CAPITAL                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ IDHM                    <dbl> 0.640, 0.700, 0.650, 0.718, 0.690, 0.685, 0.6…
## $ ALT                     <dbl> 337.74, 138.69, 236.06, 177.45, 262.81, 419.0…
## $ GDP_CAPITA              <dbl> 18732.17, 20618.18, 21202.96, 22130.78, 22721…
## $ geometry                <MULTIPOLYGON [°]> MULTIPOLYGON (((-62.19465 -..., …
## $ FOREIGNERS_PERCENTAGE   <dbl> 0.00000000, 0.12506502, 0.04752099, 0.0483620…
## $ DU_URBAN_PERCENTAGE     <dbl> 59.18087, 84.34058, 44.82062, 80.72780, 85.31…
## $ ACTIVE_PERCENTAGE       <dbl> 65.36511, 66.60103, 64.78245, 68.11618, 65.25…
## $ PLANTED_AREA_PERCENTAGE <dbl> 2.58779148, 1.60733932, 31.25955796, 4.793178…
## $ Intermediário_Adjacente <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ Intermediário_Remoto    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Rural_Adjacente         <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ Rural_Remoto            <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ GVA_AGROPEC_Percentage  <dbl> 3.654743e+01, 7.374052e+00, 4.621590e+01, 1.0…
## $ GVA_INDUSTRY_Percentage <dbl> 6.878626e-03, 1.795178e+01, 3.616531e+00, 1.3…
## $ GVA_SERVICES_Percentage <dbl> 2.517734e+01, 4.473012e-02, 1.884524e+01, 4.8…
## $ GVA_PUBLIC_Percentage   <dbl> 3.139653e+01, 2.994422e+01, 3.132234e+01, 2.7…
## $ TAX_CAPITA              <dbl> 0.9090476751, 2.0406429893, 0.0008761329, 2.2…
## $ COMP_AGRICULTURE        <dbl> 0.5905512, 1.0805943, 0.0000000, 0.7042254, 1…
## $ COMP_INDUSTRY           <dbl> 59.05512, 53.35434, 55.00000, 50.97508, 52.24…
## $ COMP_SERVICES           <dbl> 10.629921, 18.640252, 15.000000, 19.609967, 1…
## $ COMP_PUBLIC             <dbl> 22.83465, 17.96488, 18.33333, 19.66414, 24.01…
## $ CAR_RATIO               <dbl> 0.09660472, 0.16537924, 0.14708221, 0.1998361…
## $ MOTORCYCLE_RATIO        <dbl> 0.3633655, 0.4028859, 0.2922563, 0.4332533, 0…

We will first check if our data has any NA values.

sum(is.na(data))

## [1] 223

As there are NA values present, we will drop the rows with NA values.

data <- data %>% drop_na()

We have created our final dataset. Now, we will extract the dependent variable and plot a correlation plot of all the indepedent variables in order to investigate for any multicollinearity.

GDP_CAPITA <- data$GDP_CAPITA
data$GDP_CAPITA <- NULL
data$GDP_CAPITA <- GDP_CAPITA
dependent_variables <- data
st_geometry(dependent_variables)<-NULL
corrplot(cor(dependent_variables[,2:23]), diag = FALSE, order = "AOE",
         tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper",
         number.cex = 0.7)

There are multiple arguments passed through the corrplot() function in order to plot the correlation matrix. We have chosen “AOE” as the ordering method for our correlation matrix which orders the variables in their respective order of the eigenvectors. We have passed “number” as our argument for method, which allows us to visualise the numerical vlaue of the correlation coefficients.
We will classify two variables to be significantly correlated if their correlation constant is above 0.80. As seen in the diagram above, there is only one pair of variables which is significantly correlated – Car ratio and IDHM with R value of 0.86. Between the two variables, we will eliminate car ratio as IHDM consists information about life expectancy, income, and education which is more crucial for our model. We will now perform exploratory data analysis on both our depedent and indepedent variables.

4. Exploratory Data Analysis

In this section, each variable will be analysed by plotting histograms and box plots. Through these diagrams, we can understand the distribution of the data as well as understand the distribution of outliers.

4.1 Dependent Variable

The dependent variable for the regression model is going to be GDP per capita. The unit of GDP per capita is measured in 1000 real$. Real$ is the official currency of Brazil.

ggplot(data=dependent_variables, 
             aes_string(x= "GDP_CAPITA")) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

The distribution above reveals a right skewed distribution. This means that most of the municipalities in Brazil have a low GDP per capita, with a few exceptions which have extremely high GDP per capita as compared to others. Statistically, the skewed dsitribution can be normalised by using log transformation. The code chunk below is used to derive a new variable called GDP_CAPITA.log by using a log transformation on the variable GDP_CAPITA.

dependent_variables$GDP_CAPITA.log <- log(dependent_variables$GDP_CAPITA)
data$GDP_CAPITA.log <- log(dependent_variables$GDP_CAPITA)

We will now plot a histogram with the transformed data.

ggplot(data=dependent_variables, 
             aes_string(x= "GDP_CAPITA.log")) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

With the transformation, the distribution is less skewed and hence will help us in the regression analysis. Moving on, we will perform exploratory data analysis on the independent variables. Note that both GDP_CAPITA and GDP_CAPITA.log will be fitted in the multiple regression model and will be analysed in order to find out which model performs better.

4.2 Dependent Variable

Firsly, we will create two functions which will help us plot histograms and bar plots of the independent variables. Note that we will need to visualise dummy variables through bar plots.

plot_data <- function(maindata,attribute){
  return(ggplot(data=maindata, 
             aes_string(x= attribute)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue"))
}

bar_plot <- function(maindata,attribute){
  return(ggplot(data=maindata, aes_string(x=attribute))+
  geom_bar(fill="light blue"))
}

All the plots are stored in variables through the code below. This is done by using the functions made in the code above.

CAPITAL <- bar_plot(dependent_variables,"CAPITAL")
IDHM <- plot_data(dependent_variables,"IDHM")
ALT <- plot_data(dependent_variables,"ALT")
FOREIGNERS_PERCENTAGE <- plot_data(dependent_variables,"FOREIGNERS_PERCENTAGE")
DU_URBAN_PERCENTAGE <- plot_data(dependent_variables,"DU_URBAN_PERCENTAGE")
ACTIVE_PERCENTAGE <- plot_data(dependent_variables,"ACTIVE_PERCENTAGE")
PLANTED_AREA_PERCENTAGE <- plot_data(dependent_variables,"PLANTED_AREA_PERCENTAGE")
Intermediário_Adjacente <- bar_plot(dependent_variables,"Intermediário_Adjacente")
Intermediário_Remoto <- bar_plot(dependent_variables,"Intermediário_Remoto")
Rural_Adjacente <- bar_plot(dependent_variables,"Rural_Adjacente")
Rural_Remoto <- bar_plot(dependent_variables,"Rural_Remoto")
GVA_AGROPEC_Percentage <- plot_data(dependent_variables,"GVA_AGROPEC_Percentage")
GVA_INDUSTRY_Percentage <- plot_data(dependent_variables,"GVA_INDUSTRY_Percentage")
GVA_SERVICES_Percentage <- plot_data(dependent_variables,"GVA_SERVICES_Percentage")
GVA_PUBLIC_Percentage <- plot_data(dependent_variables,"GVA_PUBLIC_Percentage")
TAX_CAPITA <- plot_data(dependent_variables,"TAX_CAPITA")
COMP_AGRICULTURE <- plot_data(dependent_variables,"COMP_AGRICULTURE")
COMP_INDUSTRY <- plot_data(dependent_variables,"COMP_INDUSTRY")
COMP_SERVICES <- plot_data(dependent_variables,"COMP_SERVICES")
COMP_PUBLIC <- plot_data(dependent_variables,"COMP_PUBLIC")
MOTORCYCLE_RATIO <- plot_data(dependent_variables,"MOTORCYCLE_RATIO")

Now, we will be plotting all the histograms in a tabular form so that it is more presentable.

ggarrange(CAPITAL, IDHM, ALT, FOREIGNERS_PERCENTAGE,
          DU_URBAN_PERCENTAGE, ACTIVE_PERCENTAGE, PLANTED_AREA_PERCENTAGE, Intermediário_Remoto, Rural_Adjacente, Rural_Remoto, GVA_AGROPEC_Percentage, GVA_INDUSTRY_Percentage,GVA_SERVICES_Percentage,GVA_PUBLIC_Percentage,TAX_CAPITA,COMP_AGRICULTURE,COMP_INDUSTRY,COMP_SERVICES,COMP_PUBLIC,
          MOTORCYCLE_RATIO,
          ncol = 3, 
          nrow = 3)

## $`1`

## 
## $`2`

## 
## $`3`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

We can divide the analysis into two parts:
(1) Dummy variables: Bar graphs are plotted which show the proportion of data which have 0 (FALSE) and 1 (TRUE). We do not need to transform this dataset as it is prepared to be fit in a regression model.
(2) Normal variables: Many of these variables are right skewed while others are normally distributed. As we have seen in the indepedent variable exploratory analysis and in the general scene of Brazil, there is prominent inequality. Hence, we will not standardise any of the data so that we do not lose any information.

5. Building a multiple linear regression model

In order to build a regression model, we will be using the lm function which is used to fit linear models. Through this function, we will also find out the dependent variables which are statistically significant. We will fit the model according to both GDP_CAPITA and GDP_CAPITA.log in order to analyse the differences between the two models.

Choosing appropriate model

Creating model based on raw GDP_CAPITA values.

gdp_capita.lmr <- lm(formula = GDP_CAPITA ~ CAPITAL + IDHM + ALT+ FOREIGNERS_PERCENTAGE +
          DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente+ Intermediário_Remoto+ Rural_Adjacente + Rural_Remoto + GVA_AGROPEC_Percentage+ GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC + MOTORCYCLE_RATIO, data = dependent_variables)
summary(gdp_capita.lmr)

## 
## Call:
## lm(formula = GDP_CAPITA ~ CAPITAL + IDHM + ALT + FOREIGNERS_PERCENTAGE + 
##     DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + 
##     Intermediário_Adjacente + Intermediário_Remoto + Rural_Adjacente + 
##     Rural_Remoto + GVA_AGROPEC_Percentage + GVA_INDUSTRY_Percentage + 
##     GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA + 
##     COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC + 
##     MOTORCYCLE_RATIO, data = dependent_variables)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66845  -5008  -1709   1936 290381 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -6.954e+04  6.346e+03 -10.958  < 2e-16 ***
## CAPITAL                 -3.177e+03  2.545e+03  -1.248 0.212025    
## IDHM                     6.528e+04  4.757e+03  13.724  < 2e-16 ***
## ALT                     -3.649e-03  1.006e-02  -0.363 0.716880    
## FOREIGNERS_PERCENTAGE   -3.672e+01  3.226e+02  -0.114 0.909397    
## DU_URBAN_PERCENTAGE     -5.992e+01  1.250e+01  -4.793 1.69e-06 ***
## ACTIVE_PERCENTAGE        4.590e+02  8.733e+01   5.256 1.53e-07 ***
## PLANTED_AREA_PERCENTAGE  1.213e+02  7.074e+00  17.144  < 2e-16 ***
## Intermediário_Adjacente  1.623e+03  6.316e+02   2.570 0.010209 *  
## Intermediário_Remoto     8.797e+03  1.742e+03   5.051 4.54e-07 ***
## Rural_Adjacente          3.197e+03  5.646e+02   5.661 1.58e-08 ***
## Rural_Remoto             7.116e+03  9.084e+02   7.834 5.63e-15 ***
## GVA_AGROPEC_Percentage   1.089e-01  2.240e-02   4.860 1.21e-06 ***
## GVA_INDUSTRY_Percentage  4.130e-01  3.081e-02  13.404  < 2e-16 ***
## GVA_SERVICES_Percentage -1.218e-01  2.006e-02  -6.071 1.36e-09 ***
## GVA_PUBLIC_Percentage   -7.486e-02  1.751e-02  -4.275 1.94e-05 ***
## TAX_CAPITA               4.065e+03  6.925e+01  58.709  < 2e-16 ***
## COMP_AGRICULTURE         1.186e+02  2.149e+01   5.522 3.51e-08 ***
## COMP_INDUSTRY            1.152e+02  3.233e+01   3.564 0.000369 ***
## COMP_SERVICES            2.602e+02  3.977e+01   6.543 6.58e-11 ***
## COMP_PUBLIC              1.200e+02  3.255e+01   3.687 0.000229 ***
## MOTORCYCLE_RATIO        -7.288e+03  2.820e+03  -2.584 0.009793 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12880 on 5530 degrees of freedom
## Multiple R-squared:  0.596,  Adjusted R-squared:  0.5944 
## F-statistic: 388.4 on 21 and 5530 DF,  p-value: < 2.2e-16

The above model suggests we need to eliminate three variables when working with confidence interval of 95%. This model has an R-Square value of 0.5944 which signifies that 59.44% of the variation in the dependent variable is due to the variation in independent variables.

Creating model based on logarithmic GDP_CAPITA values.

Similar to the ethod used above, we will now use logarithmic GDP_CAPITA values to build our regression model.

gdp_capita.lmr <- lm(formula = GDP_CAPITA.log ~ CAPITAL + IDHM + ALT+ FOREIGNERS_PERCENTAGE +
          DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente+ Intermediário_Remoto+ Rural_Adjacente + Rural_Remoto + GVA_AGROPEC_Percentage+ GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC + MOTORCYCLE_RATIO, data = dependent_variables)
summary(gdp_capita.lmr)

## 
## Call:
## lm(formula = GDP_CAPITA.log ~ CAPITAL + IDHM + ALT + FOREIGNERS_PERCENTAGE + 
##     DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + 
##     Intermediário_Adjacente + Intermediário_Remoto + Rural_Adjacente + 
##     Rural_Remoto + GVA_AGROPEC_Percentage + GVA_INDUSTRY_Percentage + 
##     GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA + 
##     COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC + 
##     MOTORCYCLE_RATIO, data = dependent_variables)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.87518 -0.21594 -0.04313  0.16704  2.98407 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.635e+00  1.761e-01  32.005  < 2e-16 ***
## CAPITAL                 -1.164e-01  7.061e-02  -1.648  0.09943 .  
## IDHM                     4.486e+00  1.320e-01  33.994  < 2e-16 ***
## ALT                     -8.491e-08  2.792e-07  -0.304  0.76105    
## FOREIGNERS_PERCENTAGE    8.219e-03  8.952e-03   0.918  0.35857    
## DU_URBAN_PERCENTAGE     -2.398e-03  3.469e-04  -6.913 5.26e-12 ***
## ACTIVE_PERCENTAGE        1.847e-02  2.423e-03   7.625 2.86e-14 ***
## PLANTED_AREA_PERCENTAGE  5.311e-03  1.963e-04  27.059  < 2e-16 ***
## Intermediário_Adjacente  5.147e-02  1.752e-02   2.937  0.00332 ** 
## Intermediário_Remoto     3.605e-01  4.832e-02   7.460 9.97e-14 ***
## Rural_Adjacente          6.986e-02  1.567e-02   4.460 8.37e-06 ***
## Rural_Remoto             2.603e-01  2.520e-02  10.327  < 2e-16 ***
## GVA_AGROPEC_Percentage   5.990e-06  6.215e-07   9.638  < 2e-16 ***
## GVA_INDUSTRY_Percentage  1.024e-05  8.548e-07  11.981  < 2e-16 ***
## GVA_SERVICES_Percentage -3.256e-06  5.567e-07  -5.850 5.21e-09 ***
## GVA_PUBLIC_Percentage   -4.120e-06  4.858e-07  -8.481  < 2e-16 ***
## TAX_CAPITA               7.148e-02  1.921e-03  37.205  < 2e-16 ***
## COMP_AGRICULTURE         3.451e-03  5.961e-04   5.790 7.45e-09 ***
## COMP_INDUSTRY           -4.033e-03  8.970e-04  -4.496 7.07e-06 ***
## COMP_SERVICES            8.580e-03  1.103e-03   7.775 8.91e-15 ***
## COMP_PUBLIC             -2.845e-03  9.032e-04  -3.149  0.00164 ** 
## MOTORCYCLE_RATIO        -1.471e-02  7.825e-02  -0.188  0.85092    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3573 on 5530 degrees of freedom
## Multiple R-squared:  0.7271, Adjusted R-squared:  0.7261 
## F-statistic: 701.7 on 21 and 5530 DF,  p-value: < 2.2e-16

The above model suggests we need to eliminate three variables when working with confidence interval of 95%. This model has an R-Square value of 0.7261 which signifies that 72.61% of the variation in the dependent variable is due to the variation in independent variables. As the R-square value of the lorithmic model is significantly better than that of the model used with raw scores, we will use the logarithmic model. Note that this model is also chosen as the linearity assumptions will be more accurately met as compared to other model.

We will be using confidence interval of 95%. Therefore, with reference to the summary above, it is clear that not all the independent variables are statistically significant. The insignificant variables are:
(1) CAPITAL
(2) ALT
(3) FOREIGNERS_PERCENTAGE
(5) MOTORCYCLE_RATIO

We will revise the model by removing those variables which are not statistically significant.

gdp_capita.lmr <- lm(formula = GDP_CAPITA.log ~   IDHM +
          DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente+ Intermediário_Remoto+ Rural_Adjacente + Rural_Remoto + GVA_AGROPEC_Percentage+ GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC , data = dependent_variables)
summary(gdp_capita.lmr)

## 
## Call:
## lm(formula = GDP_CAPITA.log ~ IDHM + DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + 
##     PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente + Intermediário_Remoto + 
##     Rural_Adjacente + Rural_Remoto + GVA_AGROPEC_Percentage + 
##     GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + 
##     TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + 
##     COMP_PUBLIC, data = dependent_variables)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.87458 -0.21720 -0.04254  0.16651  2.98463 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.655e+00  1.749e-01  32.334  < 2e-16 ***
## IDHM                     4.481e+00  1.297e-01  34.557  < 2e-16 ***
## DU_URBAN_PERCENTAGE     -2.397e-03  3.457e-04  -6.933 4.59e-12 ***
## ACTIVE_PERCENTAGE        1.828e-02  2.402e-03   7.610 3.21e-14 ***
## PLANTED_AREA_PERCENTAGE  5.338e-03  1.951e-04  27.356  < 2e-16 ***
## Intermediário_Adjacente  5.203e-02  1.750e-02   2.973  0.00296 ** 
## Intermediário_Remoto     3.670e-01  4.787e-02   7.666 2.08e-14 ***
## Rural_Adjacente          7.049e-02  1.563e-02   4.510 6.63e-06 ***
## Rural_Remoto             2.612e-01  2.517e-02  10.377  < 2e-16 ***
## GVA_AGROPEC_Percentage   5.998e-06  6.212e-07   9.656  < 2e-16 ***
## GVA_INDUSTRY_Percentage  1.029e-05  8.539e-07  12.047  < 2e-16 ***
## GVA_SERVICES_Percentage -3.279e-06  5.559e-07  -5.898 3.90e-09 ***
## GVA_PUBLIC_Percentage   -4.118e-06  4.857e-07  -8.478  < 2e-16 ***
## TAX_CAPITA               7.152e-02  1.919e-03  37.275  < 2e-16 ***
## COMP_AGRICULTURE         3.478e-03  5.843e-04   5.953 2.80e-09 ***
## COMP_INDUSTRY           -4.097e-03  8.951e-04  -4.577 4.81e-06 ***
## COMP_SERVICES            8.438e-03  1.080e-03   7.812 6.66e-15 ***
## COMP_PUBLIC             -2.925e-03  8.982e-04  -3.256  0.00114 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3573 on 5534 degrees of freedom
## Multiple R-squared:  0.7269, Adjusted R-squared:  0.7261 
## F-statistic: 866.6 on 17 and 5534 DF,  p-value: < 2.2e-16

The adjusted R-square values remain the same at 0.7261 after corecting the model.
There are multiple assumptions made while fitting a linear model. These assumptions are:
(1) Linearity assumption
(2) Normality assumption
(3) Homoscedasticity
(4) No correlation between residuals

We will be performing tests on each of these assumptions to ensure that our model is accuarte.

5.1 Checking assumptions

We will first be checking the linearity assumption. The assumption is that the relationship between the depedent variable and indepedent variables is approximately linear. We can check this assumption through two plots.

5.1.1 Tests for Linearity

The first graph we are plotting is the Actual vs Fitted plot.

ols_plot_obs_fit(gdp_capita.lmr, print_plot = TRUE)

It is evident that almost all the points are close to the regressed diagonal line. The exception lies with points representing higher GDP_CAPITA as most of them are further away from the regressed line. However, as our sample size is more than 5000, the proportion of data which is away from the regressed line, indicating that an approximate linear relationship exists between GDP_CAPITA and the indepedent variables.

To be sure of our assumption, we can plot another graph, which is the Residual vs Fitted plot.

ols_plot_resid_fit(gdp_capita.lmr, print_plot = TRUE)

From this graph we can derive few things:
(1) The relationship between dependent variable and indepedent variables represent a linear relationship as most of the residual values are spread around the horizontal line.
(2) The residuals form an approximate band around the 0 line indicating homogenity of error variance.
(3) There are few data points which are away from the horizontal line, indicating that there are few outliers present in the data.

5.1.2 Tests for multicolinearity

Multicollinearity occurs when independent variables in a regression model are correlated. If multicolinearity exists between our variables, we are failing to the assumption that all our independent variables are truly independent and not related to one another. Multicollinearity can reduce the precision of the estimated coefficients, which weakens the statistical power of our regression model. Therefore, we will check if multicolinearity exists from VIF (Variance Inflation Factor) scores. We will keep the threshold for the VIF scores as 10, i.e. if any VIF score is more than 10, we will eliminate that variable.

ols_vif_tol(gdp_capita.lmr)

##                  Variables Tolerance      VIF
## 1                     IDHM 0.2644779 3.781035
## 2      DU_URBAN_PERCENTAGE 0.4190134 2.386558
## 3        ACTIVE_PERCENTAGE 0.3881498 2.576325
## 4  PLANTED_AREA_PERCENTAGE 0.8090001 1.236094
## 5  Intermediário_Adjacente 0.6929264 1.443155
## 6     Intermediário_Remoto 0.9384841 1.065548
## 7          Rural_Adjacente 0.3796542 2.633976
## 8             Rural_Remoto 0.6663301 1.500758
## 9   GVA_AGROPEC_Percentage 0.6817505 1.466812
## 10 GVA_INDUSTRY_Percentage 0.7070687 1.414290
## 11 GVA_SERVICES_Percentage 0.4817096 2.075940
## 12   GVA_PUBLIC_Percentage 0.5765119 1.734570
## 13              TAX_CAPITA 0.8012259 1.248087
## 14        COMP_AGRICULTURE 0.7250452 1.379224
## 15           COMP_INDUSTRY 0.1971123 5.073251
## 16           COMP_SERVICES 0.3543089 2.822396
## 17             COMP_PUBLIC 0.2439240 4.099638

Since the VIF of the independent variables are less than 10, we can safely conclude that there are no sign of multicollinearity among the independent variables.

5.1.3 Test for Normality Assumption

ols_plot_resid_hist(gdp_capita.lmr)

As seen above, the residual data resembles a normal distribution. Most of the statistical tests which check for normality in one-sample data do not allow a sample size more than 5000. The only test which allows a sample size more than 5000 is the Kolmogorov-Smirnov test, however, this test is not designed for such a big dataset. To confirm that our data does not violdate the normality assumption, we will be plotting a density plot and QQ plot to determine normality in the residual data.

# Density plot
ggdensity(gdp_capita.lmr$residuals, fill = "lightgray")

From the diagram above, we are able to notice a grpah which resembles a normal distribution for our residual data. To double confirm our findings, we will also be plotting the QQ plot from the code below.

# QQ plot
ggqqplot(gdp_capita.lmr$residuals)

The slope of the line is roughly 45 degrees and almost all the points are located in close proximity of the slope. The exception is for larger data. Given the large dataset which contains more than 5000 points, having little variation in the outliers is absolutely understandable. Hence, from the graph we can conclude that our residual data does not violate the normality assumption.

5.1.4 Test for Spatial Autocorrelation

In this section, we will be checking if our residual data from the linear regression is spatially correlated. Firstly, we will add the residual data in our sf data and then convert the sf data into spatial polygon data frame format because spdep functions can only process spatial data objects.

data$residuals <- gdp_capita.lmr$residuals
data.sp <- as_Spatial(data)
data.sp

## class       : SpatialPolygonsDataFrame 
## features    : 5552 
## extent      : -73.99045, -28.83594, -33.75118, 4.884623  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs 
## variables   : 26
## names       :         name_mn, CAPITAL,  IDHM,    ALT, FOREIGNERS_PERCENTAGE, DU_URBAN_PERCENTAGE, ACTIVE_PERCENTAGE, PLANTED_AREA_PERCENTAGE, Intermediário_Adjacente, Intermediário_Remoto, Rural_Adjacente, Rural_Remoto, GVA_AGROPEC_Percentage, GVA_INDUSTRY_Percentage, GVA_SERVICES_Percentage, ... 
## min values  : Abadia De Goiás,       0, 0.418,      0,                     0,    4.55261775520925,  47.1631205673759,                       0,                       0,                    0,               0,            0,                      0,     0.00142261767428109,     0.00670618854896865, ... 
## max values  :          Zortéa,       1, 0.862, 874579,      37.7218184890992,                 100,  74.4760130414532,        185.942941120092,                       1,                    1,               1,            1,       76088.5597874933,        87480.2651309903,        83553.5525749495, ...

To visualise the residuals, we will use tmap package to display the distribution of the residuals.

tm_shape(data)+
  tm_polygons("residuals", border.col = "black",border.alpha = 0.2)

From the figure above, we can infer that spatial autocorrelation of residuals exist. In order to find out if the autocorrelation is statistically significant, we will perform global moran’s I test. To perform this test, we need a weight matrix.

We will be creating a distance based weight matrix. Firstly, we will calculate the distance for which all the municipalities have at least one neighbour.

coords <- coordinates(data.sp)
k1 <- knn2nb(knearneigh(coords))
k1dists <- unlist(nbdists(k1,coords,longlat = TRUE))
summary(k1dists)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.386  11.778  15.881  21.366  23.963 369.663

As seen above, 370 km is the cut off distance such that all the municipalities have at least one neighbour. We will compute the distance-based weight matrix by using dnearneigh() function of spdep.

nb <- dnearneigh(coordinates(data.sp), 0, 370, longlat = TRUE)
summary(nb)

## Neighbour list object:
## Number of regions: 5552 
## Number of nonzero links: 3021758 
## Percentage nonzero weights: 9.80304 
## Average number of links: 544.2648 
## Link number distribution:
## 
##   1   3   7   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25 
##   1   1   1   5   1   3   4  10   9   4   7   5   8   8   6   3   2   7   6   8 
##  26  27  28  29  30  31  32  33  34  36  37  38  39  40  41  42  43  44  45  46 
##   5   5   7   6   3   4   4   1   2   1   1   2   3   4   5   3   2   2   5   3 
##  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66 
##   8   4   4   6   7   2   8   8   9   8   5  13   6   4   2   1   2   2   4   2 
##  67  70  71  72  73  74  75  76  77  78  80  81  82  83  84  85  86  87  88  89 
##   4   3   5   4   4   2   4   5   5   1   5   3   3   1   2   4   1   3   2   3 
##  90  91  92  94  95  97  98 100 101 102 103 104 105 106 107 108 109 110 111 112 
##   5   1   1   2   3   3   1   3   3   2   1   2   1   1   6   2   1   1   5   3 
## 113 114 115 116 117 118 119 120 121 123 124 125 126 128 129 130 131 132 133 134 
##   2   5   5   4   1   1   5   2   2   1   3   5   1   1   2   3   6   1   4   4 
## 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 
##   1   4   3   4   2   2   6   3   8   7  10   5   4   4   6   7   6   8   4   7 
## 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 
##   4   7   3   7   3   2   6  14   3   5   5   5   8   5   4   4   5   3   7   2 
## 175 177 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 198 
##   5   2   2   4   2   3   1   1   1   4   1   2   3   5   8   2   5   3   2   3 
## 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 218 219 220 
##   3   3   4   1   2   1   2   2   5   3   2   4   3   8   4   4   6   4   8   1 
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
##   3   7   3   4   4   5   3   7   4   4   2   5   1   2   6   7   5   2   7   5 
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
##   2   4   3   9   6  10   6   3   6   6   1   9   3   7   8   6   3   2   4   7 
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 
##   2   5   8   9   2   5   8   4   5   7   5  10   6   9   7   6   6   5   9   9 
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
##   5   7   4   7   5   4   5   5   9   5   5   6   5   1   7   3   2   3   7   4 
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 
##   7   4   6   4   4   7   5   4   6   2   7   7   5   2   4   8   8   2   7   5 
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 
##   9   8   5   6   6   6   5   2   4  10   4   3   9   3   8   6   5   6   6   4 
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   5   5   3   5   7   5   2   7   6   7   6   6   2   2   3   4   5   7   6   9 
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 
##   4   3   8   4   8   7   4   7   8   4   4   7   6   6   6  10   8   4   7   9 
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 
##   9  10   7   9   9  14  14   7   4  14   3  10   3   7   6   3   9   5  10   7 
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 
##   4   9   6   7   3   4   6   3   1   8   3   6   3   6   7   1   5   6   5   5 
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 
##   7   6  11   3   2   1   3   7   3   6   4   3   3   3   4   4   4   3   2   3 
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 
##   3   6   7   3   1   3   3   6   3   1   2   3   6   6   4   4   5   5   5   5 
## 461 462 463 464 465 466 467 468 469 470 471 473 474 475 476 477 478 479 481 482 
##   5   1   3   3   5   3   6   1   2   4   3   2   4   4   2   5   4   6   3   4 
## 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 503 
##   4   2   6   1   7   7   6   3   5   3   5   4   2   3   6   3   8   3   4   2 
## 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 
##   5   4   2   6   5   3   3  10   2   8   5   2   7   6   6   3   4   4   6   3 
## 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 
##   6   4   2   5   5   3   1   6   6   8   7   2   8   5   3   6   1   5   5   5 
## 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 
##   3   9   4   5   4   5   7   3   8   6   8   7  15   4   6   5   4  13   3   6 
## 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 
##  10   7   8   5   9   5   5   7   5  11   7   9  11  10   5   9  11   6   4   4 
## 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 
##   8   8   8   9   7   7   7   5   6   7   6   7   5   7   9   8   8   9  10   9 
## 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 
##  11   8   7   3   5   9   4  10   4   3   6   6   5   6   7   4  10  10   6   8 
## 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 
##   9  12   9   7   9  11   8   4   6   8  10   9   5   9  13  10   9   8   7  11 
## 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 
##   8   8  10   5   3  12  11  12   3  11  11   9   9   9  11  11  10   6   6  13 
## 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 
##  14   7   6   8   9   8   7   6   7   7   6   9  13  14  10  10  14   9  12  13 
## 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 
##   7   6   7   7  16   6  15   8   9   7  11   8   8  11  12   5  13   9  13   9 
## 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 
##  16   8   7   9   4  11  11   7   8  12  10   6  16  15   7  15   6   9   8   8 
## 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 
##  10   8   7  10  10   8   8   7  14  13  13  12   6  12   9   5   4  12   8   6 
## 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 
##   7  16   7   7  11  15  11   8  11   6   8  13  11   6   7   9   5   7   6  10 
## 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 
##  16   8   9   7  10  12   5  14   4   5  13  10  11   6   4   7   6  14   5   5 
## 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 
##  12   4  12   6   8  12   3  10   9   7   9   8   2   8   9   4   6   8   8   6 
## 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 
##  13   9   7   9   3   8  12  10  10   4  10   6   4   9  10   7   8   4  20   6 
## 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 
##   8  12  11   7   7  12   7   9   6   6   7   7   7   8   5   5   6   4   8  10 
## 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 
##  11  10   6   4   7   6   9  13  13   4  11  12   7   6  11   7   3   9   7   9 
## 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 
##   8   5   8   7  13  12   6   9   2   7   6   5  10   6  12  11   9   4   5   4 
## 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 
##   8   8  12   8   4   4   5   3   8   6   7   6   6   6   5   6   4   7   6   9 
## 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 
##   7   4  12   9   5   6   5   5   9   5   7   7   6   5  10   5   3   4   4   2 
## 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 944 
##  10   6   5   8   4   9   3   5   5   7   5   5   3   5   5   3   2   1   2   2 
## 945 946 947 948 951 954 955 959 963 968 970 971 
##   1   1   1   1   4   2   1   1   1   1   1   1 
## 1 least connected region:
## 1517 with 1 link
## 1 most connected region:
## 4148 with 971 links

Next, nb2listw() of spdep packge will be used to convert the output neighbours lists (i.e. nb) into a spatial weights.

nb_lw <- nb2listw(nb, style = 'W')
summary(nb_lw)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 5552 
## Number of nonzero links: 3021758 
## Percentage nonzero weights: 9.80304 
## Average number of links: 544.2648 
## Link number distribution:
## 
##   1   3   7   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25 
##   1   1   1   5   1   3   4  10   9   4   7   5   8   8   6   3   2   7   6   8 
##  26  27  28  29  30  31  32  33  34  36  37  38  39  40  41  42  43  44  45  46 
##   5   5   7   6   3   4   4   1   2   1   1   2   3   4   5   3   2   2   5   3 
##  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66 
##   8   4   4   6   7   2   8   8   9   8   5  13   6   4   2   1   2   2   4   2 
##  67  70  71  72  73  74  75  76  77  78  80  81  82  83  84  85  86  87  88  89 
##   4   3   5   4   4   2   4   5   5   1   5   3   3   1   2   4   1   3   2   3 
##  90  91  92  94  95  97  98 100 101 102 103 104 105 106 107 108 109 110 111 112 
##   5   1   1   2   3   3   1   3   3   2   1   2   1   1   6   2   1   1   5   3 
## 113 114 115 116 117 118 119 120 121 123 124 125 126 128 129 130 131 132 133 134 
##   2   5   5   4   1   1   5   2   2   1   3   5   1   1   2   3   6   1   4   4 
## 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 
##   1   4   3   4   2   2   6   3   8   7  10   5   4   4   6   7   6   8   4   7 
## 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 
##   4   7   3   7   3   2   6  14   3   5   5   5   8   5   4   4   5   3   7   2 
## 175 177 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 198 
##   5   2   2   4   2   3   1   1   1   4   1   2   3   5   8   2   5   3   2   3 
## 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 218 219 220 
##   3   3   4   1   2   1   2   2   5   3   2   4   3   8   4   4   6   4   8   1 
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
##   3   7   3   4   4   5   3   7   4   4   2   5   1   2   6   7   5   2   7   5 
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 
##   2   4   3   9   6  10   6   3   6   6   1   9   3   7   8   6   3   2   4   7 
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 
##   2   5   8   9   2   5   8   4   5   7   5  10   6   9   7   6   6   5   9   9 
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
##   5   7   4   7   5   4   5   5   9   5   5   6   5   1   7   3   2   3   7   4 
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 
##   7   4   6   4   4   7   5   4   6   2   7   7   5   2   4   8   8   2   7   5 
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 
##   9   8   5   6   6   6   5   2   4  10   4   3   9   3   8   6   5   6   6   4 
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   5   5   3   5   7   5   2   7   6   7   6   6   2   2   3   4   5   7   6   9 
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 
##   4   3   8   4   8   7   4   7   8   4   4   7   6   6   6  10   8   4   7   9 
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 
##   9  10   7   9   9  14  14   7   4  14   3  10   3   7   6   3   9   5  10   7 
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 
##   4   9   6   7   3   4   6   3   1   8   3   6   3   6   7   1   5   6   5   5 
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 
##   7   6  11   3   2   1   3   7   3   6   4   3   3   3   4   4   4   3   2   3 
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 
##   3   6   7   3   1   3   3   6   3   1   2   3   6   6   4   4   5   5   5   5 
## 461 462 463 464 465 466 467 468 469 470 471 473 474 475 476 477 478 479 481 482 
##   5   1   3   3   5   3   6   1   2   4   3   2   4   4   2   5   4   6   3   4 
## 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 503 
##   4   2   6   1   7   7   6   3   5   3   5   4   2   3   6   3   8   3   4   2 
## 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 
##   5   4   2   6   5   3   3  10   2   8   5   2   7   6   6   3   4   4   6   3 
## 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 
##   6   4   2   5   5   3   1   6   6   8   7   2   8   5   3   6   1   5   5   5 
## 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 
##   3   9   4   5   4   5   7   3   8   6   8   7  15   4   6   5   4  13   3   6 
## 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 
##  10   7   8   5   9   5   5   7   5  11   7   9  11  10   5   9  11   6   4   4 
## 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 
##   8   8   8   9   7   7   7   5   6   7   6   7   5   7   9   8   8   9  10   9 
## 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 
##  11   8   7   3   5   9   4  10   4   3   6   6   5   6   7   4  10  10   6   8 
## 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 
##   9  12   9   7   9  11   8   4   6   8  10   9   5   9  13  10   9   8   7  11 
## 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 
##   8   8  10   5   3  12  11  12   3  11  11   9   9   9  11  11  10   6   6  13 
## 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 
##  14   7   6   8   9   8   7   6   7   7   6   9  13  14  10  10  14   9  12  13 
## 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 
##   7   6   7   7  16   6  15   8   9   7  11   8   8  11  12   5  13   9  13   9 
## 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 
##  16   8   7   9   4  11  11   7   8  12  10   6  16  15   7  15   6   9   8   8 
## 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 
##  10   8   7  10  10   8   8   7  14  13  13  12   6  12   9   5   4  12   8   6 
## 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 
##   7  16   7   7  11  15  11   8  11   6   8  13  11   6   7   9   5   7   6  10 
## 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 
##  16   8   9   7  10  12   5  14   4   5  13  10  11   6   4   7   6  14   5   5 
## 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 
##  12   4  12   6   8  12   3  10   9   7   9   8   2   8   9   4   6   8   8   6 
## 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 
##  13   9   7   9   3   8  12  10  10   4  10   6   4   9  10   7   8   4  20   6 
## 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 
##   8  12  11   7   7  12   7   9   6   6   7   7   7   8   5   5   6   4   8  10 
## 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 
##  11  10   6   4   7   6   9  13  13   4  11  12   7   6  11   7   3   9   7   9 
## 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 
##   8   5   8   7  13  12   6   9   2   7   6   5  10   6  12  11   9   4   5   4 
## 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 
##   8   8  12   8   4   4   5   3   8   6   7   6   6   6   5   6   4   7   6   9 
## 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 
##   7   4  12   9   5   6   5   5   9   5   7   7   6   5  10   5   3   4   4   2 
## 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 944 
##  10   6   5   8   4   9   3   5   5   7   5   5   3   5   5   3   2   1   2   2 
## 945 946 947 948 951 954 955 959 963 968 970 971 
##   1   1   1   1   4   2   1   1   1   1   1   1 
## 1 least connected region:
## 1517 with 1 link
## 1 most connected region:
## 4148 with 971 links
## 
## Weights style: W 
## Weights constants summary:
##      n       nn   S0       S1       S2
## W 5552 30824704 5552 45.51765 22335.81

Next, lm.morantest() of spdep package will be used to perform Moran’s I test for residual spatial autocorrelation. The null hypothesis for this test is that the residuals are randomly distributed among the municipalities of Brazil. This test will be conducted at a 95% confidence interval.

lm.morantest(gdp_capita.lmr, nb_lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = GDP_CAPITA.log ~ IDHM + DU_URBAN_PERCENTAGE +
## ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente +
## Intermediário_Remoto + Rural_Adjacente + Rural_Remoto +
## GVA_AGROPEC_Percentage + GVA_INDUSTRY_Percentage +
## GVA_SERVICES_Percentage + GVA_PUBLIC_Percentage + TAX_CAPITA +
## COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC, data =
## dependent_variables)
## weights: nb_lw
## 
## Moran I statistic standard deviate = 98.893, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     1.095106e-01    -5.332931e-04     1.238219e-06

The p-value is 2.2e-16, which is less than 0.05. Hence, we reject the null hypothesis. We can conclude that the residual data are not randomly distributed . The moran’s I value is 0.02, which is very close to 0 indicating that the magnitude of clustering is low, even though we reject the null hypothesis. The findings from this test imply that there are confounding variables which affect the GDP per ceapita in the municipalities in Brazil. Even though the test indicates that the relationships in the model is stationary, this is acceptable as there are numerous more reasons which are culturally, politically, and socially linked factors which we do not have in this dataset.

5.1.4 Test for Heteroscedacity

One of the important assumptions of linear regression is that, there should be no heteroscedasticity of residuals. In simpler terms, this means that the variance of residuals should not increase with fitted values of independent variable. It is customary to check for heteroscedasticity of residuals once you build the linear regression model. The reason is, we want to check if the model thus built is unable to explain some pattern in the response variable (Y), that eventually shows up in the residuals.

We will perform Breusch Pagan Test for Heteroskedasticity.

Null hypothesis: The variance is constant, i.e. the data is there is no heteroscedasticity.
Alternate hypothesis: The variance is not constant, i.e. the data is there is heteroscedasticity.
Significance level: 0.05

ols_test_breusch_pagan(gdp_capita.lmr)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                    Data                    
##  ------------------------------------------
##  Response : GDP_CAPITA.log 
##  Variables: fitted values of GDP_CAPITA.log 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    243.6201 
##  Prob > Chi2   =    6.38839e-55

As the p-value for this test is less than 0.05, we reject the null hypothesis. We can conclude that there is heteroscedasticity in our data with 95% confidence. This test suggests that there is a lot of noise in our data which leads to inconsistent variation of data. This is not good for our regression model, however, given the number of variables and observations, building a perfect model is very hard. Therefore, this is one of the drawbacks of the regression model.

5.1.5 Conclusion from tests

Most of the assumptions of linear regression models have been met as seen above. From the tests above, we were not able to prove if the data is non-stationary as autocorrelation was found when Moran’s test was conducted. Furthermore, through Breusch Pagan Test, we were also able to find out that heterescedacity is present in our regression model which is a drawback for our model.
The drawbacks of this regression model will be overcomed by making a Geographically Weighted Regression Model. This model will account for the spatial location for each municipality when factoring the role each independent variable play.

5.2 Analysis of Multiple Linear Regression Method

To analyse our multiple regression model, we will be analysing how each independent variable affects GDP per capita. This relationship can be found by analysing the coefficients of each of the independent variables. These coefficients can be obtained from the code below.

coefficients <- as.data.frame(gdp_capita.lmr$coefficients)
colnames(coefficients) <- c("Value")
print(coefficients)

##                                 Value
## (Intercept)              5.655193e+00
## IDHM                     4.481006e+00
## DU_URBAN_PERCENTAGE     -2.396971e-03
## ACTIVE_PERCENTAGE        1.827565e-02
## PLANTED_AREA_PERCENTAGE  5.337640e-03
## Intermediário_Adjacente  5.203362e-02
## Intermediário_Remoto     3.669998e-01
## Rural_Adjacente          7.048606e-02
## Rural_Remoto             2.611680e-01
## GVA_AGROPEC_Percentage   5.998346e-06
## GVA_INDUSTRY_Percentage  1.028729e-05
## GVA_SERVICES_Percentage -3.278772e-06
## GVA_PUBLIC_Percentage   -4.117840e-06
## TAX_CAPITA               7.152464e-02
## COMP_AGRICULTURE         3.477837e-03
## COMP_INDUSTRY           -4.097323e-03
## COMP_SERVICES            8.437904e-03
## COMP_PUBLIC             -2.924581e-03

There are a total of 17 variables in our model. Out of the seventeen variable, 5 variables are negative. Negative coefficients tell us that there is an negative relationship between the independent variable and the dependent variables. Note that the dependent variable, GDP_CAPITA has the unit of 1000 real (~S$265) which is why all our coefficients are in decimal place. For easier interpretation, we will multiply our dependent variable by 1000 to make the unit of GDP_CAPITA as 1 real.

coefficients$Value <- coefficients$Value *1000
coefficients

##                                 Value
## (Intercept)              5.655193e+03
## IDHM                     4.481006e+03
## DU_URBAN_PERCENTAGE     -2.396971e+00
## ACTIVE_PERCENTAGE        1.827565e+01
## PLANTED_AREA_PERCENTAGE  5.337640e+00
## Intermediário_Adjacente  5.203362e+01
## Intermediário_Remoto     3.669998e+02
## Rural_Adjacente          7.048606e+01
## Rural_Remoto             2.611680e+02
## GVA_AGROPEC_Percentage   5.998346e-03
## GVA_INDUSTRY_Percentage  1.028729e-02
## GVA_SERVICES_Percentage -3.278772e-03
## GVA_PUBLIC_Percentage   -4.117840e-03
## TAX_CAPITA               7.152464e+01
## COMP_AGRICULTURE         3.477837e+00
## COMP_INDUSTRY           -4.097323e+00
## COMP_SERVICES            8.437904e+00
## COMP_PUBLIC             -2.924581e+00

The five variables which are negative are:
(1) COMP_INDUSTRY -4.097323
(2) COMP_PUBLIC -2.924581
(3) DU_URBAN_PERCENTAGE -2.396971
(4) GVA_PUBLIC_Percentage -0.004117840
(5) GVA_SERVICES_Percentage -0.003278772

COMP_INDUSTRY (Percentage of Company in Industrial Services), COMP_PUBLIC (Percentage of Company in Public Services) and DU_URBAN_PERCENTAGE (Percentage of domestic units which are urban) have a strong negative relationship with these GDP_CAPITA. With increase in proportion for each of these variables, there is a decrease in GDP per capita. As this is a comparitive study done for all the municipalities, these coefficents suggest that municipalities which are heavily focused in industrial services and public services have lower GDP per capita than other municipalities which have higher proprotion of other type of companies. GVA_PUBLIC_Percentage (Percentage of Gross Value Add By Public Services) and GVA_SERVICES_Percentage (Percentage of Gross Value Add By Commercial Services) are negative but very close to zero suggesting that they do not have a significant relationship on GDP_CAPITA.

The twelve variables which are positive are:
(1) IDHM 4481.006
(2) Intermediário_Remoto 366.9998
(3) Rural_Remoto 261.168
(4) TAX_CAPITA 71.52464
(5) Rural_Adjacente 70.48606
(6) Intermediário_Adjacente 52.03362
(7) ACTIVE_PERCENTAGE 18.27565
(8) COMP_SERVICES 8.437904
(9) PLANTED_AREA_PERCENTAGE 5.337640
(10) COMP_AGRICULTURE 3.477837
(11) GVA_INDUSTRY_Percentage 0.01028729
(12) GVA_AGROPEC_Percentage 0.005998346

The variables displayed above are the independent variables which have a positive relationship with GDP_CAPITA. Other than for the dummy variables marked above, this means that the single unit increase in the variable leads to an increase in GDP_CAPITA with an amount of the value of the coefficient. For example, IDHM (Human Development Index) has a coefficient of 4481.006. This means that for every one unit of increase in IDHM, the GDP_CAPITA increases by roughly 4481 real. As IDHM ranges from 0-1, a better understanding of this coefficient will be that for every 0.1 increase in IDHM, the GDP_CAPITA increases by roughly 448.1 real. IDHM is a composite index consisting of education, life expectancy, and income. this indicates that public infrastructure such as schools and hospitals significantly contribute to the increase in GDP per capita, and are essential to every municipality.

The dummy variable is analysed in a different way. For example, Intermediário_Remoto has a coefficient of 366.9998. This means that if a municipality is Intermediário_Remoto then it is likely to have 367 real more in their GDP per capita than those municipalities which dont belong to this category. Similar conclusions can be drawn from the other dummy variables. To be very honest, it is very surprising to notice that Rural and remote municipalities tend to have more GDP_CAPITA. One of the major reasons for this is due to lower population as compared to urban areas.

6. Building Hedonic GDP_CAPITA Models using GWmodel

A geographically weighted regression is a local form of linear regression used to model spatially varying relationships. We have already created an exploratory multiple linear regression model for GDP_CAPITA. GWR will increase the accuracy of our model as it will construct a separate equation for every municipality in the dataset incorporating the dependent and explanatory variables of features falling within the bandwidth of each target feauture. This means that each equation is calibrated using a different weighting of the observations contained in the data set. There are many variations of a geographically weighted regression. The parameters chosen for our regression model are described below.

6.1 Building Fixed Bandwidth GWR Model

Rationale for using Gausian Kernel

The assumption in a GWR model is that observations nearby one another have a greater influence on one another’s parameter estimates than observations farther apart. The weight assigned to each observation is based on a distance decay function from the centroid of each municipality.The weights are determined using a kernel, which is a distance decay function that determines how quickly weights decrease as distances increase. We have chosen our kernel as gausian as it assigns a weight of one to the regression feature, and weights for the surrounding features smoothly and gradually decrease as the distance from the regression feature increases. As we have more than 5000 municipalities in our regression analysis, the gausian kernel is the most appropriate weighting scheme as municipalities which are further away from the regression feauture are also considered but are given a very low weightage. This is because the gaussian kernel never reaches zero, hence each municipality have weights for all the other municipalities based on distance.

Rationale for using Fixed bandwidth

There are two types of bandwidths we can use: Fixed or Adapative. Fixed bandwidth will consider all the neighbours in a given distance whereas adaptive bandwidth will consider optimal number of neighbours. Adaptive bandwidths require the calculation of a distance matrix, which will require more than 5000 * 5000 calculations to build the matrix. As we will be publishing this report on RPubs, we will avoid using adaptive bandwidth as the matrix data file itself takes up 250 MB.
For adaptive bandwidth, GWR will compute the optimal distance (for a fixed kernel) while minimising the Cross validation score.

Rationale for using Cross validation Approach

The approach used in calculating neighbour is Cross Validation. In geographically weighted regression, one must determine a window size which will be used to subset the data locally. The cross-validation procedure is used to determine a globally optimal window size. The CV score is better than the alternative approach (AIC score) for this case, as the degrees of freedom change in the AIC approach while adjusting the bandwidth, which is not required in this model.

6.1.1 Computing fixed bandwith

The function bw.gwr() will be used to compute the fixed bandwidth. As described above, we will be using a fixed bandwidth on a gausian kernel which aims to minimise cross validation scores.

bw.fixed <- bw.gwr(formula = GDP_CAPITA.log ~ IDHM  +
          DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente+ Rural_Adjacente + GVA_AGROPEC_Percentage+ GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + Rural_Remoto+ GVA_PUBLIC_Percentage + TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC , data = data.sp, approach = "CV", kernel="gaussian", adaptive=FALSE, longlat=TRUE)

6.1.2 GWModel method - fixed bandwith

As we have found the bandwidth, we will create the geogrpahically weighted model and determine if our model has improved by considering the spatial relationships.

gwr.fixed <- gwr.basic(formula = GDP_CAPITA.log ~ IDHM  + DU_URBAN_PERCENTAGE + ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente+ Rural_Adjacente + GVA_AGROPEC_Percentage+ GVA_INDUSTRY_Percentage + GVA_SERVICES_Percentage + Rural_Remoto+ GVA_PUBLIC_Percentage + TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + COMP_PUBLIC , data = data.sp, bw = 520.5, kernel = 'gaussian', longlat = TRUE)
gwr.fixed

##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2020-06-01 03:09:22 
##    Call:
##    gwr.basic(formula = GDP_CAPITA.log ~ IDHM + DU_URBAN_PERCENTAGE + 
##     ACTIVE_PERCENTAGE + PLANTED_AREA_PERCENTAGE + Intermediário_Adjacente + 
##     Rural_Adjacente + GVA_AGROPEC_Percentage + GVA_INDUSTRY_Percentage + 
##     GVA_SERVICES_Percentage + Rural_Remoto + GVA_PUBLIC_Percentage + 
##     TAX_CAPITA + COMP_AGRICULTURE + COMP_INDUSTRY + COMP_SERVICES + 
##     COMP_PUBLIC, data = data.sp, bw = 520.5, kernel = "gaussian", 
##     longlat = TRUE)
## 
##    Dependent (y) variable:  GDP_CAPITA.log
##    Independent variables:  IDHM DU_URBAN_PERCENTAGE ACTIVE_PERCENTAGE PLANTED_AREA_PERCENTAGE Intermediário_Adjacente Rural_Adjacente GVA_AGROPEC_Percentage GVA_INDUSTRY_Percentage GVA_SERVICES_Percentage Rural_Remoto GVA_PUBLIC_Percentage TAX_CAPITA COMP_AGRICULTURE COMP_INDUSTRY COMP_SERVICES COMP_PUBLIC
##    Number of data points: 5552
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88482 -0.21926 -0.04419  0.16588  2.99060 
## 
##    Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
##    (Intercept)              5.768e+00  1.752e-01  32.925  < 2e-16 ***
##    IDHM                     4.466e+00  1.303e-01  34.264  < 2e-16 ***
##    DU_URBAN_PERCENTAGE     -2.494e-03  3.473e-04  -7.181 7.82e-13 ***
##    ACTIVE_PERCENTAGE        1.707e-02  2.409e-03   7.086 1.56e-12 ***
##    PLANTED_AREA_PERCENTAGE  5.319e-03  1.961e-04  27.123  < 2e-16 ***
##    Intermediário_Adjacente  3.133e-02  1.738e-02   1.802  0.07161 .  
##    Rural_Adjacente          4.590e-02  1.538e-02   2.985  0.00285 ** 
##    GVA_AGROPEC_Percentage   6.218e-06  6.237e-07   9.969  < 2e-16 ***
##    GVA_INDUSTRY_Percentage  1.030e-05  8.583e-07  11.998  < 2e-16 ***
##    GVA_SERVICES_Percentage -3.347e-06  5.587e-07  -5.990 2.23e-09 ***
##    Rural_Remoto             2.318e-01  2.500e-02   9.272  < 2e-16 ***
##    GVA_PUBLIC_Percentage   -4.187e-06  4.882e-07  -8.578  < 2e-16 ***
##    TAX_CAPITA               7.163e-02  1.929e-03  37.138  < 2e-16 ***
##    COMP_AGRICULTURE         3.388e-03  5.872e-04   5.770 8.38e-09 ***
##    COMP_INDUSTRY           -3.976e-03  8.997e-04  -4.420 1.01e-05 ***
##    COMP_SERVICES            8.147e-03  1.085e-03   7.509 6.90e-14 ***
##    COMP_PUBLIC             -2.965e-03  9.029e-04  -3.284  0.00103 ** 
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 0.3591 on 5535 degrees of freedom
##    Multiple R-squared: 0.724
##    Adjusted R-squared: 0.7232 
##    F-statistic: 907.6 on 16 and 5535 DF,  p-value: < 2.2e-16 
##    ***Extra Diagnostic information
##    Residual sum of squares: 713.8964
##    Sigma(hat): 0.3586502
##    AIC:  4403.766
##    AICc:  4403.89
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: gaussian 
##    Fixed bandwidth: 520.5 
##    Regression points: the same locations as observations are used.
##    Distance metric: Great Circle distance metric is used.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                                   Min.     1st Qu.      Median     3rd Qu.
##    Intercept                3.5918e+00  4.5969e+00  6.4049e+00  7.1170e+00
##    IDHM                     7.0023e-01  2.3241e+00  3.0105e+00  3.7890e+00
##    DU_URBAN_PERCENTAGE     -1.2126e-02 -2.9675e-03 -1.5261e-03  1.2966e-03
##    ACTIVE_PERCENTAGE       -4.0936e-02  1.6111e-02  2.0225e-02  3.6364e-02
##    PLANTED_AREA_PERCENTAGE  5.1165e-04  3.7827e-03  4.2643e-03  4.8486e-03
##    Intermediário_Adjacente -8.7305e-02  1.0146e-02  3.1268e-02  5.6163e-02
##    Rural_Adjacente         -9.7066e-02 -1.7413e-02  6.0540e-02  8.0944e-02
##    GVA_AGROPEC_Percentage   4.6438e-07  5.2591e-06  5.5373e-06  6.0705e-06
##    GVA_INDUSTRY_Percentage -8.6256e-06  9.4129e-06  1.0547e-05  1.1588e-05
##    GVA_SERVICES_Percentage -1.0004e-05 -3.0248e-06 -2.6983e-06 -1.6041e-06
##    Rural_Remoto            -1.5925e-01 -3.7663e-04  1.1742e-01  2.7789e-01
##    GVA_PUBLIC_Percentage   -1.3185e-05 -1.0805e-05 -5.8453e-06 -1.8408e-06
##    TAX_CAPITA               5.8351e-02  6.1810e-02  6.8618e-02  1.1677e-01
##    COMP_AGRICULTURE         3.4866e-04  9.5571e-04  2.4816e-03  1.7337e-02
##    COMP_INDUSTRY           -1.2177e-02 -4.1437e-03  7.1787e-05  1.1738e-03
##    COMP_SERVICES            3.0606e-03  9.7136e-03  1.0167e-02  1.0535e-02
##    COMP_PUBLIC             -1.2587e-02 -2.3025e-03  1.5551e-03  3.3808e-03
##                               Max.
##    Intercept               10.2177
##    IDHM                     4.2075
##    DU_URBAN_PERCENTAGE      0.0020
##    ACTIVE_PERCENTAGE        0.0551
##    PLANTED_AREA_PERCENTAGE  0.0204
##    Intermediário_Adjacente  0.1019
##    Rural_Adjacente          0.1641
##    GVA_AGROPEC_Percentage   0.0000
##    GVA_INDUSTRY_Percentage  0.0000
##    GVA_SERVICES_Percentage  0.0000
##    Rural_Remoto             0.4139
##    GVA_PUBLIC_Percentage    0.0000
##    TAX_CAPITA               0.1970
##    COMP_AGRICULTURE         0.0298
##    COMP_INDUSTRY            0.0176
##    COMP_SERVICES            0.0235
##    COMP_PUBLIC              0.0121
##    ************************Diagnostic information*************************
##    Number of data points: 5552 
##    Effective number of parameters (2trace(S) - trace(S'S)): 137.9105 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5414.09 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 2985.55 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 2879.408 
##    Residual sum of squares: 536.2443 
##    R-square value:  0.7927114 
##    Adjusted R-square value:  0.7874303 
## 
##    ***********************************************************************
##    Program stops at: 2020-06-01 03:10:14

By incorporating the GWR model in our multiple regression model, our Adjusted R-Square value has increased from 0.7232 to 0.787427. This means that the spatial weights have justified for 6.42% of the variation in GDP_CAPITA. Due to the spatial weights, our model has significantly improved as nearly 80% of the variation in GDP_CAPITA is justified.

7 Visualisation through Choropleth Maps

We will first obtain the data from the GWR model through the code below.

gdp_capita.fixed <- st_as_sf(gwr.fixed$SDF) %>%
  st_transform(crs=4674)

7.1 Plotting Local R-square values

one <- tm_shape(gdp_capita.fixed)+
  tm_polygons("Local_R2",border.col = "black", border.alpha = 0.2, style="quantile")
two <- tm_shape(data)+
  tm_polygons("GDP_CAPITA",border.col = "black", border.alpha = 0.2, style="quantile")
tmap_arrange(one,two,ncol = 2,nrow = 1)

The above diagram shows the Local R-square value and the GDP_CAPITA for municipalities in Brazil. The point of this comparison is not to analyse each municipality independently but rather to understand the patterns discovered. Local R-square value indicate the amount of variation in the GDP_CAPITA which is due to the dependent variables for each municipality. We can draw strong observations from the map above. The eastern region of Brazil has low GDP_CAPITA, and has the highest R-square values. This means that our GWR model works better for municipalities which have lesser GDP_CAPITA as it is explained from the independent variables chosen. It should also be noted that the similar inverse relationship is observed in the central region of Brazil(which has the highest GDP_CAPITA, but lowest R-square value). Another thing to note is that we have used quartile breaks in GDP_CAPITA, and it is very evident that the last break consists majority of the data. This was done to understand the relationship of between Local R-square and GDP_CAPITA, which would not be possible otherwise due to the extremely skewed nature of the data. The exception to this analysis is the southern region of Brazil which has higher GDP_CAPITA and a relatively high local r-square value. We can infer that there are many other variables which need to be included in order to understand the richer part of Brazil as this model severely lacks that understanding.

7.2 Plotting significant parameter estimates

We will first manupulate the data in order to sort out regions which are statistically significant when comparing the parameter estimates for each of the dependent variables.

gdp_capita.fixed <- gdp_capita.fixed %>%
  mutate(IDHM_TV=abs(IDHM_TV)) %>%
  mutate(DU_URBAN_PERCENTAGE_TV=abs(DU_URBAN_PERCENTAGE_TV)) %>%
  mutate(ACTIVE_PERCENTAGE_TV=abs(ACTIVE_PERCENTAGE_TV)) %>%
  mutate(PLANTED_AREA_PERCENTAGE_TV=abs(PLANTED_AREA_PERCENTAGE_TV)) %>%
  mutate(Intermediário_Adjacente=abs(Intermediário_Adjacente_TV)) %>%
  mutate(Rural_Adjacente=abs(Rural_Adjacente_TV)) %>%
  mutate(GVA_AGROPEC_Percentage=abs(GVA_AGROPEC_Percentage_TV)) %>%
  mutate(GVA_SERVICES_Percentage=abs(GVA_SERVICES_Percentage_TV)) %>%
  mutate(Rural_Remoto_TV=abs(Rural_Remoto_TV)) %>%
  mutate(GVA_PUBLIC_Percentage_TV=abs(GVA_PUBLIC_Percentage_TV)) %>%
  mutate(TAX_CAPITA_TV=abs(TAX_CAPITA_TV)) %>%
  mutate(COMP_AGRICULTURE_TV=abs(COMP_AGRICULTURE_TV)) %>%
  mutate(COMP_INDUSTRY_TV=abs(COMP_INDUSTRY_TV)) %>%
  mutate(COMP_SERVICES_TV=abs(COMP_SERVICES_TV))

The objective of this section is to map out parameter estimates which are significant. It will be plotted along side the GDP_CAPITA raw values in order to draw understanding from the GWR model. Note that the confidence interval used in this report is 95%, hence the municipalities will be considered to have a significant value for a particular independent if the absolute value of the t-statistic is more than or equal to 1.96. In order to perfrom this analysis, the code below creates functions which will help us arrange the choropleth maps in neat manner.

makeMap <- function(variable,tscore){
  one <- tm_shape(gdp_capita.fixed)+
    tm_polygons(variable,border.col = "black", border.alpha = 0.2)
  two <- tm_shape(gdp_capita.fixed)+
    tm_polygons(tscore,border.col = "black", border.alpha = 0.2, breaks=c(0,1.96,Inf))
  three <- tm_shape(data)+
    tm_polygons("GDP_CAPITA",border.col = "black", border.alpha = 0.2, style="quantile")
  
  return(tmap_arrange(one,two,three,ncol=3,nrow = 1))
}

IDHM <- makeMap("IDHM","IDHM_TV")
DU_URBAN_PERCENTAGE <- makeMap("DU_URBAN_PERCENTAGE","DU_URBAN_PERCENTAGE_TV")
ACTIVE_PERCENTAGE <- makeMap("ACTIVE_PERCENTAGE","ACTIVE_PERCENTAGE_TV")
PLANTED_AREA_PERCENTAGE <- makeMap("PLANTED_AREA_PERCENTAGE","PLANTED_AREA_PERCENTAGE_TV")
Intermediário_Adjacente <- makeMap("Intermediário_Adjacente","Intermediário_Adjacente_TV")
Rural_Adjacente <- makeMap("Rural_Adjacente","Rural_Adjacente_TV")
GVA_AGROPEC_Percentage <- makeMap("GVA_AGROPEC_Percentage","GVA_AGROPEC_Percentage_TV")
GVA_INDUSTRY_Percentage <- makeMap("GVA_AGROPEC_Percentage","GVA_INDUSTRY_Percentage_TV")
GVA_SERVICES_Percentage <- makeMap("GVA_SERVICES_Percentage","GVA_SERVICES_Percentage_TV")
Rural_Remoto <- makeMap("Rural_Remoto","Rural_Remoto_TV")
GVA_PUBLIC_Percentage <- makeMap("GVA_PUBLIC_Percentage","GVA_PUBLIC_Percentage_TV")
TAX_CAPITA <- makeMap("TAX_CAPITA","TAX_CAPITA_TV")
COMP_AGRICULTURE <- makeMap("COMP_AGRICULTURE","COMP_AGRICULTURE_TV")
COMP_INDUSTRY <- makeMap("COMP_INDUSTRY","COMP_INDUSTRY_TV")
COMP_SERVICES <- makeMap("COMP_SERVICES","COMP_SERVICES_TV")

Parameter estimates for independent variables

IDHM

It is very prominent that IDHM is significant in almost all the municipalities in Brazil. This is because this index captures education, life expectancy and income, which is key to GDP_CAPITA. This variable was also found to have the highest correlation coefficient with GDP_CAPITA when compared in the multiple linear regression model as well. When we account for spatial variation, IDHM tops the list as the most significant independent variable, indicating that it is one of the most important methods through which GDP_CAPITA can be estimated.

DU_URBAN_PERCENTAGE

The percentage of urban units is found to be significant in the western, eastern and sothern regions of Brazil. It can be inferred that this independent variable is significant only when its is either too low or too high, indicating that it does not play a big factor in determining GDP_CAPITA in municipalities which have average urban housing rates.

ACTIVE_PERCENTAGE

Active percentage reflects the percentage of population which is performing economically active work. Almost all the municipalities consider this variable as a significant variable regardless of their active percentage indicating that it one of the most important factor when determining the GDP_CAPITA.

PLANTED_AREA_PERCENTAGE

Planted_area_percentage reflects the amount of land dedicated to farming. As seen that it is a significant factor in almost all the municipalities in Brazil indicating Brazil’s dependence on the agricultural economy. We can also observe that this proportion is lesser in economically richer areas.

Intermediário_Adjacente

This index determines semi-rural areas. It is evident that these regions have a higher GDP_CAPITA as compared to the other types of regions.

Rural_Adjacente

GVA_AGROPEC_Percentage

This index tells us the percentage of value addition by the agricultural industry. It is significant in almost all the municipalities in Brazil except the northern regions. Agriculture is one of the biggest factors contributing to the GDP_CAPITA, as we have noticed from the other variables analysed above.

GVA_SERVICES_Percentage

This variable is insignificant in almost all the municipalities in Brazil.

Conclusion

The model was significantly improved when the GWR model was adopted. Through this method, we could explain how each independent variable played a role in different regions which was key to our understanding of the different factors which affect the GDP_CAPITA. Local statistics help us justify the findings from global statistics and provide us insights on a deeper level through which we can uncover findings which were not visible before. We understand how spatial relationships play an important role and dig deeper to find out variables which do not affect GDP_CAPITA at all!

assignment4

Amey Rathi

5/27/2020