1 Assignment Overview:

In this take-home exercise, you are tasked to determine factors affecting the unequal development of Brazil at the municipality level by using the data provided. The specific task of the analysis are as follows:

Prepare a choropleth map showing the distribution of GDP per capita, 2016 at municipality level.
Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using multiple linear regression method.
Prepare a choropleth map showing the distribution of the residual of the GDP per capita.
Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using geographically weighted regression method.
Prepare a series of choropleth maps showing the outputs of the geographically weighted regression model

2 Setup

2.1 Loading in the necessary packages

The R packages needed for this exercise are as follows:

Geospatial statistical modelling package * GWmodel, heatmaply, spatstat Spatial data handling * sf, geobr Attribute data handling * tidyverse, readr, ggplot2 and dplyr Choropleth mapping * tmap Savling and loading Geospatial data * rgdal (for easier loading of data)

The code chunks below installs and launches these R packages into R environment.

packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse', 'geobr','rgdal', 'heatmaply', "spatstat")
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2.2 Creating testing functions for validity and NA values

# Retrieves a quick breakdown of the number of NA rows and invalid polygons/points
Validity_NA_Check <- function(target_st) {
  validity <- st_is_valid(target_st)
  NA_rows <- target_st[rowSums(is.na(target_st))!=0,]
  Invalid_rows <- which(validity==FALSE)
  print(paste("For:", deparse(substitute(target_st))))
  print(paste("Number of Invalid polygons/points is:", length(Invalid_rows)))
  print(paste("Number of NA rows is:", nrow((NA_rows))))
}

# Retrieves the exact polygon which is invalid
get_invalid <- function(target_st) {
  validity <- st_is_valid(target_st)
  Invalid_rows <- which(validity==FALSE)
  return(Invalid_rows)
}

# Retrieves the exact rows which contain NA values for you to check the columns
get_NA_rows <- function(target_st) {
  NA_rows <- target_st[rowSums(is.na(target_st))!=0,]
  return(NA_rows)
}

# A cleaning function that replaces NA with "Missing" so that calculations can still be done.
## This function is a little unnessary as we will not be using the data attached to the geospatial points. 
replace_NA_with_zero <- function(x, column_name){
  x$column_name[is.na(x$column_name)] <- 0
}

3 Data Wrangling and Formatting

3.1 Aspatial Data Wrangling

3.1.1 Importing the aspatial data

The condo_resale_2015 is in csv file format. The codes chunk below uses read_csv() function of readr package to import condo_resale_2015 into R as a tibble data frame called condo_resale.

Brazil_cities_raw = read_delim("data/aspatial/BRAZIL_CITIES.csv", ";")

3.1.2 Importing Data Dictionary as a dataframe for reference

Reference = read_delim("data/aspatial/Data_Dictionary.csv", ";")

3.1.3 Checking input data

summary(Brazil_cities_raw)

##      CITY              STATE              CAPITAL          IBGE_RES_POP     
##  Length:5573        Length:5573        Min.   :0.000000   Min.   :     805  
##  Class :character   Class :character   1st Qu.:0.000000   1st Qu.:    5235  
##  Mode  :character   Mode  :character   Median :0.000000   Median :   10934  
##                                        Mean   :0.004845   Mean   :   34278  
##                                        3rd Qu.:0.000000   3rd Qu.:   23424  
##                                        Max.   :1.000000   Max.   :11253503  
##                                                           NA's   :8         
##  IBGE_RES_POP_BRAS  IBGE_RES_POP_ESTR     IBGE_DU        IBGE_DU_URBAN    
##  Min.   :     805   Min.   :     0.0   Min.   :    239   Min.   :     60  
##  1st Qu.:    5230   1st Qu.:     0.0   1st Qu.:   1572   1st Qu.:    874  
##  Median :   10926   Median :     0.0   Median :   3174   Median :   1846  
##  Mean   :   34200   Mean   :    77.5   Mean   :  10303   Mean   :   8859  
##  3rd Qu.:   23390   3rd Qu.:    10.0   3rd Qu.:   6726   3rd Qu.:   4624  
##  Max.   :11133776   Max.   :119727.0   Max.   :3576148   Max.   :3548433  
##  NA's   :8          NA's   :8          NA's   :10        NA's   :10       
##  IBGE_DU_RURAL      IBGE_POP            IBGE_1            IBGE_1-4     
##  Min.   :    3   Min.   :     174   Min.   :     0.0   Min.   :     5  
##  1st Qu.:  487   1st Qu.:    2801   1st Qu.:    38.0   1st Qu.:   158  
##  Median :  931   Median :    6170   Median :    92.0   Median :   376  
##  Mean   : 1463   Mean   :   27595   Mean   :   383.3   Mean   :  1544  
##  3rd Qu.: 1832   3rd Qu.:   15302   3rd Qu.:   232.0   3rd Qu.:   951  
##  Max.   :33809   Max.   :10463636   Max.   :129464.0   Max.   :514794  
##  NA's   :81      NA's   :8          NA's   :8          NA's   :8       
##     IBGE_5-9        IBGE_10-14       IBGE_15-59         IBGE_60+      
##  Min.   :     7   Min.   :    12   Min.   :     94   Min.   :     29  
##  1st Qu.:   220   1st Qu.:   259   1st Qu.:   1734   1st Qu.:    341  
##  Median :   516   Median :   588   Median :   3841   Median :    722  
##  Mean   :  2069   Mean   :  2381   Mean   :  18212   Mean   :   3004  
##  3rd Qu.:  1300   3rd Qu.:  1478   3rd Qu.:   9628   3rd Qu.:   1724  
##  Max.   :684443   Max.   :783702   Max.   :7058221   Max.   :1293012  
##  NA's   :8        NA's   :8        NA's   :8         NA's   :8        
##  IBGE_PLANTED_AREA   IBGE_CROP_PRODUCTION_$ IDHM Ranking 2010      IDHM       
##  Min.   :      0.0   Min.   :      0        Min.   :   1      Min.   :0.4180  
##  1st Qu.:    910.2   1st Qu.:   2326        1st Qu.:1392      1st Qu.:0.5990  
##  Median :   3471.5   Median :  13846        Median :2783      Median :0.6650  
##  Mean   :  14179.9   Mean   :  57384        Mean   :2783      Mean   :0.6592  
##  3rd Qu.:  11194.2   3rd Qu.:  55619        3rd Qu.:4174      3rd Qu.:0.7180  
##  Max.   :1205669.0   Max.   :3274885        Max.   :5565      Max.   :0.8620  
##  NA's   :3           NA's   :3              NA's   :8         NA's   :8       
##    IDHM_Renda     IDHM_Longevidade IDHM_Educacao         LONG       
##  Min.   :0.4000   Min.   :0.6720   Min.   :0.2070   Min.   :-72.92  
##  1st Qu.:0.5720   1st Qu.:0.7690   1st Qu.:0.4900   1st Qu.:-50.87  
##  Median :0.6540   Median :0.8080   Median :0.5600   Median :-46.52  
##  Mean   :0.6429   Mean   :0.8016   Mean   :0.5591   Mean   :-46.23  
##  3rd Qu.:0.7070   3rd Qu.:0.8360   3rd Qu.:0.6310   3rd Qu.:-41.40  
##  Max.   :0.8910   Max.   :0.8940   Max.   :0.8250   Max.   :-32.44  
##  NA's   :8        NA's   :8        NA's   :8        NA's   :9       
##       LAT               ALT               PAY_TV         FIXED_PHONES    
##  Min.   :-33.688   Min.   :     0.0   Min.   :      1   Min.   :      3  
##  1st Qu.:-22.838   1st Qu.:   169.8   1st Qu.:     88   1st Qu.:    119  
##  Median :-18.089   Median :   406.5   Median :    247   Median :    327  
##  Mean   :-16.444   Mean   :   893.8   Mean   :   3094   Mean   :   6567  
##  3rd Qu.: -8.489   3rd Qu.:   628.9   3rd Qu.:    815   3rd Qu.:   1151  
##  Max.   :  4.585   Max.   :874579.0   Max.   :2047668   Max.   :5543127  
##  NA's   :9         NA's   :9          NA's   :3         NA's   :3        
##       AREA            REGIAO_TUR        CATEGORIA_TUR      ESTIMATED_POP     
##  Min.   :     3.57   Length:5573        Length:5573        Min.   :     786  
##  1st Qu.:   204.44   Class :character   Class :character   1st Qu.:    5454  
##  Median :   416.59   Mode  :character   Mode  :character   Median :   11590  
##  Mean   :  1517.44                                         Mean   :   37432  
##  3rd Qu.:  1026.57                                         3rd Qu.:   25296  
##  Max.   :159533.33                                         Max.   :12176866  
##  NA's   :3                                                 NA's   :3         
##  RURAL_URBAN         GVA_AGROPEC       GVA_INDUSTRY       GVA_SERVICES      
##  Length:5573        Min.   :      0   Min.   :       1   Min.   :        2  
##  Class :character   1st Qu.:   4189   1st Qu.:    1726   1st Qu.:    10112  
##  Mode  :character   Median :  20426   Median :    7424   Median :    31211  
##                     Mean   :  47271   Mean   :  175928   Mean   :   489451  
##                     3rd Qu.:  51227   3rd Qu.:   41022   3rd Qu.:   115406  
##                     Max.   :1402282   Max.   :63306755   Max.   :464656988  
##                     NA's   :3         NA's   :3          NA's   :3          
##    GVA_PUBLIC         GVA_TOTAL             TAXES                GDP           
##  Min.   :       7   Min.   :       17   Min.   :   -14159   Min.   :       15  
##  1st Qu.:   17267   1st Qu.:    42253   1st Qu.:     1305   1st Qu.:    43709  
##  Median :   35866   Median :   119492   Median :     5100   Median :   125153  
##  Mean   :  123768   Mean   :   832987   Mean   :   118864   Mean   :   954584  
##  3rd Qu.:   89245   3rd Qu.:   313963   3rd Qu.:    22197   3rd Qu.:   329539  
##  Max.   :41902893   Max.   :569910503   Max.   :117125387   Max.   :687035890  
##  NA's   :3          NA's   :3           NA's   :3           NA's   :3          
##     POP_GDP           GDP_CAPITA       GVA_MAIN          MUN_EXPENDIT      
##  Min.   :     815   Min.   :  3191   Length:5573        Min.   :1.421e+06  
##  1st Qu.:    5483   1st Qu.:  9058   Class :character   1st Qu.:1.573e+07  
##  Median :   11578   Median : 15870   Mode  :character   Median :2.746e+07  
##  Mean   :   36998   Mean   : 21126                      Mean   :1.043e+08  
##  3rd Qu.:   25085   3rd Qu.: 26155                      3rd Qu.:5.666e+07  
##  Max.   :12038175   Max.   :314638                      Max.   :4.577e+10  
##  NA's   :3          NA's   :3                           NA's   :1492       
##     COMP_TOT            COMP_A            COMP_B            COMP_C        
##  Min.   :     6.0   Min.   :   0.00   Min.   :  0.000   Min.   :    0.00  
##  1st Qu.:    68.0   1st Qu.:   1.00   1st Qu.:  0.000   1st Qu.:    3.00  
##  Median :   162.0   Median :   2.00   Median :  0.000   Median :   11.00  
##  Mean   :   906.8   Mean   :  18.25   Mean   :  1.852   Mean   :   73.44  
##  3rd Qu.:   448.0   3rd Qu.:   8.00   3rd Qu.:  2.000   3rd Qu.:   39.00  
##  Max.   :530446.0   Max.   :1948.00   Max.   :274.000   Max.   :31566.00  
##  NA's   :3          NA's   :3         NA's   :3         NA's   :3         
##      COMP_D             COMP_E            COMP_F             COMP_G        
##  Min.   :  0.0000   Min.   :  0.000   Min.   :    0.00   Min.   :     1.0  
##  1st Qu.:  0.0000   1st Qu.:  0.000   1st Qu.:    1.00   1st Qu.:    32.0  
##  Median :  0.0000   Median :  0.000   Median :    4.00   Median :    74.5  
##  Mean   :  0.4262   Mean   :  2.029   Mean   :   43.26   Mean   :   348.0  
##  3rd Qu.:  0.0000   3rd Qu.:  1.000   3rd Qu.:   15.00   3rd Qu.:   199.0  
##  Max.   :332.0000   Max.   :657.000   Max.   :25222.00   Max.   :150633.0  
##  NA's   :3          NA's   :3         NA's   :3          NA's   :3         
##      COMP_H          COMP_I             COMP_J             COMP_K        
##  Min.   :    0   Min.   :    0.00   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    1   1st Qu.:    2.00   1st Qu.:    0.00   1st Qu.:    0.00  
##  Median :    7   Median :    7.00   Median :    1.00   Median :    0.00  
##  Mean   :   41   Mean   :   55.88   Mean   :   24.74   Mean   :   15.55  
##  3rd Qu.:   25   3rd Qu.:   24.00   3rd Qu.:    5.00   3rd Qu.:    2.00  
##  Max.   :19515   Max.   :29290.00   Max.   :38720.00   Max.   :23738.00  
##  NA's   :3       NA's   :3          NA's   :3          NA's   :3         
##      COMP_L             COMP_M             COMP_N            COMP_O       
##  Min.   :    0.00   Min.   :    0.00   Min.   :    0.0   Min.   :  0.000  
##  1st Qu.:    0.00   1st Qu.:    1.00   1st Qu.:    1.0   1st Qu.:  2.000  
##  Median :    0.00   Median :    4.00   Median :    4.0   Median :  2.000  
##  Mean   :   15.14   Mean   :   51.29   Mean   :   83.7   Mean   :  3.269  
##  3rd Qu.:    3.00   3rd Qu.:   13.00   3rd Qu.:   14.0   3rd Qu.:  3.000  
##  Max.   :14003.00   Max.   :49181.00   Max.   :76757.0   Max.   :204.000  
##  NA's   :3          NA's   :3          NA's   :3         NA's   :3        
##      COMP_P             COMP_Q             COMP_R            COMP_S        
##  Min.   :    0.00   Min.   :    0.00   Min.   :   0.00   Min.   :    0.00  
##  1st Qu.:    2.00   1st Qu.:    1.00   1st Qu.:   0.00   1st Qu.:    5.00  
##  Median :    6.00   Median :    3.00   Median :   2.00   Median :   12.00  
##  Mean   :   30.96   Mean   :   34.15   Mean   :  12.18   Mean   :   51.61  
##  3rd Qu.:   17.00   3rd Qu.:   12.00   3rd Qu.:   6.00   3rd Qu.:   31.00  
##  Max.   :16030.00   Max.   :22248.00   Max.   :6687.00   Max.   :24832.00  
##  NA's   :3          NA's   :3          NA's   :3         NA's   :3         
##      COMP_T      COMP_U              HOTELS            BEDS        
##  Min.   :0   Min.   :  0.00000   Min.   : 1.000   Min.   :    2.0  
##  1st Qu.:0   1st Qu.:  0.00000   1st Qu.: 1.000   1st Qu.:   40.0  
##  Median :0   Median :  0.00000   Median : 1.000   Median :   82.0  
##  Mean   :0   Mean   :  0.05027   Mean   : 3.131   Mean   :  257.5  
##  3rd Qu.:0   3rd Qu.:  0.00000   3rd Qu.: 3.000   3rd Qu.:  200.0  
##  Max.   :0   Max.   :123.00000   Max.   :97.000   Max.   :13247.0  
##  NA's   :3   NA's   :3           NA's   :4686     NA's   :4686     
##   Pr_Agencies        Pu_Agencies         Pr_Bank          Pu_Bank    
##  Min.   :   0.000   Min.   :  0.000   Min.   : 0.000   Min.   :0.00  
##  1st Qu.:   0.000   1st Qu.:  1.000   1st Qu.: 0.000   1st Qu.:1.00  
##  Median :   1.000   Median :  2.000   Median : 1.000   Median :2.00  
##  Mean   :   3.383   Mean   :  2.829   Mean   : 1.312   Mean   :1.58  
##  3rd Qu.:   2.000   3rd Qu.:  2.000   3rd Qu.: 2.000   3rd Qu.:2.00  
##  Max.   :1693.000   Max.   :626.000   Max.   :83.000   Max.   :8.00  
##  NA's   :2231       NA's   :2231      NA's   :2231     NA's   :2231  
##    Pr_Assets           Pu_Assets              Cars          Motorcycles     
##  Min.   :0.000e+00   Min.   :0.000e+00   Min.   :      2   Min.   :      4  
##  1st Qu.:0.000e+00   1st Qu.:4.047e+07   1st Qu.:    602   1st Qu.:    591  
##  Median :3.231e+07   Median :1.339e+08   Median :   1438   Median :   1285  
##  Mean   :9.180e+09   Mean   :6.005e+09   Mean   :   9859   Mean   :   4879  
##  3rd Qu.:1.148e+08   3rd Qu.:4.970e+08   3rd Qu.:   4086   3rd Qu.:   3294  
##  Max.   :1.947e+13   Max.   :8.016e+12   Max.   :5740995   Max.   :1134570  
##  NA's   :2231        NA's   :2231        NA's   :11        NA's   :11       
##  Wheeled_tractor         UBER           MAC             WAL-MART     
##  Min.   :   0.000   Min.   :1      Min.   :  1.000   Min.   : 1.000  
##  1st Qu.:   0.000   1st Qu.:1      1st Qu.:  1.000   1st Qu.: 1.000  
##  Median :   0.000   Median :1      Median :  2.000   Median : 1.000  
##  Mean   :   5.754   Mean   :1      Mean   :  4.277   Mean   : 2.059  
##  3rd Qu.:   1.000   3rd Qu.:1      3rd Qu.:  3.000   3rd Qu.: 1.750  
##  Max.   :3236.000   Max.   :1      Max.   :130.000   Max.   :26.000  
##  NA's   :11         NA's   :5448   NA's   :5407      NA's   :5471    
##   POST_OFFICES    
##  Min.   :  1.000  
##  1st Qu.:  1.000  
##  Median :  1.000  
##  Mean   :  2.081  
##  3rd Qu.:  2.000  
##  Max.   :225.000  
##  NA's   :120

Extensive data cleaning is also required to ensure the data would be useful and regressions can be formulated.

3.1.4 Data Cleaning

3.1.4.1 Observing quality of data

Unfortunately it seems that there are a lot of rows with missing values. In fact almost all of them are missing some values. We will begin to clean the dataset as best we can in order to formulate our desired indicators to test variables which affect GDP per capita growth.

3.1.4.2 Checking for duplicates

which(duplicated(Brazil_cities_raw[,1]))

##   [1]   48   50   51   91  142  143  159  179  207  226  261  270  318  352  370
##  [16]  418  434  484  497  508  517  539  551  563  582  583  591  634  635  644
##  [31]  657  670  671  676  677  678  679  693  703  704  709  715  716  717  730
##  [46]  766  813  851  856  857  877  885  939  957  973 1007 1009 1015 1041 1042
##  [61] 1049 1058 1089 1102 1162 1184 1210 1212 1217 1306 1317 1351 1353 1485 1486
##  [76] 1535 1620 1646 1673 1699 1723 1748 1762 1790 1805 1827 1901 1982 2004 2006
##  [91] 2062 2072 2163 2189 2195 2198 2253 2258 2273 2285 2327 2343 2344 2375 2381
## [106] 2393 2465 2489 2514 2531 2539 2547 2557 2640 2652 2661 2662 2702 2707 2713
## [121] 2724 2744 2935 2992 3053 3062 3082 3135 3151 3182 3213 3216 3217 3245 3251
## [136] 3298 3324 3354 3356 3357 3378 3387 3390 3405 3406 3422 3483 3484 3490 3502
## [151] 3521 3533 3536 3552 3580 3625 3635 3659 3670 3693 3702 3764 3785 3789 3811
## [166] 3813 3845 3868 3880 3881 3882 4003 4008 4015 4019 4025 4027 4031 4040 4073
## [181] 4092 4116 4141 4148 4152 4158 4195 4201 4232 4296 4312 4324 4351 4363 4369
## [196] 4370 4397 4401 4402 4403 4407 4408 4409 4411 4419 4422 4423 4424 4433 4454
## [211] 4473 4482 4488 4489 4490 4499 4538 4590 4611 4617 4618 4619 4620 4643 4644
## [226] 4645 4651 4663 4674 4686 4688 4724 4776 4829 4862 4891 4912 4917 4924 4937
## [241] 4941 5027 5038 5074 5077 5085 5115 5145 5156 5159 5162 5164 5191 5207 5222
## [256] 5226 5258 5302 5305 5306 5340 5346 5425 5435 5439 5450 5457 5471 5472 5473
## [271] 5491 5498 5499 5556

which(duplicated(Brazil_cities_raw[,1:2]))

## integer(0)

With respect to the data, there appears to be a large number of city names repeated. This could cause problems in further joining operations. We will need to create unique identifers by combining them with the STATE column in order to perform any sort of joining.

3.1.4.3 Creating unique identifers for each row

Brazil_cities_uniques <- cbind(CITY_STATE = paste(Brazil_cities_raw$CITY, Brazil_cities_raw$STATE, sep="_"), Brazil_cities_raw)

which(duplicated(Brazil_cities_uniques[,1]))

## integer(0)

3.1.4.4 Removing Columns that are after 2016 data

For the purpose of our analysis, since we’re looking at contributive factors that might lead to the differences in GDP per captial, with reference to the Data_Dictionary, we will be removing variables which come after 2016

NOTE: This is important because we would be making a logical fallacy if we try to build explainatory models on factors which happen post-event which may draw reverse causation. This would not affect things such as Area as those would stay constant regardless of time differences. Additionally, we will still have enough variables and derived variables to perform our analysis.

We will also be removing MUN_EXPENDITURE because of the large amounts of missing data points and our inability to properly estimate these values from external sources. Because this specific column has much larger amounts of missing rows, it would be ill-advised to remove rows rather than the entire column itself.

Lastly we will also remove COMP_T as there is no data values there at all

drops <- c("IBGE_PLANTED_AREA","IBGE_CROP_PRODUCTION_$", "PAY_TV", "FIXED_PHONES", "ESTIMATED_POP", "REGIAO_TUR", "CATEGORIA_TUR", "HOTELS", "BEDS", "Pr_Agencies", "Pu_Agencies", "Pr_Bank", "Pu_Bank", "Pr_Assets", "Pu_Assets", "Cars", "Motorcycles", "Wheeled_tractor", "UBER", "MAC", "WAL-MART", "POST_OFFICES", "MUN_EXPENDIT", "COMP_T")

Brazil_cities_2016 <- Brazil_cities_uniques[ , !(names(Brazil_cities_uniques) %in% drops)]

3.1.4.5 Looking for missing depedent variable

If the dependant variable is missing in our data, that specific city will unforunately not be able utilized in our analysis.

Missing_GDP_PC <- Brazil_cities_2016[(is.na(Brazil_cities_2016$GDP_CAPITA))!=0,]
Missing_GDP_PC

##              CITY_STATE            CITY STATE CAPITAL IBGE_RES_POP
## 2702 Lagoa Dos Patos_RS Lagoa Dos Patos    RS       0           NA
## 4482 Santa Teresinha_BA Santa Teresinha    BA       0           NA
## 4606     São Caetano_PE     São Caetano    PE       0           NA
##      IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL
## 2702                NA                NA      NA            NA            NA
## 4482                NA                NA      NA            NA            NA
## 4606                NA                NA      NA            NA            NA
##      IBGE_POP IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## 2702       NA     NA       NA       NA         NA         NA       NA
## 4482       NA     NA       NA       NA         NA         NA       NA
## 4606       NA     NA       NA       NA         NA         NA       NA
##      IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao      LONG
## 2702                NA   NA         NA               NA            NA        NA
## 4482              4493 0.59      0.549            0.804         0.459 -39.52114
## 4606                NA   NA         NA               NA            NA        NA
##            LAT    ALT     AREA RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY
## 2702        NA     NA 10158.75        <NA>          NA           NA
## 4482 -12.77285 222.51       NA        <NA>          NA           NA
## 4606        NA     NA       NA        <NA>          NA           NA
##      GVA_SERVICES GVA_PUBLIC  GVA_TOTAL  TAXES GDP POP_GDP GDP_CAPITA GVA_MAIN
## 2702           NA         NA          NA    NA  NA      NA         NA     <NA>
## 4482           NA         NA          NA    NA  NA      NA         NA     <NA>
## 4606           NA         NA          NA    NA  NA      NA         NA     <NA>
##      COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 2702       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 4482       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 4606       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##      COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 2702     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 4482     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 4606     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##      COMP_U
## 2702     NA
## 4482     NA
## 4606     NA

According to Wikipedia, the number of municipalities in Brazil should amount to 5,573. However, our dataset includes 5,573. Which means that the 3 cities with missing GDPC are probably not accoutned for in some way. We will then remove the observed cities assuming they are irrelevant to our study. Source: https://en.wikipedia.org/wiki/Municipalities_of_Brazil

Brazil_cities_allGDPC <- Brazil_cities_2016[(is.na(Brazil_cities_2016$GDP_CAPITA))==0,]
#summary((Brazil_cities_allGDPC))

3.1.4.6 Checking places with missing Residential Population Data.

Brazil_cities_allGDPC[(is.na(Brazil_cities_allGDPC$IBGE_RES_POP_ESTR))!=0,]

##                CITY_STATE              CITY STATE CAPITAL IBGE_RES_POP
## 472   Balneário Rincão_SC  Balneário Rincão    SC       0           NA
## 3117  Mojuí Dos Campos_PA  Mojuí Dos Campos    PA       0           NA
## 3581 Paraíso Das Águas_MS Paraíso Das Águas    MS       0           NA
## 3761    Pescaria Brava_SC    Pescaria Brava    SC       0           NA
## 3821    Pinto Bandeira_RS    Pinto Bandeira    RS       0           NA
##      IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL
## 472                 NA                NA      NA            NA            NA
## 3117                NA                NA      NA            NA            NA
## 3581                NA                NA      NA            NA            NA
## 3761                NA                NA      NA            NA            NA
## 3821                NA                NA      NA            NA            NA
##      IBGE_POP IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## 472        NA     NA       NA       NA         NA         NA       NA
## 3117       NA     NA       NA       NA         NA         NA       NA
## 3581       NA     NA       NA       NA         NA         NA       NA
## 3761       NA     NA       NA       NA         NA         NA       NA
## 3821       NA     NA       NA       NA         NA         NA       NA
##      IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG LAT
## 472                 NA   NA         NA               NA            NA   NA  NA
## 3117                NA   NA         NA               NA            NA   NA  NA
## 3581                NA   NA         NA               NA            NA   NA  NA
## 3761                NA   NA         NA               NA            NA   NA  NA
## 3821                NA   NA         NA               NA            NA   NA  NA
##      ALT    AREA       RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES
## 472   NA   63.43 Sem classificação     2045.03     51257.53     96248.50
## 3117  NA 4988.24 Sem classificação    42123.35         7.20     28168.56
## 3581  NA 5061.43 Sem classificação   210844.60    146514.00     68393.39
## 3761  NA  106.85 Sem classificação     3167.11      5812.35        29.46
## 3821  NA  104.86 Sem classificação    19067.89      4366.36      9652.04
##      GVA_PUBLIC  GVA_TOTAL     TAXES       GDP POP_GDP GDP_CAPITA
## 472    52820.64   202371.69 14863.05 217234.75   12212   17788.63
## 3117   55645.41   133135.10  4177.94 137313.05   15548    8831.56
## 3581   36606.37   462358.36 21594.41 483952.77    5251   92163.92
## 3761   39700.00       78.14  4505.77  82645.86    9908    8341.33
## 3821   14620.12       47.71  4064.74  51771.14    2847   18184.45
##                                                                  GVA_MAIN
## 472                                                       Demais serviços
## 3117 Administração, defesa, educação e saúde públicas e seguridade social
## 3581          Agricultura, inclusive apoio à agricultura e a pós colheita
## 3761 Administração, defesa, educação e saúde públicas e seguridade social
## 3821          Agricultura, inclusive apoio à agricultura e a pós colheita
##      COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 472       270      1      1     16      0      2     47    112      8     13
## 3117       78      0      0      3      0      0      2     14      6      0
## 3581      129      5      1      0      1      2      9     57     21      7
## 3761      105      1      1     22      0      2      6     36      7      3
## 3821       63      1      0     12      0      0      4     18      7      5
##      COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 472       3      6     11     10     23      2      3      6      1      5
## 3117      0      0      0      2      2      0     41      2      0      6
## 3581      1      0      0      4      9      2      3      2      0      5
## 3761      1      0      1      1      1      2     14      0      1      6
## 3821      0      0      2      2      2      1      1      1      3      4
##      COMP_U
## 472       0
## 3117      0
## 3581      0
## 3761      0
## 3821      0

3.1.4.7 Removing un-usable data rows.

Due to the large amount of missing data from these cities, we will be removing them as we would be unable to properly estimate the population at these specific dates unless the data is provided to us. Additionally, as they are only 5 cities, we can still utilize the remaining 5565 for the purposes of our analysis which is more than sufficient.

Brazil_cities_allpop <- Brazil_cities_allGDPC[(is.na(Brazil_cities_allGDPC$IBGE_RES_POP_ESTR))==0,]
#summary(Brazil_cities_allpop)

3.1.4.8 Cleaning IBGE_DU_RURAL values

# Brazil_cities_allpop[(is.na(Brazil_cities_allpop$IBGE_DU_RURAL))!=0,]

Here we can see that by comparing the IBGE_DU and IBGE_DU_URBAN values that the NA values are due to missing 0s as all the IBGE_DU are classified as urban. We will then do a mass fill for the columns.

3.1.4.9 Replacing Missing Values with 0 as per observation

Brazil_cities_allpop$IBGE_DU_RURAL[is.na(Brazil_cities_allpop$IBGE_DU_RURAL)] <- 0

#summary(Brazil_cities_allpop)

3.1.4.10 Dealing with missing IBGE_DU

In this case, IBGE_DU in the reference refers to “Domestic Units”. Upon further investigation, this is reference to Permenant Private Housing Units. We determined this by viewing the source of the data and observing the additional description at the top of the Webpage. Source: https://sidra.ibge.gov.br/tabela/3495

Unfortunately the source data does not provide us with the values we need. However, we can use the alternate data source from the IBGE website report to find a good estimate of these values. Although the values are not exact due to some corrections made further on, after checking with other cities where the IBGE_DU values are known such as Petrolina and Sao Paulo, we can confirm that the data is at least somewhat accurate.

Source: https://cidades.ibge.gov.br/brasil/pb/marcacao/pesquisa/23/25124?tipo=ranking&indicador=29522

From this, we can make a reasonable estimate

Brazil_cities_allpop[(is.na(Brazil_cities_allpop$IBGE_DU))!=0,]

##       CITY_STATE     CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BRAS
## 2937 Marcação_PB Marcação    PB       0         7609              7609
## 5367 Uiramutã_RR Uiramutã    RR       0         8375              8375
##      IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL IBGE_POP IBGE_1
## 2937                 0      NA            NA             0     2838     45
## 5367                 0      NA            NA             0      794     19
##      IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+ IDHM Ranking 2010  IDHM
## 2937      211      277        266       1701      338              5404 0.529
## 5367       83      129        110        424       29              5561 0.453
##      IDHM_Renda IDHM_Longevidade IDHM_Educacao      LONG       LAT    ALT
## 2937      0.525            0.691         0.408 -35.01392 -6.770054  92.93
## 5367      0.439            0.766         0.276 -60.19572  4.585440 605.80
##         AREA     RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC
## 2937  123.83 Rural Adjacente    23738.38      1724.29     11192.86   37551.82
## 5367 8065.56    Rural Remoto     9864.83      1189.55         4.75      87.28
##       GVA_TOTAL    TAXES      GDP POP_GDP GDP_CAPITA
## 2937    74207.34 1436.36  75643.7    8475    8925.51
## 5367   103089.25    0.59 103680.3    9664   10728.51
##                                                                  GVA_MAIN
## 2937 Administração, defesa, educação e saúde públicas e seguridade social
## 5367 Administração, defesa, educação e saúde públicas e seguridade social
##      COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 2937       36      2      0      2      0      0      0     15      1      1
## 5367        8      0      0      0      0      0      0      7      0      0
##      COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 2937      0      0      0      1      1      2      5      0      0      6
## 5367      0      0      0      0      0      1      0      0      0      0
##      COMP_U
## 2937      0
## 5367      0

3.1.4.11 Replacing missing values with externally sourced data

Brazil_cities_allpop$IBGE_DU[which(Brazil_cities_allpop$CITY_STATE == "Marcação_PB")] <- 2040
Brazil_cities_allpop$IBGE_DU_URBAN[which(Brazil_cities_allpop$CITY_STATE == "Marcação_PB")] <- 824
Brazil_cities_allpop$IBGE_DU_RURAL[which(Brazil_cities_allpop$CITY_STATE == "Marcação_PB")] <- 1216

Brazil_cities_allpop$IBGE_DU[which(Brazil_cities_allpop$CITY_STATE == "Uiramutã_RR")] <- 1444
Brazil_cities_allpop$IBGE_DU_URBAN[which(Brazil_cities_allpop$CITY_STATE == "Uiramutã_RR")] <- 219
Brazil_cities_allpop$IBGE_DU_RURAL[which(Brazil_cities_allpop$CITY_STATE == "Uiramutã_RR")] <- 1225

#summary(Brazil_cities_allpop)

3.1.4.12 Investigating missing LONG, LAT and ALT values

Brazil_cities_allpop[(is.na(Brazil_cities_allpop$LONG))!=0,]

##              CITY_STATE            CITY STATE CAPITAL IBGE_RES_POP
## 3806 Pinhal Da Serra_RS Pinhal Da Serra    RS       0         2130
## 4490 Santa Terezinha_BA Santa Terezinha    BA       0         9648
##      IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL
## 3806              2130                 0     745           180           565
## 4490              9648                 0    2891           734          2157
##      IBGE_POP IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## 3806      478     11       22       34         32        312       67
## 4490     2332     40      126      191        217       1419      339
##      IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao LONG LAT
## 3806              3121 0.65      0.641            0.835         0.513   NA  NA
## 4490                NA   NA         NA               NA            NA   NA  NA
##      ALT   AREA     RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES
## 3806  NA 438.11 Rural Adjacente     56030.9    267670.32        15.85
## 4490  NA 719.26 Rural Adjacente     13235.2      5398.61     17754.37
##      GVA_PUBLIC  GVA_TOTAL     TAXES       GDP POP_GDP GDP_CAPITA
## 3806   19831.52      359.38 25222.60 384602.56    2115  181845.18
## 4490   32630.97    69019.14  3149.33  72168.48   10619    6796.16
##                                                                                  GVA_MAIN
## 3806 Eletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação
## 4490                 Administração, defesa, educação e saúde públicas e seguridade social
##      COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 3806       45      1      0      2      1      1      3     23      2      4
## 4490       74      2      1      4      0      0      3     37      0      3
##      COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 3806      0      0      0      0      1      2      1      1      0      3
## 4490      1      0      0      1      2      2     12      2      0      4
##      COMP_U
## 3806      0
## 4490      0

3.1.4.13 Replacing Missing Latitude and Longitude values

Source: https://www.latlong.net/ Source: https://www.freemaptools.com/elevation-finder.htm

Brazil_cities_allpop$LONG[which(Brazil_cities_allpop$CITY_STATE == "Pinhal Da Serra_RS")] <- -51.171909
Brazil_cities_allpop$LAT[which(Brazil_cities_allpop$CITY_STATE == "Pinhal Da Serra_RS")] <- -27.874420
Brazil_cities_allpop$ALT[which(Brazil_cities_allpop$CITY_STATE == "Pinhal Da Serra_RS")] <- 918

Brazil_cities_allpop$LONG[which(Brazil_cities_allpop$CITY_STATE == "Santa Terezinha_BA")] <- -39.5184
Brazil_cities_allpop$LAT[which(Brazil_cities_allpop$CITY_STATE == "Santa Terezinha_BA")] <- -12.7498
Brazil_cities_allpop$ALT[which(Brazil_cities_allpop$CITY_STATE == "Santa Terezinha_BA")] <- 210

3.1.4.14 Replacing Missing Area Values

Source: https://en.wikipedia.org/wiki/Japur%C3%A1

Brazil_cities_allpop[(is.na(Brazil_cities_allpop$AREA))!=0,]

##      CITY_STATE   CITY STATE CAPITAL IBGE_RES_POP IBGE_RES_POP_BRAS
## 2531  Japurá_AM Japurá    AM       0         7326              7318
##      IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL IBGE_POP IBGE_1
## 2531                 8    1043           583           460     3235     92
##      IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+ IDHM Ranking 2010  IDHM
## 2531      369      435        478       1764       97              5451 0.522
##      IDHM_Renda IDHM_Longevidade IDHM_Educacao     LONG       LAT   ALT AREA
## 2531      0.552            0.748         0.345 -66.9969 -1.880845 69.84   NA
##       RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES GVA_PUBLIC  GVA_TOTAL 
## 2531 Rural Remoto    16398.64       2146.9      9908.92    29244.3        57.7
##        TAXES   GDP POP_GDP GDP_CAPITA
## 2531 1489.89 59.19    4660   12701.43
##                                                                  GVA_MAIN
## 2531 Administração, defesa, educação e saúde públicas e seguridade social
##      COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 2531       16      0      0      0      0      0      0     13      0      0
##      COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 2531      0      0      0      0      1      2      0      0      0      0
##      COMP_U
## 2531      0

Brazil_cities_allpop$AREA[which(Brazil_cities_allpop$CITY_STATE == "Japurá_AM")] <- 55791

#summary(Brazil_cities_allpop)

3.1.4.15 Finding missing Santa Terezinha_BA Values

Brazil_cities_allpop[(is.na(Brazil_cities_allpop$IDHM))!=0,]

##              CITY_STATE            CITY STATE CAPITAL IBGE_RES_POP
## 4490 Santa Terezinha_BA Santa Terezinha    BA       0         9648
##      IBGE_RES_POP_BRAS IBGE_RES_POP_ESTR IBGE_DU IBGE_DU_URBAN IBGE_DU_RURAL
## 4490              9648                 0    2891           734          2157
##      IBGE_POP IBGE_1 IBGE_1-4 IBGE_5-9 IBGE_10-14 IBGE_15-59 IBGE_60+
## 4490     2332     40      126      191        217       1419      339
##      IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao     LONG
## 4490                NA   NA         NA               NA            NA -39.5184
##           LAT ALT   AREA     RURAL_URBAN GVA_AGROPEC GVA_INDUSTRY GVA_SERVICES
## 4490 -12.7498 210 719.26 Rural Adjacente     13235.2      5398.61     17754.37
##      GVA_PUBLIC  GVA_TOTAL    TAXES      GDP POP_GDP GDP_CAPITA
## 4490   32630.97    69019.14 3149.33 72168.48   10619    6796.16
##                                                                  GVA_MAIN
## 4490 Administração, defesa, educação e saúde públicas e seguridade social
##      COMP_TOT COMP_A COMP_B COMP_C COMP_D COMP_E COMP_F COMP_G COMP_H COMP_I
## 4490       74      2      1      4      0      0      3     37      0      3
##      COMP_J COMP_K COMP_L COMP_M COMP_N COMP_O COMP_P COMP_Q COMP_R COMP_S
## 4490      1      0      0      1      2      2     12      2      0      4
##      COMP_U
## 4490      0

Unfortunately, we will not be able to use this datapoint as we are unable to replace the remaining missing data values for the Human Development Indexes. For the purpose of this study, this datavalue will also be excluded

Brazil_cities_cleaned<- Brazil_cities_allpop[(is.na(Brazil_cities_allpop$IDHM))==0,]

summary(Brazil_cities_cleaned)

##                   CITY_STATE       CITY              STATE          
##  Abadia De Goiás_GO    :   1   Length:5564        Length:5564       
##  Abadia Dos Dourados_MG:   1   Class :character   Class :character  
##  Abadiânia_GO          :   1   Mode  :character   Mode  :character  
##  Abaeté_MG             :   1                                        
##  Abaetetuba_PA         :   1                                        
##  Abaiara_CE            :   1                                        
##  (Other)               :5558                                        
##     CAPITAL          IBGE_RES_POP      IBGE_RES_POP_BRAS  IBGE_RES_POP_ESTR  
##  Min.   :0.000000   Min.   :     805   Min.   :     805   Min.   :     0.00  
##  1st Qu.:0.000000   1st Qu.:    5234   1st Qu.:    5228   1st Qu.:     0.00  
##  Median :0.000000   Median :   10935   Median :   10930   Median :     0.00  
##  Mean   :0.004853   Mean   :   34282   Mean   :   34205   Mean   :    77.52  
##  3rd Qu.:0.000000   3rd Qu.:   23446   3rd Qu.:   23392   3rd Qu.:    10.00  
##  Max.   :1.000000   Max.   :11253503   Max.   :11133776   Max.   :119727.00  
##                                                                              
##     IBGE_DU        IBGE_DU_URBAN     IBGE_DU_RURAL        IBGE_POP       
##  Min.   :    239   Min.   :     60   Min.   :    0.0   Min.   :     174  
##  1st Qu.:   1572   1st Qu.:    874   1st Qu.:  471.8   1st Qu.:    2802  
##  Median :   3174   Median :   1845   Median :  918.5   Median :    6174  
##  Mean   :  10301   Mean   :   8857   Mean   : 1443.8   Mean   :   27599  
##  3rd Qu.:   6726   3rd Qu.:   4622   3rd Qu.: 1813.0   3rd Qu.:   15303  
##  Max.   :3576148   Max.   :3548433   Max.   :33809.0   Max.   :10463636  
##                                                                          
##      IBGE_1            IBGE_1-4           IBGE_5-9        IBGE_10-14      
##  Min.   :     0.0   Min.   :     5.0   Min.   :     7   Min.   :    12.0  
##  1st Qu.:    38.0   1st Qu.:   158.0   1st Qu.:   220   1st Qu.:   259.8  
##  Median :    92.0   Median :   376.5   Median :   516   Median :   588.5  
##  Mean   :   383.3   Mean   :  1544.8   Mean   :  2070   Mean   :  2381.8  
##  3rd Qu.:   232.0   3rd Qu.:   951.2   3rd Qu.:  1300   3rd Qu.:  1478.2  
##  Max.   :129464.0   Max.   :514794.0   Max.   :684443   Max.   :783702.0  
##                                                                           
##    IBGE_15-59         IBGE_60+         IDHM Ranking 2010      IDHM       
##  Min.   :     94   Min.   :     29.0   Min.   :   1      Min.   :0.4180  
##  1st Qu.:   1735   1st Qu.:    341.0   1st Qu.:1392      1st Qu.:0.5990  
##  Median :   3842   Median :    722.5   Median :2782      Median :0.6650  
##  Mean   :  18215   Mean   :   3004.7   Mean   :2783      Mean   :0.6592  
##  3rd Qu.:   9629   3rd Qu.:   1724.2   3rd Qu.:4173      3rd Qu.:0.7180  
##  Max.   :7058221   Max.   :1293012.0   Max.   :5565      Max.   :0.8620  
##                                                                          
##    IDHM_Renda     IDHM_Longevidade IDHM_Educacao         LONG       
##  Min.   :0.4000   Min.   :0.6720   Min.   :0.2070   Min.   :-72.92  
##  1st Qu.:0.5720   1st Qu.:0.7690   1st Qu.:0.4900   1st Qu.:-50.87  
##  Median :0.6540   Median :0.8080   Median :0.5600   Median :-46.52  
##  Mean   :0.6429   Mean   :0.8016   Mean   :0.5591   Mean   :-46.23  
##  3rd Qu.:0.7070   3rd Qu.:0.8360   3rd Qu.:0.6310   3rd Qu.:-41.41  
##  Max.   :0.8910   Max.   :0.8940   Max.   :0.8250   Max.   :-32.44  
##                                                                     
##       LAT               ALT                AREA           RURAL_URBAN       
##  Min.   :-33.688   Min.   :     0.0   Min.   :     3.57   Length:5564       
##  1st Qu.:-22.839   1st Qu.:   169.8   1st Qu.:   204.53   Class :character  
##  Median :-18.091   Median :   406.5   Median :   416.59   Mode  :character  
##  Mean   :-16.447   Mean   :   894.0   Mean   :  1525.29                     
##  3rd Qu.: -8.489   3rd Qu.:   629.1   3rd Qu.:  1026.44                     
##  Max.   :  4.585   Max.   :874579.0   Max.   :159533.33                     
##                                                                             
##   GVA_AGROPEC       GVA_INDUSTRY       GVA_SERVICES         GVA_PUBLIC      
##  Min.   :      0   Min.   :       1   Min.   :        2   Min.   :       7  
##  1st Qu.:   4192   1st Qu.:    1725   1st Qu.:    10113   1st Qu.:   17258  
##  Median :  20432   Median :    7428   Median :    31214   Median :   35837  
##  Mean   :  47270   Mean   :  176080   Mean   :   489940   Mean   :  123860  
##  3rd Qu.:  51239   3rd Qu.:   41015   3rd Qu.:   115552   3rd Qu.:   89328  
##  Max.   :1402282   Max.   :63306755   Max.   :464656988   Max.   :41902893  
##                                                                             
##    GVA_TOTAL             TAXES                GDP               POP_GDP        
##  Min.   :       17   Min.   :   -14159   Min.   :       15   Min.   :     815  
##  1st Qu.:    42254   1st Qu.:     1302   1st Qu.:    43691   1st Qu.:    5486  
##  Median :   119492   Median :     5108   Median :   125153   Median :   11584  
##  Mean   :   833729   Mean   :   118983   Mean   :   955425   Mean   :   37028  
##  3rd Qu.:   314039   3rd Qu.:    22219   3rd Qu.:   329733   3rd Qu.:   25105  
##  Max.   :569910503   Max.   :117125387   Max.   :687035890   Max.   :12038175  
##                                                                                
##    GDP_CAPITA       GVA_MAIN            COMP_TOT            COMP_A       
##  Min.   :  3191   Length:5564        Min.   :     6.0   Min.   :   0.00  
##  1st Qu.:  9062   Class :character   1st Qu.:    68.0   1st Qu.:   1.00  
##  Median : 15870   Mode  :character   Median :   162.0   Median :   2.00  
##  Mean   : 21122                      Mean   :   907.6   Mean   :  18.27  
##  3rd Qu.: 26155                      3rd Qu.:   449.2   3rd Qu.:   8.00  
##  Max.   :314638                      Max.   :530446.0   Max.   :1948.00  
##                                                                          
##      COMP_B            COMP_C             COMP_D             COMP_E       
##  Min.   :  0.000   Min.   :    0.00   Min.   :  0.0000   Min.   :  0.000  
##  1st Qu.:  0.000   1st Qu.:    3.00   1st Qu.:  0.0000   1st Qu.:  0.000  
##  Median :  0.000   Median :   11.00   Median :  0.0000   Median :  0.000  
##  Mean   :  1.853   Mean   :   73.51   Mean   :  0.4265   Mean   :  2.031  
##  3rd Qu.:  2.000   3rd Qu.:   39.00   3rd Qu.:  0.0000   3rd Qu.:  1.000  
##  Max.   :274.000   Max.   :31566.00   Max.   :332.0000   Max.   :657.000  
##                                                                           
##      COMP_F             COMP_G             COMP_H             COMP_I        
##  Min.   :    0.00   Min.   :     1.0   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    1.00   1st Qu.:    32.0   1st Qu.:    1.00   1st Qu.:    2.00  
##  Median :    4.00   Median :    75.0   Median :    7.00   Median :    7.00  
##  Mean   :   43.29   Mean   :   348.3   Mean   :   41.03   Mean   :   55.93  
##  3rd Qu.:   15.00   3rd Qu.:   200.0   3rd Qu.:   25.00   3rd Qu.:   24.00  
##  Max.   :25222.00   Max.   :150633.0   Max.   :19515.00   Max.   :29290.00  
##                                                                             
##      COMP_J             COMP_K             COMP_L             COMP_M        
##  Min.   :    0.00   Min.   :    0.00   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:    0.00   1st Qu.:    1.00  
##  Median :    1.00   Median :    0.00   Median :    0.00   Median :    4.00  
##  Mean   :   24.77   Mean   :   15.57   Mean   :   15.15   Mean   :   51.34  
##  3rd Qu.:    5.00   3rd Qu.:    2.00   3rd Qu.:    3.00   3rd Qu.:   13.00  
##  Max.   :38720.00   Max.   :23738.00   Max.   :14003.00   Max.   :49181.00  
##                                                                             
##      COMP_N             COMP_O            COMP_P             COMP_Q        
##  Min.   :    0.00   Min.   :  1.000   Min.   :    0.00   Min.   :    0.00  
##  1st Qu.:    1.00   1st Qu.:  2.000   1st Qu.:    2.00   1st Qu.:    1.00  
##  Median :    4.00   Median :  2.000   Median :    6.00   Median :    3.00  
##  Mean   :   83.78   Mean   :  3.271   Mean   :   30.98   Mean   :   34.18  
##  3rd Qu.:   14.00   3rd Qu.:  3.000   3rd Qu.:   17.00   3rd Qu.:   12.00  
##  Max.   :76757.00   Max.   :204.000   Max.   :16030.00   Max.   :22248.00  
##                                                                            
##      COMP_R            COMP_S             COMP_U         
##  Min.   :   0.00   Min.   :    0.00   Min.   :  0.00000  
##  1st Qu.:   0.00   1st Qu.:    5.00   1st Qu.:  0.00000  
##  Median :   2.00   Median :   12.00   Median :  0.00000  
##  Mean   :  12.19   Mean   :   51.66   Mean   :  0.05032  
##  3rd Qu.:   6.00   3rd Qu.:   31.00   3rd Qu.:  0.00000  
##  Max.   :6687.00   Max.   :24832.00   Max.   :123.00000  
##

3.1.4.16 Summary of Data Cleaning

Overall we had lost a total of 9 rows of data during the data cleaning. 3 of which were missing depedent variable of GDP per Capita, 5 of which were missing a large number of variables and lastly 1 due to missing IDHM values.

Overall we reduced our number of variables from 81 to 58. We added 1 variable as a unique identifer for each state, removed 22 variables due to the collection of data recorded after our dependent variable (2016), removed 1 for all 0 values and removed 1 variable due to a large portion of missing values for each row.

3.1.5 Data Processing

In order to formulate our indicators, we will need to create some derived variables to ensure that our indicators for our explainatory model are not correlated with one another or the dependent variable by some underlying issue. Since our dependent variable is a metric which is divided by population, we would need to process values which are dependant on population in some ways.

We will be taking 3 different approaches in this case.

Using Ratios rather than counts for metrics where we have totals. E.g. (foreign resident population / total resident population)
Using the values divided by POP_GDP which is the population scale used to formulate GDP Per capita.

3.1.5.1 Categorical Data Handling

We can derive more variables for our analysis by converting categorical variables into binary arrays. This will allow us to retain our categorical variables during our regression by making them into dummy variables.

Examining GVA_MAIN

unique(Brazil_cities_cleaned[,37])

##  [1] "Demais serviços"                                                                     
##  [2] "Administração, defesa, educação e saúde públicas e seguridade social"                
##  [3] "Agricultura, inclusive apoio à agricultura e a pós colheita"                         
##  [4] "Indústrias de transformação"                                                         
##  [5] "Pecuária, inclusive apoio à pecuária"                                                
##  [6] "Eletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação"
##  [7] "Comércio e reparação de veículos automotores e motocicletas"                         
##  [8] "Indústrias extrativas"                                                               
##  [9] "Construção"                                                                          
## [10] "Produção florestal, pesca e aquicultura"

Examining RURAL_URBAN

unique(Brazil_cities_cleaned[,27])

## [1] "Urbano"                  "Rural Adjacente"        
## [3] "Rural Remoto"            "Intermediário Adjacente"
## [5] "Intermediário Remoto"

Creating Dummy Variable Arrays

Brazil_cities_CAT <- cbind(Brazil_cities_cleaned, as.data.frame(with(Brazil_cities_cleaned, model.matrix(~ RURAL_URBAN + 0))))
Brazil_cities_CAT <- cbind(Brazil_cities_CAT, as.data.frame(with(Brazil_cities_cleaned, model.matrix(~ GVA_MAIN + 0))))

Dropping Categorical Columns

dropCategorical <- c("GVA_MAIN", "RURAL_URBAN")

Brazil_cities_withDummy <- Brazil_cities_CAT[ , !(names(Brazil_cities_CAT) %in% dropCategorical)]

3.1.5.2 Building Multiple Ratios

In order to control for populational differences, we can take ratios instead of pure counts to get a better understanding of the makeup of each town

3.1.5.2.1 Reworking GVA Totals

After examining the data and source of the data. There appears to be an error in the GVA totals. This would greatly affect our ratios for GVA and upon inspection of the source data, all other GVA values are correct except the totals. It is not clear where the values in the totals are coming from, as such we will replace them by summing up all the values for each category of GVA to formulate new GVA totals.

Brazil_cities_withDummy <-  Brazil_cities_withDummy %>%
   mutate(` GVA_TOTAL ` = as.numeric(rowSums(.[27:30])))

Brazil_cities_Derived <- Brazil_cities_withDummy %>%
  # Foregin vs Local population
  mutate(RES_BRAZ_POP_RATIO = ifelse((IBGE_RES_POP_BRAS == 0), 0, (IBGE_RES_POP_BRAS/IBGE_RES_POP))) %>%
  mutate(RES_FOREIGN_POP_RATIO = ifelse((IBGE_RES_POP_ESTR == 0), 0, (IBGE_RES_POP_ESTR/IBGE_RES_POP))) %>%
  # Rural vs Urban Domestic Units
  mutate(DOM_URBAN_RATIO = ifelse((IBGE_DU_URBAN == 0), 0, (IBGE_DU_URBAN/IBGE_DU)))%>%
  mutate(DOM_RURAL_RATIO = ifelse((IBGE_DU_RURAL == 0), 0, (IBGE_DU_RURAL/IBGE_DU)))%>%
  # Residential Population Age Ratios
  mutate(POP_BEL_ONE_RATIO = ifelse((IBGE_1 == 0), 0, (IBGE_1/IBGE_POP)))%>%
  mutate(POP_ONE_to_FOUR_RATIO = ifelse((`IBGE_1-4` == 0), 0, (`IBGE_1-4`/IBGE_POP)))%>%
  mutate(POP_FIVE_to_NINE_RATIO = ifelse((`IBGE_5-9` == 0), 0, (`IBGE_5-9`/IBGE_POP)))%>%
  mutate(POP_TEN_to_FOURTEEN_RATIO = ifelse((`IBGE_10-14` == 0), 0, (`IBGE_10-14`/IBGE_POP)))%>%
  mutate(POP_WORKING_RATIO = ifelse((`IBGE_15-59` == 0), 0, (`IBGE_15-59`/IBGE_POP))) %>%
  mutate(POP_ELDERLY_RATIO = ifelse((`IBGE_60+` == 0), 0, (`IBGE_60+`/IBGE_POP)))%>%
  # Gross Added Value Ratios
  mutate(GVA_AGROPEC_RATIO = ifelse((GVA_AGROPEC == 0), 0, (GVA_AGROPEC/as.numeric(` GVA_TOTAL `))))%>%
  mutate(GVA_INDUSTRY_RATIO = ifelse((GVA_INDUSTRY == 0), 0, (GVA_INDUSTRY/as.numeric(` GVA_TOTAL `))))%>%
  mutate(GVA_SERVICES_RATIO = ifelse((GVA_SERVICES == 0), 0, (GVA_SERVICES/as.numeric(` GVA_TOTAL `))))%>%
  mutate(GVA_PUBLIC_RATIO = ifelse((GVA_PUBLIC == 0), 0, (GVA_PUBLIC/as.numeric(` GVA_TOTAL `))))%>%
  # Company Ratios
  mutate(COM_A_RATIO = ifelse((COMP_A == 0), 0, (COMP_A/COMP_TOT)))%>%
  mutate(COM_B_RATIO = ifelse((COMP_B == 0), 0, (COMP_B/COMP_TOT)))%>%
  mutate(COM_C_RATIO = ifelse((COMP_C == 0), 0, (COMP_C/COMP_TOT)))%>%
  mutate(COM_D_RATIO = ifelse((COMP_D == 0), 0, (COMP_D/COMP_TOT)))%>%
  mutate(COM_E_RATIO = ifelse((COMP_E == 0), 0, (COMP_E/COMP_TOT)))%>%
  mutate(COM_F_RATIO = ifelse((COMP_F == 0), 0, (COMP_F/COMP_TOT)))%>%
  mutate(COM_G_RATIO = ifelse((COMP_G == 0), 0, (COMP_G/COMP_TOT)))%>%
  mutate(COM_H_RATIO = ifelse((COMP_H == 0), 0, (COMP_H/COMP_TOT)))%>%
  mutate(COM_I_RATIO = ifelse((COMP_I == 0), 0, (COMP_I/COMP_TOT)))%>%
  mutate(COM_J_RATIO = ifelse((COMP_J == 0), 0, (COMP_J/COMP_TOT)))%>%
  mutate(COM_K_RATIO = ifelse((COMP_K == 0), 0, (COMP_K/COMP_TOT)))%>%
  mutate(COM_L_RATIO = ifelse((COMP_L == 0), 0, (COMP_L/COMP_TOT)))%>%
  mutate(COM_M_RATIO = ifelse((COMP_M == 0), 0, (COMP_M/COMP_TOT)))%>%
  mutate(COM_N_RATIO = ifelse((COMP_N == 0), 0, (COMP_N/COMP_TOT)))%>%
  mutate(COM_O_RATIO = ifelse((COMP_O == 0), 0, (COMP_O/COMP_TOT)))%>%
  mutate(COM_P_RATIO = ifelse((COMP_P == 0), 0, (COMP_P/COMP_TOT)))%>%
  mutate(COM_Q_RATIO = ifelse((COMP_Q == 0), 0, (COMP_Q/COMP_TOT)))%>%
  mutate(COM_R_RATIO = ifelse((COMP_R == 0), 0, (COMP_R/COMP_TOT)))%>%
  mutate(COM_S_RATIO = ifelse((COMP_S == 0), 0, (COMP_S/COMP_TOT)))%>%
  mutate(COM_U_RATIO = ifelse((COMP_U == 0), 0, (COMP_U/COMP_TOT)))

3.1.5.3 Creating Population Density Indicator

Brazil_cities_Derived <-  Brazil_cities_Derived %>%
   mutate(POP_DENSITY = POP_GDP/AREA)

#summary(Brazil_cities_Derived)

Data Looks good, though we should pay attention to the ratios which have a max-value less than 1. It would be prudent not to normalize them.

3.1.6 Plotting Derived Indicators

3.1.6.1 Plotting Foregin vs Local Residents

Brazil_cities_Derived[73:74]%>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

3.1.6.2 Plotting Domestic Rural versus Urban

Brazil_cities_Derived[75:76] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

3.1.6.3 Plotting Age Ratios

Brazil_cities_Derived[77:82] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

3.1.6.4 GVA Ratios

Brazil_cities_Derived[83:86] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

3.1.6.5 Company Type Ratios

Brazil_cities_Derived[87:106] %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

3.2 Geospatial Data Wrangling

3.2.1 Converting Aspatial Data into Geospatial Point Dataframe

Brazil_cities.sf <- st_as_sf(Brazil_cities_Derived,
                            coords = c("LONG", "LAT"),
                            crs=4326) %>%
  st_transform(crs=4674)
#head(Brazil_cities.sf)

We will be changing the CRS to 4674 as per the geobr documentation in order to accurately map the datapoints to the Brazil country map for the municipalities.

3.2.1.1 Validity checking data

Validity_NA_Check(Brazil_cities.sf)

## [1] "For: Brazil_cities.sf"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 0"

3.2.2 Importing Municipal Geospatial Data

#muni.sf <- read_municipality(year=2010)

We will be loading in the municipalities from 2010 in order to ensure that our data to align with the lat long data from our aspatial dataset which specifies the date as 2010. Additionally this will be commented out as we will save the data locally after cleaning to reduce processing time of the file.

3.2.3 Inspecting Geospatial Data

#Validity_NA_Check(muni.sf)

#muni.sf <- st_make_valid(muni.sf)
#Validity_NA_Check(muni.sf)

#muni.sp <- as_Spatial(muni.sf)
#writeOGR(muni.sp, "./data/geospatial", "Brazil_Muni", driver="ESRI Shapefile")

The above were commented out to reduce loading times. We will load in the file locally and check the validity.

tmap_mode("plot")
muni_loaded.sf <- st_read(dsn="data/geospatial", layer="Brazil_Muni")

## Reading layer `Brazil_Muni' from data source `D:\GSA\Take_Home_EX04\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 5567 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -73.99045 ymin: -33.75208 xmax: -28.83609 ymax: 5.271841
## geographic CRS: GRS 1980(IUGG, 1980)

st_crs(muni_loaded.sf) <- 4674
qtm(muni_loaded.sf)

Validity_NA_Check(muni_loaded.sf)

## [1] "For: muni_loaded.sf"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 0"

3.2.3.1 Creating unique identifier

muni_loaded_w_unique.sf <- cbind(CITY_STATE_M = paste(muni_loaded.sf$name_mn, muni_loaded.sf$abbrv_s, sep="_"), muni_loaded.sf)

3.2.4 Mapping points on map

tm_shape(muni_loaded_w_unique.sf)+
  tm_fill(col= "code_mn")+
  tm_shape(Brazil_cities.sf)+
  tm_dots(size = 0.01)

Based on the map above, we can observe the points are accurately mapped to the respective municipalities in Brazil We will create a combined dataframe to allow us to perform our next phase of choropleth mapping.

#Brazil_cities.sf <- Brazil_cities.sf[!(Brazil_cities.sf$CITY_STATE =="Fernando De Noronha_PE"), ]

tmap_mode("plot")

3.2.5 Building SuperFrame

Brazil_super.sf <- st_join(muni_loaded_w_unique.sf, Brazil_cities.sf, join=st_intersects)

3.2.5.1 Validating and cleaning superframe

Validity_NA_Check(Brazil_super.sf)

## [1] "For: Brazil_super.sf"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 3"

Checking NA Row locations

temp_NA <- Brazil_super.sf[rowSums(is.na(Brazil_super.sf))!=0,]
as.character(temp_NA$name_mn)

## [1] "Santa Teresinha" "Lagoa Mirim"     "Lagoa Dos Patos"

Based on the data above, we can see that 2 of the polygons with NA are lakes and the last one is Santa Teresinha which we removed because of missing values in the data cleaning. This means that the rest of the polygons should have the data mapped to them correctly, unless there are double points in them.

Removing NA rows

Brazil_super_cleaned.sf<- Brazil_super.sf[rowSums(is.na(Brazil_super.sf))==0,]
Validity_NA_Check(Brazil_super_cleaned.sf)

## [1] "For: Brazil_super_cleaned.sf"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 0"

Checking for duplicates

dim(Brazil_super_cleaned.sf[duplicated(Brazil_super_cleaned.sf$CITY_STATE.x),])

## [1]   0 110

Seems there are no duplicate rows. Which means that each polygon has only one data point attached to it.

4 Choropleth Map plotting

tmap_mode("plot")
tm_shape(Brazil_super_cleaned.sf)+
  tm_fill(col= "GDP_CAPITA",
          style="jenks",
          title = "GDP per Capita",
          palette ="Greens")+
  tm_layout(main.title = "Distribution of GDP per Capita by Municipality \n(Jenks classification)",
            main.title.position = "center",
            main.title.size = 1,
            legend.height = 0.45, 
            legend.width = 0.35,
            legend.outside = FALSE,
            legend.position = c("right", "bottom"),
            frame = FALSE) +
  tm_borders(alpha = 0.1)

Based on the map above. We can see a surprising result in our mapping for GDP per Capita. It appears that the highest GDP per capita are around the satelight cities around Sao Paulo rather than the main city itself. Additionally, very far inland in areas like Selviria and Campos De Júlio, we can also see concentrations of higher GDP per capita. This could be due to a lower population while the region is still generating a large amount of production. This is surpising given the larger areas of these polygons.

What is even more suprising is that the two main cities in Brazil of Rio De Janeiro and Sao Paolo only have GDP per capita of 50,690 and 57,071 respectively. This is most likely due to a much larger population count concentrated in these smaller areas which is concerning from a social development standpoint.

5 Multiple Linear Regression

5.1 Data Preperation for Regression

5.1.1 Removal of unnecessary columns

dropsAbrev <- c("CITY_STATE_M", "code_mn", "name_mn", "cod_stt", "abbrv_s", "CITY", "STATE")

Brazil_reg.sf <- Brazil_super_cleaned.sf[ , !(names(Brazil_super_cleaned.sf) %in% dropsAbrev)]

5.1.2 Seperating Categorical from Numerical

Brazil_numeric_vars <- cbind(Brazil_reg.sf[,3:28]%>%
  st_set_geometry(NULL), Brazil_reg.sf[,32:52]%>%
  st_set_geometry(NULL), Brazil_reg.sf[,102]%>%
  st_set_geometry(NULL))
  
Brazil_numeric_vars.norm <- normalize(Brazil_numeric_vars)

Brazil_Ratios_vars <- Brazil_reg.sf[,68:101] %>%
  st_set_geometry(NULL)

Brazil_Categorical_vars <- cbind(Brazil_reg.sf[,2]%>%
  st_set_geometry(NULL), Brazil_reg.sf[,53:67]%>%
  st_set_geometry(NULL))

5.1.2.1 Creating an “All variables” dataframe for correlational plot checking.

dropsReg <- c("CITY_STATE", "GDP", "GDP_CAPITA", "POP_GDP")

Brazil_All_vars <- Brazil_reg.sf[ , !(names(Brazil_reg.sf) %in% dropsReg)] %>%
  st_set_geometry(NULL)

5.1.3 Perform Correlational analysis of variables

corrplot(cor(Brazil_numeric_vars.norm, use = "complete.obs"), diag = FALSE, order = "AOE",
        tl.pos = "td", tl.cex = 0.5, method = "square", type = "upper")

corrplot(cor(Brazil_Ratios_vars, use = "complete.obs"), diag = FALSE, order = "AOE",
        tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")

corrplot(cor(Brazil_Categorical_vars, use = "complete.obs"), diag = FALSE, order = "AOE",
        tl.pos = "td", tl.cex = 0.5, method = "square", type = "upper")

# Removed all variables for display reasons. Although they were checked in the analysis to ensure all variables don't correlate too much
# corrplot(cor(Brazil_All_vars, use = "complete.obs"), diag = FALSE, order = "AOE",
        #tl.pos = "td", tl.cex = 0.5, method = "sqaure", type = "upper")

As expected, there are a number of indicators from our numeric dataset that are clearly highyl correlated with one another, noticaply the IBGE, GVA, TAXES and COMP numbers. Because of their correlation with COMP_TOT, we will use that as a metric to capture all those numbers as it is the likely contributor to those variables arizing (particularly taxes). We will also use IDHM as a measure for all the IDHM indicators specified although there will be some loss of information.

Within Ratios, we can see the amongst the population ratios the youths are very highly correlated. As these are ratios, we can sum them up to give us a new Youth metric instead. Additionally because DOM_RURAL_RATIO, DOM_URBAN_RATIO and RES_BRAZ_POP_RATIO, RES_FOREIGN_POP_RATIO are polar opposites, we can just take one to use as an indicator. In our case, we will choose the Foreign Population ratio and the Domestic Urban Units ratis.

5.1.4 Extracting and combining variables to be useful

5.1.4.1 Extracting useful numeric

Brazil_numeric_vars_pro <- Brazil_numeric_vars.norm %>% select("ALT", "AREA", "IDHM", "POP_DENSITY", "COMP_TOT")

5.1.4.2 Remodelling and extracting ratios

Brazil_Ratios_vars_pro <-  Brazil_Ratios_vars %>%
   mutate( POP_YOUTH_RATIO = as.numeric((POP_BEL_ONE_RATIO + POP_ONE_to_FOUR_RATIO + POP_FIVE_to_NINE_RATIO + POP_TEN_to_FOURTEEN_RATIO)))

dropsRatios <- c("POP_BEL_ONE_RATIO", "POP_ONE_to_FOUR_RATIO", "POP_FIVE_to_NINE_RATIO", "POP_TEN_to_FOURTEEN_RATIO", "RES_BRAZ_POP_RATIO", "DOM_RURAL_RATIO")

Brazil_Ratios_vars_pro <- Brazil_Ratios_vars_pro[ , !(names(Brazil_Ratios_vars_pro) %in% dropsRatios)]

5.1.4.3 Combining Variables for mapping

Brazil_indicators <- cbind(Brazil_Ratios_vars_pro, Brazil_Categorical_vars, Brazil_numeric_vars_pro)

5.1.4.4 Performing Correlational matrix plot once more

corrplot(cor(Brazil_indicators, use = "complete.obs"), diag = FALSE, order = "AOE",
        tl.pos = "td", tl.cex = 0.4, number.cex= 0.3, method = "number", type = "upper")

Based on our correlational plot, we dont see any variables which are heavily correlated beyond 0.75. As such, we will take these variables to be those we utilize in our regression.

5.1.5 Forming Final Simple Feature Dataframe

polygon_frame <- Brazil_reg.sf %>% select("CITY_STATE")
joining_frame <- Brazil_reg.sf %>% select("CITY_STATE", "GDP_CAPITA") %>% st_set_geometry(NULL)
joining_frame_states <- cbind(joining_frame, Brazil_indicators)
Brazil_Indicators.sf <- left_join(polygon_frame, joining_frame_states, by="CITY_STATE") ## Usually you would use an index but after checking the data, we find that it does align with the data from Brazil_reg.sf so as such, we can assume the data was actually joint to the original SF

5.1.5.1 Validating and summurizing variables

Validity_NA_Check(Brazil_Indicators.sf)

## [1] "For: Brazil_Indicators.sf"
## [1] "Number of Invalid polygons/points is: 0"
## [1] "Number of NA rows is: 0"

summary(Brazil_Indicators.sf)

##                   CITY_STATE     GDP_CAPITA     RES_FOREIGN_POP_RATIO
##  Abadia De Goiás_GO    :   1   Min.   :  3191   Min.   :0.0000000    
##  Abadia Dos Dourados_MG:   1   1st Qu.:  9062   1st Qu.:0.0000000    
##  Abadiânia_GO          :   1   Median : 15870   Median :0.0000000    
##  Abaeté_MG             :   1   Mean   : 21122   Mean   :0.0007593    
##  Abaetetuba_PA         :   1   3rd Qu.: 26155   3rd Qu.:0.0006992    
##  Abaiara_CE            :   1   Max.   :314638   Max.   :0.3772182    
##  (Other)               :5558                                         
##  DOM_URBAN_RATIO   POP_WORKING_RATIO POP_ELDERLY_RATIO GVA_AGROPEC_RATIO
##  Min.   :0.04553   Min.   :0.4716    Min.   :0.02255   Min.   :0.00000  
##  1st Qu.:0.49148   1st Qu.:0.6087    1st Qu.:0.09799   1st Qu.:0.03364  
##  Median :0.66263   Median :0.6325    Median :0.11921   Median :0.15062  
##  Mean   :0.65205   Mean   :0.6308    Mean   :0.12009   Mean   :0.21034  
##  3rd Qu.:0.83040   3rd Qu.:0.6543    3rd Qu.:0.14103   3rd Qu.:0.34094  
##  Max.   :1.00000   Max.   :0.7448    Max.   :0.42199   Max.   :0.99877  
##                                                                         
##  GVA_INDUSTRY_RATIO  GVA_SERVICES_RATIO  GVA_PUBLIC_RATIO     COM_A_RATIO      
##  Min.   :0.0000157   Min.   :0.0000461   Min.   :0.0000433   Min.   :0.000000  
##  1st Qu.:0.0368730   1st Qu.:0.1985910   1st Qu.:0.1448472   1st Qu.:0.001569  
##  Median :0.0714602   Median :0.3117002   Median :0.2948082   Median :0.011803  
##  Mean   :0.1377745   Mean   :0.3260963   Mean   :0.3257928   Mean   :0.039408  
##  3rd Qu.:0.1795132   3rd Qu.:0.4600063   3rd Qu.:0.4966551   3rd Qu.:0.031915  
##  Max.   :0.9991868   Max.   :0.9995977   Max.   :0.9996029   Max.   :0.917085  
##                                                                                
##   COM_B_RATIO        COM_C_RATIO       COM_D_RATIO         COM_E_RATIO      
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.0000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.03636   1st Qu.:0.0000000   1st Qu.:0.000000  
##  Median :0.000000   Median :0.06590   Median :0.0000000   Median :0.000000  
##  Mean   :0.006019   Mean   :0.07967   Mean   :0.0007847   Mean   :0.002508  
##  3rd Qu.:0.005188   3rd Qu.:0.10593   3rd Qu.:0.0000000   3rd Qu.:0.003226  
##  Max.   :0.333333   Max.   :0.54518   Max.   :0.4444444   Max.   :0.083333  
##                                                                             
##   COM_F_RATIO       COM_G_RATIO       COM_H_RATIO       COM_I_RATIO     
##  Min.   :0.00000   Min.   :0.01789   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.01389   1st Qu.:0.38980   1st Qu.:0.01562   1st Qu.:0.02128  
##  Median :0.02778   Median :0.46396   Median :0.03757   Median :0.04167  
##  Mean   :0.03130   Mean   :0.47234   Mean   :0.04955   Mean   :0.04567  
##  3rd Qu.:0.04348   3rd Qu.:0.55263   3rd Qu.:0.07052   3rd Qu.:0.06202  
##  Max.   :0.29213   Max.   :0.89091   Max.   :0.43689   Max.   :0.52542  
##                                                                         
##   COM_J_RATIO        COM_K_RATIO        COM_L_RATIO        COM_M_RATIO     
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.01144  
##  Median :0.007299   Median :0.000000   Median :0.000000   Median :0.02362  
##  Mean   :0.009054   Mean   :0.003933   Mean   :0.005450   Mean   :0.02536  
##  3rd Qu.:0.013982   3rd Qu.:0.006112   3rd Qu.:0.008601   3rd Qu.:0.03659  
##  Max.   :0.417249   Max.   :0.087912   Max.   :0.156863   Max.   :0.24444  
##                                                                            
##   COM_N_RATIO       COM_O_RATIO         COM_P_RATIO       COM_Q_RATIO      
##  Min.   :0.00000   Min.   :0.0001764   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.01802   1st Qu.:0.0058954   1st Qu.:0.01786   1st Qu.:0.006615  
##  Median :0.02924   Median :0.0153846   Median :0.02985   Median :0.019946  
##  Mean   :0.03553   Mean   :0.0277867   Mean   :0.04350   Mean   :0.022028  
##  3rd Qu.:0.04496   3rd Qu.:0.0361664   3rd Qu.:0.04878   3rd Qu.:0.033033  
##  Max.   :0.33527   Max.   :0.3636364   Max.   :0.83673   Max.   :0.214286  
##                                                                            
##   COM_R_RATIO        COM_S_RATIO       COM_U_RATIO        POP_YOUTH_RATIO 
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000e+00   Min.   :0.1064  
##  1st Qu.:0.000000   1st Qu.:0.04116   1st Qu.:0.000e+00   1st Qu.:0.2153  
##  Median :0.009091   Median :0.06395   Median :0.000e+00   Median :0.2452  
##  Mean   :0.010772   Mean   :0.08933   Mean   :2.036e-06   Mean   :0.2491  
##  3rd Qu.:0.015310   3rd Qu.:0.11147   3rd Qu.:0.000e+00   3rd Qu.:0.2771  
##  Max.   :0.166667   Max.   :0.56716   Max.   :2.985e-03   Max.   :0.4408  
##                                                                           
##     CAPITAL         RURAL_URBANIntermediário Adjacente
##  Min.   :0.000000   Min.   :0.0000                    
##  1st Qu.:0.000000   1st Qu.:0.0000                    
##  Median :0.000000   Median :0.0000                    
##  Mean   :0.004853   Mean   :0.1233                    
##  3rd Qu.:0.000000   3rd Qu.:0.0000                    
##  Max.   :1.000000   Max.   :1.0000                    
##                                                       
##  RURAL_URBANIntermediário Remoto RURAL_URBANRural Adjacente
##  Min.   :0.00000                 Min.   :0.0000            
##  1st Qu.:0.00000                 1st Qu.:0.0000            
##  Median :0.00000                 Median :1.0000            
##  Mean   :0.01078                 Mean   :0.5462            
##  3rd Qu.:0.00000                 3rd Qu.:1.0000            
##  Max.   :1.00000                 Max.   :1.0000            
##                                                            
##  RURAL_URBANRural Remoto RURAL_URBANUrbano
##  Min.   :0.00000         Min.   :0.0000   
##  1st Qu.:0.00000         1st Qu.:0.0000   
##  Median :0.00000         Median :0.0000   
##  Mean   :0.05805         Mean   :0.2617   
##  3rd Qu.:0.00000         3rd Qu.:1.0000   
##  Max.   :1.00000         Max.   :1.0000   
##                                           
##  GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social
##  Min.   :0.0000                                                              
##  1st Qu.:0.0000                                                              
##  Median :0.0000                                                              
##  Mean   :0.4892                                                              
##  3rd Qu.:1.0000                                                              
##  Max.   :1.0000                                                              
##                                                                              
##  GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita
##  Min.   :0.0000                                                     
##  1st Qu.:0.0000                                                     
##  Median :0.0000                                                     
##  Mean   :0.1317                                                     
##  3rd Qu.:0.0000                                                     
##  Max.   :1.0000                                                     
##                                                                     
##  GVA_MAINComércio e reparação de veículos automotores e motocicletas
##  Min.   :0.000000                                                   
##  1st Qu.:0.000000                                                   
##  Median :0.000000                                                   
##  Mean   :0.008267                                                   
##  3rd Qu.:0.000000                                                   
##  Max.   :1.000000                                                   
##                                                                     
##  GVA_MAINConstrução GVA_MAINDemais serviços
##  Min.   :0.000000   Min.   :0.0000         
##  1st Qu.:0.000000   1st Qu.:0.0000         
##  Median :0.000000   Median :0.0000         
##  Mean   :0.001258   Mean   :0.2653         
##  3rd Qu.:0.000000   3rd Qu.:1.0000         
##  Max.   :1.000000   Max.   :1.0000         
##                                            
##  GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação
##  Min.   :0.00000                                                                             
##  1st Qu.:0.00000                                                                             
##  Median :0.00000                                                                             
##  Mean   :0.01761                                                                             
##  3rd Qu.:0.00000                                                                             
##  Max.   :1.00000                                                                             
##                                                                                              
##  GVA_MAINIndústrias de transformação GVA_MAINIndústrias extrativas
##  Min.   :0.00000                     Min.   :0.00000              
##  1st Qu.:0.00000                     1st Qu.:0.00000              
##  Median :0.00000                     Median :0.00000              
##  Mean   :0.04691                     Mean   :0.00629              
##  3rd Qu.:0.00000                     3rd Qu.:0.00000              
##  Max.   :1.00000                     Max.   :1.00000              
##                                                                   
##  GVA_MAINPecuária, inclusive apoio à pecuária
##  Min.   :0.00000                             
##  1st Qu.:0.00000                             
##  Median :0.00000                             
##  Mean   :0.02894                             
##  3rd Qu.:0.00000                             
##  Max.   :1.00000                             
##                                              
##  GVA_MAINProdução florestal, pesca e aquicultura      ALT           
##  Min.   :0.000000                                Min.   :0.0000000  
##  1st Qu.:0.000000                                1st Qu.:0.0001941  
##  Median :0.000000                                Median :0.0004648  
##  Mean   :0.004493                                Mean   :0.0010222  
##  3rd Qu.:0.000000                                3rd Qu.:0.0007193  
##  Max.   :1.000000                                Max.   :1.0000000  
##                                                                     
##       AREA               IDHM         POP_DENSITY          COMP_TOT        
##  Min.   :0.000000   Min.   :0.0000   Min.   :0.000000   Min.   :0.0000000  
##  1st Qu.:0.001260   1st Qu.:0.4077   1st Qu.:0.000876   1st Qu.:0.0001169  
##  Median :0.002589   Median :0.5563   Median :0.001864   Median :0.0002941  
##  Mean   :0.009539   Mean   :0.5432   Mean   :0.008659   Mean   :0.0016997  
##  3rd Qu.:0.006412   3rd Qu.:0.6757   3rd Qu.:0.004091   3rd Qu.:0.0008356  
##  Max.   :1.000000   Max.   :1.0000   Max.   :1.000000   Max.   :1.0000000  
##                                                                            
##           geometry   
##  MULTIPOLYGON :5564  
##  epsg:4674    :   0  
##  +proj=long...:   0  
##                      
##                      
##                      
##

5.2 Building Multi-linear regression model for contributory factors to GDP per capita

When performing a multi-linear regression, we need to define our Null Hypothesis: * NULL Hypothesis: The data is randomly distributed * Alternative Hypothesis: The data is not randomly distributed

We will be selecting a confidence level of 95% for this analysis. Meaning we would need an alpha value below 0.05 in order to reject the null hypothesis

5.2.1 Performing Linear Regression

Because we have Categorical data and data which sums to 1, we will need to decide which one of the following is our baseline:

Population Age ratios:
- We will take the YOUTHs ratio as our baseline
GVA Ratios:
- We will take Public services to be our baseline
GVA Main categories:
- GVA_MAINProdução florestal, pesca e aquicultura
Company Type Ratios:
- COM_U_RATIO
Rural or Urban Classifications:
- RURAL_URBANUrbano

GDPPC.mlr<- lm(GDP_CAPITA ~ ., data=Brazil_Indicators.sf[2:53] %>% st_set_geometry(NULL))
summary(GDPPC.mlr)

## 
## Call:
## lm(formula = GDP_CAPITA ~ ., data = Brazil_Indicators.sf[2:53] %>% 
##     st_set_geometry(NULL))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40924  -5198   -713   3256 246925 
## 
## Coefficients: (5 not defined because of singularities)
##                                                                                                  Estimate
## (Intercept)                                                                                     5457565.3
## RES_FOREIGN_POP_RATIO                                                                             -5490.9
## DOM_URBAN_RATIO                                                                                   -1953.4
## POP_WORKING_RATIO                                                                                 37961.8
## POP_ELDERLY_RATIO                                                                                -34416.2
## GVA_AGROPEC_RATIO                                                                                  8195.2
## GVA_INDUSTRY_RATIO                                                                                22856.5
## GVA_SERVICES_RATIO                                                                                 4935.0
## GVA_PUBLIC_RATIO                                                                                       NA
## COM_A_RATIO                                                                                    -5474832.5
## COM_B_RATIO                                                                                    -5510719.9
## COM_C_RATIO                                                                                    -5504450.0
## COM_D_RATIO                                                                                    -5436381.1
## COM_E_RATIO                                                                                    -5436230.7
## COM_F_RATIO                                                                                    -5477177.6
## COM_G_RATIO                                                                                    -5479871.4
## COM_H_RATIO                                                                                    -5458575.7
## COM_I_RATIO                                                                                    -5467005.8
## COM_J_RATIO                                                                                    -5467998.6
## COM_K_RATIO                                                                                    -5364519.1
## COM_L_RATIO                                                                                    -5373763.3
## COM_M_RATIO                                                                                    -5460390.8
## COM_N_RATIO                                                                                    -5466424.3
## COM_O_RATIO                                                                                    -5461544.0
## COM_P_RATIO                                                                                    -5478092.3
## COM_Q_RATIO                                                                                    -5487459.3
## COM_R_RATIO                                                                                    -5479679.6
## COM_S_RATIO                                                                                    -5477920.3
## COM_U_RATIO                                                                                            NA
## POP_YOUTH_RATIO                                                                                        NA
## CAPITAL                                                                                           -8485.0
## `RURAL_URBANIntermediário Adjacente`                                                               -370.4
## `RURAL_URBANIntermediário Remoto`                                                                  4839.7
## `RURAL_URBANRural Adjacente`                                                                       1555.1
## `RURAL_URBANRural Remoto`                                                                          4727.9
## RURAL_URBANUrbano                                                                                      NA
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social`                    -9635.6
## `GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita`                              2708.8
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas`                             21939.3
## GVA_MAINConstrução                                                                                -7020.6
## `GVA_MAINDemais serviços`                                                                         -8217.5
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação`    19968.9
## `GVA_MAINIndústrias de transformação`                                                             13099.6
## `GVA_MAINIndústrias extrativas`                                                                   14549.4
## `GVA_MAINPecuária, inclusive apoio à pecuária`                                                    -5565.2
## `GVA_MAINProdução florestal, pesca e aquicultura`                                                      NA
## ALT                                                                                               -5481.8
## AREA                                                                                               8959.2
## IDHM                                                                                              36695.3
## POP_DENSITY                                                                                        7831.5
## COMP_TOT                                                                                          36690.6
##                                                                                                Std. Error
## (Intercept)                                                                                     5758288.1
## RES_FOREIGN_POP_RATIO                                                                             56651.7
## DOM_URBAN_RATIO                                                                                    1462.6
## POP_WORKING_RATIO                                                                                 10630.0
## POP_ELDERLY_RATIO                                                                                  7766.5
## GVA_AGROPEC_RATIO                                                                                  1309.1
## GVA_INDUSTRY_RATIO                                                                                 1634.1
## GVA_SERVICES_RATIO                                                                                 1161.4
## GVA_PUBLIC_RATIO                                                                                       NA
## COM_A_RATIO                                                                                     5758215.0
## COM_B_RATIO                                                                                     5758386.0
## COM_C_RATIO                                                                                     5758212.1
## COM_D_RATIO                                                                                     5758474.6
## COM_E_RATIO                                                                                     5758512.8
## COM_F_RATIO                                                                                     5758246.2
## COM_G_RATIO                                                                                     5758265.9
## COM_H_RATIO                                                                                     5758277.5
## COM_I_RATIO                                                                                     5757930.3
## COM_J_RATIO                                                                                     5758076.2
## COM_K_RATIO                                                                                     5757653.4
## COM_L_RATIO                                                                                     5757834.4
## COM_M_RATIO                                                                                     5758161.2
## COM_N_RATIO                                                                                     5757990.4
## COM_O_RATIO                                                                                     5758278.8
## COM_P_RATIO                                                                                     5758244.6
## COM_Q_RATIO                                                                                     5758228.4
## COM_R_RATIO                                                                                     5758306.1
## COM_S_RATIO                                                                                     5758244.3
## COM_U_RATIO                                                                                            NA
## POP_YOUTH_RATIO                                                                                        NA
## CAPITAL                                                                                            3355.8
## `RURAL_URBANIntermediário Adjacente`                                                                738.4
## `RURAL_URBANIntermediário Remoto`                                                                  2083.3
## `RURAL_URBANRural Adjacente`                                                                        695.9
## `RURAL_URBANRural Remoto`                                                                          1085.4
## RURAL_URBANUrbano                                                                                      NA
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social`                     2983.5
## `GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita`                              2991.4
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas`                              3698.9
## GVA_MAINConstrução                                                                                 6278.1
## `GVA_MAINDemais serviços`                                                                          3025.8
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação`     3389.8
## `GVA_MAINIndústrias de transformação`                                                              3168.8
## `GVA_MAINIndústrias extrativas`                                                                    3923.8
## `GVA_MAINPecuária, inclusive apoio à pecuária`                                                     3170.5
## `GVA_MAINProdução florestal, pesca e aquicultura`                                                      NA
## ALT                                                                                                9989.1
## AREA                                                                                               6234.3
## IDHM                                                                                               2842.5
## POP_DENSITY                                                                                        4947.6
## COMP_TOT                                                                                          14889.5
##                                                                                                t value
## (Intercept)                                                                                      0.948
## RES_FOREIGN_POP_RATIO                                                                           -0.097
## DOM_URBAN_RATIO                                                                                 -1.336
## POP_WORKING_RATIO                                                                                3.571
## POP_ELDERLY_RATIO                                                                               -4.431
## GVA_AGROPEC_RATIO                                                                                6.260
## GVA_INDUSTRY_RATIO                                                                              13.987
## GVA_SERVICES_RATIO                                                                               4.249
## GVA_PUBLIC_RATIO                                                                                    NA
## COM_A_RATIO                                                                                     -0.951
## COM_B_RATIO                                                                                     -0.957
## COM_C_RATIO                                                                                     -0.956
## COM_D_RATIO                                                                                     -0.944
## COM_E_RATIO                                                                                     -0.944
## COM_F_RATIO                                                                                     -0.951
## COM_G_RATIO                                                                                     -0.952
## COM_H_RATIO                                                                                     -0.948
## COM_I_RATIO                                                                                     -0.949
## COM_J_RATIO                                                                                     -0.950
## COM_K_RATIO                                                                                     -0.932
## COM_L_RATIO                                                                                     -0.933
## COM_M_RATIO                                                                                     -0.948
## COM_N_RATIO                                                                                     -0.949
## COM_O_RATIO                                                                                     -0.948
## COM_P_RATIO                                                                                     -0.951
## COM_Q_RATIO                                                                                     -0.953
## COM_R_RATIO                                                                                     -0.952
## COM_S_RATIO                                                                                     -0.951
## COM_U_RATIO                                                                                         NA
## POP_YOUTH_RATIO                                                                                     NA
## CAPITAL                                                                                         -2.528
## `RURAL_URBANIntermediário Adjacente`                                                            -0.502
## `RURAL_URBANIntermediário Remoto`                                                                2.323
## `RURAL_URBANRural Adjacente`                                                                     2.235
## `RURAL_URBANRural Remoto`                                                                        4.356
## RURAL_URBANUrbano                                                                                   NA
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social`                  -3.230
## `GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita`                            0.906
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas`                            5.931
## GVA_MAINConstrução                                                                              -1.118
## `GVA_MAINDemais serviços`                                                                       -2.716
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação`   5.891
## `GVA_MAINIndústrias de transformação`                                                            4.134
## `GVA_MAINIndústrias extrativas`                                                                  3.708
## `GVA_MAINPecuária, inclusive apoio à pecuária`                                                  -1.755
## `GVA_MAINProdução florestal, pesca e aquicultura`                                                   NA
## ALT                                                                                             -0.549
## AREA                                                                                             1.437
## IDHM                                                                                            12.910
## POP_DENSITY                                                                                      1.583
## COMP_TOT                                                                                         2.464
##                                                                                                Pr(>|t|)
## (Intercept)                                                                                    0.343285
## RES_FOREIGN_POP_RATIO                                                                          0.922791
## DOM_URBAN_RATIO                                                                                0.181731
## POP_WORKING_RATIO                                                                              0.000358
## POP_ELDERLY_RATIO                                                                              9.55e-06
## GVA_AGROPEC_RATIO                                                                              4.14e-10
## GVA_INDUSTRY_RATIO                                                                              < 2e-16
## GVA_SERVICES_RATIO                                                                             2.18e-05
## GVA_PUBLIC_RATIO                                                                                     NA
## COM_A_RATIO                                                                                    0.341754
## COM_B_RATIO                                                                                    0.338614
## COM_C_RATIO                                                                                    0.339149
## COM_D_RATIO                                                                                    0.345177
## COM_E_RATIO                                                                                    0.345194
## COM_F_RATIO                                                                                    0.341550
## COM_G_RATIO                                                                                    0.341315
## COM_H_RATIO                                                                                    0.343195
## COM_I_RATIO                                                                                    0.342421
## COM_J_RATIO                                                                                    0.342346
## COM_K_RATIO                                                                                    0.351522
## COM_L_RATIO                                                                                    0.350708
## COM_M_RATIO                                                                                    0.343025
## COM_N_RATIO                                                                                    0.342477
## COM_O_RATIO                                                                                    0.342933
## COM_P_RATIO                                                                                    0.341470
## COM_Q_RATIO                                                                                    0.340643
## COM_R_RATIO                                                                                    0.341335
## COM_S_RATIO                                                                                    0.341485
## COM_U_RATIO                                                                                          NA
## POP_YOUTH_RATIO                                                                                      NA
## CAPITAL                                                                                        0.011484
## `RURAL_URBANIntermediário Adjacente`                                                           0.615964
## `RURAL_URBANIntermediário Remoto`                                                              0.020211
## `RURAL_URBANRural Adjacente`                                                                   0.025484
## `RURAL_URBANRural Remoto`                                                                      1.35e-05
## RURAL_URBANUrbano                                                                                    NA
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social`                 0.001247
## `GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita`                          0.365227
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas`                          3.19e-09
## GVA_MAINConstrução                                                                             0.263506
## `GVA_MAINDemais serviços`                                                                      0.006632
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` 4.07e-09
## `GVA_MAINIndústrias de transformação`                                                          3.62e-05
## `GVA_MAINIndústrias extrativas`                                                                0.000211
## `GVA_MAINPecuária, inclusive apoio à pecuária`                                                 0.079260
## `GVA_MAINProdução florestal, pesca e aquicultura`                                                    NA
## ALT                                                                                            0.583182
## AREA                                                                                           0.150749
## IDHM                                                                                            < 2e-16
## POP_DENSITY                                                                                    0.113508
## COMP_TOT                                                                                       0.013763
##                                                                                                   
## (Intercept)                                                                                       
## RES_FOREIGN_POP_RATIO                                                                             
## DOM_URBAN_RATIO                                                                                   
## POP_WORKING_RATIO                                                                              ***
## POP_ELDERLY_RATIO                                                                              ***
## GVA_AGROPEC_RATIO                                                                              ***
## GVA_INDUSTRY_RATIO                                                                             ***
## GVA_SERVICES_RATIO                                                                             ***
## GVA_PUBLIC_RATIO                                                                                  
## COM_A_RATIO                                                                                       
## COM_B_RATIO                                                                                       
## COM_C_RATIO                                                                                       
## COM_D_RATIO                                                                                       
## COM_E_RATIO                                                                                       
## COM_F_RATIO                                                                                       
## COM_G_RATIO                                                                                       
## COM_H_RATIO                                                                                       
## COM_I_RATIO                                                                                       
## COM_J_RATIO                                                                                       
## COM_K_RATIO                                                                                       
## COM_L_RATIO                                                                                       
## COM_M_RATIO                                                                                       
## COM_N_RATIO                                                                                       
## COM_O_RATIO                                                                                       
## COM_P_RATIO                                                                                       
## COM_Q_RATIO                                                                                       
## COM_R_RATIO                                                                                       
## COM_S_RATIO                                                                                       
## COM_U_RATIO                                                                                       
## POP_YOUTH_RATIO                                                                                   
## CAPITAL                                                                                        *  
## `RURAL_URBANIntermediário Adjacente`                                                              
## `RURAL_URBANIntermediário Remoto`                                                              *  
## `RURAL_URBANRural Adjacente`                                                                   *  
## `RURAL_URBANRural Remoto`                                                                      ***
## RURAL_URBANUrbano                                                                                 
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social`                 ** 
## `GVA_MAINAgricultura, inclusive apoio à agricultura e a pós colheita`                             
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas`                          ***
## GVA_MAINConstrução                                                                                
## `GVA_MAINDemais serviços`                                                                      ** 
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` ***
## `GVA_MAINIndústrias de transformação`                                                          ***
## `GVA_MAINIndústrias extrativas`                                                                ***
## `GVA_MAINPecuária, inclusive apoio à pecuária`                                                 .  
## `GVA_MAINProdução florestal, pesca e aquicultura`                                                 
## ALT                                                                                               
## AREA                                                                                              
## IDHM                                                                                           ***
## POP_DENSITY                                                                                       
## COMP_TOT                                                                                       *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14600 on 5518 degrees of freedom
## Multiple R-squared:  0.488,  Adjusted R-squared:  0.4838 
## F-statistic: 116.9 on 45 and 5518 DF,  p-value: < 2.2e-16

5.2.2 Interpretation of Regression

Based on the F-statistic, it seems our model has a p-value less than 0.05 which means that the goodness of fit for the model is significant to reject the null hypothesis which is that the rate of change in the dependent variable is explainable by the mean.

It would seem that the company type ratios do not contribute signifcantly to GDP per Capita. Addtionally, the altitude and size of the municipality also show not significance. The same is seen for population density, ratio of foreigners in the population and percentage of urbanized households. There are some GVA main categories which are also not statistically significant which we will remove. Lastly the Urban or Rural classifications seem to have some significance except for Intermediário Remoto which is likely because the definition is very inbetween many of the othse.

5.2.3 Selecting Significant Indicators

Brazil_sig_Indic.sf <- Brazil_Indicators.sf %>% select("CITY_STATE", "GDP_CAPITA", "POP_WORKING_RATIO", "POP_ELDERLY_RATIO","GVA_AGROPEC_RATIO", "GVA_INDUSTRY_RATIO", "GVA_SERVICES_RATIO", "CAPITAL", "RURAL_URBANIntermediário Adjacente", "RURAL_URBANIntermediário Remoto", "RURAL_URBANRural Adjacente", "RURAL_URBANRural Remoto", "GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social", "GVA_MAINComércio e reparação de veículos automotores e motocicletas", "GVA_MAINDemais serviços", "GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação", "GVA_MAINIndústrias de transformação", "GVA_MAINIndústrias extrativas", "IDHM", "COMP_TOT")

5.2.4 Running Regression on Significant Indicators

GDPPC_sig.mlr<- lm(GDP_CAPITA ~ ., data=Brazil_sig_Indic.sf[2:21] %>% st_set_geometry(NULL))
summary(GDPPC_sig.mlr)

## 
## Call:
## lm(formula = GDP_CAPITA ~ ., data = Brazil_sig_Indic.sf[2:21] %>% 
##     st_set_geometry(NULL))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42585  -5379   -942   3078 252473 
## 
## Coefficients:
##                                                                                                Estimate
## (Intercept)                                                                                    -13666.5
## POP_WORKING_RATIO                                                                               28125.7
## POP_ELDERLY_RATIO                                                                              -43172.8
## GVA_AGROPEC_RATIO                                                                                8705.5
## GVA_INDUSTRY_RATIO                                                                              22762.2
## GVA_SERVICES_RATIO                                                                               5337.2
## CAPITAL                                                                                         -4722.8
## `RURAL_URBANIntermediário Adjacente`                                                            -1085.7
## `RURAL_URBANIntermediário Remoto`                                                                4783.1
## `RURAL_URBANRural Adjacente`                                                                     1129.0
## `RURAL_URBANRural Remoto`                                                                        4452.0
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social`                 -11208.5
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas`                           21417.8
## `GVA_MAINDemais serviços`                                                                       -9095.2
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação`  19564.9
## `GVA_MAINIndústrias de transformação`                                                           11273.0
## `GVA_MAINIndústrias extrativas`                                                                 15435.8
## IDHM                                                                                            39405.4
## COMP_TOT                                                                                        55830.7
##                                                                                                Std. Error
## (Intercept)                                                                                        6122.4
## POP_WORKING_RATIO                                                                                 10242.9
## POP_ELDERLY_RATIO                                                                                  7379.9
## GVA_AGROPEC_RATIO                                                                                  1296.0
## GVA_INDUSTRY_RATIO                                                                                 1636.9
## GVA_SERVICES_RATIO                                                                                 1153.6
## CAPITAL                                                                                            3279.7
## `RURAL_URBANIntermediário Adjacente`                                                                732.7
## `RURAL_URBANIntermediário Remoto`                                                                  2012.5
## `RURAL_URBANRural Adjacente`                                                                        627.3
## `RURAL_URBANRural Remoto`                                                                          1037.5
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social`                      707.0
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas`                              2295.3
## `GVA_MAINDemais serviços`                                                                           774.5
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação`     1733.3
## `GVA_MAINIndústrias de transformação`                                                              1222.3
## `GVA_MAINIndústrias extrativas`                                                                    2652.6
## IDHM                                                                                               2416.5
## COMP_TOT                                                                                          14553.1
##                                                                                                t value
## (Intercept)                                                                                     -2.232
## POP_WORKING_RATIO                                                                                2.746
## POP_ELDERLY_RATIO                                                                               -5.850
## GVA_AGROPEC_RATIO                                                                                6.717
## GVA_INDUSTRY_RATIO                                                                              13.906
## GVA_SERVICES_RATIO                                                                               4.627
## CAPITAL                                                                                         -1.440
## `RURAL_URBANIntermediário Adjacente`                                                            -1.482
## `RURAL_URBANIntermediário Remoto`                                                                2.377
## `RURAL_URBANRural Adjacente`                                                                     1.800
## `RURAL_URBANRural Remoto`                                                                        4.291
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social`                 -15.853
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas`                            9.331
## `GVA_MAINDemais serviços`                                                                      -11.743
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação`  11.288
## `GVA_MAINIndústrias de transformação`                                                            9.223
## `GVA_MAINIndústrias extrativas`                                                                  5.819
## IDHM                                                                                            16.307
## COMP_TOT                                                                                         3.836
##                                                                                                Pr(>|t|)
## (Intercept)                                                                                    0.025639
## POP_WORKING_RATIO                                                                              0.006054
## POP_ELDERLY_RATIO                                                                              5.19e-09
## GVA_AGROPEC_RATIO                                                                              2.04e-11
## GVA_INDUSTRY_RATIO                                                                              < 2e-16
## GVA_SERVICES_RATIO                                                                             3.80e-06
## CAPITAL                                                                                        0.149924
## `RURAL_URBANIntermediário Adjacente`                                                           0.138455
## `RURAL_URBANIntermediário Remoto`                                                              0.017506
## `RURAL_URBANRural Adjacente`                                                                   0.071941
## `RURAL_URBANRural Remoto`                                                                      1.81e-05
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social`                  < 2e-16
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas`                           < 2e-16
## `GVA_MAINDemais serviços`                                                                       < 2e-16
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação`  < 2e-16
## `GVA_MAINIndústrias de transformação`                                                           < 2e-16
## `GVA_MAINIndústrias extrativas`                                                                6.25e-09
## IDHM                                                                                            < 2e-16
## COMP_TOT                                                                                       0.000126
##                                                                                                   
## (Intercept)                                                                                    *  
## POP_WORKING_RATIO                                                                              ** 
## POP_ELDERLY_RATIO                                                                              ***
## GVA_AGROPEC_RATIO                                                                              ***
## GVA_INDUSTRY_RATIO                                                                             ***
## GVA_SERVICES_RATIO                                                                             ***
## CAPITAL                                                                                           
## `RURAL_URBANIntermediário Adjacente`                                                              
## `RURAL_URBANIntermediário Remoto`                                                              *  
## `RURAL_URBANRural Adjacente`                                                                   .  
## `RURAL_URBANRural Remoto`                                                                      ***
## `GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social`                 ***
## `GVA_MAINComércio e reparação de veículos automotores e motocicletas`                          ***
## `GVA_MAINDemais serviços`                                                                      ***
## `GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação` ***
## `GVA_MAINIndústrias de transformação`                                                          ***
## `GVA_MAINIndústrias extrativas`                                                                ***
## IDHM                                                                                           ***
## COMP_TOT                                                                                       ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14840 on 5545 degrees of freedom
## Multiple R-squared:  0.4685, Adjusted R-squared:  0.4668 
## F-statistic: 271.5 on 18 and 5545 DF,  p-value: < 2.2e-16

Based on our new regression, we can see some of the variables have become insignificant, Notably the CAPITAL classification and Rural Intermediate or Urban classifications for Adjacente have also become insigifcant. We will run the regression again without them.

5.2.4.1 Removing newly non significant figures

dropsInsig <- c("CAPITAL", "RURAL_URBANIntermediário Adjacente", "RURAL_URBANRural Adjacente")

Brazil_sig_Indic.sf <- Brazil_sig_Indic.sf[ , !(names(Brazil_sig_Indic.sf) %in% dropsInsig)]

5.2.4.2 Rereunning the regression

5.2.4.3 Renamming Variables for further processing

names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'RURAL_URBANIntermediário Remoto'] <- 'CAT_INTERMEDIATE_REMOTE'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'RURAL_URBANRural Remoto'] <- 'CAT_RURAL_REMOTE'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINAdministração, defesa, educação e saúde públicas e seguridade social'] <- 'GVA_MAIN_Public_Sector'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINComércio e reparação de veículos automotores e motocicletas'] <- 'GVA_MAIN_Commercial'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINDemais serviços'] <- 'GVA_MAIN_Other_services'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINEletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação'] <- 'GVA_MAIN_Public_Utilities'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINIndústrias de transformação'] <- 'GVA_MAIN_Industry_transformation'
names(Brazil_sig_Indic.sf)[names(Brazil_sig_Indic.sf) == 'GVA_MAINIndústrias extrativas'] <- 'GVA_MAIN_Industrial'

GDPPC_sig2.mlr<- lm(GDP_CAPITA ~ ., data=Brazil_sig_Indic.sf[2:18] %>% st_set_geometry(NULL))
summary(GDPPC_sig2.mlr)

## 
## Call:
## lm(formula = GDP_CAPITA ~ ., data = Brazil_sig_Indic.sf[2:18] %>% 
##     st_set_geometry(NULL))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42091  -5367   -884   3055 252671 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -13922.8     6105.4  -2.280 0.022622 *  
## POP_WORKING_RATIO                 29840.1    10241.3   2.914 0.003586 ** 
## POP_ELDERLY_RATIO                -37947.9     6994.2  -5.426 6.02e-08 ***
## GVA_AGROPEC_RATIO                  9008.2     1287.0   6.999 2.88e-12 ***
## GVA_INDUSTRY_RATIO                22507.7     1633.3  13.780  < 2e-16 ***
## GVA_SERVICES_RATIO                 4989.7     1148.6   4.344 1.42e-05 ***
## CAT_INTERMEDIATE_REMOTE            4304.2     1967.3   2.188 0.028725 *  
## CAT_RURAL_REMOTE                   3815.4      908.6   4.199 2.72e-05 ***
## GVA_MAIN_Public_Sector           -11314.6      706.3 -16.019  < 2e-16 ***
## GVA_MAIN_Commercial               21167.5     2293.4   9.230  < 2e-16 ***
## GVA_MAIN_Other_services           -9509.9      751.3 -12.657  < 2e-16 ***
## GVA_MAIN_Public_Utilities         19421.8     1734.1  11.200  < 2e-16 ***
## GVA_MAIN_Industry_transformation  11100.7     1220.7   9.094  < 2e-16 ***
## GVA_MAIN_Industrial               15348.7     2654.9   5.781 7.82e-09 ***
## IDHM                              38163.3     2351.1  16.232  < 2e-16 ***
## COMP_TOT                          46378.4    12894.6   3.597 0.000325 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14850 on 5548 degrees of freedom
## Multiple R-squared:  0.4671, Adjusted R-squared:  0.4657 
## F-statistic: 324.2 on 15 and 5548 DF,  p-value: < 2.2e-16

Now we can see the final regression, we have an adjusted R-square value of 0.4657 which is quite low which means there the majority of varation in GDP per capita are still unexplained in our model. We’ve seen the Adjusted R-squared value decrease as we continue to refine our model. The F-statistic still shows that the model is still able to reject the null hypothesis that the mean is better at explaining the rate of change in the dependent variable.

5.2.5 Intepretation of results

As per our regression above which we will validate below, we can see that the variables have a certain impact on GDP per capita. Unsuprisingly, the total number of companies significantly correlates to the GDP per capita. This is most probably due to there being more jobs and therefore more people are able to be employed. Though if we wanted to investigate further, we could examine if the ratio of Companies to Population could have an effect on GDP per capita.

The working population ratio has a positive correlation while the elderly ratio has a negative correlation. This is in line with the logic that the more economically active population percentages contribute to GDP per capita where as the higher dependents in the Elderly results in lower GDP per capita. For our Gross Value Added ratio by industry, it seems most of them contribute positively to GDP per capita, however the Industrial companies seem to contribute greater by a large amount compared to the other two. This is most probably due to the way in which GDP per capita is calculated and amnufacturing sectors contributing more to it than others.

the IDHM which is our Human Development Index seems to also be positively correlated to GDP per capita. However, it is not certain if this is a causal relationship has it might have reverse causality. This is because GDP per capita often leads to greater outcomes in life. But because this data was recorded in 2010 and the GDP per capita is in 2016, we can safetly say that a higher HDI might lead to greater GDP per capita for the people.

In terms of our categorical variables, it seems that being clusified as a Rural or Intermediate Remote region is positively correlated with higher GDP per capita. This sort of matches our choropleth map that showed the inland areas with higher GDP per capita compared to what you would think is more urbanized areas. This could be due to a lower population in these remote areas and more focus on industrial or manufacturing jobs whihc could be contributing to this.

Interestingly the labeling of main sector for Gross Value added shows that areas in which their main sector is Public services such as Public administration, defense, education and health and social security actually correlates less with GDP per Capita. This may be due to municipalities being specialized for certain government functions. Other services also follows the same negative correlation however it is not clear why this is the case. As expected, the places with main economic activities being commercial correlate the most to GDP per capita but suprisingly public utilities such as electricity and gas, water, sewage, waste management and decontamination activities comes in close as well beating out industrial and industrial transformation labelled municipalities.

5.2.6 Clearling Redundant explainatory variables

VIF <- ols_vif_tol(GDPPC_sig2.mlr)
VIF

##                           Variables Tolerance      VIF
## 1                 POP_WORKING_RATIO 0.3671431 2.723734
## 2                 POP_ELDERLY_RATIO 0.7191852 1.390463
## 3                 GVA_AGROPEC_RATIO 0.5633825 1.774993
## 4                GVA_INDUSTRY_RATIO 0.5075373 1.970299
## 5                GVA_SERVICES_RATIO 0.6316755 1.583091
## 6           CAT_INTERMEDIATE_REMOTE 0.9602515 1.041394
## 7                  CAT_RURAL_REMOTE 0.8783178 1.138540
## 8            GVA_MAIN_Public_Sector 0.3180169 3.144487
## 9               GVA_MAIN_Commercial 0.9193246 1.087755
## 10          GVA_MAIN_Other_services 0.3603482 2.775094
## 11        GVA_MAIN_Public_Utilities 0.7619520 1.312419
## 12 GVA_MAIN_Industry_transformation 0.5950853 1.680431
## 13              GVA_MAIN_Industrial 0.8998277 1.111324
## 14                             IDHM 0.2730812 3.661915
## 15                         COMP_TOT 0.9651462 1.036112

As we can see from our VIF analysis, all our variables are non-redundant as cleared by the correlational analysis done earlier.

5.2.7 Testing for Non-Linearity in model

ols_plot_resid_fit(GDPPC_sig2.mlr)

From the data, we plot above we can see that the data is relatively scattered around the mean. This means that the model passes the linearity assumption required in the multi-linear regression analysis. Additionally, there does not seem to be any obvious signs of heteroscadicity in the plot above.

5.2.8 Test for Normality Assumption

ols_plot_resid_hist(GDPPC_sig2.mlr)

The figure reveals that the residual of the multiple linear regression model resembles a normal distribution which passes the Normality Assumption. We would normally use ols_test_normality() to further test this assumption. But the function is limtied to sample sizes between 3 to 5000 and we have 5564 observations, thus we will skip this step as we have sufficient evidence from the plot that it passes normality test.

5.3 Testing for Spatial Autocorrelation

The model we built is using geographically referenced attributes, hence it is also important for us to visualize the residuals of the model in order to rule out spatial autocorrelation.

mlr.output <- as.data.frame(GDPPC_sig2.mlr$residuals)

Brazil_residual.sf <- cbind(Brazil_sig_Indic.sf, 
                        GDPPC_sig2.mlr$residuals) %>%
rename(`MLR_RES` = `GDPPC_sig2.mlr.residuals`)

5.3.1 Plotting Choropleth Map of GDP per Capita Residuals

tmap_mode("plot")
tm_shape(Brazil_residual.sf)+
  tm_fill("MLR_RES",
          n = 6,
          style = "quantile",
          palette = "RdYlBu" ) +
  tm_borders(alpha = 0.5)

From our mapping of residuals, there isn’t a clear sign on whether or not it is clustered in any way or if theres a geospatial pattern in distribution. However, we can test this using the Moran’s I test.

5.3.2 Building Nearest Neighbours matrix

For this, we will be using the spatial points of the actual municipality since we have them already. We will assume the indexing has no real change as well as we had not done any form of sorting.

Brazil_cities.sp <- as_Spatial(Brazil_cities.sf)
#st_crs(Brazil_cities.sf)
proj4string(Brazil_cities.sp)

## [1] "+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs"

5.3.2.1 Calculating maximum distance between points

coords <- coordinates(Brazil_cities.sp)
k1 <- knn2nb(knearneigh(coords))
k1dists <- unlist(nbdists(k1, coords, longlat = TRUE))
summary(k1dists)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.6029   9.1046  13.1276  17.0081  19.7337 363.0083

nb <- dnearneigh(coordinates(Brazil_cities.sp), 0, 364, longlat = TRUE)

nb_lw <- nb2listw(nb, style = 'W')

lm.morantest(GDPPC_sig2.mlr, nb_lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = GDP_CAPITA ~ ., data = Brazil_sig_Indic.sf[2:18]
## %>% st_set_geometry(NULL))
## weights: nb_lw
## 
## Moran I statistic standard deviate = 0.69146, p-value = 0.2446
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     6.391678e-04    -1.799828e-04     1.403444e-06

Based on our global Moran’s I test, we can see that the P-value is above 0.05 which means we are unable to reject the Null hypothesis that the values are randomly distributed. Showing that there is no spatial autocorrelation between the residuals which means that our data is cleared of any spatial autocorrelation in the regression. This allows us to trust the correlations in our model a little better.

6 Building an explanatory for GDP per capita Model using GWmodel

We will try to refine our regression using the GWModel

6.1 Joining Variables to spatial data points

Joint_sf <- left_join(Brazil_cities.sf[,1], Brazil_sig_Indic.sf %>% st_set_geometry(NULL))

Joint_sp <- as_Spatial(Joint_sf)
summary(Joint_sp@data)

##                   CITY_STATE     GDP_CAPITA     POP_WORKING_RATIO
##  Abadia De Goiás_GO    :   1   Min.   :  3191   Min.   :0.4716   
##  Abadia Dos Dourados_MG:   1   1st Qu.:  9062   1st Qu.:0.6087   
##  Abadiânia_GO          :   1   Median : 15870   Median :0.6325   
##  Abaeté_MG             :   1   Mean   : 21122   Mean   :0.6308   
##  Abaetetuba_PA         :   1   3rd Qu.: 26155   3rd Qu.:0.6543   
##  Abaiara_CE            :   1   Max.   :314638   Max.   :0.7448   
##  (Other)               :5558                                     
##  POP_ELDERLY_RATIO GVA_AGROPEC_RATIO GVA_INDUSTRY_RATIO  GVA_SERVICES_RATIO 
##  Min.   :0.02255   Min.   :0.00000   Min.   :0.0000157   Min.   :0.0000461  
##  1st Qu.:0.09799   1st Qu.:0.03364   1st Qu.:0.0368730   1st Qu.:0.1985910  
##  Median :0.11921   Median :0.15062   Median :0.0714602   Median :0.3117002  
##  Mean   :0.12009   Mean   :0.21034   Mean   :0.1377745   Mean   :0.3260963  
##  3rd Qu.:0.14103   3rd Qu.:0.34094   3rd Qu.:0.1795132   3rd Qu.:0.4600063  
##  Max.   :0.42199   Max.   :0.99877   Max.   :0.9991868   Max.   :0.9995977  
##                                                                             
##  CAT_INTERMEDIATE_REMOTE CAT_RURAL_REMOTE  GVA_MAIN_Public_Sector
##  Min.   :0.00000         Min.   :0.00000   Min.   :0.0000        
##  1st Qu.:0.00000         1st Qu.:0.00000   1st Qu.:0.0000        
##  Median :0.00000         Median :0.00000   Median :0.0000        
##  Mean   :0.01078         Mean   :0.05805   Mean   :0.4892        
##  3rd Qu.:0.00000         3rd Qu.:0.00000   3rd Qu.:1.0000        
##  Max.   :1.00000         Max.   :1.00000   Max.   :1.0000        
##                                                                  
##  GVA_MAIN_Commercial GVA_MAIN_Other_services GVA_MAIN_Public_Utilities
##  Min.   :0.000000    Min.   :0.0000          Min.   :0.00000          
##  1st Qu.:0.000000    1st Qu.:0.0000          1st Qu.:0.00000          
##  Median :0.000000    Median :0.0000          Median :0.00000          
##  Mean   :0.008267    Mean   :0.2653          Mean   :0.01761          
##  3rd Qu.:0.000000    3rd Qu.:1.0000          3rd Qu.:0.00000          
##  Max.   :1.000000    Max.   :1.0000          Max.   :1.00000          
##                                                                       
##  GVA_MAIN_Industry_transformation GVA_MAIN_Industrial      IDHM       
##  Min.   :0.00000                  Min.   :0.00000     Min.   :0.0000  
##  1st Qu.:0.00000                  1st Qu.:0.00000     1st Qu.:0.4077  
##  Median :0.00000                  Median :0.00000     Median :0.5563  
##  Mean   :0.04691                  Mean   :0.00629     Mean   :0.5432  
##  3rd Qu.:0.00000                  3rd Qu.:0.00000     3rd Qu.:0.6757  
##  Max.   :1.00000                  Max.   :1.00000     Max.   :1.0000  
##                                                                       
##     COMP_TOT        
##  Min.   :0.0000000  
##  1st Qu.:0.0001169  
##  Median :0.0002941  
##  Mean   :0.0016997  
##  3rd Qu.:0.0008356  
##  Max.   :1.0000000  
##

##Building Fixed Bandwidth GWR Mode We will be using an Fixed bandwith here due to the varying nature of the polygons in Brazil

#bw.fixed <- bw.gwr(formula = GDP_CAPITA ~  POP_WORKING_RATIO + POP_ELDERLY_RATIO + GVA_AGROPEC_RATIO + GVA_INDUSTRY_RATIO + GVA_SERVICES_RATIO + CAT_INTERMEDIATE_REMOTE + CAT_RURAL_REMOTE + GVA_MAIN_Public_Sector + GVA_MAIN_Commercial + GVA_MAIN_Other_services+ GVA_MAIN_Public_Utilities + GVA_MAIN_Industry_transformation +  GVA_MAIN_Industrial + IDHM + COMP_TOT, data=Joint_sp, approach= "AIC", kernel="gaussian", adaptive=FALSE, longlat=TRUE)

# Could not resolve the issue

Taking the bandwidth established earlier

gwr.fixed <- gwr.basic(formula = GDP_CAPITA ~  POP_WORKING_RATIO + POP_ELDERLY_RATIO + GVA_AGROPEC_RATIO + GVA_INDUSTRY_RATIO + GVA_SERVICES_RATIO + CAT_INTERMEDIATE_REMOTE + CAT_RURAL_REMOTE + GVA_MAIN_Public_Sector + GVA_MAIN_Commercial + GVA_MAIN_Other_services+ GVA_MAIN_Public_Utilities + GVA_MAIN_Industry_transformation +  GVA_MAIN_Industrial + IDHM + COMP_TOT, data=Joint_sp, bw=364, kernel = 'gaussian', longlat = TRUE)

gwr.fixed

##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2020-06-01 00:49:36 
##    Call:
##    gwr.basic(formula = GDP_CAPITA ~ POP_WORKING_RATIO + POP_ELDERLY_RATIO + 
##     GVA_AGROPEC_RATIO + GVA_INDUSTRY_RATIO + GVA_SERVICES_RATIO + 
##     CAT_INTERMEDIATE_REMOTE + CAT_RURAL_REMOTE + GVA_MAIN_Public_Sector + 
##     GVA_MAIN_Commercial + GVA_MAIN_Other_services + GVA_MAIN_Public_Utilities + 
##     GVA_MAIN_Industry_transformation + GVA_MAIN_Industrial + 
##     IDHM + COMP_TOT, data = Joint_sp, bw = 364, kernel = "gaussian", 
##     longlat = TRUE)
## 
##    Dependent (y) variable:  GDP_CAPITA
##    Independent variables:  POP_WORKING_RATIO POP_ELDERLY_RATIO GVA_AGROPEC_RATIO GVA_INDUSTRY_RATIO GVA_SERVICES_RATIO CAT_INTERMEDIATE_REMOTE CAT_RURAL_REMOTE GVA_MAIN_Public_Sector GVA_MAIN_Commercial GVA_MAIN_Other_services GVA_MAIN_Public_Utilities GVA_MAIN_Industry_transformation GVA_MAIN_Industrial IDHM COMP_TOT
##    Number of data points: 5564
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##    Min     1Q Median     3Q    Max 
## -42091  -5367   -884   3055 252671 
## 
##    Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
##    (Intercept)                      -13922.8     6105.4  -2.280 0.022622 *  
##    POP_WORKING_RATIO                 29840.1    10241.3   2.914 0.003586 ** 
##    POP_ELDERLY_RATIO                -37947.9     6994.2  -5.426 6.02e-08 ***
##    GVA_AGROPEC_RATIO                  9008.2     1287.0   6.999 2.88e-12 ***
##    GVA_INDUSTRY_RATIO                22507.7     1633.3  13.780  < 2e-16 ***
##    GVA_SERVICES_RATIO                 4989.7     1148.6   4.344 1.42e-05 ***
##    CAT_INTERMEDIATE_REMOTE            4304.2     1967.3   2.188 0.028725 *  
##    CAT_RURAL_REMOTE                   3815.4      908.6   4.199 2.72e-05 ***
##    GVA_MAIN_Public_Sector           -11314.6      706.3 -16.019  < 2e-16 ***
##    GVA_MAIN_Commercial               21167.5     2293.4   9.230  < 2e-16 ***
##    GVA_MAIN_Other_services           -9509.9      751.3 -12.657  < 2e-16 ***
##    GVA_MAIN_Public_Utilities         19421.8     1734.1  11.200  < 2e-16 ***
##    GVA_MAIN_Industry_transformation  11100.7     1220.7   9.094  < 2e-16 ***
##    GVA_MAIN_Industrial               15348.7     2654.9   5.781 7.82e-09 ***
##    IDHM                              38163.3     2351.1  16.232  < 2e-16 ***
##    COMP_TOT                          46378.4    12894.6   3.597 0.000325 ***
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 14850 on 5548 degrees of freedom
##    Multiple R-squared: 0.4671
##    Adjusted R-squared: 0.4657 
##    F-statistic: 324.2 on 15 and 5548 DF,  p-value: < 2.2e-16 
##    ***Extra Diagnostic information
##    Residual sum of squares: 1.223844e+12
##    Sigma(hat): 14833.64
##    AIC:  122702.5
##    AICc:  122702.6
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: gaussian 
##    Fixed bandwidth: 364 
##    Regression points: the same locations as observations are used.
##    Distance metric: Great Circle distance metric is used.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                                           Min.    1st Qu.     Median    3rd Qu.
##    Intercept                         -92329.16  -40715.76  -13121.18    6452.30
##    POP_WORKING_RATIO                 -50961.49    9286.58   18723.07   58946.25
##    POP_ELDERLY_RATIO                -300167.48  -53694.90  -40170.12  -24763.62
##    GVA_AGROPEC_RATIO                  -1419.48    2350.16    8997.05   20989.29
##    GVA_INDUSTRY_RATIO                 -4759.82   10065.73   29567.73   33829.25
##    GVA_SERVICES_RATIO                -10466.24    1004.91    9619.26   14927.33
##    CAT_INTERMEDIATE_REMOTE            -2499.98    1220.93    5936.90   19238.03
##    CAT_RURAL_REMOTE                   -2115.10     255.87    2455.88    5966.25
##    GVA_MAIN_Public_Sector            -18916.73  -10431.07   -8711.67   -8063.23
##    GVA_MAIN_Commercial               -14577.52   11193.58   16074.03   33082.35
##    GVA_MAIN_Other_services           -32806.76   -7849.04   -7358.40   -5494.94
##    GVA_MAIN_Public_Utilities         -21453.80   11024.04   18877.86   28535.35
##    GVA_MAIN_Industry_transformation  -62782.35    7649.03   13818.96   18882.11
##    GVA_MAIN_Industrial               -12371.02    5276.13   19584.56   23744.54
##    IDHM                                1910.09   15644.84   37478.87   48643.06
##    COMP_TOT                         -601785.36   38835.86   57729.70  102521.38
##                                          Max.
##    Intercept                          26461.8
##    POP_WORKING_RATIO                 111318.4
##    POP_ELDERLY_RATIO                  53442.4
##    GVA_AGROPEC_RATIO                  31812.6
##    GVA_INDUSTRY_RATIO                 45073.8
##    GVA_SERVICES_RATIO                 26536.2
##    CAT_INTERMEDIATE_REMOTE            40003.4
##    CAT_RURAL_REMOTE                   15040.3
##    GVA_MAIN_Public_Sector              1064.7
##    GVA_MAIN_Commercial                56781.7
##    GVA_MAIN_Other_services             4793.0
##    GVA_MAIN_Public_Utilities          38054.1
##    GVA_MAIN_Industry_transformation   27756.8
##    GVA_MAIN_Industrial                34343.3
##    IDHM                              132568.6
##    COMP_TOT                         1189196.5
##    ************************Diagnostic information*************************
##    Number of data points: 5564 
##    Effective number of parameters (2trace(S) - trace(S'S)): 215.9051 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5348.095 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 122049.3 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 121879.2 
##    Residual sum of squares: 1.032151e+12 
##    R-square value:  0.5506 
##    Adjusted R-square value:  0.5324541 
## 
##    ***********************************************************************
##    Program stops at: 2020-06-01 00:50:22

6.2 Interpretation of Results

By using the maximum bandwidth established earlier, we can see that the R-square value has gone up slightly which means that using geographical weighted method has resulted in a better model overall. However, we need to check the geographic R-square distribution below.

7 Visualising GWR Output

7.1 Converting SDF into sf data.frame

GWR.sf <- st_as_sf(gwr.fixed$SDF) %>%
  st_transform(4674)

GWR.sf.transformed <- st_transform(GWR.sf, 4674)

gwr.fixed.output <- as.data.frame(gwr.fixed$SDF)

Brazil_sig_Indic.sf.fixed <- cbind(Brazil_sig_Indic.sf, as.matrix(gwr.fixed.output))

range(Brazil_sig_Indic.sf.fixed$Local_R2)

## [1] 0.4524265 0.9703265

summary(gwr.fixed$SDF$yhat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -15511    8943   17909   21388   29558  104354

7.2 Visualising local R2

tm_shape(Brazil_sig_Indic.sf.fixed) +  
  tm_fill(col = "Local_R2",
          style = "jenks",
           palette = "Greens",
          title = "R-squared Values")

7.2.1 Interpretation

As we can see, there does not seem to be any pattern in distribution. Although the model does seem to explain some area better than others, it is not clear why this is the case.

Take-Home_EX04

Lee Yi De

5/29/2020