IS415-Take-home

1.0 Overview

Brazil is the world’s fifth-largest country by area and the sixth most populous. Brazil is classified as an upper-middle income economy by the World Bank and a developing country, with the largest share of global wealth in Latin America. It is considered an advanced emerging economy. It has the ninth largest GDP in the world by nominal, and eighth by PPP measures. Behind all this impressive figures, the spatial development of Brazil is highly unequal as shown in the figure below.

2.0 Task

In this take-home exercise, you are tasked to determine factors affecting the unequal development of Brazil at the municipality level by using the data provided. The specific task of the analysis are as follows:

Prepare a choropleth map showing the distribution of GDP per capita, 2016 at municipality level.
Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using multiple linear regression method.
Prepare a choropleth map showing the distribution of the residual of the GDP per capita.
Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using geographically weighted regression method.
Prepare a series of choropleth maps showing the outputs of the geographically weighted regression model.

3.0 Datasets

1. Brazil Cities: Data
1. Data Dictionary which provides meta data of each column in Brazil Cities: Data
1. 2016 municipality boundary file

4.0 Data Wrangling

4.1 Importing of Packages

packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse','geobr','dplyr','heatmaply')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

4.2 Importing Aspatial Data

cities <- read_delim("data/aspatial/BRAZIL_CITIES.csv", delim = ";")
data_dict <- read_delim("data/aspatial/Data_Dictionary.csv", delim = ";")

Reading the Aspatial Data:

glimpse(cities)

Before carrying out this analysis, I have identified the following:

Indepedent Variable (X-axis):

Gross Domestic Product Per Capita

Dependent Variables (Y-axis):

Resident Population Regular Urban Planning - from 15 to 59 years old (Economy Active)
HDI Human Development Index
HDI GNI Index
HDI Life Expectancy Index
HDI Education Index
City area (squared kilometers)
Gross Added Value - Agropecuary
Gross Added Value - Industry
Gross Added Value - Services
Gross Added Value - Public Services
Taxes
Population

These variables are chosen because they can either affect the GDP per capita of each city either directly or indirectly. There are several factors that can affect the GDP, including the quality of life, poverty and economic inequality and more. Therefore, the above variables are selected as they fall under the category of the different factors that can be related to determining the GDP per capita of each city.

With these variables identified, we have to check if there are NA values in these columns:

colSums(is.na(cities[, c(15, 20:23, 29, 34:37, 39, 41, 42)]))

##       IBGE_15-59             IDHM       IDHM_Renda IDHM_Longevidade 
##                8                8                8                8 
##    IDHM_Educacao             AREA      GVA_AGROPEC     GVA_INDUSTRY 
##                8                3             1476             1476 
##     GVA_SERVICES       GVA_PUBLIC            TAXES          POP_GDP 
##             1476             1476             1476             1476 
##       GDP_CAPITA 
##             1476

From the above results, we can tell that there are variables that have N.A. values. With the derived N.A. values, it is necessary to replace these N.A. values with 0 for an accurate analysis:

cities[, c(15, 20:23, 29, 34:37, 39, 41, 42)][is.na(cities[, c(15, 20:23, 29, 34:37, 39, 41, 42)])] <- 0

Verify again if the N.A. values are being replaced with 0:

colSums(is.na(cities[, c(15, 20:23, 29, 34:37, 39, 41, 42)]))

##       IBGE_15-59             IDHM       IDHM_Renda IDHM_Longevidade 
##                0                0                0                0 
##    IDHM_Educacao             AREA      GVA_AGROPEC     GVA_INDUSTRY 
##                0                0                0                0 
##     GVA_SERVICES       GVA_PUBLIC            TAXES          POP_GDP 
##                0                0                0                0 
##       GDP_CAPITA 
##                0

Creating a new data frame for the independent, dependent variables for study later, including LONG and LAT for geographical analysis:

cities_variable <- cities[, c(15, 20:25, 29, 34:37, 39, 41, 42)]

4.3 Importing Geospatial Data

# municipal <- read_municipality(year=2016) -> Error in running as the package "geobr" has expired. 
municipal <- st_read(dsn="data/geospatial", layer = "muni_sf")

Check CRS of Municipal:

st_crs (municipal)

## Coordinate Reference System:
##   User input: 4674 
##   wkt:
## GEOGCS["SIRGAS 2000",
##     DATUM["Sistema_de_Referencia_Geocentrico_para_las_AmericaS_2000",
##         SPHEROID["GRS 1980",6378137,298.257222101,
##             AUTHORITY["EPSG","7019"]],
##         TOWGS84[0,0,0,0,0,0,0],
##         AUTHORITY["EPSG","6674"]],
##     PRIMEM["Greenwich",0,
##         AUTHORITY["EPSG","8901"]],
##     UNIT["degree",0.0174532925199433,
##         AUTHORITY["EPSG","9122"]],
##     AUTHORITY["EPSG","4674"]]

4.4 Converting aspatial data frame into a sf object

Currently, the data frame is aspatial, thus it is necessary to convert it to a sf object. The code chunk below converts cities data frame into a simple feature data frame by using st_as_sf() of sf packages.

cities.sf <- st_as_sf(cities_variable, coords = c("LONG", "LAT"), crs=4326) %>% st_transform(crs=4674)
head(cities.sf)

## Simple feature collection with 6 features and 13 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -49.44055 ymin: -19.15585 xmax: -39.04755 ymax: -1.72347
## CRS:            EPSG:4674
## # A tibble: 6 x 14
##   `IBGE_15-59`  IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao  AREA GVA_AGROPEC
##          <dbl> <dbl>      <dbl>            <dbl>         <dbl> <dbl>       <dbl>
## 1         3542 0.708      0.687            0.83          0.622  147.         6.2
## 2         2709 0.69       0.693            0.839         0.563  881.     50525. 
## 3         6896 0.69       0.671            0.841         0.579    1          0  
## 4        11979 0.698      0.72             0.848         0.556    1          0  
## 5        53516 0.628      0.579            0.798         0.537    1          0  
## 6         2631 0.628      0.54             0.748         0.612  180.      4435. 
## # … with 7 more variables: GVA_INDUSTRY <dbl>, GVA_SERVICES <dbl>,
## #   GVA_PUBLIC <dbl>, TAXES <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>,
## #   geometry <POINT [°]>

Join both the municipal and cities.sf datasets together:

brazil_cities <- st_join(municipal, cities.sf, join=st_contains)

## although coordinates are longitude/latitude, st_contains assumes that they are planar

5.0 Exploratory Data Analysis

5.1 EDA using statistical graphics

economyactive <- ggplot(data=cities.sf, 
             aes(x= `IBGE_15-59`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

human_dev_index <- ggplot(data=cities.sf, 
             aes(x= `IDHM`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gni_index <- ggplot(data=cities.sf, 
             aes(x= `IDHM_Renda`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

life_expectancy <- ggplot(data=cities.sf, 
             aes(x= `IDHM_Longevidade`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

education_index <- ggplot(data=cities.sf, 
             aes(x= `IDHM_Educacao`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

city_area <- ggplot(data=cities.sf, 
             aes(x= `AREA`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gav_agropecuary <- ggplot(data=cities.sf, 
             aes(x= `GVA_AGROPEC`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gav_industry <- ggplot(data=cities.sf, 
             aes(x= `GVA_INDUSTRY`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gav_services <- ggplot(data=cities.sf, 
             aes(x= `GVA_SERVICES`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gav_public<- ggplot(data=cities.sf, 
             aes(x= `GVA_PUBLIC`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

taxes <- ggplot(data=cities.sf, 
             aes(x= `TAXES`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

population <- ggplot(data=cities.sf, 
             aes(x= `POP_GDP`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gdp_capita <- ggplot(data=cities.sf, 
             aes(x= `GDP_CAPITA`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

ggarrange(economyactive, human_dev_index, gni_index, life_expectancy, education_index, city_area, gav_agropecuary, gav_industry, gav_services, gav_public, taxes, population, gdp_capita, ncol = 3, nrow = 2)

## $`1`

## 
## $`2`

## 
## $`3`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

Based on the above plots, we can tell that the value range between the different variables vary quite significantly, for instance, AREA size in 100 range and Education Index in 0.1 range. Therefore, a standardisation will be necessary on the following selected variables to ensure that data accuracy is obtained:

Economy Active (IBGE_15-59)
City Area (AREA)
Gross Value Added Agropecuary (GVA_AGROPEC)
Gross Value Added Industry (GVA_INDUSTRY)
Gross Value Added Services (GVA_SERVICES)
Gross Value Added Public (GVA_PUBLIC)
Taxes (TAXES)
Population (POP_GDP)
Gdp per capita (GDP_CAPITA)

5.2 Standardisation

cities_stan <- as.data.frame(brazil_cities[, c(5, 10:18)])

brazil_cities.std <- normalize(cities_stan)
summary(brazil_cities.std)

##    IBGE_15-59             AREA          GVA_AGROPEC        GVA_INDUSTRY      
##  Min.   :0.0000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.0000000  
##  1st Qu.:0.0002448   1st Qu.:0.02501   1st Qu.:0.000000   1st Qu.:0.0000000  
##  Median :0.0005433   Median :0.20159   Median :0.009347   Median :0.0001618  
##  Mean   :0.0025765   Mean   :0.26605   Mean   :0.035082   Mean   :0.0073698  
##  3rd Qu.:0.0013588   3rd Qu.:0.41106   3rd Qu.:0.043115   3rd Qu.:0.0011137  
##  Max.   :1.0000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.0000000  
##  NA's   :2           NA's   :2         NA's   :2          NA's   :2          
##   GVA_SERVICES         GVA_PUBLIC           TAXES              POP_GDP        
##  Min.   :0.0000000   Min.   :0.000000   Min.   :0.0000000   Min.   :0.000000  
##  1st Qu.:0.0000000   1st Qu.:0.000000   1st Qu.:0.0009052   1st Qu.:0.000000  
##  Median :0.0002356   Median :0.001817   Median :0.0010144   Median :0.001951  
##  Mean   :0.0050727   Mean   :0.007349   Mean   :0.0052787   Mean   :0.007490  
##  3rd Qu.:0.0010696   3rd Qu.:0.004520   3rd Qu.:0.0016128   3rd Qu.:0.005111  
##  Max.   :1.0000000   Max.   :1.000000   Max.   :1.0000000   Max.   :1.000000  
##  NA's   :2           NA's   :2          NA's   :2           NA's   :2         
##    GDP_CAPITA               geometry   
##  Min.   :0.00000   MULTIPOLYGON :5575  
##  1st Qu.:0.00000   epsg:4674    :   0  
##  Median :0.03329   +proj=long...:   0  
##  Mean   :0.04978                       
##  3rd Qu.:0.06982                       
##  Max.   :1.00000                       
##  NA's   :2

After standardizing the required variables that are different in the range values, it is necessary to add the variables that are standardized with the ones that do not require to be standardized for further analysis:

Keeping the columns needed in cities.sf for merge purpose with the standardized variables:

new_cities = brazil_cities[c(6:9)]

To ensure that I am able to join the two data, I have to check the crs for brazil_cities.std first:

st_crs(brazil_cities.std)

## Coordinate Reference System: NA

Set the CRS of the brazil_cities.std to be the same as the cities.sf in order to merge them together:

brazil_cities.std <- st_as_sf(brazil_cities.std) %>% st_set_crs(4674)

Verify the newly changed CRS of brazil_cities.std:

st_crs(brazil_cities.std)

## Coordinate Reference System:
##   User input: EPSG:4674 
##   wkt:
## GEOGCS["SIRGAS 2000",
##     DATUM["Sistema_de_Referencia_Geocentrico_para_las_AmericaS_2000",
##         SPHEROID["GRS 1980",6378137,298.257222101,
##             AUTHORITY["EPSG","7019"]],
##         TOWGS84[0,0,0,0,0,0,0],
##         AUTHORITY["EPSG","6674"]],
##     PRIMEM["Greenwich",0,
##         AUTHORITY["EPSG","8901"]],
##     UNIT["degree",0.0174532925199433,
##         AUTHORITY["EPSG","9122"]],
##     AUTHORITY["EPSG","4674"]]

Create a new variable “cities_final” as a newly merged data for the standardized variables and the initial ones that need not be standardized:

cities_final <- st_join(new_cities, brazil_cities.std, join=st_contains)

Once the variable is created, check for NA values in the merge table:

colSums(is.na(cities_final))

##             IDHM       IDHM_Renda IDHM_Longevidade    IDHM_Educacao 
##                2                2                2                2 
##       IBGE_15-59             AREA      GVA_AGROPEC     GVA_INDUSTRY 
##                2                2                2                2 
##     GVA_SERVICES       GVA_PUBLIC            TAXES          POP_GDP 
##                2                2                2                2 
##       GDP_CAPITA         geometry 
##                2                0

Since there are N.A. values, it is necessary to replace them with the value 0 instead:

cities_final[is.na(cities_final)] <- 0

colSums(is.na(cities_final))

##             IDHM       IDHM_Renda IDHM_Longevidade    IDHM_Educacao 
##                0                0                0                0 
##       IBGE_15-59             AREA      GVA_AGROPEC     GVA_INDUSTRY 
##                0                0                0                0 
##     GVA_SERVICES       GVA_PUBLIC            TAXES          POP_GDP 
##                0                0                0                0 
##       GDP_CAPITA         geometry 
##                0                0

Part I: Prepare a choropleth map showing the distribution of GDP per capita, 2016 at municipality level.

Since the cities_final has more than 3000 observations, as understood from the cran package information online, style = “fisher” is used for more than 3000 observations and this style should thus be preferred to “jenks” as it uses the original Fortran code and runs nested for-loops much faster.

tm_shape(cities_final) + 
  tm_fill(col = "GDP_CAPITA",
          n = 5,
          style = "fisher", 
          title = "GDP Per capita") + 
  tm_borders(alpha = 0.5)

Analysis: As seen from the choropleth map above, the municipals with the higher GDP per capita are along the southern part of Brazil, bordering Argentina, Paraguay and Uruguay. An assumption for this could be due to the fact that the area is near to the other borders, which could make it a tourist-spot for many foreigners, thus driving higher GDP per capita as the tourism industry prosper.

Part II: Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using multiple linear regression method.

Ho: Rate of change of dependent variable can be explained with mean.
H1: Mean is not a good estimator of explanatory variable.
Confidence interval is at 95%, thus the alpha value = 0.05.

Visualising the relationships of the independent variables

Using multilcollinearity in statistics to determine the relationships of the variables:

Drop the geometry column in order to compute the correlation plot:

cities_without_geometry <- st_set_geometry(cities_final, NULL)

Rename the column name for IBGE_15-59 between 15-59:

names(cities_without_geometry)[names(cities_without_geometry) == "IBGE_15-59"] <- "active_economy"

Visualize the plot for the correlation between the variables used:

corrplot(cor(cities_without_geometry[,1:12]), diag = FALSE, order = "AOE",
         tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")

From the scatterplot matrix above, correlation values (>0.80) should be eliminated, and the following variables are highly correlated to some of the other variables:

GVA_SERVICES
POP_GDP
IDHM
IDHM_Renda

The newly updated correlation plot can be seen below:

corrplot(cor(cities_without_geometry[, c(3:8,10,11)]), diag = FALSE, order = "AOE",
         tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")

Building a hedonic pricing model using multiple linear regression method

cities.mlr <- lm(formula = GDP_CAPITA ~ IDHM_Longevidade + IDHM_Educacao + active_economy + AREA + GVA_AGROPEC  + GVA_INDUSTRY + GVA_PUBLIC + TAXES, data=cities_without_geometry)
summary(cities.mlr)

## 
## Call:
## lm(formula = GDP_CAPITA ~ IDHM_Longevidade + IDHM_Educacao + 
##     active_economy + AREA + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + 
##     TAXES, data = cities_without_geometry)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39563 -0.02232 -0.00614  0.01094  0.82134 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.094176   0.009848  -9.563  < 2e-16 ***
## IDHM_Longevidade  0.018348   0.015847   1.158    0.247    
## IDHM_Educacao     0.199987   0.009903  20.195  < 2e-16 ***
## active_economy   -0.316642   0.041980  -7.543 5.35e-14 ***
## AREA              0.032287   0.002865  11.269  < 2e-16 ***
## GVA_AGROPEC       0.204075   0.011480  17.777  < 2e-16 ***
## GVA_INDUSTRY      0.550390   0.022013  25.004  < 2e-16 ***
## GVA_PUBLIC       -0.421944   0.028473 -14.819  < 2e-16 ***
## TAXES             0.344407   0.027198  12.663  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05024 on 5572 degrees of freedom
## Multiple R-squared:  0.3833, Adjusted R-squared:  0.3824 
## F-statistic: 432.9 on 8 and 5572 DF,  p-value: < 2.2e-16

With reference to the report above, it is clear that not all the independent variables are statistically significant. Therefore, we will revise the model by removing the variable (“IDHM_Longevidade”) which is not statistically significant:

cities.mlr1 <- lm(formula = GDP_CAPITA ~IDHM_Educacao + active_economy + AREA + GVA_AGROPEC  + GVA_INDUSTRY + GVA_PUBLIC + TAXES, data=cities_without_geometry)
ols_regress(cities.mlr1)

##                          Model Summary                          
## ---------------------------------------------------------------
## R                       0.619       RMSE                 0.050 
## R-Squared               0.383       Coef. Var          101.009 
## Adj. R-Squared          0.382       MSE                  0.003 
## Pred R-Squared          0.344       MAE                  0.027 
## ---------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                 ANOVA                                  
## ----------------------------------------------------------------------
##                Sum of                                                 
##               Squares          DF    Mean Square       F         Sig. 
## ----------------------------------------------------------------------
## Regression      8.737           7          1.248    494.474    0.0000 
## Residual       14.067        5573          0.003                      
## Total          22.803        5580                                     
## ----------------------------------------------------------------------
## 
##                                     Parameter Estimates                                      
## --------------------------------------------------------------------------------------------
##          model      Beta    Std. Error    Std. Beta       t        Sig      lower     upper 
## --------------------------------------------------------------------------------------------
##    (Intercept)    -0.084         0.004                 -20.315    0.000    -0.092    -0.076 
##  IDHM_Educacao     0.208         0.007        0.315     28.505    0.000     0.193     0.222 
## active_economy    -0.318         0.042       -0.088     -7.566    0.000    -0.400    -0.235 
##           AREA     0.032         0.003        0.130     11.246    0.000     0.027     0.038 
##    GVA_AGROPEC     0.205         0.011        0.210     17.836    0.000     0.182     0.227 
##   GVA_INDUSTRY     0.550         0.022        0.367     25.003    0.000     0.507     0.594 
##     GVA_PUBLIC    -0.423         0.028       -0.232    -14.852    0.000    -0.479    -0.367 
##          TAXES     0.344         0.027        0.172     12.654    0.000     0.291     0.397 
## --------------------------------------------------------------------------------------------

summary(cities.mlr1)

## 
## Call:
## lm(formula = GDP_CAPITA ~ IDHM_Educacao + active_economy + AREA + 
##     GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES, data = cities_without_geometry)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39539 -0.02239 -0.00610  0.01105  0.82135 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.083823   0.004126 -20.315  < 2e-16 ***
## IDHM_Educacao   0.207750   0.007288  28.505  < 2e-16 ***
## active_economy -0.317559   0.041974  -7.566 4.49e-14 ***
## AREA            0.032216   0.002865  11.246  < 2e-16 ***
## GVA_AGROPEC     0.204601   0.011471  17.836  < 2e-16 ***
## GVA_INDUSTRY    0.550400   0.022013  25.003  < 2e-16 ***
## GVA_PUBLIC     -0.422754   0.028465 -14.852  < 2e-16 ***
## TAXES           0.344176   0.027198  12.654  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05024 on 5573 degrees of freedom
## Multiple R-squared:  0.3831, Adjusted R-squared:  0.3824 
## F-statistic: 494.5 on 7 and 5573 DF,  p-value: < 2.2e-16

Analysis: With the p-value of < 2.2e-16, it is evident it is smaller than the alpha value of 0.05, thus, we will reject the null hypothesis that the rate of change of dependent variable can be explained with mean. With the result, we can infer that the model constructed depicts a better explanatory factor of the y value.

Checking for multicolinearity

ols_vif_tol(cities.mlr1)

##        Variables Tolerance      VIF
## 1  IDHM_Educacao 0.9044930 1.105592
## 2 active_economy 0.8182043 1.222189
## 3           AREA 0.8285247 1.206965
## 4    GVA_AGROPEC 0.8003989 1.249377
## 5   GVA_INDUSTRY 0.5144873 1.943683
## 6     GVA_PUBLIC 0.4545631 2.199915
## 7          TAXES 0.6010482 1.663760

Analysis: From the results above, the VIF of the independent variables is less than 10. Therefore, it is safe to conclude that there are no sign of multicollinearity among the independent variables.**

Testing for Non-Linearity

In multiple linear regression, it is important for us to test the assumption that linearity and additivity of the relationship between dependent and independent variables:

ols_plot_resid_fit(cities.mlr1)

Analysis: From the figure above, it reveals that the data points are mainly scattered around the 0 line, thus, we can safely conclude that the relationships between the variables (dependent and independent) are linear.

Test for Normality Assumption

ols_plot_resid_hist(cities.mlr1)

The figure above has shown that the residual of the multiple linear regression model (cities.mlr1) does resemble normal distribution.

Testing for Spatial Autocorrelation

First, export the residual of the hedonic pricing model and save it in a dataframe:

mlr.output <- as.data.frame(cities.mlr1$residuals)

Second, join the newly created data frame with cities_final object:

cities_final.res.sf <- cbind(cities_final, cities.mlr1$residuals) %>%
dplyr::rename(`MLR_RES` = `cities.mlr1.residuals`)

Third, convert the cities_final.res.sf simple feature object into a SpatialPointsDataFrame because spdep package can only process sp conformed spatial data objects:

cities_final.sp <- as_Spatial(cities_final.res.sf)
cities_final.sp

## class       : SpatialPolygonsDataFrame 
## features    : 5581 
## extent      : -73.99045, -28.83594, -33.75118, 5.271841  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs 
## variables   : 14
## names       :  IDHM, IDHM_Renda, IDHM_Longevidade, IDHM_Educacao, IBGE_15.59, AREA, GVA_AGROPEC, GVA_INDUSTRY, GVA_SERVICES, GVA_PUBLIC, TAXES, POP_GDP, GDP_CAPITA,            MLR_RES 
## min values  :     0,          0,                0,             0,          0,    0,           0,            0,            0,          0,     0,       0,          0, -0.395386037884408 
## max values  : 0.862,      0.891,            0.894,         0.825,          1,    1,           1,            1,            1,          1,     1,       1,          1,  0.821346601818066

Lastly, use the tmap package to display the distribution of the residuals on an interactive map:

PART III: Prepare a choropleth map showing the distribution of the residual of the GDP per capita.

tm_shape(municipal)+
  tm_polygons(alpha = 0.4) +
tm_shape(cities_final.res.sf) +  
  tm_fill(col = "MLR_RES",
          alpha = 0.6,
          style="quantile")

tmap_mode("plot")

Analysis: The plot above revealed that there is a sign of spatial autocorrelation. However, to further support that the sign is true, we will perform the following Moran’s I test to validate.

Before performing the Moran’s I test, we have to first determine the cut-off distance:

coords <- coordinates(cities_final.sp)
k1 <- knn2nb(knearneigh(coords))
k1dists <- unlist(nbdists(k1, coords, longlat=TRUE))
summary(k1dists)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.71   15.86   21.32   23.96  369.66

With the given Max = 369.66, we can set the cut-off as 370 for the following analysis.

Moran’s I test has to be performed:

nb <- dnearneigh(coords, 0, 370, longlat = TRUE)
#summary(nb)  #Checks for the overall information for nb

nb_lw <- nb2listw(nb, style = 'W')
#summary(nb_lw)

lm.morantest(cities.mlr1, nb_lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = GDP_CAPITA ~ IDHM_Educacao + active_economy + AREA
## + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES, data =
## cities_without_geometry)
## weights: nb_lw
## 
## Moran I statistic standard deviate = 72.048, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     8.308517e-02    -2.939627e-04     1.339280e-06

Analysis: The Global Moran’s I test for residual spatial autocorrelation shows that its p-value is at 2.2e-16, which is less than the alpha value set beforehand at 0.05. With the results, we will thus reject the null hypothesis that the residuals are randomly distributed. Since the Observed Global Moran I is at 0.08308517, which is greater than 0, we can infer that the residuals resemble cluster distribution.

Part IV: Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using geographically weighted regression method.

Building Fixed Bandwidth GWR Model

Computing Fixed Bandwidth

bw.fixed <- bw.gwr(formula = GDP_CAPITA ~ IDHM_Educacao + IBGE_15.59 + AREA + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES , data=cities_final.sp, approach="CV", kernel="gaussian", adaptive=FALSE, longlat=FALSE)

## Take a cup of tea and have a break, it will take a few minutes.
##           -----A kind suggestion from GWmodel development group
## Fixed bandwidth: 34.70094 CV score: 14.86153 
## Fixed bandwidth: 21.45065 CV score: 14.71739 
## Fixed bandwidth: 13.26152 CV score: 14.39483 
## Fixed bandwidth: 8.200357 CV score: 13.85115 
## Fixed bandwidth: 5.072388 CV score: 13.42549 
## Fixed bandwidth: 3.139197 CV score: 13.26919 
## Fixed bandwidth: 1.944419 CV score: 16.13392 
## Fixed bandwidth: 3.87761 CV score: 13.29446 
## Fixed bandwidth: 2.682833 CV score: 13.48077 
## Fixed bandwidth: 3.421246 CV score: 13.25667 
## Fixed bandwidth: 3.595562 CV score: 13.26687 
## Fixed bandwidth: 3.313513 CV score: 13.25578 
## Fixed bandwidth: 3.24693 CV score: 13.25831 
## Fixed bandwidth: 3.354663 CV score: 13.25547 
## Fixed bandwidth: 3.380096 CV score: 13.2557 
## Fixed bandwidth: 3.338945 CV score: 13.25549

Analysis: From the above results, we can tell that the recommended bandwidth is 3.338945 metres. Hence, we will then calculate the local regression using all the data points within this recommended bandwidth.

Calibrate the gwr model using fixed bandwidth and gaussian kernel:

gwr.fixed <- gwr.basic(formula = GDP_CAPITA ~ IDHM_Educacao + IBGE_15.59 + AREA + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES , data=cities_final.sp, bw=bw.fixed, kernel = 'gaussian', longlat = FALSE)

Next, display the model output:

gwr.fixed

##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2020-05-31 16:51:57 
##    Call:
##    gwr.basic(formula = GDP_CAPITA ~ IDHM_Educacao + IBGE_15.59 + 
##     AREA + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES, data = cities_final.sp, 
##     bw = bw.fixed, kernel = "gaussian", longlat = FALSE)
## 
##    Dependent (y) variable:  GDP_CAPITA
##    Independent variables:  IDHM_Educacao IBGE_15.59 AREA GVA_AGROPEC GVA_INDUSTRY GVA_PUBLIC TAXES
##    Number of data points: 5581
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39539 -0.02239 -0.00610  0.01105  0.82135 
## 
##    Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
##    (Intercept)   -0.083823   0.004126 -20.315  < 2e-16 ***
##    IDHM_Educacao  0.207750   0.007288  28.505  < 2e-16 ***
##    IBGE_15.59    -0.317559   0.041974  -7.566 4.49e-14 ***
##    AREA           0.032216   0.002865  11.246  < 2e-16 ***
##    GVA_AGROPEC    0.204601   0.011471  17.836  < 2e-16 ***
##    GVA_INDUSTRY   0.550400   0.022013  25.003  < 2e-16 ***
##    GVA_PUBLIC    -0.422754   0.028465 -14.852  < 2e-16 ***
##    TAXES          0.344176   0.027198  12.654  < 2e-16 ***
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 0.05024 on 5573 degrees of freedom
##    Multiple R-squared: 0.3831
##    Adjusted R-squared: 0.3824 
##    F-statistic: 494.5 on 7 and 5573 DF,  p-value: < 2.2e-16 
##    ***Extra Diagnostic information
##    Residual sum of squares: 14.06651
##    Sigma(hat): 0.05021286
##    AIC:  -17536.75
##    AICc:  -17536.72
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: gaussian 
##    Fixed bandwidth: 3.354663 
##    Regression points: the same locations as observations are used.
##    Distance metric: Euclidean distance metric is used.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                         Min.     1st Qu.      Median     3rd Qu.    Max.
##    Intercept      -0.1085416  -0.0696238  -0.0405163  -0.0188119  0.0025
##    IDHM_Educacao  -0.0016779   0.0623688   0.1693377   0.1951273  0.2436
##    IBGE_15.59     -3.6570485  -1.2219962  -0.3738440  -0.2340178  0.0038
##    AREA           -0.0094569   0.0185517   0.0265992   0.0339449  0.0788
##    GVA_AGROPEC    -0.3460700   0.1115879   0.1325180   0.1795846  1.1304
##    GVA_INDUSTRY  -53.7690819   0.5041928   0.6028873   1.0037581  1.4931
##    GVA_PUBLIC     -0.8559779  -0.6080538  -0.3613010  -0.1912521 19.6113
##    TAXES          -2.0113579   0.2582793   0.3860744   0.4419321 32.3941
##    ************************Diagnostic information*************************
##    Number of data points: 5581 
##    Effective number of parameters (2trace(S) - trace(S'S)): 87.27933 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5493.721 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): -18561.59 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): -18630.11 
##    Residual sum of squares: 11.46709 
##    R-square value:  0.4971245 
##    Adjusted R-square value:  0.4891338 
## 
##    ***********************************************************************
##    Program stops at: 2020-05-31 16:52:20

Analysis: With the results above, we can observe that the adjusted r-square value is at 0.4891338. This value is used to compare to see if the local regression model is better than the global regression model. As the adjusted R-square value for the local regression model is higher than that of global regression model, at 0.4891338 and 0.3824 respectively, it is safe to conclude that the local regression model is better than the global regression model.

Hence, we then visualise the GWR output in addition to the regression residuals.

Part V: Prepare a series of choropleth maps showing the outputs of the geographically weighted regression model.

Visualising GWR Output

Converting the SDF into sf data.frame

cities_nogeom.fixed <- st_as_sf(gwr.fixed$SDF) %>%
  st_transform(crs=4674)

cities_nogeom.fixed.svy21 <- st_transform(cities_nogeom.fixed, 4674)

gwr.fixed.output <- as.data.frame(gwr.fixed$SDF)
cities_nogeom.fixed <- cbind(cities_final.res.sf, as.matrix(gwr.fixed.output))

#cities_nogeom.fixed

Outputs of the geographically weighted regression model:

Local R2
Predicted Values (y hat)
Coeefficient of Standard Error (Intercept_SE)

1: Choropleth map for Local R2 values

library(RColorBrewer)
tm_shape(municipal) + 
  tm_polygons() + 
tm_shape(cities_nogeom.fixed) +
  tm_polygons(col="Local_R2",
          palette = "seq",
          border.col="gray80",
          border.lwd=1)

Analysis: The local R2 values range between 0.0 to 1.0 and they indicate how well the local regression model fits observed y-values, which are the factors affecting the GDP per capita. Very low value for the local R2 indicates that the local model is performing poorly. Hence, those with higher R squared values are of better prediction than those that are lower. From the plot above, we can find that the local R2 values range from 0.3 to 1.0.

As seen from the above plot, we can tell that the strongest intensity in orange tone (indicating higher local R2) happens in the North-West region of Brazil. Thus, we can infer that the region has areas that show that the factors selected in the analysis (y-values) fit the local regression model very well as compared to other regions. This illustrates that the relationships between the factors affecting GDP per capita and GDP per capita were much better captured by the regression model in these areas.

As for the remaining regions, we can also observe that the intensity of the colour fades from the North-Eastern region down to the Southern region of Brazil. The intensity of the area colour stays pretty constant throughout the rest of the map, with similar tone intensity belonging to the same area. However, in these areas, it illustrates that the relationships between the factors affecting GDP per capita and GDP per capita were not captured as well by the regression model in these areas.

2: Choropleth map for predicted values (yhat)

tm_shape(municipal) + 
  tm_polygons() + 
tm_shape(cities_nogeom.fixed) +
  tm_fill(col="yhat", 
          border.col="gray60",
          border.lwd=1)

Analysis: The Y hat values represent the predicted equation for a line of best fit in regression. It is used to differentiate between the predicted data and the observed data y.

As observed from above, most of the regions are at a value range of 0.0 to 0.2, while a sparse distribution of values ranging from -0.2 to 0.0 can be found in the North to West Region of Brazil.

3: Choropleth map for Coefficient Standard Error (Intercept_SE)

tm_shape(municipal) + 
  tm_polygons() + 
tm_shape(cities_nogeom.fixed) +
  tm_fill(col="Intercept_SE", 
          border.col="gray60",
          border.lwd=1)

Analysis: These values measure the reliability of each coefficient estimate. Confidence in those estimates are higher when standard errors are small in relation to the actual coefficient values. Large standard errors may indicate problems with local collinearity.

The coefficient standard error is higher at the Western to North-west region of Brazil, while the rest remains pretty constant throughout the other regions. Hence, we can assume that the Western to North-west region of Brazil may indicate problems with local collinearity while the other regions do not. Further tests should be carried out to validate this observation above.

IS415-Take-home_Ex04

Danston Goh

5/28/2020