Brazil is the world’s fifth-largest country by area and the sixth most populous. Brazil is classified as an upper-middle income economy by the World Bank and a developing country, with the largest share of global wealth in Latin America. It is considered an advanced emerging economy. It has the ninth largest GDP in the world by nominal, and eighth by PPP measures. Behind all this impressive figures, the spatial development of Brazil is highly unequal as shown in the figure below.
In this take-home exercise, you are tasked to determine factors affecting the unequal development of Brazil at the municipality level by using the data provided. The specific task of the analysis are as follows:
Prepare a choropleth map showing the distribution of GDP per capita, 2016 at municipality level.
Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using multiple linear regression method.
Prepare a choropleth map showing the distribution of the residual of the GDP per capita.
Calibrate an explanatory model to explain factors affecting the GDP per capita at the municipality level by using geographically weighted regression method.
Prepare a series of choropleth maps showing the outputs of the geographically weighted regression model.
packages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse','geobr','dplyr','heatmaply')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
cities <- read_delim("data/aspatial/BRAZIL_CITIES.csv", delim = ";")
data_dict <- read_delim("data/aspatial/Data_Dictionary.csv", delim = ";")
Reading the Aspatial Data:
glimpse(cities)
Before carrying out this analysis, I have identified the following:
Indepedent Variable (X-axis):
Dependent Variables (Y-axis):
These variables are chosen because they can either affect the GDP per capita of each city either directly or indirectly. There are several factors that can affect the GDP, including the quality of life, poverty and economic inequality and more. Therefore, the above variables are selected as they fall under the category of the different factors that can be related to determining the GDP per capita of each city.
With these variables identified, we have to check if there are NA values in these columns:
colSums(is.na(cities[, c(15, 20:23, 29, 34:37, 39, 41, 42)]))
## IBGE_15-59 IDHM IDHM_Renda IDHM_Longevidade
## 8 8 8 8
## IDHM_Educacao AREA GVA_AGROPEC GVA_INDUSTRY
## 8 3 1476 1476
## GVA_SERVICES GVA_PUBLIC TAXES POP_GDP
## 1476 1476 1476 1476
## GDP_CAPITA
## 1476
From the above results, we can tell that there are variables that have N.A. values. With the derived N.A. values, it is necessary to replace these N.A. values with 0 for an accurate analysis:
cities[, c(15, 20:23, 29, 34:37, 39, 41, 42)][is.na(cities[, c(15, 20:23, 29, 34:37, 39, 41, 42)])] <- 0
Verify again if the N.A. values are being replaced with 0:
colSums(is.na(cities[, c(15, 20:23, 29, 34:37, 39, 41, 42)]))
## IBGE_15-59 IDHM IDHM_Renda IDHM_Longevidade
## 0 0 0 0
## IDHM_Educacao AREA GVA_AGROPEC GVA_INDUSTRY
## 0 0 0 0
## GVA_SERVICES GVA_PUBLIC TAXES POP_GDP
## 0 0 0 0
## GDP_CAPITA
## 0
Creating a new data frame for the independent, dependent variables for study later, including LONG and LAT for geographical analysis:
cities_variable <- cities[, c(15, 20:25, 29, 34:37, 39, 41, 42)]
# municipal <- read_municipality(year=2016) -> Error in running as the package "geobr" has expired.
municipal <- st_read(dsn="data/geospatial", layer = "muni_sf")
Check CRS of Municipal:
st_crs (municipal)
## Coordinate Reference System:
## User input: 4674
## wkt:
## GEOGCS["SIRGAS 2000",
## DATUM["Sistema_de_Referencia_Geocentrico_para_las_AmericaS_2000",
## SPHEROID["GRS 1980",6378137,298.257222101,
## AUTHORITY["EPSG","7019"]],
## TOWGS84[0,0,0,0,0,0,0],
## AUTHORITY["EPSG","6674"]],
## PRIMEM["Greenwich",0,
## AUTHORITY["EPSG","8901"]],
## UNIT["degree",0.0174532925199433,
## AUTHORITY["EPSG","9122"]],
## AUTHORITY["EPSG","4674"]]
Currently, the data frame is aspatial, thus it is necessary to convert it to a sf object. The code chunk below converts cities data frame into a simple feature data frame by using st_as_sf() of sf packages.
cities.sf <- st_as_sf(cities_variable, coords = c("LONG", "LAT"), crs=4326) %>% st_transform(crs=4674)
head(cities.sf)
## Simple feature collection with 6 features and 13 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -49.44055 ymin: -19.15585 xmax: -39.04755 ymax: -1.72347
## CRS: EPSG:4674
## # A tibble: 6 x 14
## `IBGE_15-59` IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao AREA GVA_AGROPEC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3542 0.708 0.687 0.83 0.622 147. 6.2
## 2 2709 0.69 0.693 0.839 0.563 881. 50525.
## 3 6896 0.69 0.671 0.841 0.579 1 0
## 4 11979 0.698 0.72 0.848 0.556 1 0
## 5 53516 0.628 0.579 0.798 0.537 1 0
## 6 2631 0.628 0.54 0.748 0.612 180. 4435.
## # … with 7 more variables: GVA_INDUSTRY <dbl>, GVA_SERVICES <dbl>,
## # GVA_PUBLIC <dbl>, TAXES <dbl>, POP_GDP <dbl>, GDP_CAPITA <dbl>,
## # geometry <POINT [°]>
Join both the municipal and cities.sf datasets together:
brazil_cities <- st_join(municipal, cities.sf, join=st_contains)
## although coordinates are longitude/latitude, st_contains assumes that they are planar
economyactive <- ggplot(data=cities.sf,
aes(x= `IBGE_15-59`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
human_dev_index <- ggplot(data=cities.sf,
aes(x= `IDHM`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gni_index <- ggplot(data=cities.sf,
aes(x= `IDHM_Renda`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
life_expectancy <- ggplot(data=cities.sf,
aes(x= `IDHM_Longevidade`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
education_index <- ggplot(data=cities.sf,
aes(x= `IDHM_Educacao`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
city_area <- ggplot(data=cities.sf,
aes(x= `AREA`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gav_agropecuary <- ggplot(data=cities.sf,
aes(x= `GVA_AGROPEC`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gav_industry <- ggplot(data=cities.sf,
aes(x= `GVA_INDUSTRY`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gav_services <- ggplot(data=cities.sf,
aes(x= `GVA_SERVICES`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gav_public<- ggplot(data=cities.sf,
aes(x= `GVA_PUBLIC`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
taxes <- ggplot(data=cities.sf,
aes(x= `TAXES`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
population <- ggplot(data=cities.sf,
aes(x= `POP_GDP`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
gdp_capita <- ggplot(data=cities.sf,
aes(x= `GDP_CAPITA`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
ggarrange(economyactive, human_dev_index, gni_index, life_expectancy, education_index, city_area, gav_agropecuary, gav_industry, gav_services, gav_public, taxes, population, gdp_capita, ncol = 3, nrow = 2)
## $`1`
##
## $`2`
##
## $`3`
##
## attr(,"class")
## [1] "list" "ggarrange"
Based on the above plots, we can tell that the value range between the different variables vary quite significantly, for instance, AREA size in 100 range and Education Index in 0.1 range. Therefore, a standardisation will be necessary on the following selected variables to ensure that data accuracy is obtained:
cities_stan <- as.data.frame(brazil_cities[, c(5, 10:18)])
brazil_cities.std <- normalize(cities_stan)
summary(brazil_cities.std)
## IBGE_15-59 AREA GVA_AGROPEC GVA_INDUSTRY
## Min. :0.0000000 Min. :0.00000 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.0002448 1st Qu.:0.02501 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.0005433 Median :0.20159 Median :0.009347 Median :0.0001618
## Mean :0.0025765 Mean :0.26605 Mean :0.035082 Mean :0.0073698
## 3rd Qu.:0.0013588 3rd Qu.:0.41106 3rd Qu.:0.043115 3rd Qu.:0.0011137
## Max. :1.0000000 Max. :1.00000 Max. :1.000000 Max. :1.0000000
## NA's :2 NA's :2 NA's :2 NA's :2
## GVA_SERVICES GVA_PUBLIC TAXES POP_GDP
## Min. :0.0000000 Min. :0.000000 Min. :0.0000000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.0009052 1st Qu.:0.000000
## Median :0.0002356 Median :0.001817 Median :0.0010144 Median :0.001951
## Mean :0.0050727 Mean :0.007349 Mean :0.0052787 Mean :0.007490
## 3rd Qu.:0.0010696 3rd Qu.:0.004520 3rd Qu.:0.0016128 3rd Qu.:0.005111
## Max. :1.0000000 Max. :1.000000 Max. :1.0000000 Max. :1.000000
## NA's :2 NA's :2 NA's :2 NA's :2
## GDP_CAPITA geometry
## Min. :0.00000 MULTIPOLYGON :5575
## 1st Qu.:0.00000 epsg:4674 : 0
## Median :0.03329 +proj=long...: 0
## Mean :0.04978
## 3rd Qu.:0.06982
## Max. :1.00000
## NA's :2
After standardizing the required variables that are different in the range values, it is necessary to add the variables that are standardized with the ones that do not require to be standardized for further analysis:
Keeping the columns needed in cities.sf for merge purpose with the standardized variables:
new_cities = brazil_cities[c(6:9)]
To ensure that I am able to join the two data, I have to check the crs for brazil_cities.std first:
st_crs(brazil_cities.std)
## Coordinate Reference System: NA
Set the CRS of the brazil_cities.std to be the same as the cities.sf in order to merge them together:
brazil_cities.std <- st_as_sf(brazil_cities.std) %>% st_set_crs(4674)
Verify the newly changed CRS of brazil_cities.std:
st_crs(brazil_cities.std)
## Coordinate Reference System:
## User input: EPSG:4674
## wkt:
## GEOGCS["SIRGAS 2000",
## DATUM["Sistema_de_Referencia_Geocentrico_para_las_AmericaS_2000",
## SPHEROID["GRS 1980",6378137,298.257222101,
## AUTHORITY["EPSG","7019"]],
## TOWGS84[0,0,0,0,0,0,0],
## AUTHORITY["EPSG","6674"]],
## PRIMEM["Greenwich",0,
## AUTHORITY["EPSG","8901"]],
## UNIT["degree",0.0174532925199433,
## AUTHORITY["EPSG","9122"]],
## AUTHORITY["EPSG","4674"]]
Create a new variable “cities_final” as a newly merged data for the standardized variables and the initial ones that need not be standardized:
cities_final <- st_join(new_cities, brazil_cities.std, join=st_contains)
Once the variable is created, check for NA values in the merge table:
colSums(is.na(cities_final))
## IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao
## 2 2 2 2
## IBGE_15-59 AREA GVA_AGROPEC GVA_INDUSTRY
## 2 2 2 2
## GVA_SERVICES GVA_PUBLIC TAXES POP_GDP
## 2 2 2 2
## GDP_CAPITA geometry
## 2 0
Since there are N.A. values, it is necessary to replace them with the value 0 instead:
cities_final[is.na(cities_final)] <- 0
colSums(is.na(cities_final))
## IDHM IDHM_Renda IDHM_Longevidade IDHM_Educacao
## 0 0 0 0
## IBGE_15-59 AREA GVA_AGROPEC GVA_INDUSTRY
## 0 0 0 0
## GVA_SERVICES GVA_PUBLIC TAXES POP_GDP
## 0 0 0 0
## GDP_CAPITA geometry
## 0 0
Since the cities_final has more than 3000 observations, as understood from the cran package information online, style = “fisher” is used for more than 3000 observations and this style should thus be preferred to “jenks” as it uses the original Fortran code and runs nested for-loops much faster.
tm_shape(cities_final) +
tm_fill(col = "GDP_CAPITA",
n = 5,
style = "fisher",
title = "GDP Per capita") +
tm_borders(alpha = 0.5)
Analysis: As seen from the choropleth map above, the municipals with the higher GDP per capita are along the southern part of Brazil, bordering Argentina, Paraguay and Uruguay. An assumption for this could be due to the fact that the area is near to the other borders, which could make it a tourist-spot for many foreigners, thus driving higher GDP per capita as the tourism industry prosper.
Drop the geometry column in order to compute the correlation plot:
cities_without_geometry <- st_set_geometry(cities_final, NULL)
Rename the column name for IBGE_15-59 between 15-59:
names(cities_without_geometry)[names(cities_without_geometry) == "IBGE_15-59"] <- "active_economy"
corrplot(cor(cities_without_geometry[,1:12]), diag = FALSE, order = "AOE",
tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")
From the scatterplot matrix above, correlation values (>0.80) should be eliminated, and the following variables are highly correlated to some of the other variables:
The newly updated correlation plot can be seen below:
corrplot(cor(cities_without_geometry[, c(3:8,10,11)]), diag = FALSE, order = "AOE",
tl.pos = "td", tl.cex = 0.5, method = "number", type = "upper")
cities.mlr <- lm(formula = GDP_CAPITA ~ IDHM_Longevidade + IDHM_Educacao + active_economy + AREA + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES, data=cities_without_geometry)
summary(cities.mlr)
##
## Call:
## lm(formula = GDP_CAPITA ~ IDHM_Longevidade + IDHM_Educacao +
## active_economy + AREA + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC +
## TAXES, data = cities_without_geometry)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39563 -0.02232 -0.00614 0.01094 0.82134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.094176 0.009848 -9.563 < 2e-16 ***
## IDHM_Longevidade 0.018348 0.015847 1.158 0.247
## IDHM_Educacao 0.199987 0.009903 20.195 < 2e-16 ***
## active_economy -0.316642 0.041980 -7.543 5.35e-14 ***
## AREA 0.032287 0.002865 11.269 < 2e-16 ***
## GVA_AGROPEC 0.204075 0.011480 17.777 < 2e-16 ***
## GVA_INDUSTRY 0.550390 0.022013 25.004 < 2e-16 ***
## GVA_PUBLIC -0.421944 0.028473 -14.819 < 2e-16 ***
## TAXES 0.344407 0.027198 12.663 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05024 on 5572 degrees of freedom
## Multiple R-squared: 0.3833, Adjusted R-squared: 0.3824
## F-statistic: 432.9 on 8 and 5572 DF, p-value: < 2.2e-16
With reference to the report above, it is clear that not all the independent variables are statistically significant. Therefore, we will revise the model by removing the variable (“IDHM_Longevidade”) which is not statistically significant:
cities.mlr1 <- lm(formula = GDP_CAPITA ~IDHM_Educacao + active_economy + AREA + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES, data=cities_without_geometry)
ols_regress(cities.mlr1)
## Model Summary
## ---------------------------------------------------------------
## R 0.619 RMSE 0.050
## R-Squared 0.383 Coef. Var 101.009
## Adj. R-Squared 0.382 MSE 0.003
## Pred R-Squared 0.344 MAE 0.027
## ---------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------
## Regression 8.737 7 1.248 494.474 0.0000
## Residual 14.067 5573 0.003
## Total 22.803 5580
## ----------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------
## (Intercept) -0.084 0.004 -20.315 0.000 -0.092 -0.076
## IDHM_Educacao 0.208 0.007 0.315 28.505 0.000 0.193 0.222
## active_economy -0.318 0.042 -0.088 -7.566 0.000 -0.400 -0.235
## AREA 0.032 0.003 0.130 11.246 0.000 0.027 0.038
## GVA_AGROPEC 0.205 0.011 0.210 17.836 0.000 0.182 0.227
## GVA_INDUSTRY 0.550 0.022 0.367 25.003 0.000 0.507 0.594
## GVA_PUBLIC -0.423 0.028 -0.232 -14.852 0.000 -0.479 -0.367
## TAXES 0.344 0.027 0.172 12.654 0.000 0.291 0.397
## --------------------------------------------------------------------------------------------
summary(cities.mlr1)
##
## Call:
## lm(formula = GDP_CAPITA ~ IDHM_Educacao + active_economy + AREA +
## GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES, data = cities_without_geometry)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39539 -0.02239 -0.00610 0.01105 0.82135
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.083823 0.004126 -20.315 < 2e-16 ***
## IDHM_Educacao 0.207750 0.007288 28.505 < 2e-16 ***
## active_economy -0.317559 0.041974 -7.566 4.49e-14 ***
## AREA 0.032216 0.002865 11.246 < 2e-16 ***
## GVA_AGROPEC 0.204601 0.011471 17.836 < 2e-16 ***
## GVA_INDUSTRY 0.550400 0.022013 25.003 < 2e-16 ***
## GVA_PUBLIC -0.422754 0.028465 -14.852 < 2e-16 ***
## TAXES 0.344176 0.027198 12.654 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05024 on 5573 degrees of freedom
## Multiple R-squared: 0.3831, Adjusted R-squared: 0.3824
## F-statistic: 494.5 on 7 and 5573 DF, p-value: < 2.2e-16
Analysis: With the p-value of < 2.2e-16, it is evident it is smaller than the alpha value of 0.05, thus, we will reject the null hypothesis that the rate of change of dependent variable can be explained with mean. With the result, we can infer that the model constructed depicts a better explanatory factor of the y value.
ols_vif_tol(cities.mlr1)
## Variables Tolerance VIF
## 1 IDHM_Educacao 0.9044930 1.105592
## 2 active_economy 0.8182043 1.222189
## 3 AREA 0.8285247 1.206965
## 4 GVA_AGROPEC 0.8003989 1.249377
## 5 GVA_INDUSTRY 0.5144873 1.943683
## 6 GVA_PUBLIC 0.4545631 2.199915
## 7 TAXES 0.6010482 1.663760
Analysis: From the results above, the VIF of the independent variables is less than 10. Therefore, it is safe to conclude that there are no sign of multicollinearity among the independent variables.**
In multiple linear regression, it is important for us to test the assumption that linearity and additivity of the relationship between dependent and independent variables:
ols_plot_resid_fit(cities.mlr1)
Analysis: From the figure above, it reveals that the data points are mainly scattered around the 0 line, thus, we can safely conclude that the relationships between the variables (dependent and independent) are linear.
ols_plot_resid_hist(cities.mlr1)
The figure above has shown that the residual of the multiple linear regression model (cities.mlr1) does resemble normal distribution.
First, export the residual of the hedonic pricing model and save it in a dataframe:
mlr.output <- as.data.frame(cities.mlr1$residuals)
Second, join the newly created data frame with cities_final object:
cities_final.res.sf <- cbind(cities_final, cities.mlr1$residuals) %>%
dplyr::rename(`MLR_RES` = `cities.mlr1.residuals`)
Third, convert the cities_final.res.sf simple feature object into a SpatialPointsDataFrame because spdep package can only process sp conformed spatial data objects:
cities_final.sp <- as_Spatial(cities_final.res.sf)
cities_final.sp
## class : SpatialPolygonsDataFrame
## features : 5581
## extent : -73.99045, -28.83594, -33.75118, 5.271841 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
## variables : 14
## names : IDHM, IDHM_Renda, IDHM_Longevidade, IDHM_Educacao, IBGE_15.59, AREA, GVA_AGROPEC, GVA_INDUSTRY, GVA_SERVICES, GVA_PUBLIC, TAXES, POP_GDP, GDP_CAPITA, MLR_RES
## min values : 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.395386037884408
## max values : 0.862, 0.891, 0.894, 0.825, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.821346601818066
Lastly, use the tmap package to display the distribution of the residuals on an interactive map:
tm_shape(municipal)+
tm_polygons(alpha = 0.4) +
tm_shape(cities_final.res.sf) +
tm_fill(col = "MLR_RES",
alpha = 0.6,
style="quantile")
tmap_mode("plot")
Analysis: The plot above revealed that there is a sign of spatial autocorrelation. However, to further support that the sign is true, we will perform the following Moran’s I test to validate.
Before performing the Moran’s I test, we have to first determine the cut-off distance:
coords <- coordinates(cities_final.sp)
k1 <- knn2nb(knearneigh(coords))
k1dists <- unlist(nbdists(k1, coords, longlat=TRUE))
summary(k1dists)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 11.71 15.86 21.32 23.96 369.66
With the given Max = 369.66, we can set the cut-off as 370 for the following analysis.
Moran’s I test has to be performed:
nb <- dnearneigh(coords, 0, 370, longlat = TRUE)
#summary(nb) #Checks for the overall information for nb
nb_lw <- nb2listw(nb, style = 'W')
#summary(nb_lw)
lm.morantest(cities.mlr1, nb_lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = GDP_CAPITA ~ IDHM_Educacao + active_economy + AREA
## + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES, data =
## cities_without_geometry)
## weights: nb_lw
##
## Moran I statistic standard deviate = 72.048, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 8.308517e-02 -2.939627e-04 1.339280e-06
Analysis: The Global Moran’s I test for residual spatial autocorrelation shows that its p-value is at 2.2e-16, which is less than the alpha value set beforehand at 0.05. With the results, we will thus reject the null hypothesis that the residuals are randomly distributed. Since the Observed Global Moran I is at 0.08308517, which is greater than 0, we can infer that the residuals resemble cluster distribution.
bw.fixed <- bw.gwr(formula = GDP_CAPITA ~ IDHM_Educacao + IBGE_15.59 + AREA + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES , data=cities_final.sp, approach="CV", kernel="gaussian", adaptive=FALSE, longlat=FALSE)
## Take a cup of tea and have a break, it will take a few minutes.
## -----A kind suggestion from GWmodel development group
## Fixed bandwidth: 34.70094 CV score: 14.86153
## Fixed bandwidth: 21.45065 CV score: 14.71739
## Fixed bandwidth: 13.26152 CV score: 14.39483
## Fixed bandwidth: 8.200357 CV score: 13.85115
## Fixed bandwidth: 5.072388 CV score: 13.42549
## Fixed bandwidth: 3.139197 CV score: 13.26919
## Fixed bandwidth: 1.944419 CV score: 16.13392
## Fixed bandwidth: 3.87761 CV score: 13.29446
## Fixed bandwidth: 2.682833 CV score: 13.48077
## Fixed bandwidth: 3.421246 CV score: 13.25667
## Fixed bandwidth: 3.595562 CV score: 13.26687
## Fixed bandwidth: 3.313513 CV score: 13.25578
## Fixed bandwidth: 3.24693 CV score: 13.25831
## Fixed bandwidth: 3.354663 CV score: 13.25547
## Fixed bandwidth: 3.380096 CV score: 13.2557
## Fixed bandwidth: 3.338945 CV score: 13.25549
Analysis: From the above results, we can tell that the recommended bandwidth is 3.338945 metres. Hence, we will then calculate the local regression using all the data points within this recommended bandwidth.
Calibrate the gwr model using fixed bandwidth and gaussian kernel:
gwr.fixed <- gwr.basic(formula = GDP_CAPITA ~ IDHM_Educacao + IBGE_15.59 + AREA + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES , data=cities_final.sp, bw=bw.fixed, kernel = 'gaussian', longlat = FALSE)
Next, display the model output:
gwr.fixed
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2020-05-31 16:51:57
## Call:
## gwr.basic(formula = GDP_CAPITA ~ IDHM_Educacao + IBGE_15.59 +
## AREA + GVA_AGROPEC + GVA_INDUSTRY + GVA_PUBLIC + TAXES, data = cities_final.sp,
## bw = bw.fixed, kernel = "gaussian", longlat = FALSE)
##
## Dependent (y) variable: GDP_CAPITA
## Independent variables: IDHM_Educacao IBGE_15.59 AREA GVA_AGROPEC GVA_INDUSTRY GVA_PUBLIC TAXES
## Number of data points: 5581
## ***********************************************************************
## * Results of Global Regression *
## ***********************************************************************
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39539 -0.02239 -0.00610 0.01105 0.82135
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.083823 0.004126 -20.315 < 2e-16 ***
## IDHM_Educacao 0.207750 0.007288 28.505 < 2e-16 ***
## IBGE_15.59 -0.317559 0.041974 -7.566 4.49e-14 ***
## AREA 0.032216 0.002865 11.246 < 2e-16 ***
## GVA_AGROPEC 0.204601 0.011471 17.836 < 2e-16 ***
## GVA_INDUSTRY 0.550400 0.022013 25.003 < 2e-16 ***
## GVA_PUBLIC -0.422754 0.028465 -14.852 < 2e-16 ***
## TAXES 0.344176 0.027198 12.654 < 2e-16 ***
##
## ---Significance stars
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.05024 on 5573 degrees of freedom
## Multiple R-squared: 0.3831
## Adjusted R-squared: 0.3824
## F-statistic: 494.5 on 7 and 5573 DF, p-value: < 2.2e-16
## ***Extra Diagnostic information
## Residual sum of squares: 14.06651
## Sigma(hat): 0.05021286
## AIC: -17536.75
## AICc: -17536.72
## ***********************************************************************
## * Results of Geographically Weighted Regression *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: gaussian
## Fixed bandwidth: 3.354663
## Regression points: the same locations as observations are used.
## Distance metric: Euclidean distance metric is used.
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median 3rd Qu. Max.
## Intercept -0.1085416 -0.0696238 -0.0405163 -0.0188119 0.0025
## IDHM_Educacao -0.0016779 0.0623688 0.1693377 0.1951273 0.2436
## IBGE_15.59 -3.6570485 -1.2219962 -0.3738440 -0.2340178 0.0038
## AREA -0.0094569 0.0185517 0.0265992 0.0339449 0.0788
## GVA_AGROPEC -0.3460700 0.1115879 0.1325180 0.1795846 1.1304
## GVA_INDUSTRY -53.7690819 0.5041928 0.6028873 1.0037581 1.4931
## GVA_PUBLIC -0.8559779 -0.6080538 -0.3613010 -0.1912521 19.6113
## TAXES -2.0113579 0.2582793 0.3860744 0.4419321 32.3941
## ************************Diagnostic information*************************
## Number of data points: 5581
## Effective number of parameters (2trace(S) - trace(S'S)): 87.27933
## Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5493.721
## AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): -18561.59
## AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): -18630.11
## Residual sum of squares: 11.46709
## R-square value: 0.4971245
## Adjusted R-square value: 0.4891338
##
## ***********************************************************************
## Program stops at: 2020-05-31 16:52:20
Analysis: With the results above, we can observe that the adjusted r-square value is at 0.4891338. This value is used to compare to see if the local regression model is better than the global regression model. As the adjusted R-square value for the local regression model is higher than that of global regression model, at 0.4891338 and 0.3824 respectively, it is safe to conclude that the local regression model is better than the global regression model.
Hence, we then visualise the GWR output in addition to the regression residuals.
cities_nogeom.fixed <- st_as_sf(gwr.fixed$SDF) %>%
st_transform(crs=4674)
cities_nogeom.fixed.svy21 <- st_transform(cities_nogeom.fixed, 4674)
gwr.fixed.output <- as.data.frame(gwr.fixed$SDF)
cities_nogeom.fixed <- cbind(cities_final.res.sf, as.matrix(gwr.fixed.output))
#cities_nogeom.fixed
Outputs of the geographically weighted regression model:
library(RColorBrewer)
tm_shape(municipal) +
tm_polygons() +
tm_shape(cities_nogeom.fixed) +
tm_polygons(col="Local_R2",
palette = "seq",
border.col="gray80",
border.lwd=1)
Analysis: The local R2 values range between 0.0 to 1.0 and they indicate how well the local regression model fits observed y-values, which are the factors affecting the GDP per capita. Very low value for the local R2 indicates that the local model is performing poorly. Hence, those with higher R squared values are of better prediction than those that are lower. From the plot above, we can find that the local R2 values range from 0.3 to 1.0.
As seen from the above plot, we can tell that the strongest intensity in orange tone (indicating higher local R2) happens in the North-West region of Brazil. Thus, we can infer that the region has areas that show that the factors selected in the analysis (y-values) fit the local regression model very well as compared to other regions. This illustrates that the relationships between the factors affecting GDP per capita and GDP per capita were much better captured by the regression model in these areas.
As for the remaining regions, we can also observe that the intensity of the colour fades from the North-Eastern region down to the Southern region of Brazil. The intensity of the area colour stays pretty constant throughout the rest of the map, with similar tone intensity belonging to the same area. However, in these areas, it illustrates that the relationships between the factors affecting GDP per capita and GDP per capita were not captured as well by the regression model in these areas.
tm_shape(municipal) +
tm_polygons() +
tm_shape(cities_nogeom.fixed) +
tm_fill(col="yhat",
border.col="gray60",
border.lwd=1)
Analysis: The Y hat values represent the predicted equation for a line of best fit in regression. It is used to differentiate between the predicted data and the observed data y.
As observed from above, most of the regions are at a value range of 0.0 to 0.2, while a sparse distribution of values ranging from -0.2 to 0.0 can be found in the North to West Region of Brazil.
tm_shape(municipal) +
tm_polygons() +
tm_shape(cities_nogeom.fixed) +
tm_fill(col="Intercept_SE",
border.col="gray60",
border.lwd=1)
Analysis: These values measure the reliability of each coefficient estimate. Confidence in those estimates are higher when standard errors are small in relation to the actual coefficient values. Large standard errors may indicate problems with local collinearity.
The coefficient standard error is higher at the Western to North-west region of Brazil, while the rest remains pretty constant throughout the other regions. Hence, we can assume that the Western to North-west region of Brazil may indicate problems with local collinearity while the other regions do not. Further tests should be carried out to validate this observation above.