Brazil is the fifth-largest country by area and the sixth-largest country by population. Although it has the ninth largest GDP by nominal measures, it is a highly unequal country.
A main cause of Brazil’s unequal economic growth is high transportation costs due to its large hinterland. The country Brazil may seem little known to us. We may have heard of the Amazon Rainforest which is the world’s largest tropical rainforest. It stretches from the northeastern region of Brazil to some of the southern countries in Latin America. Despite the lush rainforest, the poor soil quality prevented locals from settling there. As the transportation system was first built in the more accessible south, economic development has been more prosperous there compared to the north.
In this analysis, we will be studying the factors affecting GDP per capita using a multiple linear regression model and a geographically weighted regression for a more local analysis.
The dependent variable in this analysis is Brazil GDP per capita in 2016(GDP_CAPITA). This data is retrieved from the Kaggle dataset Brazilian Cities.
There are 13 independent variables used in this analysis.
IDHM_Renda is the Human Development Index (HDI) GNI Index 2010.IDHM_Longevidade is the HDI Life Expectancy Index 2010.IDHM_Educacao is the HDI Education Index 2010.AREA is the municipality area in square kilometer.RURAL_URBAN is the rural or urban typology.GVA_TOTAL is the total gross value-added (GVA) 2016.GVA_MAIN is the activity with the higher GVA contribution 2016.RES_POP_PROP_15-59 is the proportion of resident population who are economically active 2010.RES_POP_PROP_ESTR is the proportion of resident population who are foreigners 2010.DU_PROP_RURAL is the proportion of domestic units who are rural 2010.CROPPX_PER_HEC is the crop production price per hectare 2017.GVA_PROP_AGROPEC is the proportion of GVA contribution from Agropecuary 2016.GVA_PROP_INDUSTRY is the proportion of GVA contribution from Industry 2016.A more detailed explanation of the variables is done below.
We will be building two models to understand Brazil’s GDP per capita:
The MLR enables us to simply understand the significant factors that contribute to our target response variable, GDP per capita. However, due to its generalization, it is unable to fit every observation as good as possible. The GWR model covers this shortfall.
The GWR is an extension of the ordinary least square approach. It embeds the geographic location of observations into the model to produce a regression model for each observation. In particular, it takes into account the geographical proximity of observations. According to Tobler’s first law of Geography, nearer things are more alike than distant things. Therefore, a larger weight is placed on the neighbouring observations in calculating the GWR model.
Data required:
2.1 Importing required libraries
tidyverse for data cleaningsf to import spatial datatmap to visualise spatial dataggplot to visualise datacorrplot to identify correlationGWModel to build GWR modelolsrr to check linearity assumptionsspdep to calculate spatial autocorrelation statisticsTwo new packages are used:
geobr to get Brazil spatial boundaryHmisc to deal with missing valuespackages = c('olsrr', 'corrplot', 'ggpubr', 'sf', 'spdep', 'GWmodel', 'tmap', 'tidyverse', 'geobr', 'Hmisc')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
We use read_delim instead of read_csv since the separator used is “;” and not comma.
The term “city” and “municipality” refers to the same entity.
brazil_cities <- read_delim('data/aspatial/BRAZIL_CITIES.csv', delim=';')
Dimensions of dataframe:
dim(brazil_cities)
## [1] 5573 81
Data structure of dataframe columns:
glimpse(brazil_cities)
## Rows: 5,573
## Columns: 81
## $ CITY <chr> "Abadia De Goiás", "Abadia Dos Dourados", ...
## $ STATE <chr> "GO", "MG", "GO", "MG", "PA", "CE", "BA", ...
## $ CAPITAL <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ IBGE_RES_POP <dbl> 6876, 6704, 15757, 22690, 141100, 10496, 8...
## $ IBGE_RES_POP_BRAS <dbl> 6876, 6704, 15609, 22690, 141040, 10496, 8...
## $ IBGE_RES_POP_ESTR <dbl> 0, 0, 148, 0, 60, 0, 0, 0, 0, 0, 0, 16, 17...
## $ IBGE_DU <dbl> 2137, 2328, 4655, 7694, 31061, 2791, 2572,...
## $ IBGE_DU_URBAN <dbl> 1546, 1481, 3233, 6667, 19057, 1251, 1193,...
## $ IBGE_DU_RURAL <dbl> 591, 847, 1422, 1027, 12004, 1540, 1379, 1...
## $ IBGE_POP <dbl> 5300, 4154, 10656, 18464, 82956, 4538, 372...
## $ IBGE_1 <dbl> 69, 38, 139, 176, 1354, 98, 37, 167, 69, 1...
## $ `IBGE_1-4` <dbl> 318, 207, 650, 856, 5567, 323, 156, 733, 3...
## $ `IBGE_5-9` <dbl> 438, 260, 894, 1233, 7618, 421, 263, 978, ...
## $ `IBGE_10-14` <dbl> 517, 351, 1087, 1539, 8905, 483, 277, 927,...
## $ `IBGE_15-59` <dbl> 3542, 2709, 6896, 11979, 53516, 2631, 2319...
## $ `IBGE_60+` <dbl> 416, 589, 990, 2681, 5996, 582, 673, 803, ...
## $ IBGE_PLANTED_AREA <dbl> 319, 4479, 10307, 1862, 25200, 2598, 895, ...
## $ `IBGE_CROP_PRODUCTION_$` <dbl> 1843, 18017, 33085, 7502, 700872, 5234, 39...
## $ `IDHM Ranking 2010` <dbl> 1689, 2207, 2202, 1994, 3530, 3522, 4086, ...
## $ IDHM <dbl> 0.708, 0.690, 0.690, 0.698, 0.628, 0.628, ...
## $ IDHM_Renda <dbl> 0.687, 0.693, 0.671, 0.720, 0.579, 0.540, ...
## $ IDHM_Longevidade <dbl> 0.830, 0.839, 0.841, 0.848, 0.798, 0.748, ...
## $ IDHM_Educacao <dbl> 0.622, 0.563, 0.579, 0.556, 0.537, 0.612, ...
## $ LONG <dbl> -49.44055, -47.39683, -48.71881, -45.44619...
## $ LAT <dbl> -16.758812, -18.487565, -16.182672, -19.15...
## $ ALT <dbl> 893.60, 753.12, 1017.55, 644.74, 10.12, 40...
## $ PAY_TV <dbl> 360, 77, 227, 1230, 3389, 29, 952, 51, 55,...
## $ FIXED_PHONES <dbl> 842, 296, 720, 1716, 1218, 34, 335, 222, 3...
## $ AREA <dbl> 147.26, 881.06, 1045.13, 1817.07, 1610.65,...
## $ REGIAO_TUR <chr> NA, "Caminhos Do Cerrado", "Região Turísti...
## $ CATEGORIA_TUR <chr> NA, "D", "C", "D", "D", NA, "D", NA, NA, "...
## $ ESTIMATED_POP <dbl> 8583, 6972, 19614, 23223, 156292, 11663, 8...
## $ RURAL_URBAN <chr> "Urbano", "Rural Adjacente", "Rural Adjace...
## $ GVA_AGROPEC <dbl> 6.20, 50524.57, 42.84, 113824.60, 140463.7...
## $ GVA_INDUSTRY <dbl> 27991.25, 25917.70, 16728.30, 31002.62, 58...
## $ GVA_SERVICES <dbl> 74750.32, 62689.23, 138198.58, 172.33, 468...
## $ GVA_PUBLIC <dbl> 36915.04, 28083.79, 63396.20, 86081.41, 48...
## $ ` GVA_TOTAL ` <dbl> 145857.60, 167215.28, 261161.91, 403241.27...
## $ TAXES <dbl> 20554.20, 12873.50, 26822.58, 26994.09, 95...
## $ GDP <dbl> 166.41, 180.09, 287984.49, 430235.36, 1249...
## $ POP_GDP <dbl> 8053, 7037, 18427, 23574, 151934, 11483, 9...
## $ GDP_CAPITA <dbl> 20664.57, 25591.70, 15628.40, 18250.42, 82...
## $ GVA_MAIN <chr> "Demais serviços", "Demais serviços", "Dem...
## $ MUN_EXPENDIT <dbl> 28227691, 17909274, 37513019, NA, NA, NA, ...
## $ COMP_TOT <dbl> 284, 476, 288, 621, 931, 86, 191, 87, 285,...
## $ COMP_A <dbl> 5, 6, 5, 18, 4, 1, 6, 2, 5, 2, 0, 8, 3, 1,...
## $ COMP_B <dbl> 1, 6, 9, 1, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, ...
## $ COMP_C <dbl> 56, 30, 26, 40, 43, 4, 8, 3, 20, 4, 9, 40,...
## $ COMP_D <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
## $ COMP_E <dbl> 2, 2, 2, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 2, ...
## $ COMP_F <dbl> 29, 34, 7, 20, 27, 6, 4, 0, 10, 2, 0, 25, ...
## $ COMP_G <dbl> 110, 190, 117, 303, 500, 48, 97, 71, 133, ...
## $ COMP_H <dbl> 26, 70, 12, 62, 16, 2, 5, 0, 18, 8, 1, 67,...
## $ COMP_I <dbl> 4, 28, 57, 30, 31, 10, 5, 1, 14, 3, 0, 25,...
## $ COMP_J <dbl> 5, 11, 2, 9, 6, 2, 3, 1, 8, 1, 1, 9, 5, 14...
## $ COMP_K <dbl> 0, 0, 1, 6, 1, 0, 1, 0, 0, 1, 0, 4, 3, 3, ...
## $ COMP_L <dbl> 2, 4, 0, 4, 1, 0, 0, 0, 4, 0, 0, 7, 4, 4, ...
## $ COMP_M <dbl> 10, 15, 7, 28, 22, 2, 5, 0, 11, 4, 2, 26, ...
## $ COMP_N <dbl> 12, 29, 15, 27, 16, 3, 5, 1, 26, 0, 1, 16,...
## $ COMP_O <dbl> 4, 2, 3, 2, 2, 2, 2, 2, 2, 2, 6, 2, 4, 2, ...
## $ COMP_P <dbl> 6, 9, 11, 15, 155, 0, 8, 0, 8, 1, 6, 14, 1...
## $ COMP_Q <dbl> 6, 14, 5, 19, 33, 2, 1, 2, 9, 3, 0, 13, 22...
## $ COMP_R <dbl> 1, 6, 1, 9, 15, 0, 2, 0, 4, 0, 0, 4, 6, 6,...
## $ COMP_S <dbl> 5, 19, 8, 27, 56, 4, 38, 4, 12, 3, 4, 23, ...
## $ COMP_T <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ COMP_U <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ HOTELS <dbl> NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, ...
## $ BEDS <dbl> NA, NA, 34, NA, NA, NA, 24, NA, NA, NA, NA...
## $ Pr_Agencies <dbl> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0,...
## $ Pu_Agencies <dbl> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1,...
## $ Pr_Bank <dbl> NA, NA, 1, 2, 2, NA, NA, 1, 0, 0, 0, 1, 0,...
## $ Pu_Bank <dbl> NA, NA, 1, 2, 4, NA, NA, 0, 1, 1, 1, 2, 1,...
## $ Pr_Assets <dbl> NA, NA, 33724584, 44974716, 76181384, NA, ...
## $ Pu_Assets <dbl> NA, NA, 67091904, 371922572, 800078483, NA...
## $ Cars <dbl> 2158, 2227, 2838, 6928, 5277, 553, 896, 61...
## $ Motorcycles <dbl> 1246, 1142, 1426, 2953, 25661, 1674, 696, ...
## $ Wheeled_tractor <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, ...
## $ UBER <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ MAC <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ `WAL-MART` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ POST_OFFICES <dbl> 1, 1, 3, 4, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, ...
We can tell that there are many missing values, especially in the last few columns. Therefore in our data wrangling process, we will need to deal with these variables.
The geobr() package provides spatial datasets for Brazil at different geographical level. Here, we are interested in the municipality level data so we will use read_municipality().
muni <- read_municipality(code_muni = 'all', year = 2016)
##
|
| | 0%
|
|=== | 4%
|
|===== | 7%
|
|======== | 11%
|
|========== | 15%
|
|============= | 19%
|
|================ | 22%
|
|================== | 26%
|
|===================== | 30%
|
|======================= | 33%
|
|========================== | 37%
|
|============================= | 41%
|
|=============================== | 44%
|
|================================== | 48%
|
|==================================== | 52%
|
|======================================= | 56%
|
|========================================= | 59%
|
|============================================ | 63%
|
|=============================================== | 67%
|
|================================================= | 70%
|
|==================================================== | 74%
|
|====================================================== | 78%
|
|========================================================= | 81%
|
|============================================================ | 85%
|
|============================================================== | 89%
|
|================================================================= | 93%
|
|=================================================================== | 96%
|
|======================================================================| 100%
Class type:
class(muni)
## [1] "sf" "data.frame"
Dimensions of dataframe:
dim(muni)
## [1] 5572 5
Steps:
brazil_cities <- brazil_cities %>%
mutate_at(.vars = vars(CITY), .funs = toupper)
muni_clean <- muni %>%
mutate_at(.vars = vars(name_muni), .funs = funs(toupper))
muni_clean <- st_transform(muni_clean, 5641)
Since some municipalities have the same name (e.g. CACHOEIRINHA), we join by two columns - city name and state.
brazil_all <- inner_join(brazil_cities, muni_clean, by = c('CITY' = 'name_muni', 'STATE' = 'abbrev_state'))
class(brazil_all)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
For the purpose of this study, we need not take into consideration the year of the data source (i.e. we do not need to remove variables that were after 2016 although we are predicting GDP per capita at 2016). However, some variables do not contribute to explaining factors affecting GDP per capita.
Variables to remove and corresponding reason:
CITY, STATE, CAPITAL, code_state, code_muniLONG, LAT, ALTPAY_TV, FIXED_PHONES, HOTELS, BEDS, Pr_Agencies, Pu_Agencies, Pr_Bank, Pu_Bank, Pr_Assets, Pu_Assets, Cars, Motorcycles, Wheeled_tractor, UBER, MAC, WAL-MART, POST_OFFICESCOMP_A, COMP_B, COMP_C, COMP_D, COMP_E, COMP_F, COMP_G,COMP_H, COMP_I, COMP_J,COMP_K, COMP_L, COMP_M, COMP_N, COMP_O, COMP_P, COMP_Q, COMP_R, COMP_S, COMP_T, COMP_U, COMP_TOTESTIMATED_POPIBGE_1, IBGE_1-4, IBGE_5-9, IBGE_10-14, IBGE_60+TAXES, GDP, POP_GDPbrazil <- brazil_all %>%
subset(select=-c(`CITY`, `STATE`, `CAPITAL`, `code_state`, `code_muni`,
`LONG`, `LAT`, `ALT`, `PAY_TV`, `FIXED_PHONES`,
`HOTELS`, `BEDS`, `Pr_Agencies`, `Pu_Agencies`, `Pr_Bank`, `Pu_Bank`,
`Pr_Assets`, `Pu_Assets`, `Cars`, `Motorcycles`, `Wheeled_tractor`, `UBER`,
`MAC`, `WAL-MART`, `POST_OFFICES`, `COMP_A`, `COMP_B`, `COMP_C`, `COMP_D`,
`COMP_E`, `COMP_F`, `COMP_G`,`COMP_H`, `COMP_I`, `COMP_J`,`COMP_K`, `COMP_L`,
`COMP_M`, `COMP_N`, `COMP_O`, `COMP_P`, `COMP_Q`, `COMP_R`, `COMP_S`, `COMP_T`,
`COMP_U`, `COMP_TOT`, `ESTIMATED_POP`, `IBGE_1`, `IBGE_1-4`, `IBGE_5-9`,
`IBGE_10-14`, `IBGE_60+`, `TAXES`, `GDP`, `POP_GDP`))
Some variables cannot be used as it is since we want a comparison between the states. We can transform them to make it useful.
New variables to create:
RES_POP_PROP_15.59 (proportion of resident population who are 15-59 years old): this is the closest age group to the economically active population. Since only the economically active population contribute to the economic activities of a population, we can find the proportion of economically active population out of the whole population (i.e. IBGE_15-59 / IBGE_POP)RES_POP_PROP_ESTR (proportion of resident population who are Brazilians): in the early stage of development, foreign immigrants were an important source of productivity growth. Since different municipalities have different immigration policies, we can find the percentage of resident population who are foreigners (i.e. IBGE_RES_POP_ESTR / IBGE_RES_POP).IBGE_RES_POP_BRAS since its proportion is indirectly calculated by 1-RES_POP_PROP_ESTR.DU_PROP_RURAL (proportion of domestic units contribution by rural): domestic units are the number of households. Since rural and urban families can differ in size, we can find the percentage of domestic units who are rural (i.e. IBGE_DU_RURAL / IBGE_DU)IBGE_DU_URBAN since its proportion is indirectly calculated by 1-IBGE_DU_RURAL.CROPPX_PER_HEC (crop price per hectare of planted area): we can find the productivity of agriculture by IBGE_CROP_PRODUCTION_$ / IBGE_PLANTED_AREA.GVA_PROP_AGROPEC (proportion of Gross Added Value contribution by Agropecuary): in the early stages of early development, export of primary commodities was a key contribution to growth. Instead of using the absolute size of agropecuary which varies by size of municipality, we can find the percentage contribution of agropecuary to GAV (i.e. GVA_AGROPEC / GVA_TOTAL).GVA_PROP_INDUSTRY (proportion of GAV contribution by Industry): in the second stage of growth after agriculture, industrialisation was a key contribution to growth. The same rationale holds for finding the percentage contribution (i.e. GVA_INDUSTRY / GVA_TOTAL).GVA_SERVICES and GVA_PUBLIC since their proportion are indirectly calculated by 1-GVA_PROP_AGROPEC-GVA_PROP_INDUSTRY.brazil <- brazil %>%
mutate(`RES_POP_PROP_15.59` = `IBGE_15-59` / `IBGE_POP`) %>%
mutate(`RES_POP_PROP_ESTR` = `IBGE_RES_POP_ESTR` / `IBGE_RES_POP`) %>%
mutate(`DU_PROP_RURAL` = `IBGE_DU_RURAL` / `IBGE_DU`) %>%
mutate(`CROPPX_PER_HEC` = `IBGE_CROP_PRODUCTION_$` / `IBGE_PLANTED_AREA`) %>%
rename(`GVA_TOTAL` = ` GVA_TOTAL `) %>%
mutate(`GVA_PROP_AGROPEC` = `GVA_AGROPEC` / `GVA_TOTAL`) %>%
mutate(`GVA_PROP_INDUSTRY` = `GVA_INDUSTRY` / `GVA_TOTAL`) %>%
subset(select=-c(`IBGE_15-59`, `IBGE_POP`, `IBGE_RES_POP_BRAS`, `IBGE_RES_POP_ESTR`,
`IBGE_RES_POP`, `IBGE_DU_RURAL`, `IBGE_DU_URBAN`, `IBGE_DU`,
`GVA_AGROPEC`, `GVA_INDUSTRY`, `GVA_SERVICES`, `GVA_PUBLIC`,
`IBGE_CROP_PRODUCTION_$`, `IBGE_PLANTED_AREA`))
Column values for GVA_MAIN:
table(brazil$GVA_MAIN)
##
## Administração, defesa, educação e saúde públicas e seguridade social
## 2725
## Agricultura, inclusive apoio à agricultura e a pós colheita
## 735
## Comércio e reparação de veículos automotores e motocicletas
## 46
## Construção
## 7
## Demais serviços
## 1477
## Eletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação
## 98
## Indústrias de transformação
## 261
## Indústrias extrativas
## 35
## Pecuária, inclusive apoio à pecuária
## 161
## Produção florestal, pesca e aquicultura
## 25
Column values for RURAL_URBAN:
table(brazil$RURAL_URBAN)
##
## Intermediário Adjacente Intermediário Remoto Rural Adjacente
## 686 60 3040
## Rural Remoto Sem classificação Urbano
## 323 5 1456
brazil <- brazil %>%
mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Administração, defesa, educação e saúde públicas e seguridade social", "Public services")) %>%
mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Agricultura, inclusive apoio à agricultura e a pós colheita", "Agriculture")) %>%
mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Comércio e reparação de veículos automotores e motocicletas", "Automobile services")) %>%
mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Construção", "Construction")) %>%
mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Demais serviços", "Other services")) %>%
mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Eletricidade e gás, água, esgoto, atividades de gestão de resíduos e descontaminação", "Energy services")) %>%
mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Indústrias de transformação", "Manufacturing industries")) %>%
mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Indústrias extrativas", "Extractive industries")) %>%
mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Pecuária, inclusive apoio à pecuária", "Livestock")) %>%
mutate(GVA_MAIN=replace(GVA_MAIN, GVA_MAIN=="Produção florestal, pesca e aquicultura", "Fisheries")) %>%
mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Intermediário Adjacente", "Adjacent intermediate")) %>%
mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Intermediário Remoto", "Remote intermediate")) %>%
mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Rural Adjacente", "Adjacent rural")) %>%
mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Rural Remoto", "Remote rural")) %>%
mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Sem classificação", "Without classification")) %>%
mutate(RURAL_URBAN=replace(RURAL_URBAN, RURAL_URBAN=="Urbano", "Urban"))
Check dimensions of dataframe:
dim(brazil)
## [1] 5571 20
Since GDP per capita is our dependent variable of interest, we cannot have missing values.
sum(is.na(brazil$GDP_CAPITA))
## [1] 1
We will be dropping one row of observation.
brazil <- brazil %>%
filter(!is.na(`GDP_CAPITA`))
Confirm the number of missing values:
sum(is.na(brazil$GDP_CAPITA))
## [1] 0
The lm() function has an na.action which deals with missing values. Since the default action is na.omit, we may end up missing many observations. Therefore, we will approach it in a more systematic way.
Steps:
Counting the number of rows with missing values:
sapply(brazil, function(x) sum(is.na(x)))
## IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade
## 6 6 6 6
## IDHM_Educacao AREA REGIAO_TUR CATEGORIA_TUR
## 6 1 2286 2286
## RURAL_URBAN GVA_TOTAL GDP_CAPITA GVA_MAIN
## 0 0 0 0
## MUN_EXPENDIT geom RES_POP_PROP_15.59 RES_POP_PROP_ESTR
## 1491 0 5 5
## DU_PROP_RURAL CROPPX_PER_HEC GVA_PROP_AGROPEC GVA_PROP_INDUSTRY
## 78 67 0 0
Some 27%-41% of the observations in MUN_EXPENDIT, REGIAO_TUR and CATEGORIA_TUR columns are missing. Therefore, we will remove these variables.
brazil <- brazil %>%
subset(select=-c(`MUN_EXPENDIT`, `REGIAO_TUR`, `CATEGORIA_TUR`))
sapply(brazil, function(x) sum(is.na(x)))
## IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade
## 6 6 6 6
## IDHM_Educacao AREA RURAL_URBAN GVA_TOTAL
## 6 1 0 0
## GDP_CAPITA GVA_MAIN geom RES_POP_PROP_15.59
## 0 0 0 5
## RES_POP_PROP_ESTR DU_PROP_RURAL CROPPX_PER_HEC GVA_PROP_AGROPEC
## 5 78 67 0
## GVA_PROP_INDUSTRY
## 0
Use impute() function from Hmisc package to impute the mean/median/mode value of non-missing observations into the observations with missing values. The default impute is median. Since some of our variables are highly skewed, it is good to use the median over the mean.
Boxplots showing distribution of data with missing values:
p1 <- ggplot(brazil, aes(y=`IDHM Ranking 2010`)) +
geom_boxplot() + coord_flip()
p2 <- ggplot(brazil, aes(y=`IDHM`)) +
geom_boxplot() + coord_flip()
p3 <- ggplot(brazil, aes(y=`IDHM_Renda`)) +
geom_boxplot() + coord_flip()
p4 <- ggplot(brazil, aes(y=`IDHM_Longevidade`)) +
geom_boxplot() + coord_flip()
p5 <- ggplot(brazil, aes(y=`IDHM_Educacao`)) +
geom_boxplot() + coord_flip()
p6 <- ggplot(brazil, aes(y=`AREA`)) +
geom_boxplot() + coord_flip()
p7 <- ggplot(brazil, aes(y=`RES_POP_PROP_15.59`)) +
geom_boxplot() + coord_flip()
p8 <- ggplot(brazil, aes(y=`RES_POP_PROP_ESTR`)) +
geom_boxplot() + coord_flip()
p9 <- ggplot(brazil, aes(y=`DU_PROP_RURAL`)) +
geom_boxplot() + coord_flip()
p10 <- ggplot(brazil, aes(y=`CROPPX_PER_HEC`)) +
geom_boxplot() + coord_flip()
ggarrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10,
ncol = 5, nrow = 2)
Impute doubles type columns:
brazil[, !names(brazil) %in% c("RURAL_URBAN","GVA_TOTAL","GDP_CAPITA","GVA_MAIN","geom",
"GVA_PROP_AGROPEC", "GVA_PROP_INDUSTRY")] <- impute(brazil[, !names(brazil) %in% c("RURAL_URBAN","GVA_TOTAL","GDP_CAPITA","GVA_MAIN","geom",
"GVA_PROP_AGROPEC", "GVA_PROP_INDUSTRY")])
Check for missing values:
sapply(brazil, function(x) sum(is.na(x)))
## IDHM Ranking 2010 IDHM IDHM_Renda IDHM_Longevidade
## 0 0 0 0
## IDHM_Educacao AREA RURAL_URBAN GVA_TOTAL
## 0 0 0 0
## GDP_CAPITA GVA_MAIN geom RES_POP_PROP_15.59
## 0 0 0 0
## RES_POP_PROP_ESTR DU_PROP_RURAL CROPPX_PER_HEC GVA_PROP_AGROPEC
## 0 0 0 0
## GVA_PROP_INDUSTRY
## 0
No more missing values!
Check class type:
class(brazil)
## [1] "tbl_df" "tbl" "data.frame"
Convert to sf object:
brazil <- st_as_sf(brazil)
Check class type again:
class(brazil)
## [1] "sf" "tbl_df" "tbl" "data.frame"
Since GDP per capita is our dependent variable of interest, we will plota histogram and boxplot to understand its distribution.
p1 <- ggplot(data=brazil, aes(x= GDP_CAPITA)) +
geom_histogram(bins=20, color="black", fill="light blue")
p2 <- ggplot(brazil, aes(y=GDP_CAPITA)) +
geom_boxplot() +
coord_flip()
ggarrange(p1, p2,
ncol = 2, nrow = 1)
describe(brazil$GDP_CAPITA)
## brazil$GDP_CAPITA
## n missing distinct Info Mean Gmd .05 .10
## 5570 0 5567 1 21126 16863 6330 6967
## .25 .50 .75 .90 .95
## 9058 15870 26155 39681 50515
##
## lowest : 3190.57 4282.65 4530.24 4533.95 4585.72
## highest: 274572.12 289932.05 296459.35 306138.63 314637.69
GDP_CAPITA is highly skewed, with the median at R$15,870 and the municipal with highest GDP per capita at R$314,638.
The choropleth map helps us to understand the geographical distribution of our variable of interest, GDP per capita.
We use the quantile breaks, and n=4 to split the data into groups of bottom 25%, 25%-50%, 50%-75% and top 75%.
tm_shape(brazil)+
tm_polygons() +
tm_shape(brazil) +
tm_fill(col = "GDP_CAPITA",
alpha = 0.6,
n = 4,
style="quantile",
title = "GDP per capita")
From the choropleth map, we can see that the the richer municipalities are located in center and southern Brazil. The bottom 25% municipalities are mostly located in northeastern and northern Brazil.
This is similar to how the transporation networks are more densely connected in the south, enabling trade and thus economic growth. Understandably, the northern area where the Amazon Rainforest is located, economic growth and hence GDP per capita is low in that area. On the other hand, the northeastern region’s economy is mainly contributed by agriculture. The constant droughts has brought losses to producers, which has caused over 40% of its population to be trapped in poverty.
Correlation analysis helps us to identify highly correlated variables. These variables cause the problem of multicollinearity and will affect our modelling.
We will use corrplot.mixed() to identify variables with >=0.85 correlation coefficient.We will retain the variables that are least correlated with other variables. In addition, we will remove the non-numeric columns since correlation can only be done for numeric columns.
to_corr = brazil[, !names(brazil) %in% c("RURAL_URBAN","GVA_MAIN","geom")] %>%
st_set_geometry(NULL)
corrplot.mixed(cor(to_corr),
lower = "ellipse", upper = "number",
tl.pos = "lt", diag = "l",
tl.col = "black", number.cex=0.75)
IDHM Ranking 2010 is highly correlated with 3 variables:
IDHM (-0.99)IDHM_Renda (-0.94)IDHM_Educacao (-0.94)IDHM is highly correlated with 4 variables:
IDHM Ranking 2010 (-0.99)IDHM_Renda (0.95)IDHM_Longevidade (0.85)IDHM_Educacao (0.95)Therefore, between IDHM Ranking 2010 and IDHM, we will retain IDHM Ranking 2010.
IDHM Ranking 2010 and IDHM_Renda, we retain IDHM_Renda which is not highly correlated with any other variables.Checking correlation plot again:
to_corr = brazil[, !names(brazil) %in% c("RURAL_URBAN","GVA_MAIN","geom", "IDHM" ,
"IDHM Ranking 2010")] %>%
st_set_geometry(NULL)
corrplot.mixed(cor(to_corr),
lower = "ellipse", upper = "number",
tl.pos = "lt", diag = "l",
tl.col = "black", number.cex=0.75)
Remove two highly correlated variables from the brazil dataframe:
brazil <- brazil %>%
subset(select=-c(`IDHM`, `IDHM Ranking 2010`))
Dataframe dimensions:
dim(brazil)
## [1] 5570 15
15 variables left!
We want to know if GDP per capita can be explained by a muliple linear regression (MLR) model. The hypothesis is:
Confidence interval is 95%.
mlr <- lm(`GDP_CAPITA` ~ `IDHM_Renda` + `IDHM_Longevidade` + `IDHM_Educacao` + `AREA` +
`RURAL_URBAN` + `GVA_TOTAL` + `GVA_MAIN` + `RES_POP_PROP_15.59` + `RES_POP_PROP_ESTR` +
`DU_PROP_RURAL` + `CROPPX_PER_HEC` + `GVA_PROP_AGROPEC` + `GVA_PROP_INDUSTRY`,
data=brazil)
summary(mlr)
##
## Call:
## lm(formula = GDP_CAPITA ~ IDHM_Renda + IDHM_Longevidade + IDHM_Educacao +
## AREA + RURAL_URBAN + GVA_TOTAL + GVA_MAIN + RES_POP_PROP_15.59 +
## RES_POP_PROP_ESTR + DU_PROP_RURAL + CROPPX_PER_HEC + GVA_PROP_AGROPEC +
## GVA_PROP_INDUSTRY, data = brazil)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52931 -5069 -836 2663 252743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.392e+04 6.633e+03 -6.621 3.90e-11 ***
## IDHM_Renda 8.430e+04 6.306e+03 13.368 < 2e-16 ***
## IDHM_Longevidade -1.762e+04 8.226e+03 -2.141 0.03228 *
## IDHM_Educacao 9.406e+03 4.038e+03 2.330 0.01986 *
## AREA 6.210e-02 3.943e-02 1.575 0.11537
## RURAL_URBANAdjacent rural 1.609e+03 6.855e+02 2.347 0.01898 *
## RURAL_URBANRemote intermediate 5.566e+03 2.130e+03 2.613 0.00899 **
## RURAL_URBANRemote rural 6.457e+03 1.069e+03 6.042 1.63e-09 ***
## RURAL_URBANUrban 2.062e+03 7.237e+02 2.849 0.00440 **
## RURAL_URBANWithout classification -2.784e+04 2.611e+04 -1.066 0.28631
## GVA_TOTAL 1.274e-04 2.177e-05 5.853 5.11e-09 ***
## GVA_MAINAutomobile services 1.989e+04 2.279e+03 8.728 < 2e-16 ***
## GVA_MAINConstruction -2.700e+03 5.653e+03 -0.478 0.63290
## GVA_MAINEnergy services 2.356e+04 1.627e+03 14.482 < 2e-16 ***
## GVA_MAINExtractive industries 1.973e+04 2.592e+03 7.612 3.14e-14 ***
## GVA_MAINFisheries -1.029e+03 3.038e+03 -0.339 0.73483
## GVA_MAINLivestock -9.312e+03 1.300e+03 -7.160 9.09e-13 ***
## GVA_MAINManufacturing industries 1.443e+04 1.134e+03 12.722 < 2e-16 ***
## GVA_MAINOther services -1.008e+04 7.592e+02 -13.275 < 2e-16 ***
## GVA_MAINPublic services -1.340e+04 7.088e+02 -18.908 < 2e-16 ***
## RES_POP_PROP_15.59 3.932e+04 9.975e+03 3.942 8.17e-05 ***
## RES_POP_PROP_ESTR 4.127e+04 3.716e+04 1.110 0.26687
## DU_PROP_RURAL 2.454e+03 1.295e+03 1.895 0.05817 .
## CROPPX_PER_HEC -7.233e+01 3.307e+01 -2.187 0.02878 *
## GVA_PROP_AGROPEC -1.675e+00 2.269e+00 -0.738 0.46033
## GVA_PROP_INDUSTRY 2.806e+01 3.174e+00 8.840 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14850 on 5544 degrees of freedom
## Multiple R-squared: 0.4687, Adjusted R-squared: 0.4663
## F-statistic: 195.6 on 25 and 5544 DF, p-value: < 2.2e-16
We look at the variables that are statistically significant at the 95% confidence interval (***, **, or *). Remove variables that are not statistically significant at the 95% confidence interval and build MLR model.
mlr2 <- lm(`GDP_CAPITA` ~ `IDHM_Renda` + `IDHM_Longevidade` + `RURAL_URBAN` + `GVA_TOTAL` +
`GVA_MAIN` + `RES_POP_PROP_15.59` + `CROPPX_PER_HEC` + `GVA_PROP_INDUSTRY`,
data=brazil)
summary(mlr2)
##
## Call:
## lm(formula = GDP_CAPITA ~ IDHM_Renda + IDHM_Longevidade + RURAL_URBAN +
## GVA_TOTAL + GVA_MAIN + RES_POP_PROP_15.59 + CROPPX_PER_HEC +
## GVA_PROP_INDUSTRY, data = brazil)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54161 -5188 -874 2693 252508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.387e+04 6.552e+03 -6.696 2.35e-11 ***
## IDHM_Renda 8.879e+04 5.597e+03 15.864 < 2e-16 ***
## IDHM_Longevidade -1.615e+04 8.199e+03 -1.970 0.04886 *
## RURAL_URBANAdjacent rural 1.964e+03 6.458e+02 3.042 0.00236 **
## RURAL_URBANRemote intermediate 6.620e+03 2.012e+03 3.289 0.00101 **
## RURAL_URBANRemote rural 6.905e+03 1.038e+03 6.651 3.18e-11 ***
## RURAL_URBANUrban 2.191e+03 7.196e+02 3.044 0.00234 **
## RURAL_URBANWithout classification 2.168e+03 6.785e+03 0.319 0.74940
## GVA_TOTAL 1.303e-04 2.175e-05 5.992 2.20e-09 ***
## GVA_MAINAutomobile services 1.988e+04 2.277e+03 8.730 < 2e-16 ***
## GVA_MAINConstruction -2.393e+03 5.651e+03 -0.423 0.67197
## GVA_MAINEnergy services 2.358e+04 1.619e+03 14.565 < 2e-16 ***
## GVA_MAINExtractive industries 1.996e+04 2.586e+03 7.717 1.40e-14 ***
## GVA_MAINFisheries -1.178e+03 3.034e+03 -0.388 0.69777
## GVA_MAINLivestock -9.242e+03 1.298e+03 -7.121 1.21e-12 ***
## GVA_MAINManufacturing industries 1.446e+04 1.121e+03 12.897 < 2e-16 ***
## GVA_MAINOther services -1.003e+04 7.448e+02 -13.465 < 2e-16 ***
## GVA_MAINPublic services -1.344e+04 7.033e+02 -19.111 < 2e-16 ***
## RES_POP_PROP_15.59 4.227e+04 9.794e+03 4.315 1.62e-05 ***
## CROPPX_PER_HEC -7.329e+01 3.299e+01 -2.222 0.02633 *
## GVA_PROP_INDUSTRY 2.728e+01 3.034e+00 8.990 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14860 on 5549 degrees of freedom
## Multiple R-squared: 0.4676, Adjusted R-squared: 0.4657
## F-statistic: 243.7 on 20 and 5549 DF, p-value: < 2.2e-16
GDP_CAPITA = - 43870 + 88790(IDHM_Renda) - 16150(IDHM_Longevidade) + 1964(RURAL_URBANAdjacent rural) + 6620(RURAL_URBANRemote intermediate) + 6905(RURAL_URBANRemote rural) + 2191(RURAL_URBANUrban) + 2168(RURAL_URBANWithout classification) + 13030(GVA_TOTAL) + 19880(GVA_MAINAutomobile services) - 2393(GVA_MAINConstruction) + 23580(GVA_MAINEnergy services) + 19960(GVA_MAINExtractive industries) - 1178(GVA_MAINFisheries) - 9242(GVA_MAINLivestock) + 14460(GVA_MAINManufacturing industries) - 10040(GVA_MAINOther services) - 13440(GVA_MAINPublic services) + 42270(RES_POP_PROP_15.59) - 73.29(CROPPX_PER_HEC) + 27.28(GVA_PROP_INDUSTRY)
The MLR is interpreted as such:
IDHM_Renda, GDP_CAPITA increases by 88790 units (while other variables remain constant).RURAL_URBANRemote rural, GDP_CAPITA increases by 6905 units.GVA_TOTAL, GDP_CAPITA decreases by 13030 units.GVA_MAINAutomobile services, GDP_CAPITA increases by 19880 units.In addition, the most significant variables at the 95% confidence interval affecting GDP per capita are IDHM_Renda, IDHM_Longevidade, RURAL_URBANAdjacent rural, RURAL_URBANRemote intermediate, RURAL_URBANRemote rural, RURAL_URBANUrban, GVA_TOTAL, GVA_MAINAutomobile services, GVA_MAINEnergy services, GVA_MAINExtractive industries, GVA_MAINLivestock, GVA_MAINManufacturing industries, GVA_MAINOther services, GVA_MAINPublic services, RES_POP_PROP_15.59, CROPPX_PER_HEC, GVA_PROP_INDUSTRY.
Since the Adjusted R-squared = 0.4657, this MLR model can explain about 46.57% of variation in GDP per capita.
Since F-statistic = 243.7 >> 1, we can reject the null hypothesis that there is no relationship between our response variable GDP per capita and the independent variables in our model.
After modelling, we can check if our linear regression assumptions holds. If they do not, a transformed model may be used to explain the relationship between GDP per capita and the independent variables better.
The linear regression assumptions:
It is important to remember that we are using the residuals of the model to check against our assumptions and not the actual values. Luckily, the oslrr package takes care of this for us.
ols_plot_resid_fit(mlr2)
Since most of the data poitns are scattered around the 0 line, we can safely conclude that the relationships between the dependent variable and independent variables are linear.
ols_plot_resid_hist(mlr2)
From the above plot, we can tell that the residuals of the multiple linear regression model resemble normal distribution. We can safely conclude that the residuals have a normal distribution.
ols_vif_tol(mlr2)
## Variables Tolerance VIF
## 1 IDHM_Renda 0.1947953 5.133595
## 2 IDHM_Longevidade 0.2933923 3.408405
## 3 RURAL_URBANAdjacent rural 0.3835697 2.607088
## 4 RURAL_URBANRemote intermediate 0.9188681 1.088295
## 5 RURAL_URBANRemote rural 0.6735732 1.484620
## 6 RURAL_URBANUrban 0.3966336 2.521218
## 7 RURAL_URBANWithout classification 0.9603577 1.041279
## 8 GVA_TOTAL 0.9706129 1.030277
## 9 GVA_MAINAutomobile services 0.9335483 1.071182
## 10 GVA_MAINConstruction 0.9892562 1.010860
## 11 GVA_MAINEnergy services 0.8755305 1.142165
## 12 GVA_MAINExtractive industries 0.9495395 1.053142
## 13 GVA_MAINFisheries 0.9642609 1.037064
## 14 GVA_MAINLivestock 0.8386443 1.192401
## 15 GVA_MAINManufacturing industries 0.7063135 1.415802
## 16 GVA_MAINOther services 0.3668958 2.725569
## 17 GVA_MAINPublic services 0.3207995 3.117211
## 18 RES_POP_PROP_15.59 0.4010996 2.493146
## 19 CROPPX_PER_HEC 0.9347895 1.069760
## 20 GVA_PROP_INDUSTRY 0.9484295 1.054375
Since the VIF of the independent variables are less than 10. We can safely conclude that there are no sign of multicollinearity among the independent variables.
Combine residuals with sf dataframe:
brazil.res.sf <- cbind(brazil, mlr2$residuals) %>%
rename(`MLR_RES` = `mlr2.residuals`)
Convert to SpatialPointsDataframe:
brazil.sp <- as_Spatial(brazil.res.sf)
brazil.sp
## class : SpatialPolygonsDataFrame
## features : 5570
## extent : 1552246, 6575781, 6030702, 10583412 (xmin, xmax, ymin, ymax)
## crs : +proj=merc +lon_0=-43 +lat_ts=-2 +x_0=5000000 +y_0=10000000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## variables : 15
## names : IDHM_Renda, IDHM_Longevidade, IDHM_Educacao, AREA, RURAL_URBAN, GVA_TOTAL, GDP_CAPITA, GVA_MAIN, RES_POP_PROP_15.59, RES_POP_PROP_ESTR, DU_PROP_RURAL, CROPPX_PER_HEC, GVA_PROP_AGROPEC, GVA_PROP_INDUSTRY, MLR_RES
## min values : 0.4, 0.672, 0.207, 0.679, Adjacent intermediate, 17.48, 3190.57, Agriculture, 0.471631205673759, 0, 0.000129388931365053, 0, 0, 1.42261767428109e-05, -54161.1593263509
## max values : 0.891, 0.894, 0.825, 159533.33, Without classification, 569910502.87, 314637.69, Public services, 0.744760130414532, 0.679, 0.954473822447908, 106.960227272727, 760.885597874933, 874.802651309903, 252507.656201713
tm_shape(brazil.sp)+
tm_polygons() +
tm_shape(brazil.sp) +
tm_fill(col = "MLR_RES",
alpha = 0.6,
n = 4,
style="quantile")
The figure above reveal that the residuals display signs of spatial autocorrelation. Positive spatial autocorrelation, or clustering, means that neighbours are more similar to each other rather than randomly distributed.
To proof that our observation is indeed true, the Moran’s I test will be performed.
Calculate the distance between polygons and its neighbours:
coords <- coordinates(brazil.sp)
k1 <- knn2nb(knearneigh(coords))
k1dists <- unlist(nbdists(k1, coords, longlat = FALSE))
summary(k1dists)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1467 12473 16794 22290 24987 370668
Calculate distance-based weight matrix:
nb <- dnearneigh(coords, 0, 370669, longlat = FALSE)
Convert neighbours list to spatial weights:
nb_lw <- nb2listw(nb, style = 'W')
Perform the Moran’s I test for residual spatial autocorrelation:
lm.morantest(mlr2, nb_lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = GDP_CAPITA ~ IDHM_Renda + IDHM_Longevidade +
## RURAL_URBAN + GVA_TOTAL + GVA_MAIN + RES_POP_PROP_15.59 +
## CROPPX_PER_HEC + GVA_PROP_INDUSTRY, data = brazil)
## weights: nb_lw
##
## Moran I statistic standard deviate = 13.581, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 1.511154e-02 -4.834299e-04 1.318550e-06
The Global Moran’s I test for residual spatial autocorrelation shows that it’s p-value < 2.2e-16 which is less than the alpha value of 0.05. Hence, we will reject the null hypothesis that the residuals are randomly distributed.
Since the Observed Global Moran I = 1.511154e-02 which is greater than 0, we can infer than the residuals resemble cluster distribution.
Once again, the Geographically Weighted Regression allows us to analyse spatial non-stationarity which is the measurement of relationship among variables that differs between locations. It is done by computing a regression model for each observation.
Since the size of Brazil municipalities vary, we should use an adaptive spatial kernel instead of a fixed one. The adaptive bandwidth kernel allows the bandwidth to become a function of the number of nearest neighbours such that each observation is based on the same number of cities. Suppose a fixed bandwidth is used, the bandwidth distance is longer for larger cities and shorter for smaller cities. This makes it hard for us to determine an optimal bandwidth to use.
There are two approaches for calculating the bandwidth:
We will use both and determine the best model using three evaluation metrices:
Calculate distance matrix using gw.dist(). Data points used are Brazil municipalities from the SpatialPointsDataframe.
distmat <- gw.dist(dp.locat = coordinates(brazil.sp))
The bw.gwr() function computes the automatic bandwidth selection for a GWR model.
adaptive=TRUE for an adaptive bandwidthapproach="CV" for Cross-Validation approachkernel="gaussian" for a gaussian functionlonglat=FALSE for great circle distance to be used since the unit of projection is in mThe variables used are those significant variables identified earlier in MLR model.
bw.adaptive <- bw.gwr(`GDP_CAPITA`~`IDHM_Renda` + `IDHM_Longevidade` +
`RURAL_URBAN` + `GVA_TOTAL` + `GVA_MAIN` + `RES_POP_PROP_15.59` + `CROPPX_PER_HEC` +
`GVA_PROP_INDUSTRY`,
data=brazil.sp, approach="CV", kernel="gaussian",
adaptive=TRUE, longlat=FALSE, dMat = distmat)
## Take a cup of tea and have a break, it will take a few minutes.
## -----A kind suggestion from GWmodel development group
## Adaptive bandwidth: 3450 CV score: 1.26322e+12
## Adaptive bandwidth: 2140 CV score: 1.247214e+12
## Adaptive bandwidth: 1330 CV score: 1.228808e+12
## Adaptive bandwidth: 829 CV score: 1.214902e+12
## Adaptive bandwidth: 520 CV score: 1.207128e+12
## Adaptive bandwidth: 328 CV score: 1.226663e+12
## Adaptive bandwidth: 637 CV score: 1.209545e+12
## Adaptive bandwidth: 445 CV score: 1.20858e+12
## Adaptive bandwidth: 563 CV score: 1.207588e+12
## Adaptive bandwidth: 489 CV score: 1.207671e+12
## Adaptive bandwidth: 534 CV score: 1.207536e+12
## Adaptive bandwidth: 506 CV score: 1.207316e+12
## Adaptive bandwidth: 523 CV score: 1.207281e+12
## Adaptive bandwidth: 512 CV score: 1.207233e+12
## Adaptive bandwidth: 519 CV score: 1.207168e+12
## Adaptive bandwidth: 514 CV score: 1.207089e+12
## Adaptive bandwidth: 517 CV score: 1.207113e+12
## Adaptive bandwidth: 519 CV score: 1.207168e+12
## Adaptive bandwidth: 518 CV score: 1.207172e+12
## Adaptive bandwidth: 519 CV score: 1.207168e+12
## Adaptive bandwidth: 518 CV score: 1.207172e+12
## Adaptive bandwidth: 518 CV score: 1.207172e+12
## Adaptive bandwidth: 517 CV score: 1.207113e+12
## Adaptive bandwidth: 517 CV score: 1.207113e+12
## Adaptive bandwidth: 516 CV score: 1.207115e+12
## Adaptive bandwidth: 516 CV score: 1.207115e+12
## Adaptive bandwidth: 515 CV score: 1.207099e+12
## Adaptive bandwidth: 515 CV score: 1.207099e+12
## Adaptive bandwidth: 514 CV score: 1.207089e+12
With the CV approach, 514 data points are recommended.
The gwr.basic() function implements the GWR model.
adaptive=TRUE for an adaptive bandwidthbw=bw.adaptive to specify the bandwidth calculated earlier.gwr.adaptive <- gwr.basic(`GDP_CAPITA`~`IDHM_Renda` + `IDHM_Longevidade` +
`RURAL_URBAN` + `GVA_TOTAL` + `GVA_MAIN` + `RES_POP_PROP_15.59` + `CROPPX_PER_HEC` +
`GVA_PROP_INDUSTRY`,
data=brazil.sp, bw=bw.adaptive, kernel = 'gaussian', adaptive=TRUE, longlat = FALSE)
gwr.adaptive
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2020-05-31 22:17:59
## Call:
## gwr.basic(formula = GDP_CAPITA ~ IDHM_Renda + IDHM_Longevidade +
## RURAL_URBAN + GVA_TOTAL + GVA_MAIN + RES_POP_PROP_15.59 +
## CROPPX_PER_HEC + GVA_PROP_INDUSTRY, data = brazil.sp, bw = bw.adaptive,
## kernel = "gaussian", adaptive = TRUE, longlat = FALSE)
##
## Dependent (y) variable: GDP_CAPITA
## Independent variables: IDHM_Renda IDHM_Longevidade RURAL_URBAN GVA_TOTAL GVA_MAIN RES_POP_PROP_15.59 CROPPX_PER_HEC GVA_PROP_INDUSTRY
## Number of data points: 5570
## ***********************************************************************
## * Results of Global Regression *
## ***********************************************************************
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54161 -5188 -874 2693 252508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.387e+04 6.552e+03 -6.696 2.35e-11 ***
## IDHM_Renda 8.879e+04 5.597e+03 15.864 < 2e-16 ***
## IDHM_Longevidade -1.615e+04 8.199e+03 -1.970 0.04886 *
## RURAL_URBANAdjacent rural 1.964e+03 6.458e+02 3.042 0.00236 **
## RURAL_URBANRemote intermediate 6.620e+03 2.012e+03 3.289 0.00101 **
## RURAL_URBANRemote rural 6.905e+03 1.038e+03 6.651 3.18e-11 ***
## RURAL_URBANUrban 2.191e+03 7.196e+02 3.044 0.00234 **
## RURAL_URBANWithout classification 2.168e+03 6.785e+03 0.319 0.74940
## GVA_TOTAL 1.303e-04 2.175e-05 5.992 2.20e-09 ***
## GVA_MAINAutomobile services 1.988e+04 2.277e+03 8.730 < 2e-16 ***
## GVA_MAINConstruction -2.393e+03 5.651e+03 -0.423 0.67197
## GVA_MAINEnergy services 2.358e+04 1.619e+03 14.565 < 2e-16 ***
## GVA_MAINExtractive industries 1.996e+04 2.586e+03 7.717 1.40e-14 ***
## GVA_MAINFisheries -1.178e+03 3.034e+03 -0.388 0.69777
## GVA_MAINLivestock -9.242e+03 1.298e+03 -7.121 1.21e-12 ***
## GVA_MAINManufacturing industries 1.446e+04 1.121e+03 12.897 < 2e-16 ***
## GVA_MAINOther services -1.003e+04 7.448e+02 -13.465 < 2e-16 ***
## GVA_MAINPublic services -1.344e+04 7.033e+02 -19.111 < 2e-16 ***
## RES_POP_PROP_15.59 4.227e+04 9.794e+03 4.315 1.62e-05 ***
## CROPPX_PER_HEC -7.329e+01 3.299e+01 -2.222 0.02633 *
## GVA_PROP_INDUSTRY 2.728e+01 3.034e+00 8.990 < 2e-16 ***
##
## ---Significance stars
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 14860 on 5549 degrees of freedom
## Multiple R-squared: 0.4676
## Adjusted R-squared: 0.4657
## F-statistic: 243.7 on 20 and 5549 DF, p-value: < 2.2e-16
## ***Extra Diagnostic information
## Residual sum of squares: 1.225655e+12
## Sigma(hat): 14836.6
## AIC: 122847
## AICc: 122847.2
## ***********************************************************************
## * Results of Geographically Weighted Regression *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: gaussian
## Adaptive bandwidth: 514 (number of nearest neighbours)
## Regression points: the same locations as observations are used.
## Distance metric: Euclidean distance metric is used.
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median
## Intercept -1.5706e+05 -7.1522e+04 -4.7143e+04
## IDHM_Renda 8.5105e+03 4.3618e+04 8.7402e+04
## IDHM_Longevidade -9.9974e+04 -6.1629e+04 -5.1612e+03
## RURAL_URBANAdjacent rural -4.5147e+02 7.5420e+02 1.8534e+03
## RURAL_URBANRemote intermediate 6.3517e+02 2.3047e+03 6.1966e+03
## RURAL_URBANRemote rural -1.4197e+03 1.8151e+03 5.0593e+03
## RURAL_URBANUrban -4.6890e+02 1.5340e+03 2.2445e+03
## RURAL_URBANWithout classification -1.9804e+04 -2.2142e+03 5.0892e+03
## GVA_TOTAL 6.7580e-05 1.2539e-04 1.6448e-04
## GVA_MAINAutomobile services 3.3516e+03 1.0040e+04 1.6044e+04
## GVA_MAINConstruction -9.5004e+03 -5.9814e+03 -1.6002e+03
## GVA_MAINEnergy services 8.4164e+03 1.4175e+04 2.0356e+04
## GVA_MAINExtractive industries -1.7468e+03 1.3497e+04 1.9613e+04
## GVA_MAINFisheries -1.1084e+04 -3.6825e+03 -6.8811e+01
## GVA_MAINLivestock -1.3486e+04 -9.5680e+03 -5.6149e+03
## GVA_MAINManufacturing industries 6.1271e+03 1.2953e+04 1.8595e+04
## GVA_MAINOther services -1.4539e+04 -8.7634e+03 -7.4939e+03
## GVA_MAINPublic services -1.6544e+04 -1.3866e+04 -1.0873e+04
## RES_POP_PROP_15.59 -3.5925e+03 1.9419e+04 3.8345e+04
## CROPPX_PER_HEC -6.9442e+02 -1.0301e+02 3.2483e+01
## GVA_PROP_INDUSTRY 7.0969e+00 1.8020e+01 2.4047e+01
## 3rd Qu. Max.
## Intercept -2.2031e+04 -3648.5309
## IDHM_Renda 1.1091e+05 178842.8382
## IDHM_Longevidade 1.0048e+04 51710.7515
## RURAL_URBANAdjacent rural 2.7296e+03 6263.1210
## RURAL_URBANRemote intermediate 1.7698e+04 65592.5477
## RURAL_URBANRemote rural 8.9983e+03 14671.2002
## RURAL_URBANUrban 2.9990e+03 4412.2162
## RURAL_URBANWithout classification 3.9848e+04 63036.7410
## GVA_TOTAL 2.8841e-04 0.0007
## GVA_MAINAutomobile services 3.2978e+04 63867.6426
## GVA_MAINConstruction 2.9826e+03 10034.7902
## GVA_MAINEnergy services 3.2181e+04 46050.6621
## GVA_MAINExtractive industries 2.8758e+04 41146.5630
## GVA_MAINFisheries 9.4667e+03 17266.1776
## GVA_MAINLivestock -3.8938e+03 -945.8665
## GVA_MAINManufacturing industries 2.2352e+04 32989.5989
## GVA_MAINOther services -6.2534e+03 -3024.1746
## GVA_MAINPublic services -8.9900e+03 -7042.4360
## RES_POP_PROP_15.59 9.8460e+04 180990.3580
## CROPPX_PER_HEC 8.6959e+01 158.4598
## GVA_PROP_INDUSTRY 2.9918e+01 56.7262
## ************************Diagnostic information*************************
## Number of data points: 5570
## Effective number of parameters (2trace(S) - trace(S'S)): 154.8007
## Effective degrees of freedom (n-2trace(S) + trace(S'S)): 5415.199
## AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 122121.3
## AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 121999.5
## Residual sum of squares: 1.039356e+12
## R-square value: 0.548561
## Adjusted R-square value: 0.5356537
##
## ***********************************************************************
## Program stops at: 2020-05-31 22:19:43
For the AICc approach, the same parameters are used except approach="AICc" to specify the AICc approach.
bw.adaptive2 <- bw.gwr(`GDP_CAPITA`~`IDHM_Renda` + `IDHM_Longevidade` +
`RURAL_URBAN` + `GVA_TOTAL` + `GVA_MAIN` + `RES_POP_PROP_15.59` + `CROPPX_PER_HEC` +
`GVA_PROP_INDUSTRY`,
data=brazil.sp, approach="AICc", kernel="gaussian",
adaptive=TRUE, longlat=FALSE, dMat = distmat)
## Take a cup of tea and have a break, it will take a few minutes.
## -----A kind suggestion from GWmodel development group
## Adaptive bandwidth (number of nearest neighbours): 3450 AICc value: 122731.8
## Adaptive bandwidth (number of nearest neighbours): 2140 AICc value: 122613.8
## Adaptive bandwidth (number of nearest neighbours): 1330 AICc value: 122465.1
## Adaptive bandwidth (number of nearest neighbours): 829 AICc value: 122302.2
## Adaptive bandwidth (number of nearest neighbours): 520 AICc value: 122125.3
## Adaptive bandwidth (number of nearest neighbours): 328 AICc value: 121957.2
## Adaptive bandwidth (number of nearest neighbours): 210 AICc value: 121816.4
## Adaptive bandwidth (number of nearest neighbours): 136 AICc value: 121653.7
## Adaptive bandwidth (number of nearest neighbours): 91 AICc value: 121516.2
## Error in gw_reg(X, Y, W.i, TRUE, i) : inv(): matrix seems singular
## Adaptive bandwidth (number of nearest neighbours): 62 AICc value: Inf
## Adaptive bandwidth (number of nearest neighbours): 107 AICc value: 121567.6
## Adaptive bandwidth (number of nearest neighbours): 79 AICc value: 121465.6
## Adaptive bandwidth (number of nearest neighbours): 73 AICc value: 121449.6
## Adaptive bandwidth (number of nearest neighbours): 68 AICc value: Inf
## Adaptive bandwidth (number of nearest neighbours): 75 AICc value: 121456.3
## Adaptive bandwidth (number of nearest neighbours): 70 AICc value: Inf
## Adaptive bandwidth (number of nearest neighbours): 73 AICc value: 121449.6
With the AICc approach, a much smaller number of data points - 73 - is recommended.
gwr.adaptive2 <- gwr.basic(`GDP_CAPITA`~`IDHM_Renda` + `IDHM_Longevidade` +
`RURAL_URBAN` + `GVA_TOTAL` + `GVA_MAIN` + `RES_POP_PROP_15.59` + `CROPPX_PER_HEC` +
`GVA_PROP_INDUSTRY`,
data=brazil.sp, bw=bw.adaptive2, kernel = 'gaussian', adaptive=TRUE, longlat = FALSE)
gwr.adaptive2
## ***********************************************************************
## * Package GWmodel *
## ***********************************************************************
## Program starts at: 2020-05-31 22:34:57
## Call:
## gwr.basic(formula = GDP_CAPITA ~ IDHM_Renda + IDHM_Longevidade +
## RURAL_URBAN + GVA_TOTAL + GVA_MAIN + RES_POP_PROP_15.59 +
## CROPPX_PER_HEC + GVA_PROP_INDUSTRY, data = brazil.sp, bw = bw.adaptive2,
## kernel = "gaussian", adaptive = TRUE, longlat = FALSE)
##
## Dependent (y) variable: GDP_CAPITA
## Independent variables: IDHM_Renda IDHM_Longevidade RURAL_URBAN GVA_TOTAL GVA_MAIN RES_POP_PROP_15.59 CROPPX_PER_HEC GVA_PROP_INDUSTRY
## Number of data points: 5570
## ***********************************************************************
## * Results of Global Regression *
## ***********************************************************************
##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54161 -5188 -874 2693 252508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.387e+04 6.552e+03 -6.696 2.35e-11 ***
## IDHM_Renda 8.879e+04 5.597e+03 15.864 < 2e-16 ***
## IDHM_Longevidade -1.615e+04 8.199e+03 -1.970 0.04886 *
## RURAL_URBANAdjacent rural 1.964e+03 6.458e+02 3.042 0.00236 **
## RURAL_URBANRemote intermediate 6.620e+03 2.012e+03 3.289 0.00101 **
## RURAL_URBANRemote rural 6.905e+03 1.038e+03 6.651 3.18e-11 ***
## RURAL_URBANUrban 2.191e+03 7.196e+02 3.044 0.00234 **
## RURAL_URBANWithout classification 2.168e+03 6.785e+03 0.319 0.74940
## GVA_TOTAL 1.303e-04 2.175e-05 5.992 2.20e-09 ***
## GVA_MAINAutomobile services 1.988e+04 2.277e+03 8.730 < 2e-16 ***
## GVA_MAINConstruction -2.393e+03 5.651e+03 -0.423 0.67197
## GVA_MAINEnergy services 2.358e+04 1.619e+03 14.565 < 2e-16 ***
## GVA_MAINExtractive industries 1.996e+04 2.586e+03 7.717 1.40e-14 ***
## GVA_MAINFisheries -1.178e+03 3.034e+03 -0.388 0.69777
## GVA_MAINLivestock -9.242e+03 1.298e+03 -7.121 1.21e-12 ***
## GVA_MAINManufacturing industries 1.446e+04 1.121e+03 12.897 < 2e-16 ***
## GVA_MAINOther services -1.003e+04 7.448e+02 -13.465 < 2e-16 ***
## GVA_MAINPublic services -1.344e+04 7.033e+02 -19.111 < 2e-16 ***
## RES_POP_PROP_15.59 4.227e+04 9.794e+03 4.315 1.62e-05 ***
## CROPPX_PER_HEC -7.329e+01 3.299e+01 -2.222 0.02633 *
## GVA_PROP_INDUSTRY 2.728e+01 3.034e+00 8.990 < 2e-16 ***
##
## ---Significance stars
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 14860 on 5549 degrees of freedom
## Multiple R-squared: 0.4676
## Adjusted R-squared: 0.4657
## F-statistic: 243.7 on 20 and 5549 DF, p-value: < 2.2e-16
## ***Extra Diagnostic information
## Residual sum of squares: 1.225655e+12
## Sigma(hat): 14836.6
## AIC: 122847
## AICc: 122847.2
## ***********************************************************************
## * Results of Geographically Weighted Regression *
## ***********************************************************************
##
## *********************Model calibration information*********************
## Kernel function: gaussian
## Adaptive bandwidth: 73 (number of nearest neighbours)
## Regression points: the same locations as observations are used.
## Distance metric: Euclidean distance metric is used.
##
## ****************Summary of GWR coefficient estimates:******************
## Min. 1st Qu. Median
## Intercept -5.8415e+05 -8.2176e+04 -2.4675e+04
## IDHM_Renda -4.7526e+04 2.0540e+04 5.3026e+04
## IDHM_Longevidade -1.7100e+05 -2.4293e+04 -1.7001e+02
## RURAL_URBANAdjacent rural -9.8291e+03 -2.7287e+02 1.2075e+03
## RURAL_URBANRemote intermediate -5.7906e+03 4.2024e+02 2.6362e+03
## RURAL_URBANRemote rural -2.3799e+04 -2.1229e+03 5.5992e+02
## RURAL_URBANUrban -1.5131e+04 -1.2073e+02 1.0832e+03
## RURAL_URBANWithout classification -4.0165e+04 -2.8647e+03 9.6314e+03
## GVA_TOTAL -1.1763e-03 1.9422e-04 4.5728e-04
## GVA_MAINAutomobile services -2.5264e+04 3.8383e+03 1.2744e+04
## GVA_MAINConstruction -2.5941e+04 -7.6904e+03 -1.2297e+03
## GVA_MAINEnergy services -2.4483e+04 6.7263e+03 1.9336e+04
## GVA_MAINExtractive industries -1.4134e+04 1.0860e+04 1.9199e+04
## GVA_MAINFisheries -2.1858e+04 -5.4305e+03 7.0074e+02
## GVA_MAINLivestock -2.9123e+04 -1.0278e+04 -3.3315e+03
## GVA_MAINManufacturing industries -2.3264e+04 6.4007e+03 1.2687e+04
## GVA_MAINOther services -3.9024e+04 -9.4419e+03 -5.5550e+03
## GVA_MAINPublic services -3.2173e+04 -1.3585e+04 -9.4485e+03
## RES_POP_PROP_15.59 -7.3438e+04 1.1545e+04 3.1123e+04
## CROPPX_PER_HEC -1.9938e+03 -3.1155e+02 -6.9616e+00
## GVA_PROP_INDUSTRY -3.0009e+01 3.8901e+00 1.3568e+01
## 3rd Qu. Max.
## Intercept -5.3588e+03 4.8046e+04
## IDHM_Renda 9.6294e+04 3.4428e+05
## IDHM_Longevidade 1.4606e+04 1.6333e+05
## RURAL_URBANAdjacent rural 3.2162e+03 1.3293e+04
## RURAL_URBANRemote intermediate 1.5386e+04 9.9550e+04
## RURAL_URBANRemote rural 5.7434e+03 2.9267e+04
## RURAL_URBANUrban 2.6505e+03 1.0235e+04
## RURAL_URBANWithout classification 5.9914e+04 7.8985e+04
## GVA_TOTAL 8.9064e-04 5.4000e-03
## GVA_MAINAutomobile services 2.6056e+04 8.0397e+04
## GVA_MAINConstruction 7.5216e+03 3.2031e+04
## GVA_MAINEnergy services 3.4841e+04 7.3178e+04
## GVA_MAINExtractive industries 3.0719e+04 9.0605e+04
## GVA_MAINFisheries 9.7290e+03 3.0713e+04
## GVA_MAINLivestock 1.1565e+03 9.4133e+03
## GVA_MAINManufacturing industries 2.2383e+04 7.6055e+04
## GVA_MAINOther services -2.3132e+03 5.0042e+03
## GVA_MAINPublic services -5.9701e+03 1.8841e+03
## RES_POP_PROP_15.59 9.1626e+04 5.8977e+05
## CROPPX_PER_HEC 1.3027e+02 6.7404e+02
## GVA_PROP_INDUSTRY 3.3779e+01 1.1475e+02
## ************************Diagnostic information*************************
## Number of data points: 5570
## Effective number of parameters (2trace(S) - trace(S'S)): 716.8634
## Effective degrees of freedom (n-2trace(S) + trace(S'S)): 4853.137
## AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 121449.6
## AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 120804.5
## Residual sum of squares: 778337760180
## R-square value: 0.6619329
## Adjusted R-square value: 0.6119863
##
## ***********************************************************************
## Program stops at: 2020-05-31 22:36:28
We can use the AICc, adjusted R-squared and residual sum of squares to evaluate the performance between the two GWR models.
| Evaluation metrics | Model 1 (gwr.adaptive) | Model 2 (gwr.adaptive2) |
|---|---|---|
| AICc | 122121.3 | 121449.6 |
| Adjusted R-squared | 0.5356537 | 0.6119863 |
| Residual sum of squares | 1,039,356,000,000 | 778,337,760,180 |
The two model’s AICc is different by at least 3. Thus Model 2 with the lower AICc is the better model. Since Model 2’s adjusted R-squared 0.6119863 > Model 1’s 0.5356537 and Model 2’s residual sum of squares 78,337,760,180 < Model 1’s 1,039,356,000,000, Model 2 is a better model. Thus, gwr.adaptive2 with Guassian kernel and AICc approach is a better fit model. We will use this model in our subsequent visualization of GWR output.
brazil.sf.adaptive <- st_as_sf(gwr.adaptive2$SDF) %>%
st_transform(crs=5641)
Residuals are found by observed y (y) - fitted y (yhat).
Standardise residuals so that mean=0, standard deviation=1.
brazil.sf.adaptive$residual_scaled <- as.numeric(round(scale(brazil.sf.adaptive$residual), digits = 4))
tm_shape(brazil.sf.adaptive)+
tm_polygons() +
tm_shape(brazil.sf.adaptive) +
tm_fill(col = "residual_scaled",
alpha = 0.6,
palette = "RdBu",
breaks = c(-Inf, -0.5, 0, 0.5, Inf))
The predicted GDP per capita is very close to the actual value for cities in the northeastern region, where most of the scaled residuals are between -0.5 and 0.5. From the center to the southern region, the GWR model is unable to predict very accurately the GDP per capita. This may suggest that more variables are needed in the model.
Local R2 measures how well the local regression model fits the observed y values. A very low value indicates poor fitting which may be due to the absence of important variables from the regression model.
tm_shape(brazil.sf.adaptive)+
tm_polygons() +
tm_shape(brazil.sf.adaptive) +
tm_fill(col = "Local_R2",
alpha = 0.6,
n = 4,
style="quantile")
From the above plot, we can see that the top 75% Local R2 values are higher than the MLR model adjusted R-squared.
In fact, the lowest Local R2 value of 0.40 is only slightly worse off than mlr2 with adjusted R-squared 0.4657. Cities with the bottom 25% Local R2 values are mostly situated in the south, and some parts of the northeast. This may indicate that some variables are missing in the GWR model.
The coefficient estimates of the significant continuous variables at 95% confidence of the MLR model are plotted.
m1 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
tm_shape(brazil.sf.adaptive) +
tm_fill(col = "IDHM_Renda", breaks = c(-Inf, 0, 30000, 60000, 90000, Inf), alpha = 0.6, palette = "RdBu")
m2 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
tm_shape(brazil.sf.adaptive) +
tm_fill(col = "GVA_TOTAL", breaks = c(-Inf, 0, 0.0005, 0.001, 0.0015, Inf), alpha = 0.6, palette = "RdBu")
m3 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
tm_shape(brazil.sf.adaptive) +
tm_fill(col = "RES_POP_PROP_15.59", breaks = c(-Inf, 0, 20000, 40000, 60000, Inf), alpha = 0.6, palette = "RdBu")
m4 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
tm_shape(brazil.sf.adaptive) +
tm_fill(col = "GVA_PROP_INDUSTRY", breaks = c(-Inf, 0, 10, 20, 30, Inf), alpha = 0.6, palette = "RdBu")
tmap_arrange(m1, m2, m3, m4,
asp=1, ncol=2, nrow=2)
Interpretaion of plots:
However, it should be noted that these values do not signify the variable significance. An observation with a high parameter estimate does not equate to the strength of relationship between the independent variable and the response variable, GDP per capita.
The coefficient standard error measures the realiability of each coefficient estimate. Small values indicate higher confidence while large values may indicate problems with local collinearity.
The coefficient estimates of the significant continuous variables at 95% confidence of the MLR model are plotted.
m1 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
tm_shape(brazil.sf.adaptive) +
tm_fill(col = "IDHM_Renda_SE", style = "equal", alpha = 0.6, palette = "RdBu")
m2 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
tm_shape(brazil.sf.adaptive) +
tm_fill(col = "GVA_TOTAL_SE", style = "equal", alpha = 0.6, palette = "RdBu")
m3 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
tm_shape(brazil.sf.adaptive) +
tm_fill(col = "RES_POP_PROP_15.59_SE", style = "equal", alpha = 0.6, palette = "RdBu")
m4 <- tm_shape(brazil.sf.adaptive) + tm_polygons() +
tm_shape(brazil.sf.adaptive) +
tm_fill(col = "GVA_PROP_INDUSTRY_SE",style = "equal", alpha = 0.6, palette = "RdBu")
tmap_arrange(m1, m2, m3, m4,
asp=1, ncol=2, nrow=2)
Interpretation of plots:
Once again, we can use the three evaluation metrics to compare the OLS and GWR models.
| Evaluation metrics | OLS model | GWR model |
|---|---|---|
| AICc | 122847.2 | 121449.6 |
| Adjusted R-squared | 0.4657 | 0.6119863 |
| Residual sum of squares | 1,225,655,000,000 | 778,337,760,180 |
The two model’s AICc is different by at least 3. Thus GWR model with the lower AICc is the better model. Since GWR model’s adjusted R-squared 0.6119863 > OLS models’s 0.4657 and GWR model’s residual sum of squares 78,337,760,180 < OLS model’s 1,225,655,000,000, GWR is a better model.
From the Local R2 choropleth map of the GWR model, we can tell that the fit for the bottom 25% cities (0.400 - 0.616) is much lower than the top 25% cities (0.867, 0.992). This means that there are may be some missing independent variables in the GWR model to explain the GDP per capita of cities with low Local R2 values.
Comparing with the GDP per capita choropleth map at the EDA, the southern region with low Local R2 correspond to high GDP per capita cities while the northeastern region with low Local R2 correspond to low GDP per capita cities.
Lastly, the data used in this analysis taps on different years. Although we are examining the factors that affect GDP per capita 2016 values, we also used data that are beyond 2016.