In this research website post we analyze the economic interdependencies between several economic factors across the United States. We have chosen single state as the spatial unit. We analyze the data between 2014 and 2018.
First of all we show what data we handle in the project. Then we map the data and convert it into single dataframe.
We exclude Alaska, Hawaii and Puerto Rico from the analysis. We focus on 49 states.
states<-readOGR("data", "cb_2021_us_state_20m", verbose=FALSE)
states <- states[!(states@data$NAME %in% c("Alaska", "Hawaii", "Puerto Rico")),]
states.df <- as.data.frame(states)
plot(states)
crds<-coordinates(states)
pointLabel(crds, as.character(states.df$NAME), cex=0.6)
In order to perform more intricate modelling procedure we read additional datasets such as unemployment rate, salary, estimation of population or income in the U.S.A. We use the website (“Iowa State University” 2022) to gather relevant data revealing economic indicators in the United States across analyzed period.
emp <- read_excel("data\\emp-unemployment.xls",
sheet = "States", skip = 5, n_max=52)
names(emp) <- paste0('emp_', emp %>% names())
salary <- read_excel("data\\emp-salary.xlsx", skip = 5, n_max=66)
names(salary) <- paste0('salary_', salary %>% names())
popest <- read_excel("data\\popest-annual.xlsx",
sheet = "States", skip = 6, n_max=63)
names(popest) <- paste0('popest_', popest %>% names())
income <- read_excel("data\\income.xlsx", skip = 5, n_max=66)
names(income) <- paste0('income_', income %>% names())
We merge the spatial polygons data with economic indicators.
data_with_indicators <- states@data %>%
select(NAME) %>%
inner_join(emp, by=c("NAME" = "emp_Area")) %>%
inner_join(salary, by=c("NAME" = "salary_GeoName")) %>%
inner_join(popest, by=c("NAME" = "popest_Area")) %>%
inner_join(income, by=c("NAME" = "income_GeoName"))
Based on the obtained data we present the choropleth graphics.
x <- data_with_indicators %>% select(emp_2015) %>% pull()
shades<-auto.shading(x)
choropleth(states, x, shades)
pointLabel(crds, as.character(states.df$NAME), cex=0.6)
choro.legend(14, 50.5, shades, bty="n")
title(main="Unemployment rate in USA across states in 2015")
choro.legend(px='bottomleft', sh=shades,fmt="%4.1f",cex=0.8)
The north of the USA seems as the region with the highest unemployment rate.
variable<-data_with_indicators %>% select(salary_2015) %>% pull()
summary(variable)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 406033 1092255 2587451 3856388 4631417 22687196
intervals<-8
colors<-brewer.pal(intervals, "BuPu") # choice of colors
classes<-classIntervals(variable, intervals, style="equal")
color.table<-findColours(classes, colors)
plot(states, col=color.table)
legend("bottomleft",
legend=names(attr(color.table, "table")),
fill=attr(color.table, "palette"), cex=0.8, bty="n")
title(main="Salary in USA across states in 2015")
The highest salary has been observed in California, Texas and New York states.
Based on the spatial weights matrix W we can compute the spatial lag. Spatial lag is the average value in the neighbourhood. If we want to measure externalities we need to include it in the process.
cont.nb<-poly2nb(as(states, "SpatialPolygons"))
cont.listw<-nb2listw(cont.nb, style="W")
states.knn<-knearneigh(crds, k=4)
states.nb<-knn2nb(states.knn)
dist<-nbdists(states.nb, crds)
dist1<-lapply(dist, function(x) 1/x)
dist.listw<-nb2listw(states.nb, glist=dist1)
states$lagg<-lag.listw(cont.listw, data_with_indicators$salary_2015)
summary(states$lagg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 636443 2033244 4033102 3822437 5229178 7375118
spplot(states, "lagg") # first order lag
We show spatial lag of the salary variable.
In the previous figure we have spotted clusters of high and low values. Hence, we calculate Moran’s statistics.
moran.test(data_with_indicators$emp_2015, cont.listw)
##
## Moran I test under randomisation
##
## data: data_with_indicators$emp_2015
## weights: cont.listw
##
## Moran I statistic standard deviate = 4.4519, p-value = 4.257e-06
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.412662247 -0.020833333 0.009481734
We reject the null hypothesis that states that there is no spatial autocorrelation, based on p-value that is smaller than 0.05.
result02<-moran.test(data_with_indicators$emp_2015, dist.listw)
result02
##
## Moran I test under randomisation
##
## data: data_with_indicators$emp_2015
## weights: dist.listw
##
## Moran I statistic standard deviate = 4.4167, p-value = 5.011e-06
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.413167971 -0.020833333 0.009655714
We obtain similar results for inverse distance matrix.
x<-data_with_indicators$emp_2015 # variable selection
zx<-as.data.frame(scale(x)) #standardization of variable
moran.plot(zx$V1, cont.listw, pch=19, labels=as.character(data_with_indicators$NAME))
round(mean(zx$V1),0)
## [1] 0
sd(zx$V1)
## [1] 1
wzx<-lag.listw(cont.listw, zx$V1) # spatial lag of x
morlm<-lm(wzx~zx$V1) # linear regression
summary(morlm)
##
## Call:
## lm(formula = wzx ~ zx$V1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2292 -0.3301 -0.0055 0.3915 1.0723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.07244 0.07857 -0.922 0.361
## zx$V1 0.41266 0.07939 5.198 4.29e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.55 on 47 degrees of freedom
## Multiple R-squared: 0.365, Adjusted R-squared: 0.3515
## F-statistic: 27.02 on 1 and 47 DF, p-value: 4.291e-06
slope<-morlm$coefficients[2] # coefficient from regression
intercept<-morlm$coefficients[1] # constant term in regression
plot(zx$V1, wzx, xlab="zx",ylab="spatial lag of zx", pch="*")
abline(intercept, slope) # regression line
abline(h=0, lty=2) # supplementary horizontal line y=0
abline(v=0, lty=2) # supplementary vertical line x=0
# we have x, wx & wzx from above
cond1<-ifelse(zx>=0 & wzx>=0, 1,0) # I quarter
cond2<-ifelse(zx>=0 & wzx<0, 2,0) # II quarter
cond3<-ifelse(zx<0 & wzx<0, 3,0) # III quarter
cond4<-ifelse(zx<0 & wzx>=0, 4,0) # IV quarter
cond.all<-cond1+cond2+cond3+cond4 # all quarters in one
cond<-as.data.frame(cond.all)
brks<-c(1,2,3,4)
cols<-c("grey25", "grey60", "grey85", "grey45")
plot(states, col=cols[findInterval(cond$V1, brks)])
legend("bottomleft", legend=c("I Q - HH – high surrounded by high", "II Q - LH - low surrounded by high", "III Q - LL - low surrounded by low", "IV Q - HL - high surrounded by low"), fill=cols, bty="n", cex=0.80)
title(main="Mapping of Moranscatterplot results unemployment rate in 2015")
The clusters of observations have been observed in the figure above. It is an indicator of the high regional interdependencies.
Unemployment rate in 2014-2018 in the United States.
moran<-matrix(0, ncol=5, nrow=1) # 12 years
colnames(moran)<-2014:2018
rownames(moran)<-"Moran’s I"
for(i in 2014:2018){ # loop uses natural numbers
x <- data_with_indicators %>%
pull(paste0('emp_', i))
result01<-moran.test(x, cont.listw)
moran[1,i-2013]<-result01$estimate[1]
}
moran
## 2014 2015 2016 2017 2018
## Moran’s I 0.4545975 0.4126622 0.2540824 0.2480099 0.2270632
plot(moran[1,], type="l", axes=FALSE, ylab="", xlab="", ylim=c(0.2, 0.5))
axis(1, at=1:5, labels=2014:2018)
axis(2)
points(moran[1,], bg="red", pch=21, cex=1.5)
abline(h=c(0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6), lty=3, lwd=3, col="grey80")
abline(h=0.30+(0:30)*0.01, lty=3, lwd=1, col="grey80")
text(1:5, moran[1,]+0.015, labels=round(moran[1,],2))
title(main="Moran’s I over years Unemployment rate in 2014-2018 in Poland")
We see that the spatial autocorrelation decreases across years. It means that the states where the values of unemployment rate are similar stay in contact more often than it would result from random distribution.
We begin with spatio-temporal correlation of unemployment rate.
cor.spatial(data_with_indicators$emp_2015,
data_with_indicators$emp_2014,
crds)
## [1] 0.2272781
## attr(,"variance")
## [1] 0.02063951
# full matrix of comparison of inter-temporal spatial correlations
cormat<-matrix(0, nrow=5, ncol=5)
for(i in 1:5){
for(j in 1:5){
x <- data_with_indicators %>%
pull(paste0('emp_', i+2013))
y <- data_with_indicators %>%
pull(paste0('emp_', j+2013))
cormat[i,j]<-cor.spatial(x, y, crds)[1]
}
}
diag(cormat)<-NA # for better plotting
par(mar=c(5,5,5,5))
plot(cormat, col=rev(heat.colors(8)), main=" Tjostheim's Coefficient for Unemployment rate", xlab="years 2014-2018", ylab="years 2014-2018")
Unemployment rate in following periods is correlated spatially. However, the values are rather low and the sign and magnitude changes over time.
In this section we choose a single year and estimate model for unemployment rate in 2015.
We use the variables that are reasonable enough to impact the unemployment rate as well as spatial lags of dependent variable.
sub <- data_with_indicators %>%
select(NAME, emp_Fips, emp_2015, salary_GeoFips, salary_2015, popest_2015, income_2015)
sub$y <- sub$emp_2015
sub$x1 <- sub$salary_2015
sub$x2 <- sub$popest_2015
sub$x3 <- sub$income_2015
model.lm<-lm(y~x1+x2+x3, data=sub) # option A
summary(model.lm)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.01939 -0.54681 -0.06561 0.59056 2.20225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.525e+00 1.090e+00 3.235 0.00228 **
## x1 -2.427e-06 8.058e-07 -3.012 0.00425 **
## x2 1.449e-06 4.700e-07 3.083 0.00349 **
## x3 3.055e-05 2.340e-05 1.306 0.19820
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9665 on 45 degrees of freedom
## Multiple R-squared: 0.2221, Adjusted R-squared: 0.1702
## F-statistic: 4.283 on 3 and 45 DF, p-value: 0.009641
bptest(model.lm)
##
## studentized Breusch-Pagan test
##
## data: model.lm
## BP = 9.15, df = 3, p-value = 0.02736
According to the Breusch-Pagan test, under H0, the test statistics follows the Breusch-Pagan test follows a chi-squared distribution with parameter corresponding to the degrees of freedom that reveal the number of regressors without the constant in the model. It also means that there is homoscedasticity in the data, which correspond to our case.
In case of the RESET test, which is popular diagnostic for correctness of functional form, we check whether functional form of the model is written correctly.
resettest(model.lm, power=2, type="regressor")
##
## RESET test
##
## data: model.lm
## RESET = 10.26, df1 = 3, df2 = 42, p-value = 3.414e-05
In our case we see that additional variables should be included in the model. Hence, the functional form of the suggested model is not correct.
summary(model.lm$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.01939 -0.54681 -0.06561 0.00000 0.59056 2.20225
res<-model.lm$residuals
brks<-c(min(res), mean(res)-sd(res), mean(res), mean(res)+sd(res), max(res))
cols<-c("steelblue4", "lightskyblue", "thistle1", "plum3")
plot(states, col=cols[findInterval(res,brks)])
title(main="Residuals in OLS model")
legend("bottomleft", legend=c("<mean-sd", "(mean-sd, mean)", "(mean, mean+sd)", ">mean+sd"), leglabs(brks1), fill=cols, bty="n")
Most of the residuals across regions are in the range of mean - std, mean + std. What is more we see pattern across residuals in the map.
Based on the analysis prepared up to this point we would like to use spatial models.
With Moran test we would like to justify using spatial methods in our problem.
lm.morantest(model.lm, cont.listw) # are residuals spatially random?
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = y ~ x1 + x2 + x3, data = sub)
## weights: cont.listw
##
## Moran I statistic standard deviate = 2.2408, p-value = 0.01252
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.182124609 -0.031178288 0.009061464
We reject null hypothesis that the unemployment rates follow completely random process across states. We continue modelling with spatial variables in the model. Hence, we use models of spatial dependence.
Followingly, we also check whether residuals are spatially random.
moran.test(res, cont.listw)
##
## Moran I test under randomisation
##
## data: res
## weights: cont.listw
##
## Moran I statistic standard deviate = 2.0976, p-value = 0.01797
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.18212461 -0.02083333 0.00936216
The residuals are not randomly distributed.
We begin with full specification of the model. However, we don’t include all the variables, only income.
GNS_1<-sacsarlm(y~x3, data=sub, listw=cont.listw, type="sacmixed", method="LU") # method="LU" speeds up computations
summary(GNS_1)
##
## Call:
## sacsarlm(formula = y ~ x3, data = sub, listw = cont.listw, type = "sacmixed",
## method = "LU")
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.451675 -0.529835 0.045482 0.365522 1.565409
##
## Type: sacmixed
## Coefficients: (numerical Hessian approximate standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.6231e+00 1.1881e+00 1.3661 0.1719
## x3 6.2530e-06 1.7780e-05 0.3517 0.7251
## lag.x3 -1.8767e-05 2.3672e-05 -0.7928 0.4279
##
## Rho: 0.80224
## Approximate (numerical Hessian) standard error: 0.14472
## z-value: 5.5434, p-value: 2.9664e-08
## Lambda: -0.47879
## Approximate (numerical Hessian) standard error: 0.3475
## z-value: -1.3778, p-value: 0.16826
##
## LR test value: 15.537, p-value: 0.0014106
##
## Log likelihood: -63.92096 for sacmixed model
## ML residual variance (sigma squared): 0.60329, (sigma: 0.77672)
## Number of observations: 49
## Number of parameters estimated: 6
## AIC: 139.84, (AIC for lm: 149.38)
Neither income nor the spatial lag of this variable is statistically significant.
SAC_X1X3<-sacsarlm(y~x1+x3, data=sub, listw=cont.listw)
summary(SAC_X1X3)
##
## Call:sacsarlm(formula = y ~ x1 + x3, data = sub, listw = cont.listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2206552 -0.5023459 -0.0032648 0.3837180 1.6246842
##
## Type: sac
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.6407e-01 6.7342e-01 1.2831 0.19945
## x1 4.5649e-08 2.3684e-08 1.9274 0.05393
## x3 -5.2988e-06 9.5625e-06 -0.5541 0.57949
##
## Rho: 0.8485
## Asymptotic standard error: 0.083311
## z-value: 10.185, p-value: < 2.22e-16
## Lambda: -0.80129
## Asymptotic standard error: 0.23698
## z-value: -3.3813, p-value: 0.00072147
##
## LR test value: 14.538, p-value: 0.0006968
##
## Log likelihood: -63.20047 for sac model
## ML residual variance (sigma squared): 0.51053, (sigma: 0.71452)
## Number of observations: 49
## Number of parameters estimated: 6
## AIC: 138.4, (AIC for lm: 148.94)
In the model with two variables, salary is significant.
SEM_X1X2<-errorsarlm(y~x1+x2, data=sub, listw=cont.listw) # no spat-lags of X
summary(SEM_X1X2)
##
## Call:errorsarlm(formula = y ~ x1 + x2, data = sub, listw = cont.listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.524524 -0.541364 -0.056783 0.474705 2.333539
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.1622e+00 2.7978e-01 18.4512 <2e-16
## x1 -9.8630e-07 6.0746e-07 -1.6236 0.1045
## x2 5.7797e-07 3.5961e-07 1.6072 0.1080
##
## Lambda: 0.52715, LR test value: 6.2302, p-value: 0.012559
## Asymptotic standard error: 0.14061
## z-value: 3.7491, p-value: 0.0001775
## Wald statistic: 14.055, p-value: 0.0001775
##
## Log likelihood: -63.57061 for error model
## ML residual variance (sigma squared): 0.72551, (sigma: 0.85177)
## Number of observations: 49
## Number of parameters estimated: 5
## AIC: 137.14, (AIC for lm: 141.37)
Other combinations of variables where also considered but haven’t lead to an interesting outcome.
SDM_X1<-lagsarlm(y~x1, data=sub, listw=cont.listw, type="mixed") # with spatial lags of X
SDM_X3<-lagsarlm(y~x3, data=sub, listw=cont.listw, type="mixed") # with spatial lags of X
summary(SDM_X1)
##
## Call:lagsarlm(formula = y ~ x1, data = sub, listw = cont.listw, type = "mixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.78339 -0.62940 -0.05425 0.45617 1.85108
##
## Type: mixed
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.2095e+00 7.4387e-01 2.9703 0.002975
## x1 3.0944e-08 2.8089e-08 1.1016 0.270616
## lag.x1 2.9059e-07 6.7343e-08 4.3150 1.596e-05
##
## Rho: 0.32368, LR test value: 3.3871, p-value: 0.065709
## Asymptotic standard error: 0.15961
## z-value: 2.028, p-value: 0.04256
## Wald statistic: 4.1128, p-value: 0.04256
##
## Log likelihood: -57.48939 for mixed model
## ML residual variance (sigma squared): 0.5957, (sigma: 0.77182)
## Number of observations: 49
## Number of parameters estimated: 5
## AIC: 124.98, (AIC for lm: 126.37)
## LM test for residual autocorrelation
## test value: 4.5157, p-value: 0.033586
summary(SDM_X3)
##
## Call:lagsarlm(formula = y ~ x3, data = sub, listw = cont.listw, type = "mixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.646439 -0.548138 0.015253 0.442780 1.953080
##
## Type: mixed
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.0490e+00 1.4111e+00 2.1608 0.03071
## x3 4.9592e-06 1.8232e-05 0.2720 0.78562
## lag.x3 -2.2135e-05 2.7210e-05 -0.8135 0.41594
##
## Rho: 0.56643, LR test value: 12.497, p-value: 0.00040752
## Asymptotic standard error: 0.13294
## z-value: 4.2607, p-value: 2.0383e-05
## Wald statistic: 18.153, p-value: 2.0383e-05
##
## Log likelihood: -64.45874 for mixed model
## ML residual variance (sigma squared): 0.74173, (sigma: 0.86124)
## Number of observations: 49
## Number of parameters estimated: 5
## AIC: 138.92, (AIC for lm: 149.41)
## LM test for residual autocorrelation
## test value: 0.97567, p-value: 0.32327
The first significant variable appeared, spatial temporal lag of salary.
SAR_X1X3 <-lagsarlm(y ~ x1 + x3, data=sub, listw=cont.listw) # no spatial lags of X
summary(SAR_X1X3)
##
## Call:lagsarlm(formula = y ~ x1 + x3, data = sub, listw = cont.listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.646364 -0.532104 0.012359 0.499649 2.062333
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3224e+00 1.0281e+00 2.2588 0.02389
## x1 7.3838e-09 2.9730e-08 0.2484 0.80386
## x3 -3.2534e-06 1.5745e-05 -0.2066 0.83630
##
## Rho: 0.57393, LR test value: 11.431, p-value: 0.00072228
## Asymptotic standard error: 0.1318
## z-value: 4.3545, p-value: 1.3334e-05
## Wald statistic: 18.962, p-value: 1.3334e-05
##
## Log likelihood: -64.75397 for lag model
## ML residual variance (sigma squared): 0.74855, (sigma: 0.86519)
## Number of observations: 49
## Number of parameters estimated: 5
## AIC: 139.51, (AIC for lm: 148.94)
## LM test for residual autocorrelation
## test value: 1.4744, p-value: 0.22465
In the first model with two variables, we haven’t concluded significance of none of them.
This is the straightforward extension of linear regression model, with spatially lagged RHS variables.
SLX_1<-lmSLX(y~x1+x2+x3, data=sub, listw=cont.listw)
summary(SLX_1)
##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17349 -0.50951 -0.02916 0.37220 1.59827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.906e-01 1.800e+00 0.106 0.9162
## x1 -1.145e-06 7.009e-07 -1.633 0.1099
## x2 6.805e-07 4.105e-07 1.658 0.1048
## x3 3.700e-05 2.076e-05 1.782 0.0820 .
## lag.x1 -3.289e-06 1.382e-06 -2.379 0.0220 *
## lag.x2 2.108e-06 8.067e-07 2.613 0.0124 *
## lag.x3 4.477e-05 3.721e-05 1.203 0.2357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.773 on 42 degrees of freedom
## Multiple R-squared: 0.5356, Adjusted R-squared: 0.4693
## F-statistic: 8.074 on 6 and 42 DF, p-value: 7.928e-06
At this step of the modelling, the model above has the highest number of significant variables.
OLS_1<-lm(y~x1+x2+x3, data=sub)
summary(OLS_1)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.01939 -0.54681 -0.06561 0.59056 2.20225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.525e+00 1.090e+00 3.235 0.00228 **
## x1 -2.427e-06 8.058e-07 -3.012 0.00425 **
## x2 1.449e-06 4.700e-07 3.083 0.00349 **
## x3 3.055e-05 2.340e-05 1.306 0.19820
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9665 on 45 degrees of freedom
## Multiple R-squared: 0.2221, Adjusted R-squared: 0.1702
## F-statistic: 4.283 on 3 and 45 DF, p-value: 0.009641
screenreg(list(GNS_1, SAC_X1X3, SDM_X1, SDM_X3, SEM_X1X2, SLX_1, OLS_1),
custom.model.names=c("GNS_all", "SAC_salary_income", "SDM_salary", "SDM_income", "SEM_salary_incone", "SLX_all", "OLS_all"))
##
## =================================================================================================================
## GNS_all SAC_salary_income SDM_salary SDM_income SEM_salary_incone SLX_all OLS_all
## -----------------------------------------------------------------------------------------------------------------
## (Intercept) 1.62 0.86 2.21 ** 3.05 * 5.16 *** 0.19 3.53 **
## (1.19) (0.67) (0.74) (1.41) (0.28) (1.80) (1.09)
## x3 0.00 -0.00 0.00 0.00 0.00
## (0.00) (0.00) (0.00) (0.00) (0.00)
## lag.x3 -0.00 -0.00 0.00
## (0.00) (0.00) (0.00)
## rho 0.80 *** 0.85 *** 0.32 * 0.57 ***
## (0.14) (0.08) (0.16) (0.13)
## lambda -0.48 -0.80 *** 0.53 ***
## (0.35) (0.24) (0.14)
## x1 0.00 0.00 -0.00 -0.00 -0.00 **
## (0.00) (0.00) (0.00) (0.00) (0.00)
## lag.x1 0.00 *** -0.00 *
## (0.00) (0.00)
## x2 0.00 0.00 0.00 **
## (0.00) (0.00) (0.00)
## lag.x2 0.00 *
## (0.00)
## -----------------------------------------------------------------------------------------------------------------
## Num. obs. 49 49 49 49 49 49
## Parameters 6 6 5 5 5
## Log Likelihood -63.92 -63.20 -57.49 -64.46 -63.57 -53.13
## AIC (Linear model) 149.38 148.94 126.37 149.41 141.37
## AIC (Spatial model) 139.84 138.40 124.98 138.92 137.14
## LR test: statistic 15.54 14.54 3.39 12.50 6.23
## LR test: p-value 0.00 0.00 0.07 0.00 0.01
## R^2 0.54 0.22
## Adj. R^2 0.47 0.17
## Sigma 0.77
## Statistic 8.07
## P Value 0.00
## DF 6.00
## AIC 122.27
## BIC 137.40
## Deviance 25.10
## DF Resid. 42
## nobs 49
## =================================================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
Generally lambda and rho are significant. Only in case of full model lambda is not significant. There exists remarkable impact of spatial lags to the outcome. The SDM model with income as explanatory variable is the best - has the lowest AIC.
The best model, Spatial Durbin Model with the distribution of total impacts has a problem with interpretation of coefficients in the model due to the inverse relation of locations. Hence, we need to interpet direct and indirect impacts in this model.
# distribution of total impact
# vector of traces of powers of a spatial weights matrix
# converting censored listw to CsparseMatrix form
W.c<-as(as_dgRMatrix_listw(cont.listw), "CsparseMatrix")
# the default values for the number of powers is 30
trMat<-trW(W.c, type="mult")
SDM_X3_imp<-impacts(SDM_X3, tr=trMat, R=2000)
summary(SDM_X3_imp, zstats=TRUE, short=TRUE)
## Impact measures (mixed, trace):
## Direct Indirect Total
## x3 1.270957e-06 -4.088685e-05 -3.961589e-05
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
## Direct Indirect Total
## x3 1.781526e-05 6.22435e-05 6.750714e-05
##
## Simulated z-values:
## Direct Indirect Total
## x3 0.05952286 -0.6947904 -0.6249084
##
## Simulated p-values:
## Direct Indirect Total
## x3 0.95254 0.48719 0.53203
# extracting direct & total impacts
a<-SDM_X3_imp$res$direct
b<-SDM_X3_imp$res$total
a/b # ratio of impacts
## x3
## -0.03208199
The explanatory variables income in location i affect y in location i - this is the direct effect. For the indirect effect, we state that that income variable affect unepmloyment rate in location j, which is its spillover. The difference between direct and indirect effects are negligible.
As soon as we add spatial lag model we need to rethink the interpretation of the parameters.
sub$y.st<-lag.listw(cont.listw, data_with_indicators$emp_2014)
sub$y.t<-data_with_indicators$emp_2014
model.lag1<-lagsarlm(y~x1+x2+x3+y.st, data=sub, cont.listw, tol.solve=3e-30)
summary(model.lag1) # spatio-temporal of y added
##
## Call:
## lagsarlm(formula = y ~ x1 + x2 + x3 + y.st, data = sub, listw = cont.listw,
## tol.solve = 3e-30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35056 -0.54896 0.04388 0.37783 1.91858
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.5694e-02 1.2513e+00 -0.0285 0.9772
## x1 -1.1304e-06 7.1550e-07 -1.5798 0.1141
## x2 6.5337e-07 4.2113e-07 1.5515 0.1208
## x3 2.2276e-05 1.9040e-05 1.1700 0.2420
## y.st 7.6096e-01 1.9130e-01 3.9779 6.953e-05
##
## Rho: -0.044165, LR test value: 0.02579, p-value: 0.87241
## Asymptotic standard error: 0.18348
## z-value: -0.24071, p-value: 0.80978
## Wald statistic: 0.057942, p-value: 0.80978
##
## Log likelihood: -57.55152 for lag model
## ML residual variance (sigma squared): 0.61306, (sigma: 0.78298)
## Number of observations: 49
## Number of parameters estimated: 7
## AIC: 129.1, (AIC for lm: 127.13)
## LM test for residual autocorrelation
## test value: 24.098, p-value: 9.1541e-07
model.lag2<-lagsarlm(y~x1+x2+x3+y.t, data=sub, cont.listw, tol.solve=3e-30)
summary(model.lag2) # temporal of y added
##
## Call:
## lagsarlm(formula = y ~ x1 + x2 + x3 + y.t, data = sub, listw = cont.listw,
## tol.solve = 3e-30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8058536 -0.1171344 -0.0086293 0.1447374 0.9481954
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.0784e-01 5.1212e-01 0.6011 0.5478
## x1 -3.7560e-07 2.9812e-07 -1.2599 0.2077
## x2 2.1109e-07 1.7532e-07 1.2040 0.2286
## x3 4.2628e-06 7.9887e-06 0.5336 0.5936
## y.t 7.6265e-01 4.8643e-02 15.6787 <2e-16
##
## Rho: 0.040094, LR test value: 0.19359, p-value: 0.65994
## Asymptotic standard error: 0.094134
## z-value: 0.42593, p-value: 0.67016
## Wald statistic: 0.18142, p-value: 0.67016
##
## Log likelihood: -14.40766 for lag model
## ML residual variance (sigma squared): 0.10538, (sigma: 0.32462)
## Number of observations: 49
## Number of parameters estimated: 7
## AIC: 42.815, (AIC for lm: 41.009)
## LM test for residual autocorrelation
## test value: 0.51329, p-value: 0.47372
model.error1<-errorsarlm(y~x1+x2+x3+y.st, data=sub, cont.listw, tol.solve=3e-30)
summary(model.error1) # spatio-temporal of y added
##
## Call:
## errorsarlm(formula = y ~ x1 + x2 + x3 + y.st, data = sub, listw = cont.listw,
## tol.solve = 3e-30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.311015 -0.249453 -0.065504 0.314045 1.698717
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.5355e-01 7.0199e-01 -0.9310 0.3519
## x1 -7.2458e-07 5.7790e-07 -1.2538 0.2099
## x2 4.2582e-07 3.3920e-07 1.2554 0.2093
## x3 1.1114e-05 1.4263e-05 0.7792 0.4358
## y.st 9.0326e-01 9.7648e-02 9.2502 <2e-16
##
## Lambda: -0.84839, LR test value: 11.579, p-value: 0.00066715
## Asymptotic standard error: 0.17573
## z-value: -4.8277, p-value: 1.3811e-06
## Wald statistic: 23.307, p-value: 1.3811e-06
##
## Log likelihood: -51.77511 for error model
## ML residual variance (sigma squared): 0.41109, (sigma: 0.64117)
## Number of observations: 49
## Number of parameters estimated: 7
## AIC: 117.55, (AIC for lm: 127.13)
model.error2<-errorsarlm(y~x1+x2+x3+y.t, data=sub, cont.listw, tol.solve=3e-30)
summary(model.error2) # temporal of y added
##
## Call:
## errorsarlm(formula = y ~ x1 + x2 + x3 + y.t, data = sub, listw = cont.listw,
## tol.solve = 3e-30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.793942 -0.128052 -0.024285 0.137532 0.960226
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.2008e-01 4.1789e-01 0.7659 0.4437
## x1 -4.1479e-07 2.8104e-07 -1.4759 0.1400
## x2 2.3312e-07 1.6491e-07 1.4136 0.1575
## x3 6.7259e-06 7.9786e-06 0.8430 0.3992
## y.t 7.7674e-01 4.2881e-02 18.1140 <2e-16
##
## Lambda: 0.22778, LR test value: 0.99121, p-value: 0.31945
## Asymptotic standard error: 0.18304
## z-value: 1.2444, p-value: 0.21336
## Wald statistic: 1.5485, p-value: 0.21336
##
## Log likelihood: -14.00885 for error model
## ML residual variance (sigma squared): 0.1024, (sigma: 0.32)
## Number of observations: 49
## Number of parameters estimated: 7
## AIC: 42.018, (AIC for lm: 41.009)
screenreg(list(model.lag1, model.lag2, model.error1, model.error2),
custom.model.names=c("dynamic lag st", "dynamic lag t", "dynamic error st", "dynamic error t"))
##
## =====================================================================================
## dynamic lag st dynamic lag t dynamic error st dynamic error t
## -------------------------------------------------------------------------------------
## (Intercept) -0.04 0.31 -0.65 0.32
## (1.25) (0.51) (0.70) (0.42)
## x1 -0.00 -0.00 -0.00 -0.00
## (0.00) (0.00) (0.00) (0.00)
## x2 0.00 0.00 0.00 0.00
## (0.00) (0.00) (0.00) (0.00)
## x3 0.00 0.00 0.00 0.00
## (0.00) (0.00) (0.00) (0.00)
## y.st 0.76 *** 0.90 ***
## (0.19) (0.10)
## rho -0.04 0.04
## (0.18) (0.09)
## y.t 0.76 *** 0.78 ***
## (0.05) (0.04)
## lambda -0.85 *** 0.23
## (0.18) (0.18)
## -------------------------------------------------------------------------------------
## Num. obs. 49 49 49 49
## Parameters 7 7 7 7
## Log Likelihood -57.55 -14.41 -51.78 -14.01
## AIC (Linear model) 127.13 41.01 127.13 41.01
## AIC (Spatial model) 129.10 42.82 117.55 42.02
## LR test: statistic 0.03 0.19 11.58 0.99
## LR test: p-value 0.87 0.66 0.00 0.32
## =====================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
Across models we see that spatial lag and spatio-temporal lag of y are significant.
moran.test(model.lag2$residuals, cont.listw)
##
## Moran I test under randomisation
##
## data: model.lag2$residuals
## weights: cont.listw
##
## Moran I statistic standard deviate = 0.88956, p-value = 0.1869
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.063885964 -0.020833333 0.009070217
moran.test(model.error2$residuals, cont.listw)
##
## Moran I test under randomisation
##
## data: model.error2$residuals
## weights: cont.listw
##
## Moran I statistic standard deviate = 0.26657, p-value = 0.3949
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.004509688 -0.020833333 0.009038157
In this step on analysis we show spatial drift models with Geographically Weighted Regression.
With Geographically Weighted Regression (GWR) method we obtain coefficients for each observation. The similarity in the space of these parameters is not considered.
We include spatial lag of dependent variable
sub$y.t <- data_with_indicators$emp_2014
We use two methods in this procedure, one relates to estimation of GWR coefficients - ggwr and the other one to the optimization of the bandwidth and simultaneously minimization of RMSE (root mean square prediction error).
eq<-y~x1+x2+x3+y.t
bw<-ggwr.sel(eq, data=sub, coords=crds, family=poisson(), longlat=TRUE, verbose=FALSE)
model.ggwr<-ggwr(eq, data=sub, coords=crds, family=poisson(), longlat=TRUE, bandwidth=bw)
model.ggwr
## Call:
## ggwr(formula = eq, data = sub, coords = crds, bandwidth = bw,
## family = poisson(), longlat = TRUE)
## Kernel function: gwr.Gauss
## Fixed bandwidth: 4860.424
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu. Max. Global
## X.Intercept. 6.6847e-01 6.7199e-01 6.7292e-01 6.7354e-01 6.7566e-01 0.6725
## x1 -6.8026e-08 -6.1030e-08 -5.7929e-08 -5.4981e-08 -5.2413e-08 0.0000
## x2 2.9951e-08 3.1374e-08 3.3020e-08 3.4761e-08 3.8710e-08 0.0000
## x3 -9.8139e-08 -8.5443e-08 -4.2495e-08 3.3142e-08 1.5014e-07 0.0000
## y.t 1.6109e-01 1.6146e-01 1.6163e-01 1.6178e-01 1.6192e-01 0.1614
choropleth(states, model.ggwr$SDF$x1, main="GWR coefficients of salary")
choropleth(states, model.ggwr$SDF$x2, main="GWR coefficients of popest")
choropleth(states, model.ggwr$SDF$x3, main="GWR coefficients of income")
choropleth(states, model.ggwr$SDF$y.t, main="GWR coefficients of spatial lag of dependent variable")
There is no split across GWR coefficients of spatial lag of dependent variable.
boxplot(as.data.frame(model.ggwr$SDF)[,3:6])
Lack of variability is visible for coefficients is observed. There is a single coefficient y.t that is different than zero - y.t.
Based on obtained results we would like to obtain the optimal number of clusters.
fviz_nbclust(as.data.frame(model.ggwr$SDF[,2:6]), FUNcluster=kmeans)
The optimal number of clusters is two. Although this number appears at this stage of analysis very frequently in general, we decide to proceed the analysis with 2 clusters.
clusters<-eclust(as.data.frame(model.ggwr$SDF[,2:6]), "kmeans", k=2)
choropleth(states, clusters$cluster, main="5 clusters, result from eclust()")
text(clusters$centers[,6:7], labels=1:2, cex=2, col="green")
We assess well obtained clustering, there are non-overlapping clusters. The United States were split into western and eastern part.
fviz_silhouette(clusters)
## cluster size ave.sil.width
## 1 1 18 0.49
## 2 2 31 0.58
Thanks to the plot above we are sure that the quality of the clustering is good enough. The average silhouette is similar among two clusters. However, we spot significant difference in cluster size.
In the next step we estimate the model with dummy variable that to which cluster each state belongs.
sub$clust1<-rep(0, times=dim(sub)[1])
sub$clust1[clusters$cluster==1]<-1
sub$clust2<-rep(0, times=dim(sub)[1])
sub$clust2[clusters$cluster==2]<-1
In the new equation we use only the single variable that resulted from clustering.
eq1<-y~x1+x3+y.t+clust1
model.sem<-errorsarlm(eq1, data=sub, listw=cont.listw)
summary(model.sem, Nagelkerke=TRUE)
##
## Call:errorsarlm(formula = eq1, data = sub, listw = cont.listw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7051124 -0.1480112 -0.0053714 0.1450163 0.9758353
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.6908e-01 4.1562e-01 0.8880 0.3745
## x1 -1.7966e-08 1.1538e-08 -1.5571 0.1195
## x3 -1.0382e-06 6.1934e-06 -0.1676 0.8669
## y.t 8.1961e-01 4.3431e-02 18.8717 <2e-16
## clust1 1.5734e-01 1.1706e-01 1.3440 0.1789
##
## Lambda: 0.12526, LR test value: 0.29082, p-value: 0.58969
## Asymptotic standard error: 0.19281
## z-value: 0.64966, p-value: 0.51591
## Wald statistic: 0.42206, p-value: 0.51591
##
## Log likelihood: -14.15615 for error model
## ML residual variance (sigma squared): 0.10395, (sigma: 0.32242)
## Nagelkerke pseudo-R-squared: 0.90539
## Number of observations: 49
## Number of parameters estimated: 7
## AIC: 42.312, (AIC for lm: 40.603)
In the model above cluster variable is not-significant. The unemployment rate for states that belong to the first cluster is higher than in the second one. It means that based on this model the unemployment rate is higher in the west part of U.S.A. than eastern part of considered country.
In the report we have incorporated the spatial econometrics techniques in order to estimate unemployment rate in 2015 year. We have also incorporated significant number of visualizations in order to understand the problem better. Thanks to the other explanatory variables we were able to use more intricate techniques. We find the combination of clustering and spatial econometrics highly insightful.