Introduction to the problem

In this research website post we analyze the economic interdependencies between several economic factors across the United States. We have chosen single state as the spatial unit. We analyze the data between 2014 and 2018.

Visualization and mapping of data

First of all we show what data we handle in the project. Then we map the data and convert it into single dataframe.

Visualization of analyzed dataset

We exclude Alaska, Hawaii and Puerto Rico from the analysis. We focus on 49 states.

states<-readOGR("data", "cb_2021_us_state_20m", verbose=FALSE)
states <- states[!(states@data$NAME %in% c("Alaska", "Hawaii", "Puerto Rico")),]
states.df <- as.data.frame(states)
plot(states)
crds<-coordinates(states)
pointLabel(crds, as.character(states.df$NAME), cex=0.6)

Merging polygon data and regional statedata

In order to perform more intricate modelling procedure we read additional datasets such as unemployment rate, salary, estimation of population or income in the U.S.A. We use the website (“Iowa State University” 2022) to gather relevant data revealing economic indicators in the United States across analyzed period.

emp <- read_excel("data\\emp-unemployment.xls",
                  sheet = "States", skip = 5, n_max=52)
names(emp) <- paste0('emp_', emp %>% names())
salary <- read_excel("data\\emp-salary.xlsx", skip = 5, n_max=66)
names(salary) <- paste0('salary_', salary %>% names())
popest <- read_excel("data\\popest-annual.xlsx",
                     sheet = "States", skip = 6, n_max=63)
names(popest) <- paste0('popest_', popest %>% names())
income <- read_excel("data\\income.xlsx", skip = 5, n_max=66)
names(income) <- paste0('income_', income %>% names())

We merge the spatial polygons data with economic indicators.

data_with_indicators <- states@data %>%
  select(NAME) %>%
  inner_join(emp, by=c("NAME" = "emp_Area")) %>%
  inner_join(salary, by=c("NAME" = "salary_GeoName")) %>%
  inner_join(popest, by=c("NAME" = "popest_Area")) %>%
  inner_join(income, by=c("NAME" = "income_GeoName"))

Visualization of unemployment in 2015

Based on the obtained data we present the choropleth graphics.

x <- data_with_indicators %>% select(emp_2015) %>% pull()
shades<-auto.shading(x)
choropleth(states, x, shades)
pointLabel(crds, as.character(states.df$NAME), cex=0.6)
choro.legend(14, 50.5, shades, bty="n")
title(main="Unemployment rate in USA across states in 2015")
choro.legend(px='bottomleft', sh=shades,fmt="%4.1f",cex=0.8)

The north of the USA seems as the region with the highest unemployment rate.

variable<-data_with_indicators %>% select(salary_2015) %>% pull()

summary(variable)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   406033  1092255  2587451  3856388  4631417 22687196

intervals<-8

colors<-brewer.pal(intervals, "BuPu") # choice of colors

classes<-classIntervals(variable, intervals, style="equal")

color.table<-findColours(classes, colors)

plot(states, col=color.table)

legend("bottomleft",
       legend=names(attr(color.table, "table")),
       fill=attr(color.table, "palette"), cex=0.8, bty="n")

title(main="Salary in USA across states in 2015")

The highest salary has been observed in California, Texas and New York states.

Spatial weights matrix

Based on the spatial weights matrix W we can compute the spatial lag. Spatial lag is the average value in the neighbourhood. If we want to measure externalities we need to include it in the process.

cont.nb<-poly2nb(as(states, "SpatialPolygons"))
cont.listw<-nb2listw(cont.nb, style="W")
states.knn<-knearneigh(crds, k=4)
states.nb<-knn2nb(states.knn)
dist<-nbdists(states.nb, crds)
dist1<-lapply(dist, function(x) 1/x)
dist.listw<-nb2listw(states.nb, glist=dist1)

Spatial lags and higher order neighbourhood

states$lagg<-lag.listw(cont.listw, data_with_indicators$salary_2015)
summary(states$lagg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  636443 2033244 4033102 3822437 5229178 7375118

spplot(states, "lagg")          # first order lag

We show spatial lag of the salary variable.

Spatial statistics

In the previous figure we have spotted clusters of high and low values. Hence, we calculate Moran’s statistics.

moran.test(data_with_indicators$emp_2015, cont.listw)

## 
##  Moran I test under randomisation
## 
## data:  data_with_indicators$emp_2015  
## weights: cont.listw    
## 
## Moran I statistic standard deviate = 4.4519, p-value = 4.257e-06
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.412662247      -0.020833333       0.009481734

We reject the null hypothesis that states that there is no spatial autocorrelation, based on p-value that is smaller than 0.05.

result02<-moran.test(data_with_indicators$emp_2015, dist.listw)
result02

## 
##  Moran I test under randomisation
## 
## data:  data_with_indicators$emp_2015  
## weights: dist.listw    
## 
## Moran I statistic standard deviate = 4.4167, p-value = 5.011e-06
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.413167971      -0.020833333       0.009655714

We obtain similar results for inverse distance matrix.

Moran scatterplot

x<-data_with_indicators$emp_2015 # variable selection
zx<-as.data.frame(scale(x))  #standardization of variable
moran.plot(zx$V1, cont.listw, pch=19, labels=as.character(data_with_indicators$NAME))

round(mean(zx$V1),0)

## [1] 0

sd(zx$V1)

## [1] 1

wzx<-lag.listw(cont.listw, zx$V1) # spatial lag of x
morlm<-lm(wzx~zx$V1) # linear regression
summary(morlm)

## 
## Call:
## lm(formula = wzx ~ zx$V1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2292 -0.3301 -0.0055  0.3915  1.0723 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.07244    0.07857  -0.922    0.361    
## zx$V1        0.41266    0.07939   5.198 4.29e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.55 on 47 degrees of freedom
## Multiple R-squared:  0.365,  Adjusted R-squared:  0.3515 
## F-statistic: 27.02 on 1 and 47 DF,  p-value: 4.291e-06

slope<-morlm$coefficients[2] # coefficient from regression
intercept<-morlm$coefficients[1] # constant term in regression
plot(zx$V1, wzx, xlab="zx",ylab="spatial lag of zx", pch="*")
abline(intercept, slope)  # regression line
abline(h=0, lty=2) # supplementary horizontal line y=0
abline(v=0, lty=2) # supplementary vertical line x=0

Mapping of Moran’scatterplots quadrants

# we have x, wx & wzx from above
cond1<-ifelse(zx>=0 & wzx>=0, 1,0)  # I quarter
cond2<-ifelse(zx>=0 & wzx<0, 2,0)  # II quarter
cond3<-ifelse(zx<0 & wzx<0, 3,0)  # III quarter
cond4<-ifelse(zx<0 & wzx>=0, 4,0)  # IV quarter
cond.all<-cond1+cond2+cond3+cond4 # all quarters in one
cond<-as.data.frame(cond.all)
brks<-c(1,2,3,4)
cols<-c("grey25", "grey60", "grey85", "grey45")
plot(states, col=cols[findInterval(cond$V1, brks)])
legend("bottomleft", legend=c("I Q - HH – high surrounded by high", "II Q - LH - low surrounded by high", "III Q - LL - low surrounded by low", "IV Q - HL - high surrounded by low"), fill=cols, bty="n", cex=0.80)
title(main="Mapping of Moranscatterplot results unemployment rate in 2015")

The clusters of observations have been observed in the figure above. It is an indicator of the high regional interdependencies.

Moran’s I over years

Unemployment rate in 2014-2018 in the United States.

moran<-matrix(0, ncol=5, nrow=1) # 12 years

colnames(moran)<-2014:2018

rownames(moran)<-"Moran’s I"


for(i in 2014:2018){  # loop uses natural numbers
  x <- data_with_indicators %>%
    pull(paste0('emp_', i))
  result01<-moran.test(x, cont.listw)
  moran[1,i-2013]<-result01$estimate[1]
}

moran

##                2014      2015      2016      2017      2018
## Moran’s I 0.4545975 0.4126622 0.2540824 0.2480099 0.2270632

plot(moran[1,], type="l", axes=FALSE, ylab="", xlab="", ylim=c(0.2, 0.5))

axis(1, at=1:5, labels=2014:2018)

axis(2)

points(moran[1,], bg="red", pch=21, cex=1.5)

abline(h=c(0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6), lty=3, lwd=3, col="grey80")

abline(h=0.30+(0:30)*0.01, lty=3, lwd=1, col="grey80")

text(1:5, moran[1,]+0.015, labels=round(moran[1,],2))

title(main="Moran’s I over years Unemployment rate in 2014-2018 in Poland")

We see that the spatial autocorrelation decreases across years. It means that the states where the values of unemployment rate are similar stay in contact more often than it would result from random distribution.

Spatial association of two variables

We begin with spatio-temporal correlation of unemployment rate.

cor.spatial(data_with_indicators$emp_2015,
            data_with_indicators$emp_2014,
            crds)

## [1] 0.2272781
## attr(,"variance")
## [1] 0.02063951

# full matrix of comparison of inter-temporal spatial correlations

cormat<-matrix(0, nrow=5, ncol=5)

for(i in 1:5){
  for(j in 1:5){
    x <- data_with_indicators %>%
      pull(paste0('emp_', i+2013))
    y <- data_with_indicators %>%
      pull(paste0('emp_', j+2013))
    cormat[i,j]<-cor.spatial(x, y, crds)[1]
  }
}
diag(cormat)<-NA    # for better plotting

par(mar=c(5,5,5,5))

plot(cormat, col=rev(heat.colors(8)), main=" Tjostheim's Coefficient for Unemployment rate", xlab="years 2014-2018", ylab="years 2014-2018")

Unemployment rate in following periods is correlated spatially. However, the values are rather low and the sign and magnitude changes over time.

Spatial dependence

In this section we choose a single year and estimate model for unemployment rate in 2015.

Basic model

We use the variables that are reasonable enough to impact the unemployment rate as well as spatial lags of dependent variable.

sub <- data_with_indicators %>%
  select(NAME, emp_Fips, emp_2015, salary_GeoFips, salary_2015, popest_2015, income_2015)
sub$y <- sub$emp_2015
sub$x1 <- sub$salary_2015
sub$x2 <- sub$popest_2015
sub$x3 <- sub$income_2015
model.lm<-lm(y~x1+x2+x3, data=sub) # option A
summary(model.lm)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01939 -0.54681 -0.06561  0.59056  2.20225 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.525e+00  1.090e+00   3.235  0.00228 **
## x1          -2.427e-06  8.058e-07  -3.012  0.00425 **
## x2           1.449e-06  4.700e-07   3.083  0.00349 **
## x3           3.055e-05  2.340e-05   1.306  0.19820   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9665 on 45 degrees of freedom
## Multiple R-squared:  0.2221, Adjusted R-squared:  0.1702 
## F-statistic: 4.283 on 3 and 45 DF,  p-value: 0.009641

Diagnostics of linear model

bptest(model.lm)

## 
##  studentized Breusch-Pagan test
## 
## data:  model.lm
## BP = 9.15, df = 3, p-value = 0.02736

According to the Breusch-Pagan test, under H0, the test statistics follows the Breusch-Pagan test follows a chi-squared distribution with parameter corresponding to the degrees of freedom that reveal the number of regressors without the constant in the model. It also means that there is homoscedasticity in the data, which correspond to our case.

Ramsey’s test for functional form

In case of the RESET test, which is popular diagnostic for correctness of functional form, we check whether functional form of the model is written correctly.

resettest(model.lm, power=2, type="regressor")

## 
##  RESET test
## 
## data:  model.lm
## RESET = 10.26, df1 = 3, df2 = 42, p-value = 3.414e-05

In our case we see that additional variables should be included in the model. Hence, the functional form of the suggested model is not correct.

Spatial distribution of OLS residuals

summary(model.lm$residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.01939 -0.54681 -0.06561  0.00000  0.59056  2.20225

res<-model.lm$residuals

brks<-c(min(res), mean(res)-sd(res), mean(res), mean(res)+sd(res), max(res))

cols<-c("steelblue4", "lightskyblue", "thistle1", "plum3")

plot(states, col=cols[findInterval(res,brks)])

title(main="Residuals in OLS model")

legend("bottomleft", legend=c("<mean-sd", "(mean-sd, mean)", "(mean, mean+sd)", ">mean+sd"), leglabs(brks1), fill=cols, bty="n")

Most of the residuals across regions are in the range of mean - std, mean + std. What is more we see pattern across residuals in the map.

Estimation of spatial models

Based on the analysis prepared up to this point we would like to use spatial models.

Moran test

With Moran test we would like to justify using spatial methods in our problem.

lm.morantest(model.lm, cont.listw) # are residuals spatially random?

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = y ~ x1 + x2 + x3, data = sub)
## weights: cont.listw
## 
## Moran I statistic standard deviate = 2.2408, p-value = 0.01252
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.182124609     -0.031178288      0.009061464

We reject null hypothesis that the unemployment rates follow completely random process across states. We continue modelling with spatial variables in the model. Hence, we use models of spatial dependence.

Followingly, we also check whether residuals are spatially random.

moran.test(res, cont.listw)

## 
##  Moran I test under randomisation
## 
## data:  res  
## weights: cont.listw    
## 
## Moran I statistic standard deviate = 2.0976, p-value = 0.01797
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.18212461       -0.02083333        0.00936216

The residuals are not randomly distributed.

Manski model (full specification)

We begin with full specification of the model. However, we don’t include all the variables, only income.

GNS_1<-sacsarlm(y~x3, data=sub, listw=cont.listw, type="sacmixed", method="LU")  # method="LU" speeds up computations

summary(GNS_1)

## 
## Call:
## sacsarlm(formula = y ~ x3, data = sub, listw = cont.listw, type = "sacmixed", 
##     method = "LU")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.451675 -0.529835  0.045482  0.365522  1.565409 
## 
## Type: sacmixed 
## Coefficients: (numerical Hessian approximate standard errors) 
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)  1.6231e+00  1.1881e+00  1.3661   0.1719
## x3           6.2530e-06  1.7780e-05  0.3517   0.7251
## lag.x3      -1.8767e-05  2.3672e-05 -0.7928   0.4279
## 
## Rho: 0.80224
## Approximate (numerical Hessian) standard error: 0.14472
##     z-value: 5.5434, p-value: 2.9664e-08
## Lambda: -0.47879
## Approximate (numerical Hessian) standard error: 0.3475
##     z-value: -1.3778, p-value: 0.16826
## 
## LR test value: 15.537, p-value: 0.0014106
## 
## Log likelihood: -63.92096 for sacmixed model
## ML residual variance (sigma squared): 0.60329, (sigma: 0.77672)
## Number of observations: 49 
## Number of parameters estimated: 6 
## AIC: 139.84, (AIC for lm: 149.38)

Neither income nor the spatial lag of this variable is statistically significant.

Kelejian-Prucha model

SAC_X1X3<-sacsarlm(y~x1+x3, data=sub, listw=cont.listw)
summary(SAC_X1X3)

## 
## Call:sacsarlm(formula = y ~ x1 + x3, data = sub, listw = cont.listw)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.2206552 -0.5023459 -0.0032648  0.3837180  1.6246842 
## 
## Type: sac 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)  8.6407e-01  6.7342e-01  1.2831  0.19945
## x1           4.5649e-08  2.3684e-08  1.9274  0.05393
## x3          -5.2988e-06  9.5625e-06 -0.5541  0.57949
## 
## Rho: 0.8485
## Asymptotic standard error: 0.083311
##     z-value: 10.185, p-value: < 2.22e-16
## Lambda: -0.80129
## Asymptotic standard error: 0.23698
##     z-value: -3.3813, p-value: 0.00072147
## 
## LR test value: 14.538, p-value: 0.0006968
## 
## Log likelihood: -63.20047 for sac model
## ML residual variance (sigma squared): 0.51053, (sigma: 0.71452)
## Number of observations: 49 
## Number of parameters estimated: 6 
## AIC: 138.4, (AIC for lm: 148.94)

In the model with two variables, salary is significant.

Spatial Error Model

SEM_X1X2<-errorsarlm(y~x1+x2, data=sub, listw=cont.listw) # no spat-lags of X
summary(SEM_X1X2)

## 
## Call:errorsarlm(formula = y ~ x1 + x2, data = sub, listw = cont.listw)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.524524 -0.541364 -0.056783  0.474705  2.333539 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)  5.1622e+00  2.7978e-01 18.4512   <2e-16
## x1          -9.8630e-07  6.0746e-07 -1.6236   0.1045
## x2           5.7797e-07  3.5961e-07  1.6072   0.1080
## 
## Lambda: 0.52715, LR test value: 6.2302, p-value: 0.012559
## Asymptotic standard error: 0.14061
##     z-value: 3.7491, p-value: 0.0001775
## Wald statistic: 14.055, p-value: 0.0001775
## 
## Log likelihood: -63.57061 for error model
## ML residual variance (sigma squared): 0.72551, (sigma: 0.85177)
## Number of observations: 49 
## Number of parameters estimated: 5 
## AIC: 137.14, (AIC for lm: 141.37)

Other combinations of variables where also considered but haven’t lead to an interesting outcome.

Spatial Lag Model

SDM_X1<-lagsarlm(y~x1, data=sub, listw=cont.listw, type="mixed") # with spatial lags of X
SDM_X3<-lagsarlm(y~x3, data=sub, listw=cont.listw, type="mixed") # with spatial lags of X
summary(SDM_X1)

## 
## Call:lagsarlm(formula = y ~ x1, data = sub, listw = cont.listw, type = "mixed")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.78339 -0.62940 -0.05425  0.45617  1.85108 
## 
## Type: mixed 
## Coefficients: (asymptotic standard errors) 
##               Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 2.2095e+00 7.4387e-01  2.9703  0.002975
## x1          3.0944e-08 2.8089e-08  1.1016  0.270616
## lag.x1      2.9059e-07 6.7343e-08  4.3150 1.596e-05
## 
## Rho: 0.32368, LR test value: 3.3871, p-value: 0.065709
## Asymptotic standard error: 0.15961
##     z-value: 2.028, p-value: 0.04256
## Wald statistic: 4.1128, p-value: 0.04256
## 
## Log likelihood: -57.48939 for mixed model
## ML residual variance (sigma squared): 0.5957, (sigma: 0.77182)
## Number of observations: 49 
## Number of parameters estimated: 5 
## AIC: 124.98, (AIC for lm: 126.37)
## LM test for residual autocorrelation
## test value: 4.5157, p-value: 0.033586

summary(SDM_X3)

## 
## Call:lagsarlm(formula = y ~ x3, data = sub, listw = cont.listw, type = "mixed")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.646439 -0.548138  0.015253  0.442780  1.953080 
## 
## Type: mixed 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)  3.0490e+00  1.4111e+00  2.1608  0.03071
## x3           4.9592e-06  1.8232e-05  0.2720  0.78562
## lag.x3      -2.2135e-05  2.7210e-05 -0.8135  0.41594
## 
## Rho: 0.56643, LR test value: 12.497, p-value: 0.00040752
## Asymptotic standard error: 0.13294
##     z-value: 4.2607, p-value: 2.0383e-05
## Wald statistic: 18.153, p-value: 2.0383e-05
## 
## Log likelihood: -64.45874 for mixed model
## ML residual variance (sigma squared): 0.74173, (sigma: 0.86124)
## Number of observations: 49 
## Number of parameters estimated: 5 
## AIC: 138.92, (AIC for lm: 149.41)
## LM test for residual autocorrelation
## test value: 0.97567, p-value: 0.32327

The first significant variable appeared, spatial temporal lag of salary.

Spatial Durbin Model

SAR_X1X3 <-lagsarlm(y ~ x1 + x3, data=sub, listw=cont.listw) # no spatial lags of X
summary(SAR_X1X3)

## 
## Call:lagsarlm(formula = y ~ x1 + x3, data = sub, listw = cont.listw)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.646364 -0.532104  0.012359  0.499649  2.062333 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)  2.3224e+00  1.0281e+00  2.2588  0.02389
## x1           7.3838e-09  2.9730e-08  0.2484  0.80386
## x3          -3.2534e-06  1.5745e-05 -0.2066  0.83630
## 
## Rho: 0.57393, LR test value: 11.431, p-value: 0.00072228
## Asymptotic standard error: 0.1318
##     z-value: 4.3545, p-value: 1.3334e-05
## Wald statistic: 18.962, p-value: 1.3334e-05
## 
## Log likelihood: -64.75397 for lag model
## ML residual variance (sigma squared): 0.74855, (sigma: 0.86519)
## Number of observations: 49 
## Number of parameters estimated: 5 
## AIC: 139.51, (AIC for lm: 148.94)
## LM test for residual autocorrelation
## test value: 1.4744, p-value: 0.22465

In the first model with two variables, we haven’t concluded significance of none of them.

SLX model

This is the straightforward extension of linear regression model, with spatially lagged RHS variables.

SLX_1<-lmSLX(y~x1+x2+x3, data=sub, listw=cont.listw)
summary(SLX_1)

## 
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17349 -0.50951 -0.02916  0.37220  1.59827 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.906e-01  1.800e+00   0.106   0.9162  
## x1          -1.145e-06  7.009e-07  -1.633   0.1099  
## x2           6.805e-07  4.105e-07   1.658   0.1048  
## x3           3.700e-05  2.076e-05   1.782   0.0820 .
## lag.x1      -3.289e-06  1.382e-06  -2.379   0.0220 *
## lag.x2       2.108e-06  8.067e-07   2.613   0.0124 *
## lag.x3       4.477e-05  3.721e-05   1.203   0.2357  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.773 on 42 degrees of freedom
## Multiple R-squared:  0.5356, Adjusted R-squared:  0.4693 
## F-statistic: 8.074 on 6 and 42 DF,  p-value: 7.928e-06

At this step of the modelling, the model above has the highest number of significant variables.

OLS

OLS_1<-lm(y~x1+x2+x3, data=sub)

summary(OLS_1)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01939 -0.54681 -0.06561  0.59056  2.20225 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.525e+00  1.090e+00   3.235  0.00228 **
## x1          -2.427e-06  8.058e-07  -3.012  0.00425 **
## x2           1.449e-06  4.700e-07   3.083  0.00349 **
## x3           3.055e-05  2.340e-05   1.306  0.19820   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9665 on 45 degrees of freedom
## Multiple R-squared:  0.2221, Adjusted R-squared:  0.1702 
## F-statistic: 4.283 on 3 and 45 DF,  p-value: 0.009641

Comparison of the models in form of the table

screenreg(list(GNS_1, SAC_X1X3, SDM_X1, SDM_X3, SEM_X1X2, SLX_1, OLS_1),
          custom.model.names=c("GNS_all", "SAC_salary_income", "SDM_salary", "SDM_income", "SEM_salary_incone", "SLX_all", "OLS_all"))

## 
## =================================================================================================================
##                      GNS_all     SAC_salary_income  SDM_salary  SDM_income  SEM_salary_incone  SLX_all   OLS_all 
## -----------------------------------------------------------------------------------------------------------------
## (Intercept)            1.62        0.86               2.21 **     3.05 *      5.16 ***           0.19     3.53 **
##                       (1.19)      (0.67)             (0.74)      (1.41)      (0.28)             (1.80)   (1.09)  
## x3                     0.00       -0.00                           0.00                           0.00     0.00   
##                       (0.00)      (0.00)                         (0.00)                         (0.00)   (0.00)  
## lag.x3                -0.00                                      -0.00                           0.00            
##                       (0.00)                                     (0.00)                         (0.00)           
## rho                    0.80 ***    0.85 ***           0.32 *      0.57 ***                                       
##                       (0.14)      (0.08)             (0.16)      (0.13)                                          
## lambda                -0.48       -0.80 ***                                   0.53 ***                           
##                       (0.35)      (0.24)                                     (0.14)                              
## x1                                 0.00               0.00                   -0.00              -0.00    -0.00 **
##                                   (0.00)             (0.00)                  (0.00)             (0.00)   (0.00)  
## lag.x1                                                0.00 ***                                  -0.00 *          
##                                                      (0.00)                                     (0.00)           
## x2                                                                            0.00               0.00     0.00 **
##                                                                              (0.00)             (0.00)   (0.00)  
## lag.x2                                                                                           0.00 *          
##                                                                                                 (0.00)           
## -----------------------------------------------------------------------------------------------------------------
## Num. obs.             49          49                 49          49          49                          49      
## Parameters             6           6                  5           5           5                                  
## Log Likelihood       -63.92      -63.20             -57.49      -64.46      -63.57             -53.13            
## AIC (Linear model)   149.38      148.94             126.37      149.41      141.37                               
## AIC (Spatial model)  139.84      138.40             124.98      138.92      137.14                               
## LR test: statistic    15.54       14.54               3.39       12.50        6.23                               
## LR test: p-value       0.00        0.00               0.07        0.00        0.01                               
## R^2                                                                                              0.54     0.22   
## Adj. R^2                                                                                         0.47     0.17   
## Sigma                                                                                            0.77            
## Statistic                                                                                        8.07            
## P Value                                                                                          0.00            
## DF                                                                                               6.00            
## AIC                                                                                            122.27            
## BIC                                                                                            137.40            
## Deviance                                                                                        25.10            
## DF Resid.                                                                                       42               
## nobs                                                                                            49               
## =================================================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Generally lambda and rho are significant. Only in case of full model lambda is not significant. There exists remarkable impact of spatial lags to the outcome. The SDM model with income as explanatory variable is the best - has the lowest AIC.

Direct and indirect impacts

The best model, Spatial Durbin Model with the distribution of total impacts has a problem with interpretation of coefficients in the model due to the inverse relation of locations. Hence, we need to interpet direct and indirect impacts in this model.

# distribution of total impact
# vector of traces of powers of a spatial weights matrix
# converting censored listw to CsparseMatrix form
W.c<-as(as_dgRMatrix_listw(cont.listw), "CsparseMatrix")
# the default values for the number of powers is 30
trMat<-trW(W.c, type="mult")
SDM_X3_imp<-impacts(SDM_X3, tr=trMat, R=2000)
summary(SDM_X3_imp, zstats=TRUE, short=TRUE)

## Impact measures (mixed, trace):
##          Direct      Indirect         Total
## x3 1.270957e-06 -4.088685e-05 -3.961589e-05
## ========================================================
## Simulation results ( variance matrix):
## ========================================================
## Simulated standard errors
##          Direct    Indirect        Total
## x3 1.781526e-05 6.22435e-05 6.750714e-05
## 
## Simulated z-values:
##        Direct   Indirect      Total
## x3 0.05952286 -0.6947904 -0.6249084
## 
## Simulated p-values:
##    Direct  Indirect Total  
## x3 0.95254 0.48719  0.53203

# extracting direct & total impacts
a<-SDM_X3_imp$res$direct
b<-SDM_X3_imp$res$total
a/b # ratio of impacts

##          x3 
## -0.03208199

The explanatory variables income in location i affect y in location i - this is the direct effect. For the indirect effect, we state that that income variable affect unepmloyment rate in location j, which is its spillover. The difference between direct and indirect effects are negligible.

Models with spatio-temporal lag

Single dynamic spatial lag model

As soon as we add spatial lag model we need to rethink the interpretation of the parameters.

sub$y.st<-lag.listw(cont.listw, data_with_indicators$emp_2014)
sub$y.t<-data_with_indicators$emp_2014

model.lag1<-lagsarlm(y~x1+x2+x3+y.st, data=sub, cont.listw, tol.solve=3e-30)
summary(model.lag1) # spatio-temporal of y added

## 
## Call:
## lagsarlm(formula = y ~ x1 + x2 + x3 + y.st, data = sub, listw = cont.listw, 
##     tol.solve = 3e-30)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35056 -0.54896  0.04388  0.37783  1.91858 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value  Pr(>|z|)
## (Intercept) -3.5694e-02  1.2513e+00 -0.0285    0.9772
## x1          -1.1304e-06  7.1550e-07 -1.5798    0.1141
## x2           6.5337e-07  4.2113e-07  1.5515    0.1208
## x3           2.2276e-05  1.9040e-05  1.1700    0.2420
## y.st         7.6096e-01  1.9130e-01  3.9779 6.953e-05
## 
## Rho: -0.044165, LR test value: 0.02579, p-value: 0.87241
## Asymptotic standard error: 0.18348
##     z-value: -0.24071, p-value: 0.80978
## Wald statistic: 0.057942, p-value: 0.80978
## 
## Log likelihood: -57.55152 for lag model
## ML residual variance (sigma squared): 0.61306, (sigma: 0.78298)
## Number of observations: 49 
## Number of parameters estimated: 7 
## AIC: 129.1, (AIC for lm: 127.13)
## LM test for residual autocorrelation
## test value: 24.098, p-value: 9.1541e-07

model.lag2<-lagsarlm(y~x1+x2+x3+y.t, data=sub, cont.listw, tol.solve=3e-30)
summary(model.lag2) # temporal of y added

## 
## Call:
## lagsarlm(formula = y ~ x1 + x2 + x3 + y.t, data = sub, listw = cont.listw, 
##     tol.solve = 3e-30)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.8058536 -0.1171344 -0.0086293  0.1447374  0.9481954 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)  3.0784e-01  5.1212e-01  0.6011   0.5478
## x1          -3.7560e-07  2.9812e-07 -1.2599   0.2077
## x2           2.1109e-07  1.7532e-07  1.2040   0.2286
## x3           4.2628e-06  7.9887e-06  0.5336   0.5936
## y.t          7.6265e-01  4.8643e-02 15.6787   <2e-16
## 
## Rho: 0.040094, LR test value: 0.19359, p-value: 0.65994
## Asymptotic standard error: 0.094134
##     z-value: 0.42593, p-value: 0.67016
## Wald statistic: 0.18142, p-value: 0.67016
## 
## Log likelihood: -14.40766 for lag model
## ML residual variance (sigma squared): 0.10538, (sigma: 0.32462)
## Number of observations: 49 
## Number of parameters estimated: 7 
## AIC: 42.815, (AIC for lm: 41.009)
## LM test for residual autocorrelation
## test value: 0.51329, p-value: 0.47372

Simple dynamic spatial error model

model.error1<-errorsarlm(y~x1+x2+x3+y.st, data=sub, cont.listw, tol.solve=3e-30)
summary(model.error1) # spatio-temporal of y added

## 
## Call:
## errorsarlm(formula = y ~ x1 + x2 + x3 + y.st, data = sub, listw = cont.listw, 
##     tol.solve = 3e-30)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.311015 -0.249453 -0.065504  0.314045  1.698717 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept) -6.5355e-01  7.0199e-01 -0.9310   0.3519
## x1          -7.2458e-07  5.7790e-07 -1.2538   0.2099
## x2           4.2582e-07  3.3920e-07  1.2554   0.2093
## x3           1.1114e-05  1.4263e-05  0.7792   0.4358
## y.st         9.0326e-01  9.7648e-02  9.2502   <2e-16
## 
## Lambda: -0.84839, LR test value: 11.579, p-value: 0.00066715
## Asymptotic standard error: 0.17573
##     z-value: -4.8277, p-value: 1.3811e-06
## Wald statistic: 23.307, p-value: 1.3811e-06
## 
## Log likelihood: -51.77511 for error model
## ML residual variance (sigma squared): 0.41109, (sigma: 0.64117)
## Number of observations: 49 
## Number of parameters estimated: 7 
## AIC: 117.55, (AIC for lm: 127.13)

model.error2<-errorsarlm(y~x1+x2+x3+y.t, data=sub, cont.listw, tol.solve=3e-30)
summary(model.error2) # temporal of y added

## 
## Call:
## errorsarlm(formula = y ~ x1 + x2 + x3 + y.t, data = sub, listw = cont.listw, 
##     tol.solve = 3e-30)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.793942 -0.128052 -0.024285  0.137532  0.960226 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)  3.2008e-01  4.1789e-01  0.7659   0.4437
## x1          -4.1479e-07  2.8104e-07 -1.4759   0.1400
## x2           2.3312e-07  1.6491e-07  1.4136   0.1575
## x3           6.7259e-06  7.9786e-06  0.8430   0.3992
## y.t          7.7674e-01  4.2881e-02 18.1140   <2e-16
## 
## Lambda: 0.22778, LR test value: 0.99121, p-value: 0.31945
## Asymptotic standard error: 0.18304
##     z-value: 1.2444, p-value: 0.21336
## Wald statistic: 1.5485, p-value: 0.21336
## 
## Log likelihood: -14.00885 for error model
## ML residual variance (sigma squared): 0.1024, (sigma: 0.32)
## Number of observations: 49 
## Number of parameters estimated: 7 
## AIC: 42.018, (AIC for lm: 41.009)

Summary of models

screenreg(list(model.lag1, model.lag2, model.error1, model.error2),
           custom.model.names=c("dynamic lag st", "dynamic lag t", "dynamic error st", "dynamic error t"))

## 
## =====================================================================================
##                      dynamic lag st  dynamic lag t  dynamic error st  dynamic error t
## -------------------------------------------------------------------------------------
## (Intercept)           -0.04            0.31          -0.65              0.32         
##                       (1.25)          (0.51)         (0.70)            (0.42)        
## x1                    -0.00           -0.00          -0.00             -0.00         
##                       (0.00)          (0.00)         (0.00)            (0.00)        
## x2                     0.00            0.00           0.00              0.00         
##                       (0.00)          (0.00)         (0.00)            (0.00)        
## x3                     0.00            0.00           0.00              0.00         
##                       (0.00)          (0.00)         (0.00)            (0.00)        
## y.st                   0.76 ***                       0.90 ***                       
##                       (0.19)                         (0.10)                          
## rho                   -0.04            0.04                                          
##                       (0.18)          (0.09)                                         
## y.t                                    0.76 ***                         0.78 ***     
##                                       (0.05)                           (0.04)        
## lambda                                               -0.85 ***          0.23         
##                                                      (0.18)            (0.18)        
## -------------------------------------------------------------------------------------
## Num. obs.             49              49             49                49            
## Parameters             7               7              7                 7            
## Log Likelihood       -57.55          -14.41         -51.78            -14.01         
## AIC (Linear model)   127.13           41.01         127.13             41.01         
## AIC (Spatial model)  129.10           42.82         117.55             42.02         
## LR test: statistic     0.03            0.19          11.58              0.99         
## LR test: p-value       0.87            0.66           0.00              0.32         
## =====================================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05

Across models we see that spatial lag and spatio-temporal lag of y are significant.

moran.test(model.lag2$residuals, cont.listw)

## 
##  Moran I test under randomisation
## 
## data:  model.lag2$residuals  
## weights: cont.listw    
## 
## Moran I statistic standard deviate = 0.88956, p-value = 0.1869
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.063885964      -0.020833333       0.009070217

moran.test(model.error2$residuals, cont.listw)

## 
##  Moran I test under randomisation
## 
## data:  model.error2$residuals  
## weights: cont.listw    
## 
## Moran I statistic standard deviate = 0.26657, p-value = 0.3949
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.004509688      -0.020833333       0.009038157

GWR local models

In this step on analysis we show spatial drift models with Geographically Weighted Regression.

GWR estimation

With Geographically Weighted Regression (GWR) method we obtain coefficients for each observation. The similarity in the space of these parameters is not considered.

We include spatial lag of dependent variable

sub$y.t <- data_with_indicators$emp_2014

We use two methods in this procedure, one relates to estimation of GWR coefficients - ggwr and the other one to the optimization of the bandwidth and simultaneously minimization of RMSE (root mean square prediction error).

eq<-y~x1+x2+x3+y.t
bw<-ggwr.sel(eq, data=sub, coords=crds, family=poisson(), longlat=TRUE, verbose=FALSE)
model.ggwr<-ggwr(eq, data=sub, coords=crds, family=poisson(), longlat=TRUE, bandwidth=bw)
model.ggwr

## Call:
## ggwr(formula = eq, data = sub, coords = crds, bandwidth = bw, 
##     family = poisson(), longlat = TRUE)
## Kernel function: gwr.Gauss 
## Fixed bandwidth: 4860.424 
## Summary of GWR coefficient estimates at data points:
##                     Min.     1st Qu.      Median     3rd Qu.        Max. Global
## X.Intercept.  6.6847e-01  6.7199e-01  6.7292e-01  6.7354e-01  6.7566e-01 0.6725
## x1           -6.8026e-08 -6.1030e-08 -5.7929e-08 -5.4981e-08 -5.2413e-08 0.0000
## x2            2.9951e-08  3.1374e-08  3.3020e-08  3.4761e-08  3.8710e-08 0.0000
## x3           -9.8139e-08 -8.5443e-08 -4.2495e-08  3.3142e-08  1.5014e-07 0.0000
## y.t           1.6109e-01  1.6146e-01  1.6163e-01  1.6178e-01  1.6192e-01 0.1614

Visualization of results

GGWR coefficients on map

choropleth(states, model.ggwr$SDF$x1, main="GWR coefficients of salary")

choropleth(states, model.ggwr$SDF$x2, main="GWR coefficients of popest")

choropleth(states, model.ggwr$SDF$x3, main="GWR coefficients of income")

choropleth(states, model.ggwr$SDF$y.t, main="GWR coefficients of spatial lag of dependent variable")

There is no split across GWR coefficients of spatial lag of dependent variable.

Boxplots

boxplot(as.data.frame(model.ggwr$SDF)[,3:6])

Lack of variability is visible for coefficients is observed. There is a single coefficient y.t that is different than zero - y.t.

Ex ante analysis

Based on obtained results we would like to obtain the optimal number of clusters.

fviz_nbclust(as.data.frame(model.ggwr$SDF[,2:6]), FUNcluster=kmeans)

The optimal number of clusters is two. Although this number appears at this stage of analysis very frequently in general, we decide to proceed the analysis with 2 clusters.

clusters<-eclust(as.data.frame(model.ggwr$SDF[,2:6]), "kmeans", k=2)

choropleth(states, clusters$cluster, main="5 clusters, result from eclust()")
text(clusters$centers[,6:7], labels=1:2, cex=2, col="green")

We assess well obtained clustering, there are non-overlapping clusters. The United States were split into western and eastern part.

fviz_silhouette(clusters)

##   cluster size ave.sil.width
## 1       1   18          0.49
## 2       2   31          0.58

Thanks to the plot above we are sure that the quality of the clustering is good enough. The average silhouette is similar among two clusters. However, we spot significant difference in cluster size.

General model with dummy for clusters

In the next step we estimate the model with dummy variable that to which cluster each state belongs.

sub$clust1<-rep(0, times=dim(sub)[1])
sub$clust1[clusters$cluster==1]<-1
sub$clust2<-rep(0, times=dim(sub)[1])
sub$clust2[clusters$cluster==2]<-1

In the new equation we use only the single variable that resulted from clustering.

eq1<-y~x1+x3+y.t+clust1
model.sem<-errorsarlm(eq1, data=sub, listw=cont.listw)
summary(model.sem, Nagelkerke=TRUE)

## 
## Call:errorsarlm(formula = eq1, data = sub, listw = cont.listw)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.7051124 -0.1480112 -0.0053714  0.1450163  0.9758353 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                Estimate  Std. Error z value Pr(>|z|)
## (Intercept)  3.6908e-01  4.1562e-01  0.8880   0.3745
## x1          -1.7966e-08  1.1538e-08 -1.5571   0.1195
## x3          -1.0382e-06  6.1934e-06 -0.1676   0.8669
## y.t          8.1961e-01  4.3431e-02 18.8717   <2e-16
## clust1       1.5734e-01  1.1706e-01  1.3440   0.1789
## 
## Lambda: 0.12526, LR test value: 0.29082, p-value: 0.58969
## Asymptotic standard error: 0.19281
##     z-value: 0.64966, p-value: 0.51591
## Wald statistic: 0.42206, p-value: 0.51591
## 
## Log likelihood: -14.15615 for error model
## ML residual variance (sigma squared): 0.10395, (sigma: 0.32242)
## Nagelkerke pseudo-R-squared: 0.90539 
## Number of observations: 49 
## Number of parameters estimated: 7 
## AIC: 42.312, (AIC for lm: 40.603)

In the model above cluster variable is not-significant. The unemployment rate for states that belong to the first cluster is higher than in the second one. It means that based on this model the unemployment rate is higher in the west part of U.S.A. than eastern part of considered country.

Summary

In the report we have incorporated the spatial econometrics techniques in order to estimate unemployment rate in 2015 year. We have also incorporated significant number of visualizations in order to understand the problem better. Thanks to the other explanatory variables we were able to use more intricate techniques. We find the combination of clustering and spatial econometrics highly insightful.

References

“Cartographic Boundary Files.” 2022. https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html.

“Iowa State University.” 2022. https://www.iastate.edu/.

Spatial Econometrics in R

State-level deep dive on various economic indicators in the USA. Closing the gap between the average salary and unemployment rate.

Mateusz Baryła