Basic multiple linear regression
Let’s regress log housing values (lmedhval) on log total population (ltotp), median household income (lmedinc), median age of housing (medage), median number of rooms (medrooms), the median number of years current residents have been residing in their houses (meddur), the number of parks within a 10 minute walk (parks), and the percent of 4th graders attending the nearest school who scored proficient and above on the California’s English Language Arts standardized test (edppl13).
Call:
lm(formula = lmedhval ~ ltotp + lmedinc + medage + medrooms +
meddur + parks + edppl3, data = bayarea)
Residuals:
Min 1Q Median 3Q Max
-1.84833 -0.18460 -0.01965 0.16931 1.44107
Coefficients:
Estimate Std. Error t value
(Intercept) 4.1986398 0.3034919 13.834
ltotp -0.0382342 0.0176646 -2.164
lmedinc 0.7675167 0.0263889 29.085
medage 0.0061966 0.0005512 11.241
medrooms -0.0856691 0.0095095 -9.009
meddur 0.0117370 0.0023333 5.030
parks 0.0102368 0.0012140 8.432
edppl3 0.8972933 0.0647460 13.859
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
ltotp 0.0306 *
lmedinc < 0.0000000000000002 ***
medage < 0.0000000000000002 ***
medrooms < 0.0000000000000002 ***
meddur 0.000000546 ***
parks < 0.0000000000000002 ***
edppl3 < 0.0000000000000002 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3129 on 1568 degrees of freedom
Multiple R-squared: 0.6422, Adjusted R-squared: 0.6406
F-statistic: 402 on 7 and 1568 DF, p-value: < 0.00000000000000022
The linear model is a global model, i.e. it estimates an average effect and assumes that this effect applies to all places. Let’s now deal with spatial heterogeneity in the regression coefficients.
Let’s fit the interaction between median age of housing (medage) and region.
Call:
lm(formula = lmedhval ~ ltotp + lmedinc + medage * region + medrooms +
meddur + parks + edppl3, data = bayarea)
Residuals:
Min 1Q Median 3Q Max
-1.83400 -0.16412 -0.00931 0.15247 1.27117
Coefficients:
Estimate Std. Error t value
(Intercept) 6.1506997 0.3125779 19.677
ltotp -0.0614002 0.0165147 -3.718
lmedinc 0.5957876 0.0267336 22.286
medage 0.0040160 0.0007772 5.167
regionNorth Bay -0.0676059 0.0688656 -0.982
regionPeninsula -0.2307712 0.1273507 -1.812
regionSan Francisco 0.5371517 0.1104769 4.862
regionSouth Bay -0.0315589 0.0690407 -0.457
medrooms -0.0379107 0.0095196 -3.982
meddur 0.0064221 0.0022256 2.885
parks 0.0058858 0.0013373 4.401
edppl3 1.0073740 0.0604405 16.667
medage:regionNorth Bay 0.0007567 0.0014485 0.522
medage:regionPeninsula 0.0095904 0.0023297 4.117
medage:regionSan Francisco -0.0027678 0.0015792 -1.753
medage:regionSouth Bay 0.0056334 0.0014471 3.893
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
ltotp 0.000208 ***
lmedinc < 0.0000000000000002 ***
medage 0.000000269 ***
regionNorth Bay 0.326396
regionPeninsula 0.070164 .
regionSan Francisco 0.000001278 ***
regionSouth Bay 0.647659
medrooms 0.000071368 ***
meddur 0.003962 **
parks 0.000011492 ***
edppl3 < 0.0000000000000002 ***
medage:regionNorth Bay 0.601484
medage:regionPeninsula 0.000040476 ***
medage:regionSan Francisco 0.079847 .
medage:regionSouth Bay 0.000103 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2878 on 1560 degrees of freedom
Multiple R-squared: 0.6988, Adjusted R-squared: 0.6959
F-statistic: 241.3 on 15 and 1560 DF, p-value: < 0.00000000000000022
This method partitions or stratifies data by region and fits regression models separately for each region. We have 5 Bay Area regions.
Let’s subset the data. Example for San Francisco:
Call:
lm(formula = lmedhval ~ ltotp + lmedinc + medage + medrooms +
meddur + parks + edppl3, data = bayarea, subset = region ==
"San Francisco")
Residuals:
Min 1Q Median 3Q Max
-0.77154 -0.14630 -0.02581 0.13285 0.58939
Coefficients:
Estimate Std. Error t value
(Intercept) 9.737834 0.464642 20.958
ltotp -0.025386 0.028981 -0.876
lmedinc 0.318145 0.037911 8.392
medage 0.004164 0.001093 3.812
medrooms -0.015036 0.020216 -0.744
meddur -0.006940 0.004157 -1.670
parks 0.004527 0.002011 2.251
edppl3 0.625570 0.121673 5.141
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
ltotp 0.382194
lmedinc 0.0000000000000118 ***
medage 0.000188 ***
medrooms 0.457958
meddur 0.096696 .
parks 0.025548 *
edppl3 0.0000006871664101 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2184 on 186 degrees of freedom
Multiple R-squared: 0.5344, Adjusted R-squared: 0.5169
F-statistic: 30.49 on 7 and 186 DF, p-value: < 0.00000000000000022
The goal of the SRM is to determine whether the regression coefficients vary across geographic space, in our case across the 5 regions.
Call:
lm(formula = lmedhval ~ 0 + region/(ltotp + lmedinc + medage +
medrooms + meddur + parks + edppl3), data = bayarea)
Residuals:
Min 1Q Median 3Q Max
-1.84101 -0.14322 -0.00983 0.14132 1.22961
Coefficients:
Estimate Std. Error t value
regionEast Bay 7.0667271 0.5448765 12.969
regionNorth Bay -1.3558964 0.8475253 -1.600
regionPeninsula 0.2200716 1.5807389 0.139
regionSan Francisco 9.7378339 0.5678225 17.149
regionSouth Bay 6.0805756 0.6050788 10.049
regionEast Bay:ltotp -0.0824204 0.0252857 -3.260
regionNorth Bay:ltotp -0.1156100 0.0402536 -2.872
regionPeninsula:ltotp 0.2260217 0.0628154 3.598
regionSan Francisco:ltotp -0.0253860 0.0354172 -0.717
regionSouth Bay:ltotp -0.0704088 0.0358652 -1.963
regionEast Bay:lmedinc 0.5114252 0.0463247 11.040
regionNorth Bay:lmedinc 1.4296971 0.0846586 16.888
regionPeninsula:lmedinc 0.9609715 0.1296948 7.409
regionSan Francisco:lmedinc 0.3181446 0.0463297 6.867
regionSouth Bay:lmedinc 0.5916428 0.0551912 10.720
regionEast Bay:medage 0.0038393 0.0008124 4.726
regionNorth Bay:medage 0.0006410 0.0014241 0.450
regionPeninsula:medage 0.0096107 0.0023089 4.162
regionSan Francisco:medage 0.0041644 0.0013352 3.119
regionSouth Bay:medage 0.0085567 0.0013515 6.331
regionEast Bay:medrooms -0.0149528 0.0148162 -1.009
regionNorth Bay:medrooms -0.2943818 0.0334664 -8.796
regionPeninsula:medrooms -0.0497104 0.0369444 -1.346
regionSan Francisco:medrooms -0.0150357 0.0247049 -0.609
regionSouth Bay:medrooms -0.0555579 0.0168478 -3.298
regionEast Bay:meddur 0.0054152 0.0033194 1.631
regionNorth Bay:meddur 0.0251022 0.0051989 4.828
regionPeninsula:meddur -0.0019460 0.0068275 -0.285
regionSan Francisco:meddur -0.0069402 0.0050801 -1.366
regionSouth Bay:meddur 0.0115730 0.0053658 2.157
regionEast Bay:parks 0.0129149 0.0028795 4.485
regionNorth Bay:parks 0.0014171 0.0028955 0.489
regionPeninsula:parks -0.0177200 0.0052184 -3.396
regionSan Francisco:parks 0.0045266 0.0024573 1.842
regionSouth Bay:parks 0.0037323 0.0026900 1.387
regionEast Bay:edppl3 1.1097366 0.0928388 11.953
regionNorth Bay:edppl3 0.7243787 0.1557113 4.652
regionPeninsula:edppl3 0.5133467 0.1976856 2.597
regionSan Francisco:edppl3 0.6255696 0.1486926 4.207
regionSouth Bay:edppl3 1.3448835 0.1364490 9.856
Pr(>|t|)
regionEast Bay < 0.0000000000000002 ***
regionNorth Bay 0.109842
regionPeninsula 0.889294
regionSan Francisco < 0.0000000000000002 ***
regionSouth Bay < 0.0000000000000002 ***
regionEast Bay:ltotp 0.001140 **
regionNorth Bay:ltotp 0.004134 **
regionPeninsula:ltotp 0.000331 ***
regionSan Francisco:ltotp 0.473625
regionSouth Bay:ltotp 0.049809 *
regionEast Bay:lmedinc < 0.0000000000000002 ***
regionNorth Bay:lmedinc < 0.0000000000000002 ***
regionPeninsula:lmedinc 0.000000000000208 ***
regionSan Francisco:lmedinc 0.000000000009491 ***
regionSouth Bay:lmedinc < 0.0000000000000002 ***
regionEast Bay:medage 0.000002498677425 ***
regionNorth Bay:medage 0.652691
regionPeninsula:medage 0.000033235702884 ***
regionSan Francisco:medage 0.001848 **
regionSouth Bay:medage 0.000000000318854 ***
regionEast Bay:medrooms 0.313027
regionNorth Bay:medrooms < 0.0000000000000002 ***
regionPeninsula:medrooms 0.178648
regionSan Francisco:medrooms 0.542871
regionSouth Bay:medrooms 0.000997 ***
regionEast Bay:meddur 0.103019
regionNorth Bay:meddur 0.000001513803860 ***
regionPeninsula:meddur 0.775665
regionSan Francisco:meddur 0.172091
regionSouth Bay:meddur 0.031175 *
regionEast Bay:parks 0.000007829080748 ***
regionNorth Bay:parks 0.624608
regionPeninsula:parks 0.000702 ***
regionSan Francisco:parks 0.065657 .
regionSouth Bay:parks 0.165493
regionEast Bay:edppl3 < 0.0000000000000002 ***
regionNorth Bay:edppl3 0.000003568587313 ***
regionPeninsula:edppl3 0.009500 **
regionSan Francisco:edppl3 0.000027358794393 ***
regionSouth Bay:edppl3 < 0.0000000000000002 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.267 on 1536 degrees of freedom
Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996
F-statistic: 9.942e+04 on 40 and 1536 DF, p-value: < 0.00000000000000022
Is the spatial regime a better model than the non-interacted OLS? To answer this question, you can run the spatial chow test to determine whether there is evidence that the relationships between the independent variables and housing values differ across regions.
[[1]]
[1] 78.49631
[[2]]
[1] 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000004832431
[[3]]
[1] 8
[[4]]
[1] 1560
The 2nd value in the list gives the p-value. Using a cutoff of 0.05, we can reject the null of the restrained model (non spatial regime OLS).
Geographically Weighted Regression (GWR) attempts to treat your study area like a continuous surface by using Kernel density function and cross-validation based bandwidth.
Bandwidth: 99167.94 CV score: 152.5657
Bandwidth: 160296.9 CV score: 154.6469
Bandwidth: 61388.17 CV score: 147.5673
Bandwidth: 38038.98 CV score: 137.5391
Bandwidth: 23608.39 CV score: 124.6536
Bandwidth: 14689.8 CV score: 112.6202
Bandwidth: 9177.801 CV score: 102.8401
Bandwidth: 5771.201 CV score: 102.4129
Bandwidth: 7135.327 CV score: 98.98384
Bandwidth: 7425.969 CV score: 99.20257
Bandwidth: 7090.012 CV score: 98.97257
Bandwidth: 7029.76 CV score: 98.96845
Bandwidth: 7039.899 CV score: 98.96825
Bandwidth: 7040.456 CV score: 98.96825
Bandwidth: 7040.377 CV score: 98.96825
Bandwidth: 7040.378 CV score: 98.96825
Bandwidth: 7040.378 CV score: 98.96825
Bandwidth: 7040.378 CV score: 98.96825
Bandwidth: 7040.378 CV score: 98.96825
Call:
gwr(formula = lmedhval ~ ltotp + lmedinc + medage + medrooms +
meddur + parks + edppl3, data = bayarea.sp, bandwidth = gwr.b1,
hatmatrix = TRUE)
Kernel function: gwr.Gauss
Fixed bandwidth: 7040.378
Summary of GWR coefficient estimates at data points:
Min. 1st Qu. Median
X.Intercept. -14.8367664 4.7192790 6.4927353
ltotp -1.3257556 -0.0812652 -0.0249426
lmedinc -2.6077786 0.3291706 0.5705322
medage -0.0351063 0.0027374 0.0051062
medrooms -0.8227821 -0.0525912 -0.0224669
meddur -0.0664802 0.0012466 0.0056096
parks -0.0414050 0.0022049 0.0067550
edppl3 -3.0989630 0.6059132 0.7902467
3rd Qu. Max. Global
X.Intercept. 9.0014005 37.9782702 4.1986
ltotp 0.0083518 0.3033374 -0.0382
lmedinc 0.7039981 2.6226942 0.7675
medage 0.0069553 0.0493151 0.0062
medrooms 0.0188203 1.7695621 -0.0857
meddur 0.0124948 0.2522683 0.0117
parks 0.0108894 0.0717151 0.0102
edppl3 1.0476815 2.6368918 0.8973
Number of data points: 1576
Effective number of parameters (residual: 2traceS - traceS'S): 284.4856
Effective degrees of freedom (residual: 2traceS - traceS'S): 1291.514
Sigma (residual: 2traceS - traceS'S): 0.2262557
Effective number of parameters (model: traceS): 219.4385
Effective degrees of freedom (model: traceS): 1356.561
Sigma (model: traceS): 0.2207646
Sigma (ML): 0.2048194
AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): -12.45141
AIC (GWR p. 96, eq. 4.22): -305.9628
Residual sum of squares: 66.11474
Quasi-global R2: 0.8458944
Map of the effects of the number of parks within a 10 minute walk (parks) coefficient on the log of housing value in the San Francisco Bay Area.