Assignment2.knit

Using Spatial Lag, Spatial Error, and Geographically Weighted Regression to Predict Median House Values in Philadelphia Block Groups

Brooke Porter, Dennis Espejo, Oindriza Reza Nodi MUSA 5000: Statistical and Data Mining Methods for Urban Data Analysis Eugene Brusilovskiy November 3, 2025

Introduction

Philadelphia is a highly segregated city due to historic redlining (Hillier, 2003) and rising gentrification (Hwang et al., 2020). Such policies and trends have led to an unequal distribution of resources throughout the city. Looking at census block data, the present study considers the relationship between median house values in Philadelphia (MEDHVAL) and various community attributes, including proportion of residents with at least a bachelor’s degree (PCTBACHMOR), proportion of vacant housing units (PCTVAVANT), percent of single-family housing units (PCTSINGLES), number of households living in poverty (NBELPOV100), and median household income (MEDHHINC). Census blocks are the smallest geographic unit of analysis provided by the U.S. Census Bureau (Rossitier, 2011), and provide the most granular level of geographic analysis necessary to conduct an in-depth and delineated analysis of trends across specific sectors in Philadelphia. In a previous report, we used OLS regression to examine the relationship between the dependent variable MEDHVAL and predictors PCTBACHMOR, PCTVAVANT, and PCTSINGLES. However, our analysis revealed spatial autocorrelation, a violation of the OLS assumption of independent observations. In this report, we use spatial regression models, specifically spatial lag, spatial error, and geographically weighted regression, to address concerns around spatial autocorrelation and determine if these models perform better than OLS.

Methods

A Description of the Concept of Spatial Autocorrelation

American geographer and cartographer Waldo Tobler defines the first law of geography as: “Everything is related to everything else, but near things are more related than distant things” (Tobler, 1970). His 1970 publication, A Computer Movie Simulating Urban Growth in the Detroit Region, has over 14,500 citations on Google Scholar at the time of this writing and is widely referenced by geographers and spatial statisticians. Spatial autocorrelation is a direct application of Tobler’s first law. It describes the extent to which location-based units of analysis exhibit similar characteristics to nearby units (positive spatial autocorrelation) or dissimilar characteristics (negative spatial autocorrelation). For example, if our unit of analysis is neighborhoods (as defined by the census or city government), and low-income neighborhoods (where the neighborhood poverty rate is at least 20 percent) are more likely to be surrounded by other low-income neighborhoods, we can say there is positive spatial autocorrelation in neighborhood poverty. Negative spatial autocorrelation occurs when nearby units are more different from each other than distant ones. A common example is the placement of supermarkets in a city. Rather than clustering all supermarkets on the same block, market competition and zoning often lead to a pattern where areas with a supermarket are adjacent to areas without one, creating a checkerboard-like pattern of access. To test for the presence and strength of positive or negative spatial autocorrelation, spatial statisticians frequently use Moran’s I, a global measure that summarizes the degree to which similar or dissimilar values cluster across a spatial dataset.

Moran’s I

Moran’s I is named after its founder, Patrick Alfred Pierce Moran, an Australian statistician. It is one of the most commonly used tests in spatial statistics to determine whether there is spatial dependence or spatial autocorrelation within a spatial dataset. Moran’s I is calculated using the following general equation:

\[ I = \frac{ \left( \dfrac{\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij} (X_i - \bar{X})(X_j - \bar{X})} {\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij}} \right) }{ \left( \dfrac{\sum_{i=1}^{n} (X_i - \bar{X})^2}{n} \right) } \]

Where:

$\bar{X}$ is the mean of the variable $X$
$X_i$ is the value of $X$ at location $i$
$X_j$ is the value of $X$ at location $j$
$w_{ij}$ is the spatial weight indexing the relationship of location $i$ to location $j$
$n$ is the number of observations (points or areal units) (Brusilovskiy, 2025)

The spatial weights $w_{ij}$ determine which locations are treated as neighbors. These neighbors are defined by the spatial weights matrix. In our case, we use a queen contiguity matrix, which treats any census block that shares either a vertex or a boundary segment with census block $i$ as a neighbor $j$. This is analogous to the movement of a queen in chess, which can move both diagonally and orthogonally.

To compute the numerator, we:

Subtract the overall mean $\bar{X}$ from the value at each location, $(X_i - \bar{X})$.
Do the same for each neighboring location, $(X_j - \bar{X})$.
Multiply these deviations together and weight them by $w_{ij}$. If two units are neighbors, $w_{ij} = 1$; if they are not neighbors, $w_{ij} = 0$, so non-neighbor pairs drop out of the sum.
Sum these weighted cross-products across all $i$ and $j$:
\[ \sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij} (X_i - \bar{X})(X_j - \bar{X}) \]
Divide by the total number of neighbor links, $\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij}$. This gives the average covariance among neighboring units.
Next, we compute the denominator, which is the variance of $X$:

\[ \frac{\sum_{i=1}^{n} (X_i - \bar{X})^2}{n} \]

Finally, we divide the average neighbor covariance (the entire top fraction) by this variance term (the bottom fraction). Dividing by the variance is the last step. This standardizes Moran’s I so that it reflects how strongly nearby values of $X$ move together relative to the overall variability in the data, which is exactly what we are trying to capture with a measure of spatial autocorrelation (Brusilovskiy, 2025).

Statisticans typically do not rely on a single weight matrix. Instead, they test spatial autocorrelation using different matrices to assess the robustness of their results. Using multiple matrices enables verification of whether findings are consistent across different definitions of neighbors and to identify patterns that a single matrix might overlook. For example, we might initially find no spatial autocorrelation using a queen contiguity matrix. However, if we then implement a distance-based k-nearest neighbors matrix that defines neighbors based on the actual geographic distance between census blocks, we may detect spatial autocorrelation that was not visible under the queen’s rule specification. This comparison could help ensure that our conclusions are not driven solely by a particular choice of spatial weights.

A Moran’s I value that is positive and closer to 1 indicates strong positive spatial autocorrelation. In that case, similar values tend to cluster together in space. A Moran’s I value that is negative and closer to −1 indicates strong negative spatial autocorrelation, where nearby units tend to have dissimilar values. Under the null hypothesis of no spatial autocorrelation, the expected value of Moran’s I is slightly negative and is given by the following equation:

\[ E[I] = -\frac{1}{n - 1} \]

In other words, when there is no spatial dependence, Moran’s I is typically negative and close to zero, but not exactly zero. Formally, we can state the hypotheses for Moran’s I as follows:

H0: There is no spatial autocorrelation. (Moran’s I is equal to its expected value under the null, which is negative and close to zero.)
𝐻 1 : There is spatial autocorrelation. (Moran’s I is significantly different from its expected value under the null hypothesis (either positive or negative).)

Once we compute Moran’s I, we then need to test its significance to decide whether to reject or fail to reject the null hypothesis. A common approach is to use a permutation test. To do this, we randomly shuffle the values of our variable of interest across the spatial units 999 times. For each shuffled dataset, we recompute Moran’s I. Including our original, unshuffled dataset, this yields a total of 1,000 Moran’s I values. We then rank the original Moran’s I within this distribution and divide its rank by 1,000 to obtain a pseudo p-value.

For example, suppose we are examining the percentage of households with pets across census blocks and we obtain an observed Moran’s I of 0.75. This suggests that census blocks with a high percentage of households with pets tend to be surrounded by other blocks with similarly high percentages, while blocks with low percentages tend to be located near other low-percentage blocks. After shuffling the pet ownership values across the census blocks 999 times, we compute Moran’s I for each permutation. If our original Moran’s I of 0.75 is the third most extreme value (for a one-sided positive test, the third highest) among the 1,000 values, the pseudo p-value is calculated as:

\[ \text{Pseudo } p\text{-value} = \frac{\text{rank of the observed } I}{1000} \]

For example, if the observed Moran’s $I = 0.75$ is the third most extreme value (for a one-sided positive test), then:

\[ \text{Pseudo } p\text{-value} = \frac{3}{1000} = 0.003 \]

This means that there is a 3 in 1,000 chance of observing a Moran’s I as large as 0.75 if there were truly no spatial autocorrelation. Because this probability is very small, we would reject the null hypothesis and conclude that there is statistically significant positive spatial autocorrelation. If our observed Moran’s I were negative and we were testing for significant negative spatial autocorrelation, we would sort the permutation distribution in ascending order and use the rank from the lower tail instead. The logic of the permutation test remains the same.

If we are interested in local rather than global patterns, we can compute a local version of Moran’s I. A local Moran’s I provides a separate statistic for each unit of analysis, such as each census block. Local Indicators of Spatial Association (LISA) statistics are commonly used for this purpose, testing whether the values at sites in the vicinity of location 𝑖 are associated with the value at 𝑖. In other words, LISA statistics tell us which individual census blocks have values that are significantly similar or dissimilar to those of their neighbors, given the chosen weight matrix.

The significance test for local Moran’s I follows the same permutation logic as the global Moran’s I. However, instead of computing a single Moran’s I per permutation, we compute one Moran’s I for each unit 𝑖 in each permutation. This produces a distribution of 1,000 Moran’s I values for every unit. Each unit’s observed local Moran’s I is then ranked within its own permutation distribution and assigned a pseudo p-value. As a result, each census block receives both a local Moran’s I value and an associated significance level.

A key benefit of local Moran’s I is that it allows us to identify specific patterns across space. We can detect high values surrounded by other high values, low values surrounded by other low values, and potential outliers where high values are surrounded by low values or vice versa. This level of detail enables us to move beyond a single global measure and instead pinpoint the specific areas where clustering or spatial outliers are driving the overall spatial autocorrelation pattern.

Review of OLS Regression and Assumptions

A regression analysis is defined as “the part of statistics that investigates the relationship of two or more variables related in a non-deterministic fashion” (Devore, 2016). Regression analysis is an efficient method for estimating the relationship between one or more predictors (x) and an outcome (y), and is commonly used to uncover patterns between variables. A regression analysis must meet the following assumptions to ensure that the results are reliable and the model’s inferences are valid: linearity, independence of observations, normality of residuals, homoscedasticity, and absence of multicollinearity. For more information on OLS regression analysis and testing OLS assumptions, please refer to our paper titled “OLS Regression to Predict Median Household Values in Philadelphia” .

Typically, when dealing with spatial data, the OLS assumption of independence of observations is violated. Where there may be independence of the variables when added on a scatterplot without any specific patterns, yet there may be clusters of residuals with similar values when displayed on a map. Clusters indicate a relationship (spatial in this case) across the residuals, suggesting that the residuals aren’t independent of each other. To test whether these spatial clusters of residuals are statistically significant, the global Moran’s I of the residuals are collected and tested for statistical significance. If there is a significant Moran’s I far from 0, then there is a possibility that there is a presence of spatial autocorrelation, meaning the assumption of independence of residuals is violated.

To further investigate whether the assumption of independence of residuals is violated, we can regress our residuals on those of nearby observations. For example, let’s say we want to regress our residuals on their neighboring residuals using the Queen’s matrix. This is calculated by summing the neighboring residuals and dividing the result by the number of neighbors included in the analysis. When regressing residuals on lagged (neighboring) residuals, a one unit change in the lagged residuals signifies a change in the residuals by the amount listed in the B coefficient. For example, if we find that the B coefficient when we regress our residuals on lagged residuals is .75, then a 1 unit increase in our lagged residuals signifies a .75 increase in our residuals. If this coefficient and the results of our regression are significant, this further confirms that a relationship exists across residuals based on space, indicating that our OLS regression may have potential spatial biases that need to be accounted for. Additionally, platforms such as GeoDa or R, common software used in spatial statistics, have ways of testing additional OLS regression assumptions. These software have three different diagnostic tests for heteroscedasticity. Which include the following:

The Breusch-Pagan Test: A simple test for heteroscedasticity and random coefficient variation.
The Koenker-Bassett Test: Robust tests for heteroscedasticity based on regression quintiles
The White Test: A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity

Across these three tests, the null and alternative hypotheses are :

H0: There is homoscedasticity (there is no pattern in the residuals) (Need to specify this a bit more)
H1: There is heteroscedasticity (pattern across residuals)

If the p-value is less than .05 across any of the prior three tests, we reject the null hypothesis of homoscedasticity for the alternative hypothesis of heteroscedasticity. Additionally, we can test for the normality of errors. The Jarque-Bera test, named after Carlos Jarque and Anil K. Bera, in GeoDa is a goodness-of-fit test of “whether sample data have the skewness and kurtosis matching a normal distribution” (Wikipedia Contributors, 2025). The null and alternative hypotheses for the Jarque-Bera test are as follows:

H0: Residuals are normal
H1: Residuals are non-normal.

For example, if we have a JB statistic with a p-value of 0.00001, less than .05, we reject the null hypothesis of normality in favor of the alternative hypothesis of non-normality. Alternatively, if we have a p-value of 0.094, the p-value is greater than 0.05, meaning we fail to reject the null hypothesis of normality.

Spatial Lag and Spatial Error Regression

To account for spatial autocorrelation in residuals and in the outcome, researchers often turn to two commonly used spatial regression models: the Spatial Lag Model (SLM) and the Spatial Error Model (SEM). For this assignment we will use GeoDa to construct both models. Both models are designed to address violations of the OLS assumption of independence of observations, which is frequently violated when data are spatial in nature. When nearby geographic units influence each other, ignoring this structure can result in biased coefficient estimates, inefficient standard errors, and residuals that remain spatially autocorrelated. Spatial regression directly incorporates the spatial structure to correct these issues.

The spatial lag model addresses spatial dependence by incorporating a spatially lagged version of the dependent variable directly into the regression model. In practice, this means that the model gathers each unit’s neighbors based on a predefined spatial weight matrix, such as the queen contiguity matrix used in this analysis, and computes the weighted average of the dependent variable for those neighbors. This spatial lag then becomes an additional predictor in the model. Conceptually, the spatial lag model captures the idea that the outcome of a location is influenced by outcomes in nearby locations, which aligns with Tobler’s first law of geography. If we ignore this dependence and fit a standard OLS model, the estimated coefficients for our predictors may be biased because the model is forced to absorb variation that is actually driven by neighboring outcomes rather than by local characteristics alone.

The general form of the spatial lag model can be written as:

\[ y = \rho Wy + \beta_0 + \beta_1 X_1 + \cdots + \beta_n X_n + \varepsilon \]

For our current study, we use the log transformed median house value (LNMEDHVAL) as a function of the spatial lag term and four additional neighborhood-level predictors: the proportion of vacant housing units (PCTVACANT), the proportion of single-family housing units (PCTSINGLES), the number of households living in poverty (NBELPOV100), and Percent Bachhelors or Higher (PCTBACHMORE). The spatial lag formula is:

\[ \begin{aligned} \text{LNMEDHVAL} =\;& \rho W(\text{LNMEDHVAL}) + \beta_0 + \beta_1 \text{PCTVACANT} + \beta_2 \text{PCTSINGLES} \\ &+ \beta_3 \text{LNBELPOV100} + \beta_4 \text{PCTBACHMORE} + \varepsilon \end{aligned} \]

Where:

LNMEDHVAL is the dependent variable that represents the natural log of median house value for each census block group.
$W(\text{LNMEDHVAL})$ is the spatially lagged dependent variable. It is calculated as the weighted average of MEDHVAL among all neighboring census block groups, with neighbors defined by the queen contiguity matrix.
$\rho$ (rho) is the spatial autoregressive parameter that measures the strength and direction of spatial dependence in the outcome. A significant and positive $\rho$ indicates that high values of MEDHVAL tend to cluster near other high values, and low values tend to cluster near low values.
$\beta_1, \beta_2, \beta_3, \beta_4$ are regression coefficients that represent the marginal effect of each predictor on MEDHVAL while holding the other predictors and the spatial lag term constant.
PCTVACANT, PCTSINGLES, LNBELPOV100, MEDHHINC are the neighborhood-level predictors used in the model.
$\varepsilon$ is the model error term that captures random variation in MEDHVAL that is not explained by the predictors or the spatial lag term.

The main distinction between the OLS model and the spatial lag model is that OLS assumes that the value of the dependent variable in one location does not depend on the values in neighboring locations. The spatial lag model explicitly relaxes this assumption and allows outcomes in neighboring block groups to influence each other through the term $\rho W(\text{MEDHVAL})$.

While the spatial lag model accounts for spatial dependence in the outcome itself, the spatial error model captures spatial dependence in the error structure. The spatial error model assumes that the residual at one location is systematically related to the residuals at nearby locations, and that this pattern reflects omitted spatially structured variables or processes that are not included in the model. In other words, the SEM is appropriate when the source of spatial dependence arises from unobserved or unmeasured factors that vary across space rather than from the dependent variable directly influencing neighboring outcomes.

The spatial error model can be described in two linked equations:

\[ y = \beta_0 + \beta_1 X_{1} + \dots + \beta_n X_{n} + \varepsilon \]

\[ \varepsilon = \lambda W \varepsilon + u \]

For our analysis, this becomes:

\[ \text{LNMEDHVAL} = \beta_1 \text{PCTVACANT} + \beta_2 \text{PCTSINGLES} + \beta_3 \text{LNBELPOV100} + \beta_4 \text{PCTBACHMOR} + \lambda W \varepsilon + u \]

Where:

$\beta_1, \beta_2, \beta_3, \beta_4$ represent the effects of PCTVACANT, PCTSINGLES, LNBELPOV100, and PCTBACHMOR on MEDHVAL, after accounting for spatial structure in the errors.
$\lambda$ (lambda) is the spatial error parameter that measures the strength and direction of spatial autocorrelation in the residuals. A positive and significant $\lambda$ suggests that locations with positive residuals tend to be near other locations with positive residuals, and similarly for negative residuals.
$W\varepsilon$ is the spatial lag of the error term, which averages the residuals of neighboring observations using the same queen contiguity matrix.
$u$ is the remaining uncorrelated (white noise) component of the error term after accounting for spatial dependence.

The spatial error model removes spatial autocorrelation from the residuals by explicitly modeling the structure of the errors. This adjustment yields more reliable coefficient estimates and standard errors, since the model no longer assumes that residuals are independent when they are not.

The assumptions required for OLS still apply to both the spatial lag and spatial error models. We still require linearity, normality of residuals, homoscedasticity, and low multicollinearity among predictors. The only assumption that is relaxed is the independence of observations. Instead of assuming independence, we now model spatial dependence directly through the spatial lag term in the SLM or the spatial error process in the SEM. The primary goal of fitting spatial lag and spatial error models is to produce regression residuals that no longer exhibit spatial autocorrelation. In practice, this means that the Moran’s I of the residuals should be close to zero and statistically insignificant. To evaluate whether the spatial models perform better than OLS, we compare several criteria:

Akaike Information Criterion (AIC)/Schwartz Criterion: A measure of goodness of fit, two models are needed for AIC. Lower AIC values indicate a better tradeoff between goodness of fit and model complexity. For example, if OLS has an AIC of 12,500 and the spatial lag model has an AIC of 11,900, the spatial lag model is preferred because it has the lower value.

Log Likelihood: The log likelihood measures how likely the observed data are under a given model. Higher log likelihood values indicate a model that fits the data better.For example, if the OLS model reports a log likelihood of -6,200 and the spatial lag model reports -5,850, the higher log likelihood (closer to zero) indicates that the spatial lag model fits the data better. Specifically, the Log likelihood method only works for nested models.So we can do a log likelihood test between SEM and OLS or SLM and OLS but not with SEM AND SLM

Likelihood Ratio (LR) Test: The LR test compares a simpler model (such as OLS) to a more complex spatial model (such as SLM or SEM). For example, if comparing OLS to the spatial lag model yields an LR statistic of 35.7 with a p-value of 0.001, this indicates that the spatial lag model significantly improves the fit over OLS. Just like the log Likelihood, the likelihood ratio only works for nested models. So we can do a likelihood ratio test between SEM and OLS or SLM and OLS but not with SEM AND SLM.

Another way to compare OLS results with spatial lag and spatial error results is to examine the Moran’s I of the regression residuals from each model. We compute Moran’s I for the OLS residuals, the spatial lag residuals, and the spatial error residuals. The preferred model is the one whose residuals have the smallest magnitude of Moran’s I and the least statistical significance. If the spatial models substantially reduce Moran’s I relative to OLS and produce non-significant spatial autocorrelation in the residuals, this is strong evidence that the spatial dependence has been appropriately modeled and that the spatial regression model is an improvement over the standard OLS specification. For example, the OLS residuals may have a Moran’s I of 0.31 (p < 0.001), indicating strong residual spatial autocorrelation. If the spatial lag model reduces Moran’s I to 0.05 (p = 0.18) and the spatial error model reduces it to 0.02 (p = 0.41), the spatial error model would be preferred because it removes the most spatial dependence from the residuals.

Geographically Weighted Regression (GWR)

For this portion of the analysis, we will conduct all Geographically Weighted Regression (GWR) models in R. GWR is a local regression technique that allows the relationship between predictors and an outcome to vary across space. The intuition behind GWR is closely tied to Simpson’s paradox and the logic of local regression. Simpson’s paradox shows that a relationship observed at the global level can reverse or disappear when data are grouped into smaller, more meaningful subsets. GWR follows that same logic. Instead of forcing one global relationship onto all locations, GWR estimates many local regressions, letting the data reveal how the strength and direction of associations shift across geographic space.

GWR can be expressed through the following general equation:

\[ y_i = \beta_{i0} + \beta_{i1} x_{i1} + \beta_{i2} x_{i2} + \cdots + \beta_{im} x_{im} + \varepsilon_i \]

In this equation, each observation $i$ has its own set of coefficients. In simpler terms, instead of estimating one slope for all areas of interest, GWR asks what the slope looks like for each area of interest, assuming its surrounding areas are most influential for estimating that relationship.

To run local regression, GWR assigns weights to all observations based on their distance from the focal location. Observations that are closer receive higher weights, and observations farther away receive lower weights. A separate regression is then estimated for each location using these distance-decayed weights. As a result, GWR produces a surface of coefficients, each capturing how the predictors operate locally rather than globally.

A core component of this process is the bandwidth, which controls how “local” the local regression actually is. A small bandwidth uses only nearby neighbors, leading to highly localized estimates, while a larger bandwidth smooths the relationships over a wider area. Bandwidths can be either fixed or adaptive. A fixed bandwidth uses the same geographic distance everywhere. Let’s say we have a fixed bandwith of five miles, the bandwith will count units within a five mile radius of each unit. This means fixed bandwiths perform best in datasets where observations are spread evenly, which is very unlikely to be present in social science data. An adaptive bandwidth, by contrast, adjusts based on the amount of observations. For example, if we use an adaptive bandwith we specifically want the n amount of closest neighbors.

For this study, we use an adaptive bandwidth. The spatial distribution of census block groups in Philadelphia is not uniform, and some areas have more densely packed units than others. Using an adaptive bandwidth allows the model to adjust to this unevenness and avoids creating unstable estimates in areas with sparse observations. Because the goal of this project is to capture meaningful local variation in economic and housing patterns, an adaptive bandwidth is the more appropriate choice.

With GWR the same OLS assumptions still apply. Linearity, independence, normality of residuals, and homoscedasticity remain important to evaluate. Multicollinearity is also a concern in GWR. Instead of relying on global variance inflation factors alone, GWR analysts typically examine the Condition Number produced for each local regression. High Condition Numbers, typically over 30 (Brusilovskiy, 2025), indicate instability in coefficient estimates and often reflect a combination of multicollinearity and spatial clustering in the predictors.

Finally, GWR outputs do not include traditional p-values. Local regressions generate hundreds or thousands of coefficient estimates, making standard hypothesis testing inappropriate. In addition, the local nature of GWR violates the assumptions required for conventional parametric p-values. Instead of p-values, interpretation focuses on the spatial patterns, magnitudes, and stability of coefficients and on whether the GWR model improves model fit compared to the global OLS model.

Results

Spatial Autocorrelation

Figure 1: Moran’s I of LNMEDHVAL with the spatially lagged LNMEDHVAL.

Figure 2: Pseudo p-value significance test based on 999 permutations.

To determine whether the dependent variable is significantly autocorrelated, we used GeoDa to examine the global Moran’s I using a scatterplot and Queens weight matrix. The global Moran I’s value of LNMEDHVAL is 0.794. Running 999 permutations revealed that the p-value of this Moran’s I is 0.001, indicating that LNMEDHVAL is significantly autocorrelated.

Figure 3: LISA cluster map for LNMEDHVAL.

Figure 4: LISA significance map for LNMEDHVAL.

Running the Local Moran’s I provides insights into how similar observations are to its neighbors. On the LISA Cluster Map, we observe high value homes surrounded by other high value homes in Northwest, Northeast, and parts of South Philly as indicated by the dark red areas, and low value homes surrounded by other low value homes in North, West, and parts of South Philly as indicated by the dark blue areas on the map. We also see incidents of low-high in the Grays Ferry area, where low home values are surrounded by high value neighbors as well as high-low in South Philly, where high value homes are surrounded by low value neighbors. Side by side comparison of the LISA Cluster Map and the LISA significance Map shows that these patterns are statistically significant, meaning there is systematic over and underprediction in the model due to spatial autocorrelation.

A Review of OLS Regression and Assumptions: Results

Table 1: OLS regression results.

OLS regression reveals that all 4 independent variables, the proportion of residents with at least a bachelor’s degree (PCTBACHMOR), the proportion of vacant housing units (PCTVAVANT), percent of single-family housing units (PCTSINGLES), and log-transformed number of households living in poverty LN(NBELPOV100) , are statistically significant predictors (with p-values <.05) of the dependent variable, the log-transformed median house value LN(MEDHVAL). The model’s adjusted R² (0.6615) indicates that approximately 66.15% of the variance in LN(MEDHVAL) is explained by these 4 predictors. Additionally, the F-statistic (840.9, p < 0.001) indicates that the overall model is statistically significant. The heteroscedasticity tests in the output, the Breusch-Pagan test, the Konker-Bassett the test, and the White Test , have p-values <.05, meaning we reject the null hypothesis of homoscedasticity in favor of the alternate hypothesis of heteroscedasticity. This conclusion is not consistent with the Residuals vs. Fitted Values scatter plot from our OLS exploratory analysis. This discrepancy may be due to these tests’ high level of sensitivity and ability to pick up on variance that is overlooked by simply looking at the distribution of points on the scatterplot. However, this conclusion does align with the evidence of heteroscedasticity present on the standardized residuals choropleth.

Figure 5: OLS scatter plot of standardized residuals and fitted values.

The test on normality of errors (Jarque-Bera test) is also statistically significant (with a p-value<.05) so we reject the null hypothesis of normal residuals in favor of the alternate hypothesis of non-normal residuals. This result indicates a problem with normality that is not consistent the histogram of residual (errors) seen below. This discrepancy may be due to the Jarque-Bera test;s high level of sensitivity and ability to pick up on variance that is overlooked by simply looking at the histogram.

Figure 6: Histogram of standardized residuals for the normality check.

Figure 7: Comparison of OLS residuals and spatially weighted residuals.

Next, we plotted the OLS residuals and the weighted residuals to see if residuals are associated with nearby observations. The scatterplot shows that Slope b is 0.733, meaning that the coefficient of WT_RESIDU is 0.733 when you regress OLS_RESIDU on WT_RESIDU. This slope is statistically significant with a p-value of 0.

Figure 8: Global Moran’s I of the OLS residuals.

Figure 9: Permutation-based significance test for the OLS residuals.

The Moran’s I scatterplot and results from the 999 permutations for OLS regression residuals reveal that a Moran’s I of 0.313 is statistically significant with a p-value of 0.001, indicating that there is significant spatial autocorrelation in the OLS residuals. Such spatial relationships are problematic because they violate the assumption of independent residuals . The Moran’s I and the beta coefficient of the spatially lagged residuals tell similar stories of spatial autocorrelation in the residuals, observations are related to their neighbors.

Spatial Lag and Spatial Error Regression Results

Table 2: Spatial lag model results for LNMEDHVAL.

The coefficient for W_LNMEDHVAL is 0.615097 with a z-value of 36.07 and p-value at 0, which indicates that the variable is highly statistically significant. This positive and significant coefficient means that median home values in one area are strongly associated with those in neighboring areas. Spatial spillover effects are present. All four predictors (LNNBELPOV, PCTBACHMOR, PCTSINGLES, PCTVACANT) have p-values at 0, meaning they are all statistically significant in the spatial lag model. The signs and magnitudes of these variables indicate the direction of their relationships. LNNBELPOV has a negative coefficient, meaning higher poverty is associated with lower median housing values. Similarly, PCTVACANT has a negative coefficient, showing that tracts with higher vacancy rates correspond to lower housing values. In contrast, PCTBACHMOR has a positive coefficient, indicating that tracts with a higher percentage of residents holding a bachelor’s degree or higher tend to have higher housing values. Lastly, the positive coefficient for PCTSINGLES suggests that areas with more single-person households have slightly higher housing values. Compared to the OLS model, the Spatial Lag model additionally includes W_LNMEDHVAL, which is highly significant, revealing strong spatial dependence. Even though all the variables remain highly significant in both the models, the coefficient magnitudes are smaller in the spatial lag model because spatial spillover effects are now explicitly modeled. For instance, the coefficient of PCTVACANT drops slightly from –0.0191 to -0.0085, and PCTBACHMORE dropped from 0.0209 to 0.00851.Since the p-value of the Breusch-Pagan test is less than 0.05, we reject the null hypothesis in favor of the alternative that the residuals are heteroskedastic.

The model improved as the AIC decreased to 523.48 in the spatial lag model compared to 1432.99 in OLS regression. The Spatial Lag model shows a much higher log-likelihood value (–255.74) compared to the OLS model (–711.493). Because higher (less negative) log-likelihood values indicate a better model fit, this suggests that the Spatial Lag model fits the data substantially better. Similarly, the Schwarz Criterion (BIC) is much lower in the Spatial Lag model (556.18) than in the OLS model (1460.24), further confirming that the Spatial Lag model provides a more efficient and parsimonious explanation of the data. Since the p-value is 0 for the log likelihood test statistic of 911.5067, we reject the null hypothesis that there is no spatial dependence in favor of the alternative that there is spatial dependence, so the spatial lag model provides a better fit.

Figure 10: Moran’s I of residuals x lagged residuals from the spatial lag model.

In the OLS model, the positive Moran’s I (0.313) indicates strong spatial autocorrelation. However, in the spatial lag model, the Moran’s I value drops sharply to –0.082, which is close to zero and slightly negative, implying that most of the spatial dependence has been removed after including the spatially lagged dependent variable (W_LNMEDHVAL). Overall, comparing all the criterion discussed above, the spatial lag model is doing much better compared to the OLS model.

Table 3: Spatial error model results.

The large positive value of λ = 0.8149 implies that strong spatial correlation exists among the error terms. In other words, unobserved factors affecting housing values in one tract are closely related to those in neighboring tracts. Since the p-value < 0.001, the λ coefficient is highly statistically significant. This allows us to reject the null hypothesis, that the model’s residuals are independent across space in favor of the alternative, that the residuals from nearby areas are correlated with one another. This establishes the fact that there is spatial autocorrelation in the error terms.

All four predictors (LNNBELPOV, PCTBACHMOR, PCTSINGLES, and PCTVACANT) have p-values of 0, meaning they are all statistically significant in the Spatial Error model. The signs and magnitudes of these variables indicate the direction of their relationships. LNNBELPOV has a negative coefficient, meaning higher poverty is associated with lower median housing values. Similarly, PCTVACANT has a negative coefficient, showing that tracts with higher vacancy rates correspond to lower housing values. In contrast, PCTBACHMOR has a positive coefficient, indicating that tracts with a higher percentage of residents holding a bachelor’s degree or higher tend to have higher housing values. Lastly, the positive coefficient for PCTSINGLES suggests that areas with more single-person households have slightly higher housing values.

Compared to the OLS model, the Spatial Error model additionally includes the LAMBDA term, which is highly significant and captures spatial dependence in the residuals. Even though all the variables remain highly significant in both models, the coefficient magnitudes are smaller in the Spatial Error model because spatial error dependence is now explicitly modeled. For instance, the coefficient of PCTVACANT decreases from –0.0191 to –0.00578, and PCTBACHMOR decreases from 0.0209 to 0.00981. This reduction indicates that once spatially correlated errors are accounted for, the direct influence of these predictors on housing values becomes smaller, and the model more accurately represents spatial structure in the data.

Since the p-value of the Breusch-Pagan test is less than 0.05, we reject the null hypothesis in favor of the alternative that the residuals are still heteroskedastic. The model improved as the Akaike Information Criterion (AIC) decreased to 755.38 in the Spatial Error model compared to 1432.99 in the OLS regression.

The Spatial Error model also shows a much higher log-likelihood value (–372.69) compared to the OLS model (–711.49). Because higher (less negative) log-likelihood values indicate a better model fit, this suggests that the Spatial Error model fits the data substantially better. Similarly, the Schwarz Criterion is much lower in the Spatial Error model (782.63) than in the OLS model (1460.24), confirming that the Spatial Error model provides a more efficient and parsimonious explanation of the data. Since the p-value is 0 for the Likelihood Ratio Test statistic (677.61), we reject the null hypothesis that there is no spatial dependence in favor of the alternative that spatial dependence exists. Therefore, the Spatial Error model provides a significantly better fit than the OLS model, effectively accounting for spatial correlation in the error terms.

Table 11: Local Moran’s I of the spatial error model residuals.

In the OLS model, the positive Moran’s I (0.313) indicates strong spatial autocorrelation. However, in the Spatial Error model, the Moran’s I value drops sharply to –0.095, which is close to zero and slightly negative, implying that most of the spatial dependence has been removed after accounting for spatial error correlation through the LAMBDA term. Overall, the spatial error model is doing better compared to the OLS model. The spatial lag model has a lower AIC of 523.48 and lower BIC of 556.18 compared to the spatial error model, which has AIC of 755.38 and a BIC of 782.63. Since both the AIC and BIC values are smaller for the spatial lag model, it is the better performing model.

Geographically Weighted Regression Results

Table 4: Geographically weighted regression model using a fixed bandwidth.

$Table 5: Geographically weighted regression model using an adaptive bandwidth.$

Table 5: Geographically weighted regression model using an adaptive bandwidth.

The quasi global R squared for fixed bandwidth GWR is 0.8438, and for adaptive bandwidth it is 0.8479, whereas for the OLS regression, the adjusted R squared is 0.6623. GWR explains a substantially larger proportion of the variation in the dependent variable, LNMEDHVAL. The Akaike Information Criterion (AIC) values for each model are as follows: OLS regression = 1432.99, Spatial Lag model = 523.48, Spatial Error model = 755.38, and GWR (Adaptive Bandwidth) = 308.71. Among all models, the GWR (Adaptive Bandwidth) model has the lowest AIC value, 308.71, substantially lower than the OLS, Spatial Lag, and Spatial Error models.

Figure 12: Scatter plot of GWR residuals and fitted values.

Figure 13: Comparison of Moran’s I across all models.(OLS, Spatial Lag, Sparial Error, and Geographically Weighted Regression

The Moran’s I scatterplot for the GWR residuals with Moran’s I value of 0.033 shows a very weak but positive spatial relationship, indicates that although the autocorrelation is statistically detectable, its magnitude is extremely small. Because the p-value for the GWR Moran’s I test is below 0.05, we reject the null hypothesis of spatial randomness, indicating that a small but statistically significant amount of spatial autocorrelation remains in the GWR residuals. The Spatial Lag and Spatial Error models both produce negative and near-zero Moran’s I values (Lag: –0.0824, Error: –0.0945), with p-values around 0.9999, meaning their residuals display no meaningful spatial autocorrelation. Because the p-values for the Spatial Lag and Spatial Error models are far above 0.05, we cannot reject the null hypothesis of no spatial autocorrelation, meaning both models completely eliminate spatial dependence in their residuals. In fact, the slightly negative values indicate very mild over-correction, which is common when spatial dependence is fully absorbed by the spatial parameters in these models. By contrast, the OLS residuals show a much larger Moran’s I of 0.2053, indicating strong spatial autocorrelation that violates OLS assumptions. Because the OLS Moran’s I test has a p-value of 0.0000, we reject the null hypothesis of spatial randomness, indicating that the OLS residuals exhibit strong and statistically significant spatial autocorrelation.

Figure 14:Standardized coefficient maps

Poverty Rate (log) The LNNEBELPOV standardized coefficient map shows large dark-blue clusters in North and West Philadelphia, indicating strong negative local relationships between poverty and the dependent variable in those neighborhoods. A few red patches appear in parts of the Northeast, suggesting locally positive and possibly significant relationships as well.
Percentage with Bachelor’s Degree The PCTBACHMOR map is dominated by red areas in central and northwestern Philadelphia, indicating positive coefficients above +2, meaning the share of adults with a bachelor’s degree has a strong, locally significant positive effect in these neighborhoods. Some light blue regions in the far Northeast suggest weaker or negative effects elsewhere.
Percentage of Single-Person Households The PCTSINGLES map shows deep red areas in Northwest and parts of South Philadelphia, meaning the percentage of single-person households has a strong, positive, possibly significant effect in those locations. Conversely, dark blue areas in North and West Philadelphia indicate negative and possibly significant local effects.
Percentage of Vacant Housing For PCTVACANT, dark blue areas appear throughout North, West, and Southwest Philadelphia, indicating negative standardized coefficients below –2, meaning vacancy is a strong negative predictor locally. A few small red patches indicate places where vacancy may have a positive and possibly significant association.

Figure 15: Choropleth map of local R-squared values from the GWR model.

The choropleth map of local R-squared values shows that the explanatory power of the GWR model varies substantially across Philadelphia. Many neighborhoods have very low R-squared values (below 0.20), indicating a lousy fit in most parts of the city, where the four predictors explain little of the variation in the dependent variable. However, there are distinct pockets, especially in the far Northeast and Northwest, where the model fits much better, with local R-squared values rising above 0.50. These darker regions show where the GWR model captures strong local relationships, while the lighter areas reveal places where the predictors perform poorly.

Discussion

In this paper, we examined how neighborhood characteristics like poverty, education, vacancy rates, and single-person households affect median housing values across Philadelphia’s census block groups using four regression methods: OLS, Spatial Lag, Spatial Error, and Geographically Weighted Regression (GWR). Our findings show that spatial dependence and heterogeneity significantly influence housing values. While the Spatial Lag and Spatial Error models improved upon OLS by accounting for spatial autocorrelation, GWR with an adaptive bandwidth provided the best overall fit, with the lowest AIC (308.71). This indicates that relationships between predictors and housing values vary across space and that GWR most effectively captures these local variations. However, the Moran’s I test reveals that GWR does not completely eliminate spatial autocorrelation: the p-value for the GWR residuals is below 0.05, meaning we reject the null hypothesis of spatial randomness, and a small amount of spatial dependence remains. In contrast, the Spatial Lag and Spatial Error models produce residuals with high p-values, so we fail to reject the null hypothesis and conclude that those models successfully remove spatial autocorrelation. Finally, although GWR provides the best model fit and captures meaningful local patterns, many spatial statisticians caution that it should be treated as an exploratory tool rather than a definitive substitute for global spatial regression models. Therefore, the Spatial Lag model with the next lowest AIC of 523.48 serves as the next best choice, with a spatial dependence correction. Additionally, several predictors were found to be strongly correlated. Poverty (LNNBELPOV) and vacancy rate (PCTVACANT) were negatively correlated with housing values, while education (PCTBACHMOR) and single-person households (PCTSINGLES) were positively correlated, indicating clear socioeconomic patterns across Philadelphia. A key limitation of this analysis is that several OLS assumptions were not fully met. The assumption of homoscedasticity was violated, as indicated by the Breusch–Pagan, Koenker–Bassett, and White tests, suggesting non-constant variance in the residuals. Additionally, the assumption of normality of residuals was not satisfied, based on the significant Jarque–Bera test results. Finally, the assumption of independent observations was violated due to significant spatial autocorrelation in the OLS residuals (Moran’s I = 0.313), necessitating the use of spatial regression models. For the spatial lag model and spatial error model, the assumption of homoscedasticity was still not met, as indicated by the significant Breusch–Pagan test (p < 0.05), meaning the residual variance was not constant across observations. Weighted, or spatially lagged residuals, which are used in spatial lag regression, are created by averaging each observation’s residuals with those of its neighboring observations, using a spatial weights matrix, for example, the queen contiguity matrix. In contrast, spatial lag model residuals are the residuals obtained after fitting a Spatial Lag regression, which already includes the spatially lagged dependent variable as a predictor. These residuals represent the remaining unexplained variation after spatial dependence has been explicitly modeled. Spatial lag regression assumes that the spatial autocorrelation in the OLS residuals can be accounted for by including the spatial lag of the dependent variable as a predictor in the regression. On the other hand, spatial error regression assumes that the spatial autocorrelation in the OLS residuals is due to a missing spatial variable, and includes a spatially weighted error term as a predictor in the regression. However, both of these methods assume that we’re dealing with spatial stationarity, which is the assumption that modeled relationships are constant across space. ArcGIS is problematic for GWR because it does not adjust for the massive multiple-testing problem that occurs when thousands of local t-tests are performed. With a 4-predictor model at around 2,000 locations, GWR produces about 10,000 significance tests, and at α = 0.05 we would expect roughly 500 false positives simply by chance. Since ArcGIS provides no correction for this, its GWR significance results can be misleading.

References

Brusilovskiy, E (2025) Statistical and Data Mining Methods for Urban Data Analysis.

Devore, J.L. (2015) Probability and Statistics for Engineering and the Sciences. Cengage Learning, Boston.

Hillier, A. E. (2003). Spatial analysis of historical redlining: A methodological exploration. Journal of Housing Research, 14(1), 137-167

Hwang, Jackelyn, and Lei Ding. “Unequal Displacement: Gentrification, Racial Stratification, and Residential Destinations in Philadelphia.” The American journal of sociology. 126.2 (2020): 354–406. Web.

Moran, P. A. P. (1950). “Notes on Continuous Stochastic Phenomena.” Biometrika, 37(1-2), 17–23.

Tobler, W. R. (1970). A computer movie simulating urban growth in the Detroit region. Economic geography, 46(sup1), 234-240.

Wikipedia Contributors. (2025, September 5). Jarque–Bera test. In Wikipedia, The Free Encyclopedia. Retrieved [date you accessed it], from https://en.wikipedia.org/wiki/Jarque–Bera_test