1 Introduction

1.1 Problem & Setting

As in our previous homework assignment, this study aims to examine the relationship between median house values and several neighborhood characteristics. It will establish a model for predicting median house values, with a geographic focus on Philadelphia. Earlier models of house value prediction—especially the influential HUD report, Characteristic Prices of Housing in Fifty-Nine Metropolitan Areas—propose a hedonic model, which has been widely adopted and which we will use here.¹

\[R = f (S, N, C, t)\] where

\(R =\) rent or value,

\(S =\) structural characteristics,

\(N =\) neighborhood characteristics (including location),

\(C =\) contract conditions (implicit and explicit), and

\(t =\) time trend (accounting for inflation over time).

In a subsequent study that looks at Place-to-place Housing Price Indexes and their determinants, the researchers utilized the log-linear form of the hedonic model,² expressed as:

\[\ln R = \beta_0 + S\beta_1 + N\beta_2 + L\beta_3 + C\beta_4 + \epsilon \] where

\(\ln R =\) the natural log of rent,

\(R, S, N, C =\) same as above,

\(\beta_i =\) the hedonic regression coefficient, and

\(\epsilon\) = the residuals.

Since the researchers mentioned “four appealing characteristics” of the log-linear form, including mitigating a common form of heteroskedasticity,³ our study will also use the logarithmic transformation of some variables.

Both studies identified and categorized structural, neighborhood, and contract characteristics as three separate groups. Nonetheless, the analysis primarily focused on the structural characteristics (e.g., the number of bedrooms) and oversimplified the variables in neighborhood characteristics (i.e., using a generic subjective rating of how “good” a neighborhood is instead of using more objective parameters). Given that previous studies largely focused on structural characteristics, we will adopt similar regression models to explore four neighborhood characteristics that were previously overlooked:

the percentage of individuals with Bachelor’s degrees or higher,
the percentage of vacant houses,
the number of households living in poverty, and
the percentage of single family housing units.

In our previous homework assignment, we found that the map of standardized residuals showed signs of spatial autocorrelation, which undermined the viability of our OLS regression model. In this assignment, we attempt to account for the influence of spatial autocorrelation by by using various spatial regression techniques, including spatial lag, spatial error, and geographically weighted regression.

1.2 Prior Analysis

In Homework #1, we utilized OLS (ordinary least sqaures) regression to build a model looking at the relationship between a dependent variable (log-transformed median house values) and four predictors (percentage of vacant houses, percentage of single family houses, percentage of residents holding a Bachelor’s degree or higher, and log-transformed poverty rate).

1.3 Issues with OLS

Although the result of our OLS regression model was relatively acceptable for social science research, it inevitably suffered from OLS regression’s inability to address the spatial component of our dataset. Given the geographic nature of house price data, taking saptial elements into consideration may substantially improve our model.

1.4 Improving on OLS

This project aims to incorporate the spatial components of our data by considering spatial lag, spatial error, and geographically weighted regression (GWR). In the results and discussion sections, we will revisit the OLS approach from last project and make comparisons between different approaches, and suggest whether spatial lag, spatial error, and/or GWR perform better than OLS.

2 Methods

2.1 Spatial Autocorrelation

Spatial autocorrelation looks at how the value of a variable for an observation varies in comparison to values for other observations nearby. In other words, it is the relationship of values of a single variable at neighboring locations. Positive spatial autocorrelation suggests more similar values at proximate locations, while negative spatial autocorrelation suggests greater differences in the value of the variable for more proximate observations.

2.1.1 Tobler’s 1st Law

Spatial autocorrelation is based on Waldo Tobler’s First Law of Geography. Proposed in 1970, Tobler’s First Law states that, “everything is related to everything else, but near things are more related than distant things.” Because of this potential association between values of a variable and their locations, spatial statistics and autocorrelation should be taken into consideration in models.

2.1.2 Moran’s I

One of the most commonly used tests for spatial autocorrelation (also known as spatial dependencies) is Moran’s I. Larger positive values of Moran’s I indicate a stronger positive spatial autocorrelation (geospatial clustering of similar values), while larger negative values indicate a strong negative correlation (spatial dispersion of similar values). Value closer to 0 indicate little to no spatial autocorrelation, i.e., random distribution.

Present and explain formula for Moran’s I

2.1.3 Weight Matrices

Weight matrices are a \(n * n\) table (given \(n\) observations in the dataset) that is useful for calculating spatial autocorrelation indices (which means, quantifying the extent of spatial autocorrelation). There are different types of weight matrices, with “rook” and “queen” being the two most frequently used contiguity-based spatial weights. Often, statisticians try several different weight matrices to ensure that their results are universal, unless they have theoretical reasoning that strongly suggest the application of a particular weight matrix.

In this project, we are using a queen weight matrix, which accounts for neighbors that are directly next to and diagonally linked to the area of interest. In comparison to a rook weight matrix, which does not account for diagonal areas (only four directions reflected, up, down, left, and right), queen weights do a better job reflecting neighboring values in all eight directions.

2.1.4 Significance Testing

To test the significance of our our spatial autocorrelation tests (Moran’s I), we are using a Monte Carlo approach. This process requires us to shuffle the value of the dependent variable 999 times, recalculate the Moran’s I value for each permutation, then rank and compare the original Moran’s I value with all 999 permutations. We calculate a pseudo p-value by taking the rank of the original Moran’s I compared to the permutations and dividing it by 1,000. The pseudo p-value helps us test the significance of our Moran’s I and test our hypotheses, which are:

\(H_0\) = there is no spatial autocorrelation (i.e., the pattern is random)

\(H_a\) = there is spatial autocorrelation, whether positive or negative

2.1.5 Local vs. Global Spatial Autocorrelation

Developed by Luc Anselin, Local Indices of Spatial Autocorrelation (LISA) are used for calculating local spatial autocorrleation. Unlike the global Moran’s I, which generates one value, LISAs generates a local Moran’s I value for each location. While the global Moran’s I helps us understand the entire dataset, the local Moran’s I values help us dig into specific locations and see how they compare to their neighbors. Our project calculates local Moran’s I values in order to carry out further analysis of spatial autocorrelation.

2.2 OLS Regression and Assumptions

2.2.1 Overview of OLS Regression

OLS (Ordinary least squares) is a method to explore relationships between a dependent variable and one or more explanatory variables. It considers the strength and direction of these relationships and the goodness of model fit. (For more information, please refer to our previous homework assignment.) When running an OLS regression, it is crucial to ensure that the following five assumptions are not violated:

A linear relationship between the dependent variable and predictors
Normality of residuals
Homoscedasticity
Independence of observations (no spatial, temporal or other forms of dependence in the data)
No multicollinearity

In Homework #1, we showed that three of above five OLS assumptions were violated. First, the relationships between our dependent variable and the predictor variables were generally not linear. Additionally, the scatterplot of predicted values compared to standardized residuals showed heteroscedasticity. Finally, our map of the standardized residuals showed signs of spatial autocorrelation, indicating systematic patterns in our data, violating the fourth assumption. These three issues undermine key assumptions of linear regression.

2.2.2 Random Errors in Spatial Data

When residuals contain a systematic pattern, the assumption of randomness of residuals is not met. Specifically, if we have spatially autocorrelated OLS residuals, there is systematic under-prediction or over-prediction in certain parts of the study region; furthermore, the significance estimates for the \(\beta\) coefficients in OLS may be incorrect (inflated). There are a couple ways to check for spatial autocorrelation of the OLS residuals, though the first thing to do might be to map the residuals. We want the map of the residuals to look random; this would imply that our model is not spatially autocorrelated. After looking at the residual maps, we would want to compute the global Moran’s I of the residuals. Ideally, we will see a Moran’s I close to 0, there is an obvious problem with spatial autocorrelation of residuals.

Below in section 3.2.2, we will regress residuals \(\hat{ε}\) on the spatially lagged residuals \(W\hat{ε}\). Ideally, we would see that there is no relationship between \(\hat{ε}\) and \(W\hat{ε}\) – that is, that the coefficient of \(W\hat{ε}\), which we denote as \(\beta_1\), is not significantly different from 0. \(\beta_1\) can range between -1 and 1. We are able to regress residuals on the nearest neighbor residuals, thereby filtering the spatial information out of the OLS residuals and decomposing the residuals \(ε\) into two parts: one with a spatial pattern \(λWε\) and one which is simply random noise \(u\). \[ y = β_0 + β_1X_1 +...+ β_nX_n +λWε+ u\]

2.2.3 Testing Other Regression Assumptions in R

Below, in section 3.2 and section 3.3, we will test the null hypothesis of homoscedasticity against the alternative hypothesis of heteroscedasticity using the following tests:

Koenker-Bassett Test : Robust Tests for Heteroscedasticity Based on Regression Quantiles
Breusch-Pagan Test : Simple test for heteroscedasticity and random coefficient variation
White Test : A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.

For these tests, \(H_0\) is the assumption of homoscedasticity, while \(H_a\) is the presence of heteroscedasticity. If the p-value < 0.05, we can reject \(H_0\).

Another important assumption is the normality of residuals. When residuals are not normal, we may run into issues with both OLS regression and spatial regression. We can check the normality of residuals by:

Examining the histogram of residuals, and
Running the Jarque-Bera test.

Here, \(H_0\) is the normality of residuals, while \(H_a\) is the non-normal distribution of residuals, If the p-value < 0.05, we can reject \(H_0\).

2.3 Spatial Lag and Spatial Error Regression

2.3.1 Spatial Lag Regression

The spatial lag model assumes that the value of the dependent variable at one location is associated with the values of that variable in nearby locations. “Nearby” is as defined by the weights matrix W (rook, queen, or within a certain distance of one another). In other words, the spatial lag model includes the spatial lag of the dependent variable as a predictor. In this assignment, our equation for the Spatial Lag Model is as follow: \[ LNMEDHVAL = ρWy\;+\;β_0\;+\;LNNBELPOVβ_1\;+\;PCTBACHMORβ_2\;+\;PCTSINGLESβ_3\;+\;PCTVACANTβ_4\;+\;ε \] Here, \(ρ\) is the coefficient of the \(y-lag\) variable \(Wy\), just as \(β_1\) is the coefficient of the variable \(LNNBELPOV\), and \(ε\) is the residual term.

It may be help to consider spatial lag through the analogy of this Stats Wizard Penguin (SWP) and his Inept Companions (ICs): although the SWP may have a stats wizardry quotient (SWQ) of over 9,000, its spatial lag (i.e., the average value of the SWQs of its neighboring ICs) is 0 because all the neighboring penguins know exactly 0 about statistics. (Crucially, we should use a symmetrical matrix of statistically incompetent penguins, as this is a key requirement of spatial lag and spatial error regressions).

The one person in the lab who really knows statistics pic.twitter.com/dmLbkXG7mN
— Björn Schumacher @schumacherbj@qoto.org (@schumacherbj) October 23, 2022

2.3.2 Spatial Error Regression

The spatial error model assumes that the residual at one location is associated with residuals at nearby locations. “Nearby” is as defined by the weights matrix W (rook, queen, or within a certain distance of one another). This requires two steps: First, we run our OLS regression, where we regress Y on the predictors. \[ LNMEDHVAL = β_0\;+\;LNNBELPOVβ_1\;+\;PCTBACHMORβ_2\;+\;PCTSINGLESβ_3\;+\;PCTVACANTβ_4\;+\;ε \] Second, we regress the residuals on the nearest neighbor residuals, thereby filtering the spatial information out of the OLS residuals and decomposing the residuals \(ε\) into two parts, one with a spatial pattern \(λWε\) and one which is simply random noise \(u\).

\[ ε = λWε\;+\; u\] The part with a spatial pattern can be thought of as some variable with a spatial component missing from the first OLS regression. \[ LNMEDHVAL = β_0\;+\;LNNBELPOVβ_1\;+\;PCTBACHMORβ_2\;+\;PCTSINGLESβ_3\;+\;PCTVACANTβ_4\;+\; λWε\;+\; u \]

2.3.3 Assumptions for Spatial Lag and Error Regression

With both spatial lag and spatial error regression models, we still retain all of the assumptions of OLS except for the spatial independence of observations. The relationship between the independent variable and each of the dependent variables must still be, residuals must still be linear, and there should be no multicollinearity.

2.3.4 Goals of Spatial Lag and Error Regression

The goal of spatial regression is to consider the possibility of spatial dependency in the residuals. By using this method, we want to remove spatially autocorrelation from our model and make our model less heteroscedastic.

2.3.5 Regression Performance Compared

Various measures can be used to compare models. The first one is the Akaike Information Criterion and the Schwartz Criterion. These measure the goodness of fit of an estimated statistical model. They are relative measures of the information that is lost when a given model is used to describe reality, and they can be said to describe the trade-off between precision and complexity of the model. The equation of the AIC is the following: \[AIC = 2k - 2ln(\hat{L})\] Given a set of candidate models for the data, the preferred model is the one with the lowest AIC value. Thus, the AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, which is desired because increasing the number of parameters in the model almost always improves the goodness of fit.

The second option is the log likelihood, which is associated with the maximum likelihood method of fitting a statistical model to the data and estimating model parameters. The maximum likelihood approach picks the values of the model parameters that make the data more likely than any other values of the parameters would make them. Thus, the higher the log likelihood, the better the model fit. This comparison should be used for nested models only. OLS is a special case of spatial lag and spatial error models. So we can compare the OLS model with the spatial lag and spatial error models, but we cannot use log likelihood to compare the spatial lag and spatial error models to each other because they are not a special case of each other.

A likelihood ratio test is another tool to compare the OLS model to the spatial model. \(H_0\) for this test is that the spatial model is not a better specification than the OLS model. If the p-value < 0.05, we reject \(H_0\) and state that the spatial lag (error) model is better than the OLS model.

We can use Moran’s I to figure out which model is better between the OLS and spatial lag and spatial error models. First, we run OLS regression and examine the residuals for spatial autocorrelation (Moran’s I). If there is no significant spatial autocorrelation in the residuals, we stop. If there is significant spatial autocorrelation in residuals, we need to look at the the Lagrange Multiplier (lag) and the Lagrange Multiplier (error), which tells us whether we should run a spatial regression or spatial error model. If neither Lagrange Multiplier is significant, then the OLS model is good enough. If the Lagrange Multiplier for the spatial lag model is significant and the Lagrange Multiplier for the spatial error model is not, then we run the spatial lag model. If the Lagrange Multiplier for the spatial error model is significant and Lagrange Multiplier for the spatial lag model is not, then we run the spatial error model. If both are significant (as in this assignment), we look at we look at the Robust LM for both the spatial lag and spatial error models. We choose the model which has a lower p-value or higher test statistic value and run that model.

2.4 Geographically Weighted Regression

2.4.1 Simpson’s Paradox and Local Regression

Simpson’s Paradox is a statistical phenomenon where an association between two variables in a population emerges, disappears or reverses when the population is divided into sub-populations. OLS regression, spatial lag regression, and spatial error regression assume spatial stationarity, i.e., that modeled relationships are constant across space. In spatial statistics, stationarity equals the homogeneity of an effect, or, that a process works the same regardless of where you observe the process; it is a necessary assumption, but likely not true in practice. For example, higher home values might be correlated with higher crime rates in some locations and lower crime rates in some others. The aim of geographically weighted regression is to separate area into multiple locations and conducting multiple local regression instead of having a single, global regression.

2.4.2 GWR Equations

OLS models are run to determine the global regression coefficients(\(β\)) for the independent variables is following: \[y_i = β_0 + \displaystyle\sum_{k=1}^{n}{β_kx_{ik}} + ε_i\]

where \(y_i\) is the \(ith\) observation of the dependent variable, \(x_{ik}\) is the \(ith\) observation of the \(kth\) independent variable, the \(ε_i\)s are independent normally distributed error terms with zero means, and each \(β_k\) must be determined from a sample of \(n\) observations. Usually the least squares method is used to estimate the \(β_n\)s. Using matrix notation this may be expressed as follows: \[\hat{β} = (x^tx)^{-1}x^ty\]. where the independent observations are the columns of \(x\) and the dependent observations are the single column vector \(y\). The column vector \(\hat{β}\) contains the coefficient estimates. Each of these estimates can be thought of as a “rate of change” between one of the independent variables and the dependent variable.

Once the independent variables that you wish to retain in the model are identified, and there is a theoretical basis for thinking that the relationships may differ by space, GWR may be an appropriate next step. GWR is a relatively simple technique that extends the traditional regression framework by allowing local variations in rates of change so that the coefficients in the model are specific to a location \(i\), rather than being global estimates. The regression equation is then:

\[y_i = β_{i0} + \displaystyle\sum_{k=1}^{n}{β_{ik}x_{ik}} + ε_i\] where \(β_{ik}\) is the value of the \(kth\) parameter at location \(i\). Using a weighted least squares approach to calibrating regression models, different emphases can be placed on different observations in generating the estimated parameters. In ordinary least squares, the sum of the squared differences of predicted and actual \(y_i\)s is minimized by the coefficient estimates. In weighted least squares a weighting factor \(w\), is applied to each squared difference before minimizing, so that the inaccuracy of some predictions carries more of a penalty than others. If \(w\) is the diagonal matrix of \(w_i\)s. then the estimated coefficients satisfy \[\hat{β} = (x^twx)^{-1}x^twy\] In GWR, weighting an observation in accordance with its proximity to \(i\) would allow an estimation of \(β_{ik}\) to be made that meets the criterion of “closeness of calibration points” set out above. Note that usually in weighted regression models the values of \(w_i\) are constant, so that only one calibration has to be carried out to obtain a set of coefficient estimates, but in this case \(w\) varies with \(i\) so that a different calibration exists for every point in the study area. In this case, the parameter estimation formula could be written more generally as \[\hat{β}_i = (x^tw_ix)^{-1}x^tw_iy\]

2.4.3 Running Local Regression

We need multiple locations to run a regression, not just a single location. GWR uses other observations in the dataset to run the regression. Observations that are close to location \(i\) are given greater weights. The weight of an observation varies with proximity to location \(i\). Normally, observations that are closer to location \(i\) have a stronger influence on the estimation of the parameters for location \(i\).

An initial step toward weighting based on locality might be to exclude from the model calibration observations that are further than some distance \(d\) from the locality. This would be equivalent to setting their weights to zero, giving a weighting function of \[w_{ij} = 1 \;\;\;\;\;\;\; if \;\;d_{ij} < d\\w_{ij} = 0 \;\;\;\;\;\;\; otherwise\] However, the spatial weighting function above suffers from discontinuity. As \(i\) varies around the study area, the regression coefficients could change drastically as one sample point moves into or out of the circular buffer around \(i\) and which defines the data to be included in the calibration for location \(i\).

2.4.4 Bandwidth (Adaptive vs. Fixed)

There are a couple of ways to weight nearby locations considering the problem of discontinuity. Graphically, GWR involves fitting a spatial kernel to the data as described in the Figure 1. For a given regression point \(X\), the weight \(W\) of a data point is at a maximum at the location of the regression point. The weight decreases gradually as the distance between two points increases. A regression model is thus calibrated locally by moving the regression point across the area under study. For each location, the data are weighted differently so that the resulting estimates are unique to a particular location. Figure 1. GWR with fixed spatial kernel. Source: Fotheringham et al. (2002) With fixed bandwidth \(h\), \(w_{ij}\) is as follows: \[w_{ij} = \begin{cases} e^{-0.5(\frac{distance_{ij}}{h})^2},\;\;\;\;\;if\;\;\;distance_{ij}\leq h\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;0,\;\;\;\;\;if\;\;\;otherwise \end{cases}\] \(Distance_{ij}\) is the distance between the regression point \(i\) and data point \(j\), and number of observations will vary around each point \(i\), but the bandwidth distance \(h\) will remain constant.

A key issue is to decide between two options of spatial kernels: a fixed kernel or an adaptive kernel. Intuitively, a fixed kernel involves using a fixed bandwidth to define a region around all regression points as displayed in Figure 2. The extent of the kernel is determined by the distance to a given regression point, with the kernel being identical at any point in space. An adaptive kernel involves using varying bandwidth to define a region around regression points as displayed in Figure 2. The extent of the kernel is determined by the number of nearest neighbors from a given regression point. The kernels have larger bandwidths where the data are sparse. Figure 2. GWR with adaptive spatial kernel. Source: Fotheringham et al. (2002) With adaptive bandwidth \(h\), \(w_{ij}\) is as follows : \[w_{ij} = \begin{cases} \Bigl[1-(\frac{distance_{ij}}{h})^2\Bigr]^2,\;\;\;\;\;if\;\;j\;\;is\;one\;of\;i's\;N\;nearest\;neighbors\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;0,\;\;\;\;\;if\;\;otherwise \end{cases}\] \(Distance_{ij}\) is the distance between the regression point \(i\) and data point \(j\), and number of observations will remain constant around each point \(i\), but the bandwidth distance \(h\) will vary. As for us, adaptive bandwidth is more appropriate because our data varies across space.

2.4.5 Assumptions in GWR

Many of the assumptions in OLS–such as the normality of residuals, homoscedastiticty, the absence of multicollinearity, and the normal distribution of residuals–still need to be met for GWR. GWR builds a local regression equation for each feature in the dataset. When the value for an explanatory varially clusters spatially in a substantial way, we are still likely to have a problem. That is, the variable tells the same story at every location. We also may have a problem with multicollinearity when we have two or more variables that have similar patterns of clusters in a certain region. Likewise, We may run into problems if we use two variables which both have very high (or very low) values at all locations in a certain part of the study region. The condition number in the attribute table indicates when results are unstable due to local multicollinearity. To check this issue, we can look through the Cond.Number field from the GWR results attribute table, As a rule of thumb, multicollinearity issues likely arise for features with a condition number larger than 30, equal to Null, or equal to -1.7976931348623158e + 308.

2.4.6 GWR and P-Values

In a global model it is usual to test whether the parameter estimates are significantly different from zero. This can be accomplished with a t-test. The situation with GWR is a little more complex and is the subject of current research. As there is one set of parameters associated with each regression point, as well as one set of standard errors, there are potentially thousands of tests that would be required to determine whether parameters are locally significant. The assumption behind the tests means that if the 0.05 significance level is used, we would expect 5 tests in every 100 to be significant. With a 5 variable model estimated at 20,000 regression points, we would expect 5,000 of these tests to return a significant result. This is the problem raised by multiple testing.

2.5 Tools

In our assignment, we rely on the usual tidyverse, sf, and tmap packages in R. We also incorporate a number of spatial packages such as sp, rgdal, and rgeos. Most of our spatial regression work relies on spdep and spgwr. Importantly, we use the recently-released sfdep package, which is basically a tidy version of the spdep package and allows us to incorporate sf objects. Finally, we use lmtest, whitestrap, and tseries for the Breusch-Pagan, White, and Jarque-Bera tests respectively.

2.6 Import and Prepare Data

As in Homework #1, we are completing this assignment in R (rather than using GeoDa and ArcGIS). Once again, we will read in the original shapefile of data and transform the relevant columns to produce the initial data that we need for our analysis.

reg_data = st_read('C:/Users/Nissim/Desktop/Fall 2022/Spat Stats/musa-500-hmwk-2/reg_data_files/Regression Data.shp')

## Reading layer `Regression Data' from data source 
##   `C:\Users\Nissim\Desktop\Fall 2022\Spat Stats\musa-500-hmwk-2\reg_data_files\Regression Data.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1720 features and 14 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 2660605 ymin: 207610.6 xmax: 2750171 ymax: 304858.8
## Projected CRS: NAD83 / Pennsylvania South (ftUS)

#Min 
#reg_data = st_read("C:/Users/vestalk/Desktop/00_Upenn/20.Fall/03.MUSA 5000 Spatial Statistics and Data Analysis/Assignment/HW2/Project/00.221025/musa-500-hmwk-2/data")

#Ann: 
#reg_data = st_read(file.path("/Users/annzhang/Downloads/reg_data_2/Regression Data.shp"))

#define a function to find zero values in columns

# This function applies log transformations to the relevant columns. It checks whether there are zero values in each column and then applies the appropriate log transformation accordingly.

col_zeros = function(a, b) {
                  pct_col_zeros = count(subset(st_drop_geometry(a), b != 0)) |>
                                      pull(n) / nrow(st_drop_geometry(a))
                  return(pct_col_zeros)
                  }
#apply function with case_when statement
#case_when is a vectorized function, while ifelse is not.
#running this with ifelse will result in all row values in the mutated column being the same.
reg_data = reg_data |>
            mutate(
                ln_med_h_val = case_when(col_zeros(reg_data, reg_data$MEDHVAL) == 1 ~ log(reg_data$MEDHVAL),
                                     TRUE ~ log(1 + reg_data$MEDHVAL)),
                   ln_pct_bach_more = case_when(col_zeros(reg_data, reg_data$PCTBACHMOR) == 1 ~ log(reg_data$PCTBACHMOR),
                                     TRUE ~ log(1 + reg_data$PCTBACHMOR)),
                   ln_n_bel_pov_100 = case_when(col_zeros(reg_data, reg_data$NBelPov100) == 1 ~ log(reg_data$NBelPov100),
                                     TRUE ~ log(1 + reg_data$NBelPov100)),
                   ln_pct_vacant = case_when(col_zeros(reg_data, reg_data$PCTVACANT) == 1 ~ log(reg_data$PCTVACANT),
                                     TRUE ~ log(1 + reg_data$PCTVACANT)),
                   ln_pct_singles = case_when(col_zeros(reg_data, reg_data$PCTSINGLES) == 1 ~ log(reg_data$PCTSINGLES),
                                     TRUE ~ log(1 + reg_data$PCTSINGLES)),
                  )

3 Results

3.1 Spatial Autocorrelation

3.1.1 Distribution of Median House Value

The map below displays the spatial distribution of median house value by census tract in Philadelphia. As a starting point, this map offers us an= overview of (log of) median house values in the city. We can observe a clustering of higher values in the north-east and north-west of the city and a clustering of lower values in the center to north part. Statistical tests are applied to further explore such phenomena and results are presented as following.

tmap_mode("plot")

## tmap mode set to plotting

tm_shape(reg_data) + 
  tm_polygons(title = "ln(MEDHVAL)", col = "LNMEDHVAL", border.col = NA, border.alpha = 0, lwd = 0, palette = "viridis", style = "jenks") + 
 # tm_shape(phl_city_lims) +
#  tm_borders(col = "grey", lwd = 5) +
  tm_compass(position = c("left", "top")) +
  tm_layout(main.title = "Median House Values by Tract",
            legend.position = c("right", "bottom"),
            frame = FALSE)

3.1.2 Global Moran’s I

Below, we calculate a global Moran’s I value and analyze it using a Monte Carlo approach, which compares our calculated Moran’s I to 999 random permutations.

First, we’ll prepare a weight matrix based on Queen neighbors.⁴

The results of Global Moran’s I is indicated below, followed by a histogram. According to the results and the histogram (as indicated by the red line), the Global Moran’s I value is 0.79, with a pseudo p-value < 0.05. The results suggest that log median house value (LNMEDHVAL) is significantly spatially autocorrelated.

reg_data = reg_data |>
              mutate(nb_geom = st_geometry(reg_data), #geoms for poly2nb
                     nb = st_contiguity(nb_geom), # generate neighbors
                     weights = st_weights(nb)) # weight matrices from neighbors

summary(reg_data$nb)

## Neighbour list object:
## Number of regions: 1720 
## Number of nonzero links: 10526 
## Percentage nonzero weights: 0.3558004 
## Average number of links: 6.119767 
## Link number distribution:
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  18  27 
##   4  16  52 175 348 493 344 177  62  28  10   4   2   1   1   1   1   1 
## 4 least connected regions:
## 441 708 1391 1665 with 1 link
## 1 most connected region:
## 1636 with 27 links

We can visualize the connections between neighbors based on the weight matrix that we’ve connected:

reg_data_lines = nb2lines(reg_data$nb, 
                          coords = st_centroid(reg_data$geometry), 
                          as_sf = TRUE)

tm_shape(reg_data) + 
  tm_borders(col = "grey", lwd = 0.5) + 
tm_shape(reg_data) +
  tm_dots() +
tm_shape(reg_data_lines) +
  tm_lines(col = "red") +
tm_layout(frame = FALSE)

Now we can apply the weight matrix that to calculate the global Moran’s I.

# Global Moran's I

# now use that weight matrix to calculate global moran's i for MEDHVAL

global_moran(reg_data$ln_med_h_val, reg_data$nb, reg_data$weights)$`I`

## [1] 0.7935638

Having calculated a global Moran’s I, we’ll use a Monte Carlo approach to test whether our global Moran’s I is statistically significant. If the null hypothesis is true, then the probability of drawing the observed data is the same as any other permutation of the \(i\)’s amongst the polygons. Thus, if \(m\) is the number of simulated Moran’s I values exceeding the observed one, and \(M\) is the total number of simulations, then the probability of getting the observed Moran’s I or greater is

\[p = \frac{m+1}{M+1}\]

# reshuffling test w 999 permutations
# present Moran's I value, histogram of Moran's I values with all permutations, and p-value

moranMC = global_moran_perm(reg_data$ln_med_h_val, reg_data$nb, reg_data$weights, alternative = "two.sided", 999)

moranMC

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  x 
## weights: listw  
## number of simulations + 1: 1000 
## 
## statistic = 0.79356, observed rank = 1000, p-value < 2.2e-16
## alternative hypothesis: two.sided

Here we visualize the randomly permuted values from the Monte Carlo test.

#draws a histogram of our randomly permuted values
# and adds a red line for our actual moran's i

moranMCres = moranMC$res |>
              as.data.frame()

colnames(moranMCres) = "col1"

ggplot(moranMCres) +
  geom_histogram(aes(col1), bins = 100) +
  geom_vline(xintercept = moranMC$statistic, col = "red") +
  labs(title = "Histogram of MoranMCres",
       x = "moranMCres",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Additional confirmation of the significance of our Moran’s I value can be found by plotting the relationship between ln_med_h_val of the block groups and their neighbors. If there were no relationship between block group observations and those of their neighbors, we would not see a clear pattern in the plot below. However, we observe that this is not the case.

moran.plot(reg_data$ln_med_h_val, nb2listw(reg_data$nb),
           xlab = "ln_med_h_val", 
            ylab = "Spatial Lag")

3.1.3 Local Moran’s I

The local_moran function in sfdep allows us to compute LISA statistics for each block group. This function generates the Local Moran statistic as ii. We might also be interested in var_ii, which tells us how much each block group’s Local Moran Statistic varies from the global mean, and p_ii, which indicates whether the Local Moran Statistic can be considered statistically significant.

# Local Moran's I

# local Moran's I analysis with queen weight matrix; print results

lisa = local_moran(reg_data$ln_med_h_val, reg_data$nb, reg_data$weights, nsim = 999)

head(lisa)

##          ii           eii       var_ii     z_ii         p_ii p_ii_sim
## 1 5.3518753 -0.0614898509 1.3364064928 4.682718 2.830954e-06    0.002
## 2 4.4121912 -0.0073062459 0.7118931178 5.238000 1.623261e-07    0.002
## 3 3.5006316  0.0411214557 0.7642634601 3.957246 7.581882e-05    0.002
## 4 2.4444508  0.0235953540 0.2490908184 4.850539 1.231264e-06    0.002
## 5 1.8834699  0.0273707902 0.6441148595 2.312701 2.073908e-02    0.046
## 6 0.0995037  0.0001556868 0.0009613705 3.204157 1.354586e-03    0.002
##   p_folded_sim    skewness    kurtosis      mean    median     pysal
## 1        0.001  0.07783226  0.16297614 High-High High-High High-High
## 2        0.001  0.30980196  0.51776442 High-High High-High High-High
## 3        0.001  0.11421600  0.09571973 High-High High-High High-High
## 4        0.001  0.19659418  0.29340205 High-High High-High High-High
## 5        0.023  0.42558089  0.57809875 High-High High-High High-High
## 6        0.001 -0.02400010 -0.01847904 High-High High-High High-High

Now we can examine the local Moran’s I values to see if there is significant spatial autocorrelation. Mapping the p-values that we’ve just calculated, it appears that we have statistically significant (p < 0.05) clustering in parts of northwest, northeast, north, and west Philadelphia.

# first we need to bind the lisa dataframe to our reg_data dataframe

reg_data = cbind(reg_data, lisa)

tm_shape(reg_data) +
  tm_fill(style = "fixed", 
          col = "p_ii", 
          breaks = c(0,0.001, 0.01, 0.05, 1), 
          palette = '-viridis',
          title = "P-Value") +
  #tm_borders(col = "white",
   #          lwd = 0.05) +
  tm_layout(frame = FALSE, main.title = "P-Value")

We can now visualize statistically significant clustering throughout the city, which the local Moran’s I calculation groups into High-High, High-Low, Low-High, and Low-Low clusters. (We will also exclude clusters for which p > 0.05.)

Based on our map, it appears that there are significant High-High (high house values observed in neighboring areas) clusters in northwest and northeast Philadelphia, as well as some small neighborhoods near Old City, along City Line Avenue, and near the Navy Yard in south Philadelphia. Significant Low-Low clusters (low house values observed in neighboring areas). are also found in North Philadelphia, parts of West Philadelphia, and small scattered sites near Grays Ferry.

reg_data = reg_data |>
            mutate(
                  mean = as.character(mean),
                  lisa_map = case_when(
                                  p_ii < 0.05 ~ mean,
                                  TRUE ~ "Insignificant"
                                )
                  )

pal = c("#B2182B", "#F4A582", "lightgrey", "#92C5DE", "#2166AC")

tm_shape(reg_data) +
  tm_fill(col = "lisa_map", palette = pal, title = "LISA Clusters")+
  tm_borders(col = "white", lwd = 0.05) +
  tm_layout(frame = FALSE, main.title = "LISA Clusters")

3.2 OLS Regression and Assumptions

3.2.1 OLS Output

Our model regressed the log of median house values (ln_med_h_val) on four predictors, namely,

the percents of individual with bachelor’s degrees or higher (PCBACHMORE),
the number of households living below the poverty line (LNNBELPOV),
the percent of vacant houses (PCTVACANT), and
the percent of single houses units (PCTSINGLES).

The output of our OLS regression indicates that all four predictors are statistically significant; the p-value for the F-statistic is less than 0.05, as are the individual p-values for each predictor. Additionally, our OLS model explains more than 60% of the variance of LNMEDHVAL (\(R^2\) is 0.6623 and the Adjusted \(R^2\) is 0.6615). LNNBELPOV and PCTVACANT are negatively associated with LNMEDHVAL, while PCTBACHMOR and PCTSINGLES are positively associated with LNMEDHVAL.

# OLS Regression

reg = lm(formula=ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, data = reg_data)

summary(reg)

## 
## Call:
## lm(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + 
##     PCTVACANT, data = reg_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25825 -0.20391  0.03822  0.21744  2.24347 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      11.1137661  0.0465330 238.836  < 2e-16 ***
## ln_n_bel_pov_100 -0.0789054  0.0084569  -9.330  < 2e-16 ***
## PCTBACHMOR        0.0209098  0.0005432  38.494  < 2e-16 ***
## PCTSINGLES        0.0029769  0.0007032   4.234 2.42e-05 ***
## PCTVACANT        -0.0191569  0.0009779 -19.590  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3665 on 1715 degrees of freedom
## Multiple R-squared:  0.6623, Adjusted R-squared:  0.6615 
## F-statistic: 840.9 on 4 and 1715 DF,  p-value: < 2.2e-16

In order to compare our OLS model to the other models generated later in this assignment, we will run a log likelihood test. We will return to these results later.

logLik(reg)

## 'log Lik.' -717.6489 (df=4)

To test for heteroscedasticity, we will run three tests:

the Breusch-Pagan test

bptest(reg, studentize = FALSE)

## 
##  Breusch-Pagan test
## 
## data:  reg
## BP = 68.649, df = 2, p-value = 1.239e-15

the Koenker-Bassett test (also known as the Studentized Breusch-Pagan test)

bptest(reg, studentize = TRUE)

## 
##  studentized Breusch-Pagan test
## 
## data:  reg
## BP = 20.892, df = 2, p-value = 2.906e-05

the White test

white_test(reg)

## White's test results
## 
## Null hypothesis: Homoskedasticity of the residuals
## Alternative hypothesis: Heteroskedasticity of the residuals
## Test Statistic: 52.48
## P-value: 0

In all three cases, p < 0.05, indicating a problem with heteroscedasticity in our data.

Finally, we will run the Jarque-Bera test to assess whether the residuals of our regression are normal. Here, too, p < 0.05, indicating that our residuals are not normally distributed.

jarque.bera.test(reg$residuals)

## 
##  Jarque Bera Test
## 
## data:  reg$residuals
## X-squared = 2012.4, df = 2, p-value < 2.2e-16

3.2.2 Scatterplot of Residuals

Here, we’ll generate standardized residuals, which are OLS Model residuals divided by an estimate of their standard deviation. We can evaluate these standardized residuals to consider spatial autocorrelation in our OLS model by comparing them to their spatial lag.

First, we plot standardized residuals against the spatially lagged residuals. We see strong correlation, which is confirmed by the output of the linear model: \(\rho\) (the slope of our linear model in the plot) is 0.732, indicating strong spatial autocorrelation in our OLS model.

reg_data = reg_data |>
              mutate(stand_resids = rstandard(reg),
                     spatial_lag = st_lag(stand_resids, 
                                             nb, 
                                             weights))

ggplot(reg_data, aes(x = spatial_lag, y = stand_resids)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = 'lm') +
  labs(x = "Spatial Lag of Residuals",
       y = "Standardized Residuals") +
  theme_minimal() +
  theme(aspect.ratio = 1)

## `geom_smooth()` using formula 'y ~ x'

lm(formula =  stand_resids ~ spatial_lag, data = reg_data)

## 
## Call:
## lm(formula = stand_resids ~ spatial_lag, data = reg_data)
## 
## Coefficients:
## (Intercept)  spatial_lag  
##    -0.01281      0.73235

Further confirmation of this comes from mapping our standardized residuals, which appear to cluster in northwest, northeast, and north Philadelphia, as well as possibly downtown.

tm_shape(reg_data)+
  tm_fill(col = 'stand_resids', 
          style = 'quantile', 
          title = 'Standardized OLS Residuals', 
          palette ='viridis',
          midpoint = 0)+
  tm_layout(frame = FALSE, 
            title = 'Standardised OLS Residuals')

However, a visual assessment is not sufficient, and we will test the presence of spatial autocorrelation in two ways: 1) by regressing residuals on their queen neighbors, and 2) by looking at the Moran’s I of the residuals.

First, let’s regress the OLS standardized residuals on the spatial lag of the OLS residuals (i.e., OLS residuals at the queen neighbors).

resids_lm = lm(formula = stand_resids ~ spatial_lag, data = reg_data)

summary(resids_lm)

## 
## Call:
## lm(formula = stand_resids ~ spatial_lag, data = reg_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3699 -0.4157  0.0826  0.5578  4.0238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.007314   0.022610   0.323    0.746    
## spatial_lag 0.598182   0.038569  15.509   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9375 on 1718 degrees of freedom
## Multiple R-squared:  0.1228, Adjusted R-squared:  0.1223 
## F-statistic: 240.5 on 1 and 1718 DF,  p-value: < 2.2e-16

3.2.3 Moran’s I Significance

Once again, we can use the Monte Carlo simulation to generate a Moran’s I statistics and a pseudo p-value.

global_moran_perm(reg_data$stand_resids, reg_data$nb, reg_data$weights, alternative = "two.sided", 999)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  x 
## weights: listw  
## number of simulations + 1: 1000 
## 
## statistic = 0.20532, observed rank = 1000, p-value < 2.2e-16
## alternative hypothesis: two.sided

It is strongly apparent that spatial autocorrelation exists among the regression residuals of the OLS Model. The p-value is very small, indicating that Moran’s I is significant. Because there’s clearly spatial autocorrelation in our OLS residuals, the OLS Model is inappropriate and we need to consider another method. Next, we will attempt to run the spatial lag model, the spatial error model, and geographically weighted regression.

moran.plot(reg_data$stand_resids, nb2listw(reg_data$nb),
           xlab = "Standardized Residuals", 
            ylab = "Spatial Lag")

3.3 Spatial Lag and Spatial Error Regression

3.3.1 Spatial Lag Regression Results

Here we have generated a spatial lag regression model and summarized the output. The coefficient of spatial lag \(\rho = 0.65\) and is significant (p < 0.05). The significance of this coefficient indicates an association between the median house value at one location and values of houses in the surrounding areas —- 1 unit increase in surrounding household values is associated with 0.65 unit increase in housing value at a particular location. Furthermore, in alignment with results from our OLS model, all four of our explanatory variables (i.e., PCTBACHMOR, PCTSINGLES, PCTVACANT, and ln_n_bel_pov_100,) have p < 0.05, meaning that they are significant.

spatial_lag_reg = lagsarlm(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, data = reg_data, nb2listw(reg_data$nb))

sumslr = summary(spatial_lag_reg)

sumslr

## 
## Call:lagsarlm(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + 
##     PCTSINGLES + PCTVACANT, data = reg_data, listw = nb2listw(reg_data$nb))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.655464 -0.117249  0.018654  0.133132  1.726460 
## 
## Type: lag 
## Coefficients: (asymptotic standard errors) 
##                     Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)       3.89844052  0.20111373  19.3843 < 2.2e-16
## ln_n_bel_pov_100 -0.03405540  0.00629303  -5.4116 6.246e-08
## PCTBACHMOR        0.00851392  0.00052195  16.3119 < 2.2e-16
## PCTSINGLES        0.00203340  0.00051578   3.9424 8.068e-05
## PCTVACANT        -0.00852964  0.00074369 -11.4694 < 2.2e-16
## 
## Rho: 0.6511, LR test value: 911.51, p-value: < 2.22e-16
## Asymptotic standard error: 0.01805
##     z-value: 36.072, p-value: < 2.22e-16
## Wald statistic: 1301.2, p-value: < 2.22e-16
## 
## Log likelihood: -255.7844 for lag model
## ML residual variance (sigma squared): 0.071952, (sigma: 0.26824)
## Number of observations: 1720 
## Number of parameters estimated: 7 
## AIC: 525.57, (AIC for lm: 1435.1)
## LM test for residual autocorrelation
## test value: 67.739, p-value: 2.2204e-16

To test for heteroscedasticity, we will once again run two of the three tests run above. In both cases, p < 0.05, indicating problems with heteroscedasticity in our spatial lag model.

the Breusch-Pagan test

bptest.Sarlm(spatial_lag_reg, studentize = FALSE)

## 
##  Breusch-Pagan test
## 
## data:  
## BP = 34.725, df = 2, p-value = 2.881e-08

the Koenker-Bassett test (also known as the Studentized Breusch-Pagan test)

bptest.Sarlm(spatial_lag_reg, studentize = TRUE)

## 
##  studentized Breusch-Pagan test
## 
## data:  
## BP = 11.945, df = 2, p-value = 0.002548

Likewise, we again run the Jarque-Bera test to assess whether the residuals of our regression are normal. Here, too, p < 0.05, indicating an non-normality in the distribution of the residuals of our spatial lag model.

jarque.bera.test(spatial_lag_reg$residuals)

## 
##  Jarque Bera Test
## 
## data:  spatial_lag_reg$residuals
## X-squared = 1360.1, df = 2, p-value < 2.2e-16

Next, we will consider spatial autocorrelation in our model. Visually, it appears that the residuals from our spatial lag model show less autocorrelation than in our OLS model.

reg_data = reg_data |>
              mutate(
                spatial_lag_resids = lagsarlm(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, data = reg_data, nb2listw(reg_data$nb))$residuals
              )

tm_shape(reg_data)+
  tm_fill(col = 'spatial_lag_resids', 
          style = 'quantile', 
          title = 'Spatial Lag Residuals', 
          palette ='Blues',
          midpoint = 0)+
  tm_layout(frame = FALSE)

Firstly we can compare Akaike Information Criterion (AIC) value of each model, the spatial lag model returned a result equal to 525.57, and the OLS model returned an AIC value equals to 1435. It implies that the spatial lag model is better than OLS model. Furthermore, we can compare our spatial lag model to the previous OLS model using the log likelihood ratio. The log likelihood of the spatial lag model is -255.7844, and that of the OLS model is -711.5376, and spatial lag model is performing better than OLS model. Additionally, using the likelihood ratio stest, we can see that the p-value < 0.05, so we can reject \(H_0\) and state that the spatial lag model is better than the OLS model.

LR.Sarlm(spatial_lag_reg, reg) #Here lagreg is the SL output; reg is the OLS output

## 
##  Likelihood ratio for spatial linear models
## 
## data:  
## Likelihood ratio = 237.47, df = 1, p-value < 2.2e-16
## sample estimates:
## Log likelihood of spatial_lag_reg             Log likelihood of reg 
##                         -598.9141                         -717.6489

moran.plot(reg_data$spatial_lag_resids, nb2listw(reg_data$nb),
           xlab = "Lag Residuals", 
            ylab = "Lagged Lag Residuals")

Running a Monte Carlo simulation for Moran’s I of the spatial lag residuals shows the -0.082 of Moran’s I value with significant level of p-value, which is smaller than the OLS residual Moran’s I value of 0.312. It implies that the spatial lag error is likely to be less spatial autocorrelated than OLS residuals. Overall, the spatial lag model is performing better then OLS model.

global_moran_perm(spatial_lag_reg$residuals, reg_data$nb, reg_data$weights, alternative = "two.sided", 999)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  x 
## weights: listw  
## number of simulations + 1: 1000 
## 
## statistic = -0.041474, observed rank = 2, p-value = 0.004
## alternative hypothesis: two.sided

3.3.2 Spatial Error Regression Results

According to the results of Spatial Error Regression shown as below, he coefficient of the spatial parameter, \(lambda\) = 0.81, at \(p < 0.001\), showing its significance. Similar to \(rho\) in Spatial Lag Regression, positive significant \(lambda\) indicates spatial autocorrelation in the error.

Same as the result for OLS Regression, the four explanatory factors in the model (ln_n_bel_pov_100, PCTBACHMOR, PCTSINGLES, and PCTVACANT) all appears to remain significant at \(p < 0.001\).

spatial_error_reg = errorsarlm(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, 
                               data = reg_data, 
                               nb2listw(reg_data$nb))

summary(spatial_error_reg)

## 
## Call:errorsarlm(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + 
##     PCTSINGLES + PCTVACANT, data = reg_data, listw = nb2listw(reg_data$nb))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92652 -0.11541  0.01489  0.13386  1.94867 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                     Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)      10.90641384  0.05346908 203.9761 < 2.2e-16
## ln_n_bel_pov_100 -0.03453480  0.00708951  -4.8712 1.109e-06
## PCTBACHMOR        0.00981310  0.00072898  13.4614 < 2.2e-16
## PCTSINGLES        0.00267790  0.00062085   4.3133 1.608e-05
## PCTVACANT        -0.00578324  0.00088672  -6.5220 6.936e-11
## 
## Lambda: 0.81492, LR test value: 677.61, p-value: < 2.22e-16
## Asymptotic standard error: 0.016373
##     z-value: 49.772, p-value: < 2.22e-16
## Wald statistic: 2477.2, p-value: < 2.22e-16
## 
## Log likelihood: -372.7338 for error model
## ML residual variance (sigma squared): 0.076555, (sigma: 0.27669)
## Number of observations: 1720 
## Number of parameters estimated: 7 
## AIC: 759.47, (AIC for lm: 1435.1)

As with our OLS before, we will calculate the likelihood ratio for our spatial lag model.

LR.Sarlm(spatial_error_reg, reg) #Here lagreg is the SL output; reg is the OLS output

## 
##  Likelihood ratio for spatial linear models
## 
## data:  
## Likelihood ratio = 173.5, df = 1, p-value < 2.2e-16
## sample estimates:
## Log likelihood of spatial_error_reg               Log likelihood of reg 
##                           -630.8990                           -717.6489

To test for heteroscedasticity, we will once again run two of the three tests run above:

the Breusch-Pagan test

bptest.Sarlm(spatial_error_reg, studentize = FALSE)

## 
##  Breusch-Pagan test
## 
## data:  
## BP = 19.75, df = 2, p-value = 5.144e-05

Based on the results of Breusch-Pagan test, since \(p < 0.05\), the spatial lag regression residuals are still heteroscedastic.

the Koenker-Bassett test (also known as the Studentized Breusch-Pagan test)

bptest.Sarlm(spatial_error_reg, studentize = TRUE)

## 
##  studentized Breusch-Pagan test
## 
## data:  
## BP = 7.1133, df = 2, p-value = 0.02853

And, again, the Jarque-Bera test to assess whether the residuals of our regression are normal.

jarque.bera.test(spatial_error_reg$residuals)

## 
##  Jarque Bera Test
## 
## data:  spatial_error_reg$residuals
## X-squared = 1263.9, df = 2, p-value < 2.2e-16

reg_data = reg_data |>
              mutate(
                spatial_error_resids = errorsarlm(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, data = reg_data, nb2listw(reg_data$nb))$residuals
              )

tm_shape(reg_data)+
  tm_fill(col = 'spatial_error_resids', 
          style = 'quantile', 
          title = 'Spatial Error Residuals', 
          palette ='Blues',
          midpoint = 0)+
  tm_layout(frame = FALSE)

global_moran_perm(spatial_error_reg$residuals, reg_data$nb, reg_data$weights, alternative = "two.sided", 999)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  x 
## weights: listw  
## number of simulations + 1: 1000 
## 
## statistic = -0.031944, observed rank = 17, p-value = 0.034
## alternative hypothesis: two.sided

Moran’s I scatter plot of Spatial Error Regression residuals is presented as below. In comparison to the scatter plot of OLS Regression earlier, the Moran’s I seems closer to 0 in Spatial Error Regression, which indicates less spatial autocorrelation in residuals than OLS Regression.

moran.plot(reg_data$spatial_error_resids, nb2listw(reg_data$nb),
           xlab = "Lag Residuals", 
            ylab = "Lagged Lag Residuals")

Overall, which model is doing better based on all of these criteria?

3.3.3 Spatial Lag vs. Spatial Error

To compare Spatial Lag Regression and Spatial Error Regression, since neither of the methods is a special case of the other), instead of using likelihood-ratio test, we should look at the Akaike Information Criterion. As the results from both regression reflects, Spatial Lag has a lower AIC (525.57) than Spatial Error (759.47), suggesting Spatial Lag being the better model. More comparison with other regression models will be covered in a later section ‘Model Comparison.’

3.4 Geographically Weighted Regression

3.4.1 GWR Results

Here we create an adaptive bandwidth.

adapt_bandwidth_est = GWmodel::bw.gwr(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, 
                                data = reg_data |> sf::as_Spatial(),
                                approach = 'AIC', 
                                kernel = 'gaussian', 
                                adaptive = TRUE)

## Take a cup of tea and have a break, it will take a few minutes.
##           -----A kind suggestion from GWmodel development group
## Adaptive bandwidth (number of nearest neighbours): 1070 AICc value: 1334.777 
## Adaptive bandwidth (number of nearest neighbours): 669 AICc value: 1280.776 
## Adaptive bandwidth (number of nearest neighbours): 420 AICc value: 1214.742 
## Adaptive bandwidth (number of nearest neighbours): 267 AICc value: 1129.01 
## Adaptive bandwidth (number of nearest neighbours): 171 AICc value: 1026.228 
## Adaptive bandwidth (number of nearest neighbours): 113 AICc value: 940.2137 
## Adaptive bandwidth (number of nearest neighbours): 76 AICc value: 863.4234 
## Adaptive bandwidth (number of nearest neighbours): 54 AICc value: 803.0275 
## Adaptive bandwidth (number of nearest neighbours): 39 AICc value: 745.4633 
## Adaptive bandwidth (number of nearest neighbours): 31 AICc value: 712.1281 
## Adaptive bandwidth (number of nearest neighbours): 25 AICc value: 686.4389 
## Adaptive bandwidth (number of nearest neighbours): 22 AICc value: 676.2034 
## Adaptive bandwidth (number of nearest neighbours): 19 AICc value: 670.9108 
## Adaptive bandwidth (number of nearest neighbours): 18 AICc value: 667.0847 
## Adaptive bandwidth (number of nearest neighbours): 16 AICc value: 663.0365 
## Adaptive bandwidth (number of nearest neighbours): 16 AICc value: 663.0365

adapt_bandwidth_est = gwr.sel(formula=ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, 
                              data = reg_data |> sf::as_Spatial(),
                              method = "aic",
                              adapt = TRUE)

## Bandwidth: 0.381966 AIC: 1278.726 
## Bandwidth: 0.618034 AIC: 1334.073 
## Bandwidth: 0.236068 AIC: 1209.53 
## Bandwidth: 0.145898 AIC: 1115.876 
## Bandwidth: 0.09016994 AIC: 1007.348 
## Bandwidth: 0.05572809 AIC: 910.4307 
## Bandwidth: 0.03444185 AIC: 821.2906 
## Bandwidth: 0.02128624 AIC: 737.6009 
## Bandwidth: 0.01315562 AIC: 681.6086 
## Bandwidth: 0.008130619 AIC: 660.8787 
## Bandwidth: 0.005024999 AIC: 714.2598 
## Bandwidth: 0.009856269 AIC: 667.0857 
## Bandwidth: 0.006944377 AIC: 667.59 
## Bandwidth: 0.008427569 AIC: 661.757 
## Bandwidth: 0.007677515 AIC: 663.6788 
## Bandwidth: 0.008171309 AIC: 660.931 
## Bandwidth: 0.00805268 AIC: 661.144 
## Bandwidth: 0.008130619 AIC: 660.8787

Here we create a fixed bandwidth.

fixed_bandwidth_est = gwr.sel(formula=ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + PCTVACANT, 
                              data = reg_data |> sf::as_Spatial(),
                              method = "aic",
                              adapt = FALSE)

## Bandwidth: 47374.26 AIC: 1380.178 
## Bandwidth: 76576.63 AIC: 1412.408 
## Bandwidth: 29326.2 AIC: 1314.513 
## Bandwidth: 18171.89 AIC: 1205.472 
## Bandwidth: 11278.14 AIC: 1056.874 
## Bandwidth: 7017.572 AIC: 904.1873 
## Bandwidth: 4384.396 AIC: 773.8975 
## Bandwidth: 2757.003 AIC: 701.36 
## Bandwidth: 1751.22 AIC: 921 
## Bandwidth: 3378.612 AIC: 715.0242 
## Bandwidth: 2578.467 AIC: 707.8216 
## Bandwidth: 2916.634 AIC: 700.6484 
## Bandwidth: 2860.57 AIC: 700.4427 
## Bandwidth: 2865.223 AIC: 700.4423 
## Bandwidth: 2863.527 AIC: 700.4421 
## Bandwidth: 2863.506 AIC: 700.4421 
## Bandwidth: 2863.504 AIC: 700.4421 
## Bandwidth: 2863.505 AIC: 700.4421 
## Bandwidth: 2863.505 AIC: 700.4421 
## Bandwidth: 2863.505 AIC: 700.4421 
## Bandwidth: 2863.504 AIC: 700.4421 
## Bandwidth: 2863.505 AIC: 700.4421 
## Bandwidth: 2863.505 AIC: 700.4421

fixed_bandwidth_est

## [1] 2863.505

Here we run a basic GWR with an adaptive bandwidth.

gwrmodel_adaptive = gwr(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + PCTVACANT,
                        data = reg_data |> sf::as_Spatial(),
                        adapt = adapt_bandwidth_est, #adaptive bandwidth determined by proportion of observations accounted for
                        gweight = gwr.Gauss,
                        se.fit = TRUE, #to return local standard errors
                        hatmatrix = TRUE)

gwrmodel_adaptive

## Call:
## gwr(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + 
##     PCTSINGLES + PCTVACANT, data = sf::as_Spatial(reg_data), 
##     gweight = gwr.Gauss, adapt = adapt_bandwidth_est, hatmatrix = TRUE, 
##     se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.008130619 (about 13 of 1720 data points)
## Summary of GWR coefficient estimates at data points:
##                        Min.    1st Qu.     Median    3rd Qu.       Max.  Global
## X.Intercept.      9.6727001 10.7142952 10.9542194 11.1741864 12.0831379 11.1138
## ln_n_bel_pov_100 -0.2365328 -0.0733584 -0.0401192 -0.0126662  0.0948799 -0.0789
## PCTBACHMOR        0.0010974  0.0101382  0.0149281  0.0202191  0.0347274  0.0209
## PCTSINGLES       -0.0249715 -0.0075552 -0.0016627  0.0042280  0.0143341  0.0030
## PCTVACANT        -0.0317416 -0.0142387 -0.0089603 -0.0035771  0.0167916 -0.0192
## Number of data points: 1720 
## Effective number of parameters (residual: 2traceS - traceS'S): 360.5225 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 1359.477 
## Sigma (residual: 2traceS - traceS'S): 0.276227 
## Effective number of parameters (model: traceS): 257.9061 
## Effective degrees of freedom (model: traceS): 1462.094 
## Sigma (model: traceS): 0.2663572 
## Sigma (ML): 0.2455771 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 660.8787 
## AIC (GWR p. 96, eq. 4.22): 308.7987 
## Residual sum of squares: 103.73 
## Quasi-global R2: 0.8479231

Here we run a basic GWR with a fixed bandwidth.

# Perform a basic GWR
gwrmodel_fixed = gwr(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + PCTVACANT,
              data = reg_data |> sf::as_Spatial(),
              bandwidth = fixed_bandwidth_est, #adaptive bandwidth determined by proportion of observations accounted for
              gweight = gwr.Gauss,
              se.fit = TRUE, #to return local standard errors
              hatmatrix = TRUE)

gwrmodel_fixed

## Call:
## gwr(formula = ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + 
##     PCTSINGLES + PCTVACANT, data = sf::as_Spatial(reg_data), 
##     bandwidth = fixed_bandwidth_est, gweight = gwr.Gauss, hatmatrix = TRUE, 
##     se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Fixed bandwidth: 2863.505 
## Summary of GWR coefficient estimates at data points:
##                        Min.    1st Qu.     Median    3rd Qu.       Max.  Global
## X.Intercept.      9.9111144 10.7328953 10.9397247 11.1639817 14.1200886 11.1138
## ln_n_bel_pov_100 -0.4449947 -0.0737759 -0.0433086 -0.0171177  0.1491675 -0.0789
## PCTBACHMOR       -0.0860921  0.0118751  0.0168151  0.0213557  0.0306660  0.0209
## PCTSINGLES       -0.0238339 -0.0073896 -0.0025703  0.0040499  0.0190000  0.0030
## PCTVACANT        -0.0469927 -0.0137377 -0.0088798 -0.0038448  0.0778877 -0.0192
## Number of data points: 1720 
## Effective number of parameters (residual: 2traceS - traceS'S): 346.7158 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 1373.284 
## Sigma (residual: 2traceS - traceS'S): 0.2785303 
## Effective number of parameters (model: traceS): 255.6016 
## Effective degrees of freedom (model: traceS): 1464.398 
## Sigma (model: traceS): 0.2697262 
## Sigma (ML): 0.2488791 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 700.4421 
## AIC (GWR p. 96, eq. 4.22): 352.4397 
## Residual sum of squares: 106.5382 
## Quasi-global R2: 0.843806

gwr_adaptive_results = as.data.frame(gwrmodel_adaptive$SDF)

gwr_fixed_results = as.data.frame(gwrmodel_fixed$SDF)

reg_data = reg_data |>
              mutate(
                coef_ln_n_bel_pov_100_adapt_stand = gwr_adaptive_results$ln_n_bel_pov_100 / gwr_adaptive_results$ln_n_bel_pov_100_se,
                coef_pctbachmor_adapt_stand = gwr_adaptive_results$PCTBACHMOR / gwr_adaptive_results$PCTBACHMOR_se,
                coef_pctsingles_adapt_stand = gwr_adaptive_results$PCTSINGLES / gwr_adaptive_results$PCTSINGLES_se,
                coef_pctvacant_adapt_stand = gwr_adaptive_results$PCTVACANT / gwr_adaptive_results$PCTVACANT_se,
                gwr_adapt_resids = gwr_adaptive_results$gwr.e,
                gwr_adapt_r2 = gwr_adaptive_results$localR2,
              )

We can look at a summary of the coefficients of the local regressions, stored in the SDF object within the gwrmodel. Note in particular the minimum and maximum values of the Local \(R^2\) (0.1337407 - 0.8862806). There are no negative values, meaning that this output, unlike the output we get in some versions of ArcGIS Pro, is correct.

We can check AIC values to compare the performance of the model, the typical OLS model had an AIC of 1435. On the other hand, since the GWR model showed AICs of 308 and 352 respectively in Adaptive and fixed bandwidth, it can be said that the GWR model has better performance.

3.4.2 Moran’s I Scatterplot of GWR Residuals

Moran’s I scatter plot of GWR residuals is presented as below. The Moran’s I seems closer to 0 here than OLS Regression, Spatial Lag, or Spatial Error, which indicates less spatial autocorrelation in GWR residuals than residuals in other three models.

spdep::moran.plot(reg_data$gwr_adapt_resids, nb2listw(reg_data$nb),
           xlab = "Standardized Residuals", 
            ylab = "GWR Residuals (Adaptive)")

3.4.3 Local GWR Results

Local GWR is key to provide insight on relationship between the dependent variable and explanatory variables specific to certain locations.

summary(gwrmodel_adaptive$SDF)

## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min       max
## x 2660604.8 2750171.3
## y  207610.6  304858.8
## Is projected: TRUE 
## proj4string :
## [+proj=lcc +lat_0=39.3333333333333 +lon_0=-77.75 +lat_1=40.9666666666667
## +lat_2=39.9333333333333 +x_0=600000 +y_0=0 +datum=NAD83 +units=us-ft
## +no_defs]
## Data attributes:
##      sum.w        X.Intercept.      LNMEDHVAL          PCTVACANT         
##  Min.   :15.94   Min.   :-2.032   Min.   :-0.08287   Min.   :-0.0873804  
##  1st Qu.:23.75   1st Qu.: 4.304   1st Qu.: 0.27806   1st Qu.:-0.0097721  
##  Median :25.83   Median : 5.733   Median : 0.41888   Median :-0.0047076  
##  Mean   :26.60   Mean   : 5.628   Mean   : 0.42758   Mean   :-0.0057903  
##  3rd Qu.:28.42   3rd Qu.: 7.205   3rd Qu.: 0.55664   3rd Qu.: 0.0003928  
##  Max.   :82.64   Max.   :11.068   Max.   : 1.12049   Max.   : 0.0221103  
##  X.Intercept._se  LNMEDHVAL_se      PCTVACANT_se          gwr.e         
##  Min.   :0.417   Min.   :0.03691   Min.   :0.002276   Min.   :-2.12779  
##  1st Qu.:1.006   1st Qu.:0.09035   1st Qu.:0.004919   1st Qu.:-0.13725  
##  Median :1.466   Median :0.13302   Median :0.006560   Median : 0.02338  
##  Mean   :1.615   Mean   :0.14588   Mean   :0.007866   Mean   :-0.00141  
##  3rd Qu.:2.050   3rd Qu.:0.18633   3rd Qu.:0.009363   3rd Qu.: 0.17374  
##  Max.   :6.271   Max.   :0.54688   Max.   :0.031691   Max.   : 1.20816  
##       pred           pred.se           localR2        X.Intercept._se_EDF
##  Min.   : 9.202   Min.   :0.02840   Min.   :0.05789   Min.   :0.4263     
##  1st Qu.: 9.980   1st Qu.:0.05163   1st Qu.:0.24333   1st Qu.:1.0281     
##  Median :10.232   Median :0.06172   Median :0.34231   Median :1.4984     
##  Mean   :10.245   Mean   :0.06740   Mean   :0.35696   Mean   :1.6507     
##  3rd Qu.:10.522   3rd Qu.:0.07704   3rd Qu.:0.45616   3rd Qu.:2.0956     
##  Max.   :12.172   Max.   :0.25217   Max.   :0.75169   Max.   :6.4103     
##  LNMEDHVAL_se_EDF  PCTVACANT_se_EDF     pred.se.1      
##  Min.   :0.03773   Min.   :0.002326   Min.   :0.02904  
##  1st Qu.:0.09236   1st Qu.:0.005028   1st Qu.:0.05278  
##  Median :0.13597   Median :0.006706   Median :0.06309  
##  Mean   :0.14912   Mean   :0.008041   Mean   :0.06890  
##  3rd Qu.:0.19047   3rd Qu.:0.009571   3rd Qu.:0.07875  
##  Max.   :0.55902   Max.   :0.032395   Max.   :0.25777

Here we have mapped the standardized coefficients. The higher the absolute value of the ratio between the coefficient and the standard error, the more plausible it is that the relationship between the predictor and the dependent variable is significant at the location.

coef_ln_n_bel_pov_100_adapt_stand_map = tm_shape(reg_data)+
                                    tm_fill(col='coef_ln_n_bel_pov_100_adapt_stand', 
                                            breaks=c(-Inf, -6, -4, -2, 0, 2, 4, 6, Inf), 
                                            title='Stand. coef. of ln_n_bel_pov_100', 
                                            palette ='-RdBu')+
                                    tm_layout(frame=FALSE, 
                                              main.title = '# Bel. Pov Line (Log)')

coef_pctbachmor_adapt_stand_map = tm_shape(reg_data)+
                                    tm_fill(col='coef_pctbachmor_adapt_stand', 
                                            breaks=c(-Inf, -6, -4, -2, 0, 2, 4, 6, Inf), 
                                            title='Stand. coef. of PCTBACHMOR', 
                                            palette='-RdBu')+
                                    tm_layout(frame=FALSE, 
                                              main.title = '% w/Bach/ Deg.+')

coef_pctsingles_adapt_stand_map = tm_shape(reg_data)+
                                    tm_fill(col='coef_pctsingles_adapt_stand', 
                                            breaks=c(-Inf, -6, -4, -2, 0, 2, 4, 6, Inf), 
                                            title='Stand. coef. of PCTSINGLES', 
                                            palette='-RdBu')+
                                    tm_layout(frame=FALSE, 
                                              main.title = '% Single Housing Units')

coef_pctvacant_adapt_stand_map = tm_shape(reg_data)+
                                    tm_fill(col='coef_pctvacant_adapt_stand', 
                                            breaks=c(-Inf, -6, -4, -2, 0, 2, 4, 6, Inf), 
                                            title='Stand. coef. of PCTVACANT', 
                                            palette='-RdBu')+
                                    tm_layout(frame=FALSE, 
                                              main.title = 'Percentage of Housing Vacant')


tmap_arrange(coef_ln_n_bel_pov_100_adapt_stand_map, 
             coef_pctbachmor_adapt_stand_map,
             ncol = 2)

## Variable(s) "coef_ln_n_bel_pov_100_adapt_stand" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

tmap_arrange(coef_pctsingles_adapt_stand_map,
             coef_pctvacant_adapt_stand_map, 
             ncol=2)

## Variable(s) "coef_pctsingles_adapt_stand" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

## Variable(s) "coef_pctvacant_adapt_stand" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

3.4.4 Local GWR R-Squared Results

tm_shape(reg_data)+
  tm_fill(col = 'gwr_adapt_r2',  
          breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8), 
          n = 5, 
          palette = 'Blues',
          title = "R Squared")+
  tm_layout(frame = FALSE, 
  main.title = 'Local R Squared')

Based on this map, we can see an uneven distribution of \(R^2\) value across the city – with parts in north west Philly having very high value and some areas north to the center city having very low \(R^2\) value.

3.4.5 Local GWR Residuals Results

tm_shape(reg_data)+
  tm_fill(col = 'gwr_adapt_resids',  
          style = 'quantile',
          palette = 'Blues', 
          title = "Residuals",
          midpoint = 0)+
  tm_layout(frame = FALSE, 
  main.title = 'Residuals',
  title = "Adaptive GWR")

3.5 Model Comparison

As mentioned in sections above, Spatial Lag and Spatial Error are respectively better than OLS. In addition, the following table is informative for comparisons across four models tested in this project. Firstly, by comparing the Moran’s I value, we can see that GWR has the value closest to 0 (0.033) followed by Spatial Lag, Spatial Error, and OLS. Similarly, by comparing AIC, GWR has the lowest AIC (308.8), followed by Spatial Lag, Spatial Error, and OLS Regression. In addition, by comparing \(R^2\) of OLS and GWR, GWR has higher \(R^2\), suggesting that GWR is better at accounting for variables. Overall, GWR is the best among four models.

Note that the global R^2 for our GWR here is not exactly the same as in the actual function output; we’re calculating it based on the source code so that it will be automatically update in the table.

model = c("OLS", "Spatial Lag", "Spatial Error", "GWR")
global_moran = c(global_moran_perm(reg$residuals, reg_data$nb, reg_data$weights, alternative = "two.sided", 999)$statistic,
                 global_moran_perm(spatial_lag_reg$residuals, reg_data$nb, reg_data$weights, alternative = "two.sided", 999)$statistic,
                 global_moran_perm(spatial_error_reg$residuals, reg_data$nb, reg_data$weights, alternative = "two.sided", 999)$statistic,
                 global_moran_perm(reg_data$gwr_adapt_resids, reg_data$nb, reg_data$weights, alternative = "two.sided", 999)$statistic)
  
aic = c(stepAIC(reg)$anova$AIC,
        spatial_lag_reg$AIC_lm.model, #there's a difference between AIC and AIC for the lm for both this and spatial error
        spatial_error_reg$AIC_lm.model, #not sure why that is--might have to ask Eugene
        #also, why are these both the same?? they shouldn't be...
        gwrmodel_adaptive$results$AICh)

## Start:  AIC=-3448.07
## ln_med_h_val ~ ln_n_bel_pov_100 + PCTBACHMOR + PCTSINGLES + PCTVACANT
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          230.34 -3448.1
## - PCTSINGLES        1     2.407 232.75 -3432.2
## - ln_n_bel_pov_100  1    11.692 242.04 -3364.9
## - PCTVACANT         1    51.546 281.89 -3102.7
## - PCTBACHMOR        1   199.020 429.36 -2379.0

aic2 = c(AIC(reg),
        AIC(spatial_lag_reg), #there's a difference between AIC and AIC for the lm for both this and spatial error
        AIC(spatial_error_reg), #not sure why that is--might have to ask Eugene
        #also, why are these both the same?? they shouldn't be...
        gwrmodel_adaptive$results$AICh)

log_lik = c(logLik(reg), logLik(spatial_lag_reg), NA, NA)

lik_ratio = c(NA, LR.Sarlm(spatial_lag_reg, reg)$statistic, LR.Sarlm(spatial_error_reg, reg)$statistic, NA)
  
lik_rat_p = c(NA, LR.Sarlm(spatial_lag_reg, reg)$p.value, LR.Sarlm(spatial_error_reg, reg)$p.value, NA)

r2 = c(summary(reg)$r.squared, NA, NA, (1 - (gwrmodel_adaptive$results$rss/gwrmodel_adaptive$gTSS))) #am I meant to be using r squared or ADJUSTED r squared?
                                            # calculating global r squared for gwr based on: https://stackoverflow.com/questions/43927662/return-the-global-r2-of-a-geographically-weighted-regression-gwr-in-r

comp = cbind(model,
             round(global_moran, 3),
             round(aic2, 1),
             round(log_lik, 4),
             round(lik_ratio, 1),
             lik_rat_p,
             round(r2, 4)) |>
      as.data.frame()

colnames(comp) = c("Model", "Moran's I", "AIC", "Log Likelihood", "Likelihood Ratio", "L.R. P-Value", "R^2")

table_out = comp |>
        gt() |>
        tab_header(
          title = md("**Model Comparison**"))
     
#print output
table_out

Model	Moran's I	AIC	Log Likelihood	Likelihood Ratio	L.R. P-Value	R^2
Model Comparison
OLS	0.313	1435.1	-711.5376	NA	NA	0.6623
Spatial Lag	-0.082	525.6	-255.7844	911.5	0	NA
Spatial Error	-0.095	759.5	NA	677.6	0	NA
GWR	0.033	308.8	NA	NA	NA	0.8479

gwr_local_r2 = tm_shape(reg_data)+
                          tm_fill(col = 'gwr_adapt_r2', 
                                  style = 'quantile', 
                                  title = 'GWR Local R^2', 
                                  palette = 'Blues',
                                midpoint = 0)+
                          tm_layout(frame = FALSE)

gwr_coef_ln_n_bel_pov_100 = tm_shape(reg_data)+
                        tm_fill(col = 'coef_ln_n_bel_pov_100_adapt_stand', 
                                style = 'quantile', 
                                title = 'GWR Coef., ln_n_bel_pov_100', 
                                palette = 'Blues',
                                midpoint = 0)+
                        tm_layout(frame = FALSE)

gwr_coef_pctbachmor = tm_shape(reg_data)+
                        tm_fill(col = 'coef_pctbachmor_adapt_stand', 
                                style = 'quantile', 
                                title = 'GWR Coef., PCTBACHMOR', 
                                palette = 'Blues',
                                midpoint = 0)+
                        tm_layout(frame = FALSE)

gwr_coef_pctsingles = tm_shape(reg_data)+
                        tm_fill(col = 'coef_pctsingles_adapt_stand', 
                                style = 'quantile', 
                                title = 'GWR Coef., PCTSINGLES', 
                                palette = 'Blues',
                                midpoint = 0)+
                        tm_layout(frame = FALSE)


gwr_coef_pctvacant = tm_shape(reg_data)+
                        tm_fill(col = 'coef_pctvacant_adapt_stand', 
                                style = 'quantile', 
                                title = 'GWR Coef., PCTVACANT', 
                                palette = 'Blues',
                                midpoint = 0)+
                        tm_layout(frame = FALSE)

# mapview(reg_data, zcol = 'coef_pctvacant_adapt_stand')

gwr_local_r2

tmap_arrange(gwr_coef_ln_n_bel_pov_100,
             gwr_coef_pctbachmor,
             gwr_coef_pctsingles,
             gwr_coef_pctvacant,
             ncol = 2)

4 Discussion

In this project, we used three approaches in creating model other than OLS, including spatial lag, spatial error, and geographically weighted regression (GWR), to check whether they can better account for spatial autocorrelation than OLS.

Based on the results indicated in the ‘model comparison’ section above, GWR is the best out of the four regression methods, not only because it has a high \(R^2\) value than OLS but also because it has an Moran’s I that is closer to 0 out of the four and lower AIC than spatial error or spatial lag. Such results suggest that GWR is overall a better model with better ability to account for spatial autocorrelation.

Although the GWR model has the best performance among all four regression models, there are some limitations, including problem of heteroscedasticity and potential local multicollinearity. Based on the significant results of Breusch-Pagan Tests, all four models faces potential heteroscedasticity, including GWR model. In addition, according to the maps shown in Section 3.5 on GWR local \(R^2\) for individual explanatory variables, there might be a slight local multicollinearity between the poverty rate and percentage vacancy.

5 Citations

url = "https://www.smbc-comics.com/comics/1618931828-20210420.png"

Stephen Malpezzi, Larry Ozanne, and Thomas Thibodaeu, Characteristic Prices of Housing in Fifty-nine Metropolitan Areas (Washington D.C.: Department of Housing and Urban Development: 1980).↩︎
Stephen Malpezzi, Gregory H. Chun, and Richard K. Green, “New Place-to-Place Housing Price Indexes for U.S. Metropolitan Areas, and Their Determinants,” Real Estate Economics 26, No. 2 (1998): 235-274. https://doi-org.proxy.library.upenn.edu/10.1111/1540-6229.00745 ↩︎
Malpezzi, Chun, and Green, “Place-to-Place Housing Price Index,” 238.↩︎
This analysis is based on Data Analysis and Visualization with R:Spatial (2022)↩︎

MUSA 500, Assignment #2

Minwook Kang, Nissim Lebovits, Ann Zhang

2022-11-30