Assignment1.knit

OLS Regression to Predict Median Household Values in Philadelphia

Brooke Porter, Dennis Espejo, Oindriza Reza Nodi MUSA 5000: Statistical and Data Mining Methods for Urban Data Analysis Eugene Brusilovskiy October 3, 2025

Introduction

Property values provide a monetary marker of a neighborhood that, when considered alongside other neighborhood characteristics, can provide a rich picture of quality of life . Considering how house values are related to other neighborhood characteristics can reveal community assets as well as unmet needs. The present study considers the relationship between median house values in Philadelphia (MEDHVAL) and various community attributes, including proportion of residents with at least a bachelor’s degree (PCTBACHMOR), proportion of vacant housing units (PCTVAVANT), percent of single-family housing units (PCTSINGLES), number of households living in poverty (NBELPOV100), and median household income (MEDHHINC). Analyzing the relationship between these neighborhood characteristics and MEDHVAL at the census block level can reveal patterns across Philly neighborhoods.. Environment and outcomes are deeply intertwined. This close relationship can lead to pockets of poverty or wealth, a concentration or dearth of resources in a city. By identifying predictors of median house value, this study pulls apart how social systems are interconnected and how their combined influence shapes outcomes.

Although Philadelphia is no longer America’s poorest big city, its poverty rate remains high at 19.7%, with significant disparities across sex, race and ethnicity, and education level (File & Duchneskie, 2025). Additionally, Philly’s history of residential segregation (Logan & Bellman, 2016) and ongoing, racialized gentrification (Hwang et al., 2020) suggest that much can be learned by considering the relationship between these community attributes and house values in Philadelphia. Understanding how these neighborhood characteristics might predict MEDHVAL can reveal what predictors are worth targeting. Such insights can guide the development of efficient, targeted community revitalization projects and interventions.

Methods

Data

Philadelphia’s census block shapefiles and demographic information were sourced from the 2000 United States Census Bureau’s “Decennial Census”. Specifically, the following variables of interest, amongst others, were retrieved for use in the current study.

      1) POLY_ID: Census Block Group ID
      2) MEDHVAL: Median value of all owner-occupied housing units
      3) PCBACHMORE: Proportion of residents in Block Group with at least a bachelor’s degree
      4) PCTVACANT: Proportion of housing units that are vacant
      5) PCTSINGLES: Percent of housing units that are detached single family houses
      6) NBELPOV100: Number of households with incomes below 100% poverty level
            (i.e., number of households living in poverty)
      7) MEDHHINC: Median household income

The POLY_ID and Geography columns are variables related to Philadelphia’s Census blocks. Census blocks are the smallest geographic unit of analysis provided by the U.S. Census Bureau (Rossitier, 2011) and thus provide the most granular level of geographic analysis necessary to conduct an in-depth and delineated analysis of trends across specific sectors in Philadelphia. Although the original Philadelphia dataset contains 1,816 block groups, the following exclusion criteria reduced the number of block groups in our final dataset to 1,720:

        1) Block groups where the population <40
        2) Block groups where there are no housing unitss
        3) Block groups where the median house value is lower than $10,000
        4) One North Philadelphia block group, which had a very high median house value (over               $800,000) and a very low median household income (less than $8,000)

Exploratory Data Analysis

To get a baseline understanding of the dataset, we will examine the summary statistics and distributions of variables. As a part of our exploratory data analysis, we will examine the correlations between the predictors. A correlation is a standardized measure of the strength of the relationship between variables that does not depend on the units of measurement of $x$ and $y$. The possible range of values for a correlation coefficient $r$ is

\[ -1 \leq r \leq 1 \]

A correlation of 0 means that there is no relationship between $x$ and $y$. The formula to calculate the correlation coefficient $r$ between variables $x$ and $y$ is as follows: \[ r = \frac{1}{n - 1} \sum_{i = 1}^{n} \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right) \]

Multiple Regression Analysis

A regression analysis is defined as “the part of statistics that investigates the relationship of two or more variables related in a non-deterministic fashion” (Devore, 2016). Regression analysis is an efficient method to estimate the relationship between one or more predictors (x) and an outcome (y) and is commonly used to unearth patterns between variables. For example, the likelihood of a student enrolling in a four-year university (outcome) can be examined if they participated in a college prep course (predictor), or the predictability of stock increases (outcome) can be modeled based on specific political climates (predictor). Our current study uses a multiple regression analysis to measure one outcome regressed on multiple predictors, where our outcome variable LNMEDHVAL is regressed on the predictors PCTBACHMOR, PCTVACANT, PCTSINGLES, and LNNBELPOV100. Our OLS regression equation is:

\[ \ln(\text{MEDHVAL}) = \beta_0 + \beta_1(\text{PCTBACHMOR}) + \beta_2(\text{PCTVACANT}) + \beta_3(\ln(\text{NBELPOV100})) + \beta_4(\text{PCTSINGLES}) + \varepsilon \]

The $\beta_i$ coefficients represent the amount our outcome of ln(MEDHVAL) changes as each predictor increases by one unit, while all other predictors in our model are controlled. Each coefficient shows the magnitude and direction of the relationship between the predictor and the outcome. In this model, because the dependent variable is log-transformed, a one-unit increase in our predictors (except for lnNBELPOV100) indicates a percentage change in the dependent variable of

\[ \left( e^{\beta_i} - 1 \right) \times 100\%. \]

When both the dependent and independent variables are logged, as in the relationship between ln(MEDHVAL) and ln(NBELPOV100), a one percent change in the predictor corresponds to an approximate $\beta_i$ percent change in the dependent variable. For example, if the coefficient $\beta_i = 0.25$, a 1% increase in x is associated with a 0.25% increase in y.

The $\varepsilon$ variable in our regression equation represents the random error term, which captures the variation of individual data points around the model’s best-fitted regression line. It reflects the portion of the outcome that is not explained by the predictors. The random error for each observation is calculated by subtracting the predicted value of y (based on the model) from the actual observed value of y in the sample. Without $\varepsilon$, every data point would lie exactly on the regression line, resulting in a perfectly deterministic relationship that does not reflect real-world data.

A regression analysis assumes the following: linearity, independence of observations, normality of residuals, homoscedasticity, and no multicollinearity. Linearity assumes that as our predictor increases or decreases, our outcome variable steadily increases or decreases as well. One simple way to determine linearity is by examining a scatter plot of the predictor and outcome and identifying a linear pattern. If there is no observed linearity, some type of variable transformation or polynomial regression may be required. Independence of observations establishes that the residuals for each observation should be independent of those of other observations; that is, the value of one residual shouldn’t tell us anything about a separate residual. There should be no spatial, temporal, or other forms of dependence. Normality of residuals assumes that residuals have a mean of 0 and follow a normal distribution, although this assumption isn’t entirely important when there is a large sample size. Homoscedasticity specifies that the variance of residuals should be constant across all levels of the predictors. For example, in a scatter plot of the residuals and predictor variables, as the predictor variable increases, the variance of the residuals should remain consistent, with no unique shape or pattern formed by the points on the scatter plots. Lastly, each predictor variable in a multiple regression should have low levels of multicollinearity, meaning it should not be excessively correlated with other predictors, usually above 0.80 (Brusilovisky, 2025), but this varies across fields. When predictor variables have a high level of multicollinearity, the model’s predictive ability remains intact, but the attribution of that ability to individual predictors becomes unreliable. In other words, the model cannot accurately tell which predictors are contributing to the outcome.

Aside from meeting the previous assumptions, a multiple regression analysis needs to estimate the following parameters: $\sigma^2$ and $\beta_0, \dots, \beta_k$. As previously mentioned, $\beta_0$ is the y-intercept of the regression line, meaning that when all predictors are 0, the outcome equals $\beta_0$, and $\beta_1 \dots \beta_k$ are the coefficient estimates, or slopes, for each predictor. Sigma squared ($\sigma^2$) represents the variance of the error term after applying the model. It is used to measure the amount of variability remaining after the model accounts for all predictors and is calculated as:

\[ \sigma^2 = \frac{\text{SSE}}{n - (k + 1)} \]

n is the number of observations, and k+1 represents the degrees of freedom (k predictors plus the intercept). In multiple regression, the parameters are estimated using the least squares method, which works by minimizing the sum of squared errors (SSE) between the observed and predicted values. Where SSE is the sum of squared errors, found by summing the squares of the differences between the actual and predicted values for all data points in a dataset. The following equations display how SSE is calculated:

\[ \text{SSE} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2, \]

\[ \text{SSE} = \sum \left[ y_i - \left( b_0 + b_1x_{1i} + b_2x_{2i} + \dots + b_kx_{ki} \right) \right]^2 \]

This equation measures how far each predicted value is from the actual value, squares those differences, and adds them up. The goal is to find the set of b coefficients that makes this total error as small as possible. By taking derivatives of SSE with respect to each coefficient and setting them equal to zero, we solve for the estimates of b0, b1, …, bk. This gives us the line (or plane) that best fits the data in terms of minimizing squared prediction errors.

The coefficient of multiple determination, $R^2$, represents the proportion of variance in the dependent variable that is explained by the model:

\[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}} \]

R² relies on two main functions: SSE and SST. SST (Total Sum of Squares) measures the total variation in the dependent variable by summing the squared differences between each observed value and the overall mean

\[ \text{SST} = \sum_{i=1}^{n}(y_i - \bar{y})^2. \]

The model seeks to minimize SSE relative to SST. Dividing SSE by SST gives the proportion of total variance left unexplained by the model, and subtracting this ratio from 1 gives the proportion of variance explained. An R² of 0 indicates that the model explains none of the variance, while an R² of 1 indicates that the model explains all the variance in the outcome.

The adjusted R² takes this one step further by accounting for the number of predictors in the model. R² will always increase as more predictors are added, even if they are not meaningful. The adjusted R² corrects this by penalizing models that add more predictors through the following equation:

\[ \text{Adjusted } R^2 = 1 - \left[ \frac{\text(n - 1)/{R^2-k}}{n - (k+1)} \right] \]

This equation modifies the regular R² value to account for model complexity, meaning it considers how many predictors are included in the model. R² measures the proportion of variance in the dependent variable that is explained by the independent variables. However, R² will always increase, or at least not decrease, as more predictors are added, even when those predictors don’t meaningfully improve the model.

The adjusted R² corrects for this by adding a penalty based on the number of predictors (k) relative to the sample size (n). The term (n − k − 1) adjusts the Sum of Squared Errors (SSE) for the degrees of freedom lost when estimating model parameters, since each estimated coefficient uses one degree of freedom. Similarly, (n − 1) adjusts the Total Sum of Squares (SST) because one degree of freedom is used when estimating the mean of the dependent variable.

The hypotheses tested in a multiple regression analysis involve both overall model significance and individual coefficients. The overall F-test examines whether the set of predictors, as a group, significantly predicts the outcome variable. The null hypothesis states that all slope coefficients are equal to zero

\[ H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \]

meaning none of the predictors have a relationship with the outcome. The alternative hypothesis states that at least one coefficient is not equal to zero

\[ H_a: \text{at least one } \beta_i \neq 0. \]

A significant F-ratio indicates that the model explains a significant amount of variance in the outcome beyond what would be expected by chance. For individual predictors, t-tests are used. The null hypothesis for each predictor states that $\beta_i = 0$, meaning the predictor has no effect, while the alternative states that $\beta_i \neq 0$. These tests assess whether each predictor contributes significantly to the model after controlling for the other variables.

Additional Analysis

Additional analyses can be conducted to further assess model quality. Stepwise regression is a variable selection method that adds or removes predictors based on statistical criteria such as p-values. It is sometimes used to identify a more parsimonious model by including only predictors that meet significance thresholds. However, stepwise methods can exclude theoretically important variables, so results should be interpreted with caution. Another common procedure is k-fold cross-validation. In this approach, the dataset is divided into k equally sized folds, and the model is trained on k−1 folds and validated on the remaining fold. This process is repeated k times, and the results are averaged to evaluate model performance. In our case, k = 5. The Root Mean Squared Error (RMSE) is often used to compare models:

\[ \text{RMSE} = \sqrt{\frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n}} \]

RMSE measures the average size of the prediction errors. Lower RMSE values indicate better model fit and generalization performance.

Software

All of our data analysis and statistical tests were performed through RStudio. A free-to-use and publicly available software commonly used in programming and statistics. Our final results and write-up are presented through Rmarkdown, a “file format for making dynamic documents with R” (Grolemund, 2014).

Results

## Variable Mean SD ## 1 MEDHVAL 66287.733140 60006.075990 ## 2 PCTBACHMOR 16.081372 17.769558 ## 3 NBelPov100 189.770930 164.318480 ## 4 PCTVACANT 11.288529 9.628472 ## 5 PCTSINGLES 9.226473 13.249250

Table 1: Descriptive statistics of dependent and independent variables

The table above shows the descriptive statistics for all variables used in the regression analysis. The dependent variable, Median House Value, has a mean of $66,287.73 and a relatively large standard deviation ($60,006.08), suggesting considerable variation in housing prices across the study area. Because the SD is almost as large as the mean, this suggests substantial spread and that the variable is likely right-skewed, with a few areas having much higher house values than others.

Among the predictors, the number of households living in poverty averages 189.77 (SD = 164.32), showing that poverty levels also vary widely, and shows a high level of variability relative to the mean. This also implies that the variable may be right-skewed. The percentage of individuals with a bachelor’s degree or higher has a mean of 16.08 % (SD = 17.77), and as the SD slightly exceeds the mean, suggesting that some tracts have much higher education levels than others, another possible sign of skewness. The percentage of vacant houses (mean = 11.29 %, SD = 9.63) and percentage of single-house units (mean = 9.23 %, SD = 13.25) both display moderate variability, suggesting differences in housing stock and occupancy patterns, thus both variables might also be somewhat skewed. Overall, the data show substantial variation across all variables.

Figure 1: Histograms of Dependent and Independent Variables

The first set of histograms shows the original variables. All of these variables are strongly right-skewed, with most observations clustered near low values and a few extreme high outliers. This is common for socio-economic data such as house values and poverty counts.

Figure 2: Histograms of Natural Log-Transformed Dependent and Independent Variables

After applying the natural log transformation, as seen in the second set of histograms, the distributions became more symmetric overall. However, not all transformations produced meaningful improvements. While LNMEDHVAL and LNNBELPOV100 showed clear improvement in normality and spread, the other predictors (LNPCBACHMOR, LNPCTVACANT, and LNPCTSINGLES) developed a substantial zero spike, meaning many observations had values close to zero after logging. This distorted their distributions and made interpretation less meaningful. Therefore, it made sense to use the log-transformed versions only for MEDHVAL and NBelPov100, where the transformation successfully reduced skewness and stabilized variance. For the remaining variables, the original forms were retained because the log transformation introduced heavy zero clustering rather than improving normality. In summary, the regression model uses LNMEDHVAL (dependent variable) and LNNBELPOV100 (predictor) in log form, while keeping the other predictors in their original scale to maintain interpretability and distributional balance. Other regression assumptions will be examined in a separate section below.

Figure 3: Choropleth Maps of Predictor Variables

The Percent Single-Person Households, Percent with Bachelor’s or More, Log-transformed Percentage of families below poverty rates and Percentage of Vacant lots maps all show similar spatial patterns. Their higher values are concentrated within the same areas, suggesting that neighborhoods with more single residents have higher educational attainment also tend to live in areas with lower percent of vacant lots. In contrast, areas with higher poverty rates and higher vacancy rates display much lower values for education and percentage of singles.

Figure 4: Choropleth Map of Outcome Variable

The log transformed median house value map in figure 4 shows a cluster of lower median value homes at parts of Fairhill and Kensington neighborhoods located centrally in the map and towards north of Center City. Lower home values also huddle near Mill Creek neighborhoods, located north of University City, and within South Philadelphia. These are also the areas with high concentration of census tracts with lower education attainments, higher vacancy rates, higher percentage of tracts below poverty rates, and lower percentage of singles.

This visual relationship indicates that more affluent and educated neighborhoods are spatially clustered and associated with higher housing values, while disadvantaged areas experience overlapping challenges of poverty and vacancy. Therefore, we can expect Percent with Bachelor’s or More and Poverty Rate to have a strong negative association, while both are likely to be strongly related to Median House Value in opposite directions. Because education, poverty, and single-person household rates appear to co-vary across space, there may also be some degree of multicollinearity among these predictors in the regression model.

##             PCTVACANT PCTSINGLES PCTBACHMOR LNBELPOV100
## PCTVACANT       1.000     -0.151     -0.298       0.250
## PCTSINGLES     -0.151      1.000      0.198      -0.291
## PCTBACHMOR     -0.298      0.198      1.000      -0.320
## LNBELPOV100     0.250     -0.291     -0.320       1.000

Table 2: Correlations Across All Independent Variavles (Multicollinearity Check)

In examining the correlation matrix in Table 2, none of the predictor pairs show correlations above |0.8|, which is typically the threshold for severe multicollinearity (Brusilovskiy, 2025). In this case, the observed correlations, for example, around –0.32 between Poverty Rate and Percent with Bachelor’s or More, and 0.25 between Poverty Rate and Vacancy Rate, are relatively small. This means that, statistically, the predictors are not strongly linearly dependent, and thus, multicollinearity is not severe. The correlation results support the visual interpretation from the maps. Areas with high education and more single-person households correspond to higher median home values and lower poverty, which is consistent with the moderate negative correlations between Poverty Rate and both Education (–0.32) and Single-Person Households (–0.29). Meanwhile, the Vacancy Rate shows a weak positive correlation with the Poverty Rate (0.25), aligning with the visual clustering of poverty and vacancy in the same neighborhoods. Thus, while the maps suggested overlapping spatial trends, the correlation matrix clarifies that these relationships are real but not severe, indicating moderate co-variation without problematic multicollinearity.

## 
## Call:
## lm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + 
##     LNBELPOV100, data = main_df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25825 -0.20391  0.03822  0.21744  2.24347 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 11.1137661  0.0465330 238.836 < 0.0000000000000002 ***
## PCTVACANT   -0.0191569  0.0009779 -19.590 < 0.0000000000000002 ***
## PCTSINGLES   0.0029769  0.0007032   4.234            0.0000242 ***
## PCTBACHMOR   0.0209098  0.0005432  38.494 < 0.0000000000000002 ***
## LNBELPOV100 -0.0789054  0.0084569  -9.330 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3665 on 1715 degrees of freedom
## Multiple R-squared:  0.6623, Adjusted R-squared:  0.6615 
## F-statistic: 840.9 on 4 and 1715 DF,  p-value: < 0.00000000000000022

## Analysis of Variance Table
## 
## Response: LNMEDHVAL
##               Df  Sum Sq Mean Sq  F value                Pr(>F)    
## PCTVACANT      1 180.392 180.392 1343.087 < 0.00000000000000022 ***
## PCTSINGLES     1  24.543  24.543  182.734 < 0.00000000000000022 ***
## PCTBACHMOR     1 235.118 235.118 1750.551 < 0.00000000000000022 ***
## LNBELPOV100    1  11.692  11.692   87.054 < 0.00000000000000022 ***
## Residuals   1715 230.344   0.134                                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Table 3: OLS Regresion Results

The estimated model is : LNMEDHVAL=11.1138−0.0192(PCTVACANT)+0.0030(PCTSINGLES)+0.0209(PCTBACHMOR)−0.0789(LNNBELPOV)+εi

Because the dependent variable (LNMEDHVAL) is in logarithmic form while the predictors are in their original scales, this model is a log-linear model. In such models, each coefficient e ˆβ represents the expected proportional (percentage) change in the dependent variable Y for a one-unit increase in the independent variable X, holding other variables constant. We can interpret all of this as x goes up by 1 unit, the expected change in our dependent variable y is (𝑒𝛽1 ―1) ∗ 100%. However, since all the 𝛽1values are below 0.3, we are using the approximation, saying that as x goes up by 1 unit, the expected change in our dependent variable y is approximately 100𝛽1%.

Only for the Log of Households Below Poverty variable, this model functions as a log-log model, where both the dependent variable y and predictor x are log-transformed. We can interpret the equation as x changes by 1%, the expected value of y changes by (1.01𝛽1 ― 1) ∙ 100%. However, in this equation, as the B1 is below small and below 20, we are using the following approximation: (1.01𝛽1 ― 1) ∙ 100% ≈ 𝛽1%

Coefficient Interpretation

Percent Vacant Houses (PCTVACANT): The estimated coefficient is –0.0192, meaning that a one percentage point increase in vacancy rate is associated with an approximate 1.9% decrease in median house value, holding other factors constant. The very low p-value (< 0.001) allows us to reject the null hypothesis (H₀: β = 0) and conclude that vacancy rate has a statistically significant negative effect on housing values.

Percent Single-Person Households (PCTSINGLES): The coefficient of 0.00298 suggests that each one percentage point increase in the share of single-person households is associated with roughly a 0.3% increase in median house value. Since the p-value is less than 0.001, we also reject null hypothesis H₀, indicating that this relationship is statistically significant.

Percent with Bachelor’s or More (PCTBACHMOR): The coefficient of 0.0209 implies that each additional percentage point of adults with a bachelor’s degree or higher is associated with about a 2.1% increase in median house value. With a t-value of 38.49 and a p-value < 0.001, we can confidently reject the null hypothesis H₀ and conclude a strong positive and statistically significant effect of education on housing values.

Log of Households Below Poverty (LNNBELPOV): In this case, both the dependent variable (LNMEDHVAL) and the predictor (LNNBELPOV) are expressed in natural logarithms. The coefficient for LNNBELPOV is –0.0789, which means that a 1% increase in the number of households living below the poverty line is associated with an expected 0.0789% decrease in the median house value, holding all other factors constant. Again, the p-value < 0.001 leads us to reject null hypothesis H₀, confirming a strong negative and statistically significant association between poverty and housing value.

Model Fit and Significance: The model’s adjusted R² = 0.6615 indicates that approximately 66.15% of the variation in log median house values is explained by these four predictors. The F-statistic (840.9, p < 0.001) shows that, overall, the model is highly statistically significant. Meaning we can reject the joint null hypothesis that all coefficients are simultaneously equal to zero.

Figure 7: Scatter plots of the dependent variables and each predictor

In the sections below, we will be discussing the model assumption tests. We have already looked at the variable distributions earlier. The model assumes a linear relationship between Median House Value and the variables, however, the scatter plots of the dependent variables and each predictor in figure 7, tell varying stories. The percent vacant scatter plot visibly appears to be an exponential decay curve. As PCTVACANT (percent of vacant houses) increases, MEDHVAL (median house value) decreases rapidly at first and then levels off as the rate of decline slows down. Secondly, the MEDHVAL vs. PCTSINGLES scatter plot does not show a distinct or easily recognizable curve. The points are fairly dispersed, and while there is an upward trend, the relationship between median house value and percent single-person households is not clearly linear or exponential. Finally, the other variables, percentage bachelor or higher degree, and log of percentage of population below poverty rate, visibly appear to be somewhat linear as no distinct curvature emerges from the scatter plot.

Figure 8: Histogram of Standardized Residuals of the regression model

Standardized residuals are the residuals from a regression model that have been scaled by their estimated standard deviation. In other words, they measure how far each observed value is from the predicted value in standard deviation units.

The histogram of standardized residuals in figure 8 appears approximately normally distributed, with most residuals clustering around zero and fewer observations in the tails. This pattern indicates that the normality assumption of OLS regression is reasonably satisfied.

Figure 9: Plot of Residual vs Predicted Values

Standardized residuals measure how far each observed value deviates from its predicted value, expressed in standard deviation units. This plo in figure 9 is used to check whether residuals behave like random noise, as required by OLS assumptions.

In this context, the null hypothesis is that of homoscedasticity, meaning the residuals have constant variance, that is no heteroscedasticity. If a formal test produced a p-value less than 0.05, we would reject this null in favor of the alternative hypothesis of heteroscedasticity.

Visually, the residuals in the plot appear randomly scattered around zero with no distinct pattern, curve, or funnel shape. This suggests that the variance of errors remains roughly constant across fitted values, supporting the assumption of homoscedasticity. There are a few isolated points beyond ±3 standardized residuals, which could be considered mild outliers, but their number is small and they do not form a systematic pattern.

Referencing the maps of the dependent variable and predictors, it appears that the data exhibit spatial autocorrelation, meaning that neighboring block groups tend to have similar values rather than being completely independent. For example, the maps for Percent with Bachelor’s or More, Percent Single-Person Households, and Log Median House Value show clear clusters of high values concentrated in specific regions of the city, especially around the central and northwestern areas. Likewise, the Poverty Rate and Vacancy Rate maps show clusters of high values in other, often adjacent, more disadvantaged neighborhoods.

This spatial clustering indicates that block groups with high or low values tend to be located near others with similar characteristics, rather than being randomly distributed across space. Therefore, the assumption of independence among observations is likely violated. There appears to be positive spatial autocorrelation in several variables. In practical terms, this means that local housing market conditions, socioeconomic status, and educational attainment are spatially structured and may influence one another across nearby areas.

Figure 10: Choropleth map of standardized residuals

Figure 10 represents a choropleth map of the standardized residuals. Through our initial observation of Figure 10, we are confident in stating that our OLS model residuals are spatially autocorrelated. Specifically, there is a large concentration of darker purple census blocks at the center of the city, denoting a systematic underprediction, and a large concentration of orange/yellow census blocks around the North/NorthEast Philadelphia area, denoting a systematic overprediction.

## 
## Call:
## lm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + 
##     LNBELPOV100, data = main_df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25825 -0.20391  0.03822  0.21744  2.24347 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 11.1137661  0.0465330 238.836 < 0.0000000000000002 ***
## PCTVACANT   -0.0191569  0.0009779 -19.590 < 0.0000000000000002 ***
## PCTSINGLES   0.0029769  0.0007032   4.234            0.0000242 ***
## PCTBACHMOR   0.0209098  0.0005432  38.494 < 0.0000000000000002 ***
## LNBELPOV100 -0.0789054  0.0084569  -9.330 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3665 on 1715 degrees of freedom
## Multiple R-squared:  0.6623, Adjusted R-squared:  0.6615 
## F-statistic: 840.9 on 4 and 1715 DF,  p-value: < 0.00000000000000022

##   Step Df Deviance Resid. Df Resid. Dev       AIC
## 1      NA       NA      1715   230.3435 -3448.073

Table 4:Stepwise Regression Results

The stepwise regression began with the full model including all four predictors: Percent Vacant Houses (PCTVACANT), Percent Single-Person Households (PCTSINGLES), Percent with Bachelor’s or More (PCTBACHMOR), and Log Number of Households Below Poverty (LNNBELPOV). The final stepwise model retained all four predictors, indicating that each variable made a statistically significant contribution to explaining variation in log median house value (LNMEDHVAL).

The model achieved an R² of 0.6623 (Adjusted R² = 0.6615), indicating that approximately 66% of the variation in log median house value is explained by these predictors. Because the p-value is far below 0.05, we reject the null hypothesis that all regression coefficients (except the intercept) are equal to zero. This means that, collectively, the predictors explain a statistically significant portion of the variation in housing values.

## RMSE (Model 1): 0.3664

## RMSE (Model 2): 0.4427

Figure 11:K-Fold Cross Validation Results

Cross-validation was used to compare predictive performance between the full model with all four predictors, and A reduced model that includes only PCTVACANT and MEDHHINC (median household income).

Root Mean Square Error (RMSE) is a measure of how well a regression model predicts the dependent variable. It represents the average distance between the predicted values and the actual observed values, expressed in the same units as the dependent variable (in this case, log median house value). A lower RMSE indicates that the model’s predictions are closer to the true observed values, meaning the model has better predictive accuracy and smaller residual errors.

In this analysis, the full model’s RMSE of 0.366 is lower than the reduced model’s RMSE of 0.443, showing that the full model predicts log median house values more accurately. This suggests that including all four predictors helps capture more of the underlying variation in housing values, leading to improved model performance.

Discussion and Limitations

We used regression analysis to look at the relationship between the dependent variable- median household value- and our four independent variables-number of households living in poverty, percent of individuals with bachelor’s degrees or higher, percent of vacant houses, and percent of single house units- across census blocks in Philadelphia.

To check whether the OLS regression assumptions were met, we performed exploratory data analysis on the dataset. Through this analysis, we were able to identify violations of the OLS linearity assumption, which we addressed by log transforming the variables and determining which log-transformed variables should be included in the model. Then, we regressed the log-transformed median house value LN(MEDHVAL), on the proportion of residents with at least a bachelor’s degree (PCTBACHMOR), the proportion of vacant housing units (PCTVAVANT), percent of single-family housing units (PCTSINGLES), and the log-transformed number of households living in poverty LN(NBELPOV100). Our regression revealed that all 4 predictors were statistically significant with p-values <.05. Therefore, we rejected the null hypothesis for each variable. This finding aligns with existing literature on how neighborhood characteristics impact residential property values (Musa & Yusoff, 2015).

Overall, we can conclude that this is a strong model based on its R² value and results on the F-ratio test. The model’s adjusted R² (0.6615) indicates that approximately 66.15% of the variance in the model is explained by these 4 predictors. Additionally, the F-statistic (840.9, p < 0.001) shows that the overall model is statistically significant. As such, we can reject the null hypothesis that all coefficients are jointly equal to zero. While not included in this model, future studies could incorporate other predictors that might be associated with our dependent variable, like crime statistics, access to green space, access to transportation, and the number of schools.

We also performed a stepwise regression and a k-fold cross-validation to assess the quality of our model. The stepwise regression kept all 4 predictors, which suggests that these predictors are important enough to remain in the model and affirms the quality of the model. The RSME for the 4 predictor model (0.3664401) was smaller than the RSME of the 2 predictor model (0.4427216). A lower RSME value indicates that the 4-predictor model is better at predicting unseen data.

The model is limited by violations of some key OLS assumptions. Our exploratory analysis revealed that the assumptions of linearity and observation independence were violated. These violations affect the accuracy of the model. Specifically, the clustering of observations suggests that there may be spatial autocorrelation, which can lead to systemic overpredicting and underpredicting in the model. Additionally, we use NBELPOV100, the number of households living in poverty, as a variable rather than the percentage. Using a raw number instead of a rate introduces possible inconsistency because the value can change based on how census blocks are divided.

Ridge and Lasso Regression are alternatives to OLS Regression that address some of the limitations of OLS, including requirements about the number of observations, multicollinearity, and overfitting. Specifically, Ridge Regression allows for predictors to outnumber the number of observations, allows for multicollinearity, and addresses overfitting by shrinking variable coefficients towards 0. Ridge Regression is good for prediction accuracy, but it produces challenges for model interpretation because it cannot select variables. On the other hand, Lasso Regression selects variables by setting more coefficients to 0 and performs more shrinkage on the nonzero coefficients. Lasso performs better in situations when a small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero. In contrast, Ridge performs between when there are several predictors with coefficients of roughly equal size. In this case, neither Ridge nor Lasso is necessary because we do have a larger amount of observations relative to predictors, the predictors are not strongly correlated, and all predictors make significant contributions to the model as demonstrated by our cross-validation test.

References

Brusilovskiy, E (2025) Statistical and Data Mining Methods for Urban Data Analysis.

Devore, J.L. (2015) Probability and Statistics for Engineering and the Sciences. Cengage Learning, Boston.

File, N., & Duchneskie, J. (2025, September 11). Philly is no longer the country’s poorest big city. The Philadelphia Inquirer https://www.inquirer.com/news/philadelphia/poverty-rate-census-20250911.html

Grolemund, G. (2014, July 16). Introduction to R Markdown. RStudio. Retrieved from https://rmarkdown.rstudio.com/articles_intro.html R Markdown

Hwang, Jackelyn, and Lei Ding. “Unequal Displacement: Gentrification, Racial Stratification, and Residential Destinations in Philadelphia.” The American journal of sociology. 126.2 (2020): 354–406. Web.

Logan, J. R., & Bellman, B. (2016). Before The Philadelphia Negro: Residential Segregation in a Nineteenth-Century Northern City. Social Science History, 40(4), 683–706. doi:10.1017/ssh.2016.27

Rossiter, K. (2011, July 11). What are Census blocks? U.S. Census Bureau. Retrieved from https://www.census.gov/newsroom/blogs/random-samplings/2011/07/what-are-census-blocks.html census.gov