1 Introduction

This study aims to examine the relationship between median house values and several neighborhood characteristics. It will establish a model for predicting median house values, with a geographic focus on Philadelphia. Earlier models of house value prediction—especially the influential HUD report, Characteristic Prices of Housing in Fifty-Nine Metropolitan Areas—propose a hedonic model, which has been widely adopted and which we will use here.¹

\[R = f (S, N, C, t)\] where

$R =$ rent or value,

$S =$ structural characteristics,

$N =$ neighborhood characteristics (including location),

$C =$ contract conditions (implicit and explicit), and

$t =$ time trend (accounting for inflation over time).

In a subsequent study that looks at Place-to-place Housing Price Indexes and their determinants, the researchers utilized the log-linear form of the hedonic model,² expressed as:

\[\ln R = \beta_0 + S\beta_1 + N\beta_2 + L\beta_3 + C\beta_4 + \epsilon \] where

$\ln R =$ the natural log of rent,

$R, S, N, C =$ same as above,

$\beta_i =$ the hedonic regression coefficient, and

$\epsilon$ = the residuals.

Since the researchers mentioned “four appealing characteristics” of the log-linear form, including mitigating a common form of heteroskedasticity,³ our study will also use the logarithmic transformation of some variables.

Both studies identified and categorized structural, neighborhood, and contract characteristics as three separate groups. Nonetheless, the analysis primarily focused on the structural characteristics (e.g., the number of bedrooms) and oversimplified the variables in neighborhood characteristics (i.e., using a generic subjective rating of how “good” a neighborhood is instead of using more objective parameters). Given that previous studies largely focused on structural characteristics, we will adopt similar regression models to explore four neighborhood characteristics that were previously overlooked:

the percentage of individuals with Bachelor’s degrees or higher,
the percentage of vacant houses,
the number of households living in poverty, and
the percentage of single family housing units.

2 Methods

2.1 Data Cleaning

The original Philadelphia block group dataset has 1816 observations. We clean the data by removing the following block groups:

Block groups where population < 40
Block groups where there are no housing units
Block groups where the median house value is lower than $10,000
One North Philadelphia block group which had a very high median house value (over $800,000) and a very low median household income (less than $8,000)

The final dataset contains 1720 block groups.

2.2 Exploratory Data Analysis

Before we get into the regression analysis, we will perform basic exploratory analysis on our data. First and foremost, we will use histograms to check if our data are normally distributed. If any variables are not normally distributed, we will transform them to prevent heteroscedasiticy, nonlinearity, and outliers.

Correlation, a standardized measure of the strength of the relationship between variables, is known as $r$. Here, $r(\hat{ρ})$ is the value of the sample correlation coefficient for our sample, is a point estimate of the population correlation coefficient $ρ$. The Pearson correlation coefficient $r$ is given by :

\[\hat{ρ} = r = Corr(x,y) = \frac{\sum_{i=1}^{n}(x_i -\bar{x})(y_i -\bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i -\bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i -\bar{y})^2}}\] Here, $\bar{x}$ and $\bar{y}$ are values of the sample means of $x$ and $y$. Values of $r$ range between -1, which means perfect negative correlation, and 1, which means perfect positive correlation. A value of 0 indicates that there’s no linear association between $x$ and $y$.

In multiple regression, predictors should not be very strongly correlated with each other, i.e., there should be no multicollinearity. Normally, multicollinearity is when $r > 0.9$ or $r < -0.9$. Multicollinearity can lead to skewed or misleading results when a researcher or analyst attempts to determine how well each independent variable can be used most effectively to predict or understand the dependent variable in a statistical model.

2.3 Multiple Regression Analysis

2.3.1 Method of Regression

Multiple linear regression is used to examine the relationship between a variable of interest (the dependent variable) and one or more explanatory variables. The goal of this method is to fit a linear relationship between a quantitative dependent variable $Y$ and a set of predictors $X_1,X_2,...,X_i$. This method allows us to calculate the amount by which the dependent variable changes when a predictor variable changes by one unit. Similar to correlation, if an explanatory variable is a significant predictor of the dependent variable, it doesn’t imply that the explanatory variable is a cause of the dependent variable. Regression models are widely used in biological, behavioral, and social sciences to explain possible relationships between variables.

2.3.2 Equation

For this project the equation is:

\[MEDHVAL\; =\; 𝛽_0\;+\;𝛽_1PCBACHMORE\;+\;𝛽_2NBELPOV100\;+\;𝛽_3PCTVACANT\;+\;𝛽_4PCTSINGLES\;+ε\]

Each independent variable will have its own slope coefficient, which will indicate the relationship of that particular predictor with the dependent variable, controlling for all other independent variables in the regression. The coefficient $𝛽_i$ of each predictor is interpreted as the amount by which the dependent variable changes as the independent variable increases by one unit. For instance, $𝛽=0.5$ means that the dependent variable goes up by 0.5 units when the predictor goes up by 1 unit, holding all other predictors constant. The variable $ε$ is usually referred to as the residual or random error term in the model. Without $ε$, any observed values would fall exactly on the line, called the true regression line.

\[MEDHVAL\; =\; \hat{𝛽_0}\;+\;\hat{𝛽_1}PCBACHMORE\;+\;\hat{𝛽_2}NBELPOV100\;+\;\hat{𝛽_3}PCTVACANT\;+\;\hat{𝛽_4}PCTSINGLES\;\]

2.3.3 Regression Assumptions

The first important assumption for linear regression is that the relationship between the independent variable and each of the dependent variables is always linear. That is, if the value of the independent variable changes, the dependent variable must also change continuously. To check this assumption, we draw scatterplots between $y$ and each of the predictors $x$. If the relationship is not linear, we may choose to transform it, or use a polynomial regression.

The second assumption is that each observation and its residual have to be independent. This is called homoscedasticity, and means that the variance of the residuals $ε$ is constant regardless of the values of each $x$. That is, the residual by predicted plot and the residual by $X$ plot should show no systematic patterns that for different values of $X$. To check this, we can use scatterplots of standardized residuals against each predictor. The occurrence of heteroscedasticity often means that there is systematic under/over-prediction happening in the model. The inclusion of additional predictors, or running a spatial regression, may help reduce or eliminate heteroscedasticity.

The normality of residuals is important for point estimation, confidence intervals, and hypothesis tests only for small samples due to the Central Limit Theorem. But if we have a large sample size (>30), it isn’t as important as some of other assumptions. By looking at the histogram of residuals, we can check whether they are normal or not. We are generally more likely to see normal residuals if the dependent variable is normal and the predictor variables are normal. Violations of normality of residuals might arise either because the dependent or independent variables are themselves non-normal, or the linearity assumptions violated. In such cases, a nonlinear transformation of variables might solve both problems. In some cases, the problem with the residual distribution is mainly due to one or two very large errors, called outliers. If they are merely errors or if they can be explained as unique events not likely to be repeated, then you may able to remove them.

The third key assumption is non-multicollinearity, meaning that predictor variables shouldn’t be strongly correlated with each other. If the correlation between any pair of predictors $r > 0.8$ or $r < -0.8$, we might have multicollinearity. A better way to test for multicollinearity is to calculate the Variance Inflation Factor (VIF). The VIF for each predictor $i$ is defined as follows

\[VIF_k = \frac{1}{1-R_i^2}\]

A VIF of 1 means that there is no correlation among the $i^{th}$ predictor and the remaining predictor variables. The general rule of thumb is that a VIF exceeding 4 warrants further investigation, while a VIF exceeding 10 is a sign of serious multicollinearity requiring correction.

2.3.4 Parameters that need to be estimated in multiple regression

Given $n$ observations on $y$, and $k$ predictors $x_1,x_2,...,x_k$, the estimates $\hat{𝛽_0}\;+\hat{𝛽_1}\;+\hat{𝛽_2}\;,...,\hat{𝛽_k}\;$ are chosen simultaneously to minimize to expression for the Error Sum of Squares, given by:

\[SSE = \displaystyle\sum_{i=1}^{n}{ε^2} = \displaystyle\sum_{i=1}^{n}{(y-\hat{y})^2} = \displaystyle\sum_{i=1}^{n}{(y_i-\hat{𝛽_0}\;-\hat{𝛽_1}x_{1i}\;-\hat{𝛽_2}x_{2i}\;...-\;\hat{𝛽_k}x_{ki})^2}\]

In Multiple Regression, with $n$ observations on $y$, and $k$ predictors $x_1,x_2,...,x_k$: $σ^2 = MSE\;(mean\;of\;squared\;error) = \frac{SSE}{n-(k+1)}$

2.3.5 Method of Estimating Parameters : OLS Regression

Multiple OLS regression requires us to estimate values for a critical set of parameters: a regression constant and one regression coefficient for each independent variable in our model. The regression constant tells us the predicted value of the dependent variable (DV, hereafter) when all of the independent variables (IVs, hereafter) equal 0. The unstandardized regression coefficient for each IV tells us how much the predicted value of the DV would change with a one-unit increase in the IV, when all other IVs are at 0. OLS estimates these parameters by finding the values for the constant and coefficients that minimize the sum of the squared errors of prediction, i.e., the differences between a case’s actual score on the DV and the score we predict for them using actual scores on the IVs. A matrix of $β$ Coefficient, $\textrm{B}$ is given by \[\textrm{B}\;=\;\textrm{(X´X)}^{-1}\textrm{(X´Y)}\] $\textrm{Y}$ is an $N×1$ column matrix of cases’ scores on the DV, $\textrm{X}$ is an $N×(k+1)$ matrix of cases’ scores on the IVs (where the first column is a placeholder column of ones for the constant and the remaining columns correspond to each IV), $\textrm{B}$ is a $(k+1)×1$ column matrix containing the regression constant and coefficients.

2.3.6 Coefficient of Multiple Determination R2, and the Adjusted R2

As in simple regression, $SSE = \sum(y_i-\hat{y_i})^2$, which in the multiple regression case can be written as: \[SSE = \sum(y_i-\hat{y_i})^2 = \sum{[(y_i-(\hat{𝛽_0}\;+\hat{𝛽_1}x_{1i}\;+\hat{𝛽_2}x_{2i}\;..._\;\hat{𝛽_k}x_{ki})]^2}\] In multiple regression, with k predictors and n observations.

\[σ^2 = MSE(mean\;of\;squared\;error) = \frac{SSE}{n-(k+1)} \]

Also, as in simple regression, $SST = \sum(y_i-\bar{y_i})^2$ and $R^2 = 1- \frac{SSE}{SST}$. As for the multiple regression, $R^2$ is the coefficient of multiple determination, or the proportion of variance in the model explained by all k predictors. Because extra predictors will generally increase $R^2$, it is typically adjusted for the number of predictors, it is called Adjusted $R^2$ \[R^2_{adj} = \frac{(n-1)R^2-k}{n-(k+1)}\]

2.3.7 Hypothesis Testing : F-Ratio and T-Test

Before doing a hypothesis test for each individual predictor (t-test), we do a so-called model utility test, referred to as the F-ratio, which tests the $H_o$ (null nypothesis) that all coefficients in the model are zero versus $H_α$ (alternative hypothesis) that at least 1 of the coefficients is not 0. That is, none of the independent variables is a significant predictor of the dependent variable, and it means our model is seriously wrong. In addition to an F-test, a t-test (hypothesis test) is done for each predictor $i$. Most statisticians will only look at these t-tests for each individual predictor if the F-test is significant. In particular, for each predictor $i$, $\frac{\hat{β_i}-β_i}{s_{\hat{β_i}}}$ has a t distribution with $ν = n \;– (k+1)$ degrees of freedom. A 95% confidence interval for $β_i$ is calculated as $\hat{β_i}\;±\;t_{0.025,ν}*s_{\hat{β_i}}$. In general, if the p-value for a certain independent variable is less than 0.05, we can reject the null hypothesis of $β=0$ that this particular predictor is not a significant predictor of the dependent variable for an $H_α$ of $β≠0$.

2.4 Additional Analyses

2.4.1 Stepwise Regression

Stepwise regression is a data mining method that selects predictors based on criteria such as p-values below a certain threshold or the smallest value of the Akaike Information Criterion (AIC), a measure of the relative quality of statistical models.

Stepwise regression has several limitations. First, the final model is not guaranteed to be optimal in any specific sense. The procedure yields only a single final model, although there may be several equally first of all, final model is not guaranteed to be optimal in any specified sense. It also does not account for a researcher’s knowledge about the predictors.

2.4.2 K-Fold Cross-Validation

Imagine that we have a data set with 1000 observations and are trying to fit a model using this data set. We use the training data set to “train” the regression model. For the model, we then take the $𝛽$ coefficients that were estimated with the training data set and compute the predicted value $\hat{y_𝑖}$ and the residual $ε_i=y_i−\hat{y_i}$ for each observation $i$ in the validation data set. We then repeat this for the test model. For both models, we then use the validation data set to calculate a statistic known as the Root Mean Square Error (RMSE).

The problem with this approach is that the RMSE depends on which observations have been randomly assigned to the training data set versus the validation data set. If we repeat the process of randomly splitting the sample set into two parts, we will get a somewhat different estimate for the RMSE each time, and that may affect the selection of the best model. To solve this issue, we run a k-fold cross-validation test. This test involves randomly dividing the set of observations into $k$ groups or folds of equal size. The 1st fold is treated as a validation data set and the model is fitted on the remaining $k-1$ folds (training data sets). The MSE is computed for the validation fold. This procedure is repeated $k$ times; each time, a different fold is treated as the validation data set. This process results in $k$ estimates of the MSE. The k-fold MSE estimate is computed by averaging the MSEs across the $k$ folds. The k-fold RMSE is computed as the square root of that MSE. We can compare the RMSE values for different models and choose the model with the smallest RMSE as the best one. In our analysis below, we use $k = 5$.

2.5 Tools

The analyses and visualizations for this report have all been done in R.

3 Results

3.1 Exploratory Results

3.1.1 Import data

In order to complete this entire project in R (rather than using ArcGIS, too), we have chosen to use the shapefile of data, rather than the .csv. Below, we import the shapefile and use a custom function to apply log transformations to the relevant columns. The function checks whether there are zero values in each column and then applies the appropriate log transformation accordingly.

#Nissim
reg_data = read_sf('C:/Users/Nissim/Desktop/Fall 2022/Spat Stats/ass_1_data_shp/RegressionData.shp')

#Min
#reg_data = read_sf('C:/Users/vestalk/Desktop/00_Upenn/20.Fall/03.MUSA 5000 Spatial Statistics and Data Analysis/Assignment/HW1/RegressionData.shp')


#define a function to find zero values in columns
col_zeros = function(a, b) {
                  pct_col_zeros = count(subset(st_drop_geometry(a), b != 0)) |>
                                      pull(n) / nrow(st_drop_geometry(a))
                  return(pct_col_zeros)
                  }


#apply function with case_when statement
#case_when is a vectorized function, while ifelse is not.
#running this with ifelse will result in all row values in the mutated column being the same.
reg_data = reg_data |>
            mutate(
                ln_med_h_val = case_when(col_zeros(reg_data, reg_data$MEDHVAL) == 1 ~ log(reg_data$MEDHVAL),
                                     TRUE ~ log(1 + reg_data$MEDHVAL)),
                   ln_pct_bach_more = case_when(col_zeros(reg_data, reg_data$PCTBACHMOR) == 1 ~ log(reg_data$PCTBACHMOR),
                                     TRUE ~ log(1 + reg_data$PCTBACHMOR)),
                   ln_n_bel_pov_100 = case_when(col_zeros(reg_data, reg_data$NBelPov100) == 1 ~ log(reg_data$NBelPov100),
                                     TRUE ~ log(1 + reg_data$NBelPov100)),
                   ln_pct_vacant = case_when(col_zeros(reg_data, reg_data$PCTVACANT) == 1 ~ log(reg_data$PCTVACANT),
                                     TRUE ~ log(1 + reg_data$PCTVACANT)),
                   ln_pct_singles = case_when(col_zeros(reg_data, reg_data$PCTSINGLES) == 1 ~ log(reg_data$PCTSINGLES),
                                     TRUE ~ log(1 + reg_data$PCTSINGLES)),
                  )

3.1.2 Data Table

The table below summarizes the mean and standard deviation of the dependent variable (median house value) and four predictors. We can see that two predictors (percent with a Bachelors’ degree and percent in single housing units) have standard deviations larger than their means, indicating a large variation in values. In the following sections (3.1.3 and 3.1.4), the data from this table are visualized in histograms.

med_house_val = c("Median House Value", mean(reg_data$MEDHVAL), sd(reg_data$MEDHVAL))

hhs_in_pov = c("# Households Living in Poverty", mean(reg_data$NBelPov100), sd(reg_data$NBelPov100))

pct_w_bach_or_higher = c("% of Individuals with Bachelor's Degrees or Higher", mean(reg_data$PCTBACHMOR), sd(reg_data$PCTBACHMOR))

pct_vac_houses = c("% of Vacant Houses", mean(reg_data$PCTVACANT), sd(reg_data$PCTVACANT))

pct_sing_house_units = c("% of Single House Units", mean(reg_data$PCTSINGLES), sd(reg_data$PCTSINGLES))

table = as.data.frame(t(data.frame(
              med_house_val,
              hhs_in_pov,
              pct_w_bach_or_higher,
              pct_vac_houses,
              pct_sing_house_units
              )))

colnames(table) = c("Variable", "Mean", "SD")

table$Mean = as.numeric(table$Mean)
table$SD = as.numeric(table$SD)

table = table |>
          mutate_if(is.numeric, round, digits = 3)

table_out = table |>
        gt() |>
        tab_header(
          title = md("**Summary Statistics**")
        ) |>
        tab_row_group(
          label = md('**Predictors**'),
          rows = 2:5
        ) |>
        tab_row_group(
          label = md('**Dependent Variable**'),
          rows = 1
        )

#print output
table_out

Variable	Mean	SD
Summary Statistics
Dependent Variable
Median House Value	66287.733	60006.076
Predictors
# Households Living in Poverty	189.771	164.318
% of Individuals with Bachelor's Degrees or Higher	16.081	17.770
% of Vacant Houses	11.289	9.628
% of Single House Units	9.226	13.249

3.1.3 Histograms

Figures 1a - 1e show the distribution of predictor variables before logarithmic transformation. We can see that both the dependent variable and four predictors are not normally distributed; all of them are positively skewed. To achieve normal distributions of the variables, we performed logarithmic transformations, shown in the following section.

 house_val = ggplot(reg_data) +
                geom_histogram(aes(MEDHVAL)) +
                geom_vline(xintercept = mean(reg_data$MEDHVAL), color = 'darkred') +
    geom_vline(xintercept = (mean(reg_data$MEDHVAL) + sd(reg_data$MEDHVAL)), linetype = 'dashed')+
    geom_vline(xintercept = (mean(reg_data$MEDHVAL) - sd(reg_data$MEDHVAL)), linetype = 'dashed') +
    labs(title = "Figure 1a",
        subtitle = "Histogram of Median House Values",
        x = "Median House Value") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5),
          axis.title.y = element_blank()) +
    annotate("text", x = mean(reg_data$MEDHVAL), y = 850, label = "bar(x) ", size = 7, parse = T)+
    annotate("text", x = (mean(reg_data$MEDHVAL) + sd(reg_data$MEDHVAL)+1000), y = 850, label = "~sigma ", size = 7, parse = T)+
    annotate("text", x = (mean(reg_data$MEDHVAL) - sd(reg_data$MEDHVAL)-2500), y = 850, label = "~-sigma ", size = 7, parse = T)
  
  pct_bach = ggplot(reg_data) +
    geom_histogram(aes(PCTBACHMOR)) +
    geom_vline(xintercept = mean(reg_data$PCTBACHMOR), color = 'darkred') +
    geom_vline(xintercept = (mean(reg_data$PCTBACHMOR) + sd(reg_data$PCTBACHMOR)), linetype = 'dashed')+
    geom_vline(xintercept = (mean(reg_data$PCTBACHMOR) - sd(reg_data$PCTBACHMOR)), linetype = 'dashed') +
     labs(title = "Figure 1b",
        subtitle = "Histogram of Educational Achievement",
        x = "% w/a Bachelor's or Higher") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5),
          axis.title.y = element_blank()) +
    annotate("text", x = mean(reg_data$PCTBACHMOR), y = 300, label = "bar(x) ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTBACHMOR) + sd(reg_data$PCTBACHMOR)), y = 300, label = "~sigma ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTBACHMOR) - sd(reg_data$PCTBACHMOR)), y = 300, label = "~-sigma ", size = 5, parse = T)
  
  nbelpov = ggplot(reg_data) +
    geom_histogram(aes(NBelPov100)) +
    geom_vline(xintercept = mean(reg_data$NBelPov100), color = 'darkred') +
    geom_vline(xintercept = (mean(reg_data$NBelPov100) + sd(reg_data$NBelPov100)), linetype = 'dashed')+
    geom_vline(xintercept = (mean(reg_data$NBelPov100) - sd(reg_data$NBelPov100)), linetype = 'dashed') +
     labs(title = "Figure 1c",
        subtitle = "Histogram of Poverty Levels",
        x = "# Below Poverty Line") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5),
          axis.title.y = element_blank()) +
    annotate("text", x = mean(reg_data$NBelPov100), y = 300, label = "bar(x) ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$NBelPov100) + sd(reg_data$NBelPov100)), y = 300, label = "~sigma ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$NBelPov100) - sd(reg_data$NBelPov100)), y = 300, label = "~-sigma ", size = 5, parse = T)
  
  pct_vac = ggplot(reg_data) +
    geom_histogram(aes(PCTVACANT)) +
    geom_vline(xintercept = mean(reg_data$PCTVACANT), color = 'darkred') +
    geom_vline(xintercept = (mean(reg_data$PCTVACANT) + sd(reg_data$PCTVACANT)), linetype = 'dashed')+
    geom_vline(xintercept = (mean(reg_data$PCTVACANT) - sd(reg_data$PCTVACANT)), linetype = 'dashed') +
     labs(title = "Figure 1d",
        subtitle = "Histogram of Vacancy Rates",
        x = "% Vacancy Rate") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5),
          axis.title.y = element_blank()) +
    annotate("text", x = mean(reg_data$PCTVACANT), y = 275, label = "bar(x) ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTVACANT) + sd(reg_data$PCTVACANT)), y = 275, label = "~sigma ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTVACANT) - sd(reg_data$PCTVACANT)), y = 275, label = "~-sigma ", size = 5, parse = T)
  
  pct_sing = ggplot(reg_data) +
    geom_histogram(aes(PCTSINGLES)) +
    geom_vline(xintercept = mean(reg_data$PCTSINGLES), color = 'darkred') +
    geom_vline(xintercept = (mean(reg_data$PCTSINGLES) + sd(reg_data$PCTSINGLES)), linetype = 'dashed')+
    geom_vline(xintercept = (mean(reg_data$PCTSINGLES) - sd(reg_data$PCTSINGLES)), linetype = 'dashed') +
     labs(title = "Figure 1e",
        subtitle = "Histogram of Single House Units",
        x = "% Single House Units") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5),
          axis.title.y = element_blank()) +
    annotate("text", x = mean(reg_data$PCTSINGLES), y = 450, label = "bar(x) ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTSINGLES) + sd(reg_data$PCTSINGLES)), y = 450, label = "~sigma ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTSINGLES) - sd(reg_data$PCTSINGLES)), y = 450, label = "~-sigma ", size = 5, parse = T)
  
  house_val

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ggarrange(pct_bach, nbelpov, pct_vac, pct_sing)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.1.4 Log Transform Histograms

Figures 2a - 2e show the distribution of our variables after logarithmic transformation. Log transformation helps to achieve a linear relationship between the dependent variables and independent variables (predictors) as well as normality of the residuals, both of which are important assumptions for regression.

As is apparent in the figures, there is a high frequency of 0-values for the following log transformed variables: percent bachelor’s degree (Fig. 2b), vacancy rate (Fig. 2d), and percent single house units (Fig. 2e). This is unhelpful for our regression model. On the other hand, the log transformed versions of median house value (Fig. 2a) and poverty level (Fig. 2c) are more normally distributed after the transformation. Hence, we choose to use the log transformed versions of median house value (our dependent variable) and poverty level (one of our predictors) while keeping other three predictors in their original form in our regression.

In addition to 1) the linear relationship between dependent variables and predictors and 2) normality of residuals—which are achieved by log transformation—we will test several other regression assumptions later in Section 3.3. These are 3) homoscedasticity, 4) the independence of observations, 5) multicollinearity, and 6) that each predictor has at least 10 observations.

 ln_house_val = ggplot(reg_data) +
                geom_histogram(aes(ln_med_h_val)) +
                geom_vline(xintercept = mean(reg_data$ln_med_h_val), color = 'darkred') +
                geom_vline(xintercept = (mean(reg_data$ln_med_h_val) + sd(reg_data$ln_med_h_val)), linetype = 'dashed')+
                geom_vline(xintercept = (mean(reg_data$ln_med_h_val) - sd(reg_data$ln_med_h_val)), linetype = 'dashed') +
                labs(title = "Figure 2a",
                    subtitle = "Histogram of Ln of Median House Values",
                    x = "Ln of Median House Value") +
                theme_minimal() +
                theme(plot.title = element_text(hjust = 0.5),
                      plot.subtitle = element_text(hjust = 0.5),
                      axis.title.y = element_blank()) +
                annotate("text", x = mean(reg_data$ln_med_h_val), y = 225, label = "bar(x) ", size = 7, parse = T)+
                annotate("text", x = (mean(reg_data$ln_med_h_val) + sd(reg_data$ln_med_h_val)), y = 225, label = "~sigma ", size = 7, parse = T)+
                annotate("text", x = (mean(reg_data$ln_med_h_val) - sd(reg_data$ln_med_h_val)), y = 225, label = "~-sigma ", size = 7, parse = T)
  
  ln_pct_bach = ggplot(reg_data) +
                geom_histogram(aes(ln_pct_bach_more)) +
                geom_vline(xintercept = mean(reg_data$ln_pct_bach_more), color = 'darkred') +
                  geom_vline(xintercept = (mean(reg_data$ln_pct_bach_more) + sd(reg_data$ln_pct_bach_more)), linetype = 'dashed')+
                  geom_vline(xintercept = (mean(reg_data$ln_pct_bach_more) - sd(reg_data$ln_pct_bach_more)), linetype = 'dashed') +
                  labs(title = "Figure 2b",
                      subtitle = "Histogram of Ln of Educational Achievement",
                      x = "Ln of Educational Achievement") +
                  theme_minimal() +
                  theme(plot.title = element_text(hjust = 0.5),
                        plot.subtitle = element_text(hjust = 0.5),
                        axis.title.y = element_blank()) +
                  annotate("text", x = mean(reg_data$ln_pct_bach_more), y = 140, label = "bar(x) ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_bach_more) + sd(reg_data$ln_pct_bach_more)), y = 140, label = "~sigma ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_bach_more) - sd(reg_data$ln_pct_bach_more)), y = 140, label = "~-sigma ", size = 5, parse = T)
  
  ln_nbelpov = ggplot(reg_data) +
                geom_histogram(aes(ln_n_bel_pov_100)) +
                geom_vline(xintercept = mean(reg_data$ln_n_bel_pov_100), color = 'darkred') +
                  geom_vline(xintercept = (mean(reg_data$ln_n_bel_pov_100) + sd(reg_data$ln_n_bel_pov_100)), linetype = 'dashed')+
                  geom_vline(xintercept = (mean(reg_data$ln_n_bel_pov_100) - sd(reg_data$ln_n_bel_pov_100)), linetype = 'dashed') +
                  labs(title = "Figure 2c",
                      subtitle = "Histogram of Ln of Poverty Levels",
                      x = "Ln of # Below Poverty Line") +
                  theme_minimal() +
                  theme(plot.title = element_text(hjust = 0.5),
                        plot.subtitle = element_text(hjust = 0.5),
                        axis.title.y = element_blank()) +
                  annotate("text", x = mean(reg_data$ln_n_bel_pov_100), y = 225, label = "bar(x) ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_n_bel_pov_100) + sd(reg_data$ln_n_bel_pov_100)), y = 225, label = "~sigma ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_n_bel_pov_100) - sd(reg_data$ln_n_bel_pov_100)), y = 225, label = "~-sigma ", size = 5, parse = T)
  
  ln_pct_vac = ggplot(reg_data) +
                geom_histogram(aes(ln_pct_vacant)) +
                geom_vline(xintercept = mean(reg_data$ln_pct_vacant), color = 'darkred') +
                  geom_vline(xintercept = (mean(reg_data$ln_pct_vacant) + sd(reg_data$ln_pct_vacant)), linetype = 'dashed')+
                  geom_vline(xintercept = (mean(reg_data$ln_pct_vacant) - sd(reg_data$ln_pct_vacant)), linetype = 'dashed') +
                  labs(title = "Figure 2d",
                      subtitle = "Histogram of Ln of Vacancy Rates",
                      x = "Ln of % Vacancy Rate") +
                  theme_minimal() +
                  theme(plot.title = element_text(hjust = 0.5),
                        plot.subtitle = element_text(hjust = 0.5),
                        axis.title.y = element_blank()) +
                  annotate("text", x = mean(reg_data$ln_pct_vacant), y = 150, label = "bar(x) ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_vacant) + sd(reg_data$ln_pct_vacant)), y = 150, label = "~sigma ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_vacant) - sd(reg_data$ln_pct_vacant)), y = 150, label = "~-sigma ", size = 5, parse = T)
  
  ln_pct_sing = ggplot(reg_data) +
                geom_histogram(aes(ln_pct_singles)) +
                geom_vline(xintercept = mean(reg_data$ln_pct_singles), color = 'darkred') +
                  geom_vline(xintercept = (mean(reg_data$ln_pct_singles) + sd(reg_data$ln_pct_singles)), linetype = 'dashed')+
                  geom_vline(xintercept = (mean(reg_data$ln_pct_singles) - sd(reg_data$ln_pct_singles)), linetype = 'dashed') +
                  labs(title = "Figure 2e",
                      subtitle = "Histogram of Ln of Single House Units",
                      x = "Ln of % Single House Units") +
                  theme_minimal() +
                  theme(plot.title = element_text(hjust = 0.5),
                        plot.subtitle = element_text(hjust = 0.5),
                        axis.title.y = element_blank()) +
                  annotate("text", x = mean(reg_data$ln_pct_singles), y = 225, label = "bar(x) ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_singles) + sd(reg_data$ln_pct_singles)), y = 225, label = "~sigma ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_singles) - sd(reg_data$ln_pct_singles)), y = 225, label = "~-sigma ", size = 5, parse = T)
  
  ln_house_val

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ggarrange(ln_pct_bach, ln_nbelpov, ln_pct_vac, ln_pct_sing)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.1.5 Choropleth Maps

Figures 3a - 3e depict the geographic distribution of our variables. Note that all five maps show the log transformed versions of our variables.

By comparing the maps, we can see a noticeable similarity between log of median house value (Fig. 3a) and log of percent bachelor’s degree (Fig. 3b), and an opposite pattern in the log of percent of vacant homes. The relationship between house values (Fig. 3a) and poverty levels (Fig. 3c) or single house units (Fig. 3e) is less clear based solely on the maps.

Comparing the maps of the four predictor variables (Fig. 3b, 3c, 3d, and 3e), we see distinct patterns for each variable. It is unlikely that they are inter-correlated, and therefore unlikely that we have multi-collinearity.

#lifted from lovelace: https://geocompr.robinlovelace.net/adv-map.html#faceted-maps
tmap_mode("plot")

## tmap mode set to plotting

#phl_city_lims = st_read("C:/Users/Nissim/Desktop/Fall 2022/Spat Stats/phl_city_limits/City_Limits.shp")

phl_city_lims = st_read("https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson")

## Reading layer `City_Limits' from data source 
##   `https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 3 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -75.28031 ymin: 39.86747 xmax: -74.95575 ymax: 40.13793
## Geodetic CRS:  WGS 84

tm_shape(reg_data) + 
  tm_polygons(title = "Ln of Median House Value", col = "ln_med_h_val", border.col = NA, border.alpha = 0, lwd = 0, palette = "Blues", style = "jenks") + 
  tm_shape(phl_city_lims) +
  tm_borders(col = "grey", lwd = 5) +
  tm_compass(position = c("left", "top")) +
  tm_layout(main.title = "Figure 3a",
            legend.position = c("right", "bottom"))

facets = c("ln_pct_bach_more",
           "ln_n_bel_pov_100",
           "ln_pct_vacant",
           "ln_pct_singles")

facet_titles = c("Ln of Edu. Attain.",
                 "Ln of Pov. Levels",
                 "Ln of Vacancy",
                 "Ln of Single Occ")

tm_shape(reg_data) + 
  tm_polygons(facets, title = facet_titles, border.col = NA, border.alpha = 0,lwd = 0, palette = "Blues", style = "jenks") + 
  tm_facets(nrow = 2, sync = TRUE) +
  tm_layout(legend.position = c("right", "bottom"),
            panel.labels = c("Figure 3b",
                             "Figure 3c",
                             "Figure 3d",
                             "Figure 3e"))

3.1.6 Correlation Matrix

To further test for multi-collinearity, we have created a correlation correlation matrix of our predictor variables. The matrix suggests a weak correlation across predictors (correlation coefficients fall between -0.32 and 0.25), which means no pair of predictors is strongly correlated. Hence, the matrix shows that there is no severe multicollinearity, which aligns with our earlier observation of the choropleth maps.

#https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
corr_reg_data = reg_data |>
                  st_drop_geometry() |>
                  dplyr::select(
                                PCTVACANT,
                                PCTSINGLES,
                                PCTBACHMOR,
                                LNNBELPOV)

corrplot(cor(corr_reg_data), method = "number", type = "lower", tl.col = "black", tl.cex = 0.75, number.cex = 1)

3.2 Regression Results

Our model regressed the log of median house values (ln_med_h_val) on four predictors, namely,

the percents of individual with bachelor’s degrees or higher (PCBACHMORE),
the number of households living below the poverty line (NBELPOV100),
the percent of vacant houses (PCTVACANT), and
the percent of single houses units (PCTSINGLES).

The regression output suggests significant relationships between median house value and each of the four predictors respectively (p < 0.001 for all predictors). Specifically, median house value is positively associated with the percent of individuals with a bachelor’s degree or higher and with the percent of single houses. It is negatively associated with the number of households living in poverty and the percent of vacant houses.

A one unit increase in the percent of vacant houses (β1), the percent of single house units (β2), the percent of individuals with a bachelor’s degree or higher (β3), and the number of households living in poverty (β4) are respectively associated with a β1= $ 578.91 decrease, β2 = $ 754.91 increase, β3 = 2051.89 increase, and β4 = $5681.22 decrease in median house values.

That all four predictors have p-values of less than 0.001 suggests that, if there is actually no relationship between median home value and any of the four predictors, the probability of getting the aforementioned β1 - β4 coefficient estimates is less than 0.001. We can therefore reject $H0: β1 = 0$ for $Ha: β1 ≠ 0$, $H0: β2 = 0 for Ha: β2 ≠ 0$, $H0: β3 = 0 for Ha: β3 ≠ 0$, $H0: β4 = 0 for Ha: β4 ≠ 0$ (at most reasonable levels of $α = P$ (a Type I error)).

3.2.1 Regression

lm = lm(ln_med_h_val ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + ln_n_bel_pov_100, data = reg_data)

stargazer(lm, type = "text", title = "Linear Regression", align = TRUE)

## 
## Linear Regression
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            ln_med_h_val        
## -----------------------------------------------
## PCTVACANT                    -0.019***         
##                               (0.001)          
##                                                
## PCTSINGLES                   0.003***          
##                               (0.001)          
##                                                
## PCTBACHMOR                   0.021***          
##                               (0.001)          
##                                                
## ln_n_bel_pov_100             -0.079***         
##                               (0.008)          
##                                                
## Constant                     11.114***         
##                               (0.047)          
##                                                
## -----------------------------------------------
## Observations                   1,720           
## R2                             0.662           
## Adjusted R2                    0.662           
## Residual Std. Error      0.366 (df = 1715)     
## F Statistic          840.857*** (df = 4; 1715) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

stargazer(anova(lm), type = "text", title = "Analysis of Variance", align = TRUE)

## 
## Analysis of Variance
## =============================================
## Statistic N  Mean   St. Dev.  Min      Max   
## ---------------------------------------------
## Df        5 343.800 766.524    1      1,715  
## Sum Sq    5 136.418 110.193  11.692  235.118 
## Mean Sq   5 90.376  109.227  0.134   235.118 
## F value   4 840.857 832.891  87.054 1,750.551
## Pr(> F)   4  0.000   0.000     0        0    
## ---------------------------------------------

pred_vals = fitted(lm)

resids = residuals

stand_resids = rstandard(lm)

lm_df = data.frame(reg_data$ln_med_h_val, pred_vals, stand_resids) |>
          rename(ln_med_h_val = reg_data.ln_med_h_val)

3.3 Regression Assumption Checks

In the earlier sections, we looked at statistical and geospatial distributions of our variables. This allowed us to partially check our regression assumptions, such as linearity, normality, and multi-collinearity. In this section, we conduct further testing to ensure other assumptions are met and our model is apt.

As mentioned above, the regression assumptions that need to be checked include

the linear relationship between the dependent variable and predictors,
the normality of residuals,
homoscedasticity,
the independence of observations,
the absence of multi-collinearity, and
that each predictor has at least 10 observations.

3.3.1 Scatterplots of Predictors

The four scatterplots depict the relationship between the log of median home value and the four predictors. According to those plots, we can see that they do not present linear relationships. Therefore, our first assumption—that the relationship between the dependent variable and its predictors—is not met.

  pct_bach_plot = ggplot(reg_data) +
                    geom_point(aes(x = ln_med_h_val, 
                                   y = PCTBACHMOR)) +
                    theme_minimal()
  
  nbelpov_plot = ggplot(reg_data) +
                    geom_point(aes(x = ln_med_h_val, 
                                   y = NBelPov100)) +
                    theme_minimal()
  
  pct_vac_plot = ggplot(reg_data) +
                    geom_point(aes(x = ln_med_h_val, 
                                   y = PCTVACANT)) +
                    theme_minimal()
    
  pct_sing_plot = ggplot(reg_data) +
                    geom_point(aes(x = ln_med_h_val, 
                                   y = PCTSINGLES)) +
                    theme_minimal()
  
  ggarrange(pct_bach_plot, nbelpov_plot, pct_vac_plot, pct_sing_plot)

3.3.2 Histogram of Standardized Residuals

Below, the histogram of standardized residuals confirms that they are normally distributed, so our second regression assumption is met.

#join lm_df back to reg_data to map stand_resids
#I'm not sure there's an easy way to make sure the rows match, but it should be okay
reg_data = left_join(reg_data, lm_df, by = "ln_med_h_val")

ggplot(reg_data) +
  geom_histogram(aes(x = stand_resids)) +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.3.3 Standardized Residual by Predicted Value Scatter Plot

This scatter plot of the standardized residuals by predicted value shows heteroscedasticity: as predicted values increases, the variance in the residuals increases as well. Since the regression model assumes homoscedasticity—that is, the variance of residuals does not change as predicted values change—our third assumption is not met.

ggplot(lm_df) +
  geom_point(aes(x = pred_vals, y = stand_resids)) +
  theme_minimal()

3.3.4 Histogram & Choropleth of SRRs

Systematic patterns in residual (should be random)

The histogram and choropleth map below show systematic patterns in the residuals of our regression—that is, the residuals are spatially autocorrelated and therefore not random.

tm_shape(reg_data) + 
  tm_polygons(col = "stand_resids", border.col = NA, border.alpha = 0.1, lwd = 0, palette = "Blues", style = "jenks") + 
  tm_layout(legend.position = c("right", "bottom"))

3.4 Additional Models

In this section, we performed stepwise regression and k-fold cross validation. The stepwise regression allows us to analyze the significance of all four of our predictor variables. K-fold cross validation allows us to compare the root mean square error (RMSE) of our original model with four predictors to a new model with only two predictors.

3.4.1 Stepwise Regression

The stepwise regression indicates that all four predictor variables have p-values below 0.05. This suggests that all four predictors are significant, so we will keep them in our final model. Note that the poverty level predictor is in its log-transformed form.

stargazer(stepAIC(lm), type = "text", title = "StepAIC", align = TRUE)

## Start:  AIC=-3448.07
## ln_med_h_val ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + ln_n_bel_pov_100
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          230.34 -3448.1
## - PCTSINGLES        1     2.407 232.75 -3432.2
## - ln_n_bel_pov_100  1    11.692 242.04 -3364.9
## - PCTVACANT         1    51.546 281.89 -3102.7
## - PCTBACHMOR        1   199.020 429.36 -2379.0
## 
## StepAIC
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            ln_med_h_val        
## -----------------------------------------------
## PCTVACANT                    -0.019***         
##                               (0.001)          
##                                                
## PCTSINGLES                   0.003***          
##                               (0.001)          
##                                                
## PCTBACHMOR                   0.021***          
##                               (0.001)          
##                                                
## ln_n_bel_pov_100             -0.079***         
##                               (0.008)          
##                                                
## Constant                     11.114***         
##                               (0.047)          
##                                                
## -----------------------------------------------
## Observations                   1,720           
## R2                             0.662           
## Adjusted R2                    0.662           
## Residual Std. Error      0.366 (df = 1715)     
## F Statistic          840.857*** (df = 4; 1715) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

stargazer(anova(lm), type = "text", title = "Analysis of Variance", align = TRUE)

## 
## Analysis of Variance
## =============================================
## Statistic N  Mean   St. Dev.  Min      Max   
## ---------------------------------------------
## Df        5 343.800 766.524    1      1,715  
## Sum Sq    5 136.418 110.193  11.692  235.118 
## Mean Sq   5 90.376  109.227  0.134   235.118 
## F value   4 840.857 832.891  87.054 1,750.551
## Pr(> F)   4  0.000   0.000     0        0    
## ---------------------------------------------

3.4.2 K-Fold Cross-Validation

We performed cross-validation on both the original model with 4 predictors and a new model with only 2 predictors: percent of vacant units and median household income. By comparing the root mean squared error (RMSE) of two cross-validations, we can see that the original model has a smaller RMSE and is therefore the better of the two models.

#----
#IGNORING THIS BC I RAN INTO ISSUES W THIS PACKAGE

#RMSE for full model
#cvlm_data = reg_data |>
#              st_drop_geometry() |>
#              dplyr::select(MEDHVAL,
#                            PCTVACANT,
#                            PCTSINGLES,
#                            PCTBACHMOR,
#                            ln_n_bel_pov_100)

#CVlm(data = cvlm_data, form.lm = lm, m = 5)

#class(lm)

#CVlm(reg_data, form.lm = lm, m =5)

#RMSE for model with only PCTVACANT and MEDHHINC
#----


#running into some weird errors with the DAAG cv.lm function
#trying a different one

#rmse for full model
lm_ii = trainControl(method = "cv", number = 5)

cvlm_model = train(ln_med_h_val ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + ln_n_bel_pov_100, data = reg_data, method = "lm", trControl = lm_ii)

stargazer(print(cvlm_model), type = "text", title = "StepAIC", align = TRUE)

## Linear Regression 
## 
## 4844 samples
##    5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3875, 3877, 3874, 3875, 3875 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.3471292  0.5932863  0.2622488
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## 
## StepAIC
## ===============================
##     RMSE    Rsquared     MAE   
## -------------------------------
##   0.3471292 0.5932863 0.2622488
## -------------------------------

#rmse for reduced model (just PCTVACANT and MEDHHINC)
lm_ii_reduced = trainControl(method = "cv", number = 5)

cvlm_model_reduced = train(ln_med_h_val ~ PCTVACANT + MEDHHINC, data = reg_data, method = "lm", trControl = lm_ii_reduced)

stargazer(print(cvlm_model_reduced), type = "text", title = "StepAIC", align = TRUE)

## Linear Regression 
## 
## 4844 samples
##    3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3875, 3874, 3877, 3875, 3875 
## Resampling results:
## 
##   RMSE       Rsquared  MAE      
##   0.4086802  0.435782  0.2914365
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## 
## StepAIC
## ==============================
##     RMSE    Rsquared    MAE   
## ------------------------------
##   0.4086802 0.435782 0.2914365
## ------------------------------

4 Discussion and Limitations

4.1 Recap

When constructing our regression model, we log transformed the predictor variables for percent house vacancy and for median house value, yielding normal distributions for both. However, we did not ultimately use the log-transformed versions in our regression model because they were zero-inflated. Additionally, other than the vacancy rate, the predictor variables did not exhibit linear relationships with the log of median house value. This indicates that we should consider a different regression model, such as the polynomial method.

According to the summary statistics of our regression model, the p-values associated with each independent variable are less than 0.01. This means that we can reject the null hypothesis that each particular predictor is not a significant predictor of the dependent variable. When we look at the sign and value of $\hat{β}$ for each predictor, the variable for percent of the population with a bachelor’s degree or higher seems the most significant predictor. This is because when percent vacancy ($\hat{β} = -0.019$), percent single housing units ($\hat{β} = 0.003$), percent bachelor’s degree or higher ($\hat{β} = 0.021$) changes 1 unit, median house value changes -1.88% ($(e^{-0.019}-1)*100$), 0.3% ($(e^{0.003}-1)*100$), 2.12% ($(e^{0.021}-1)*100$), and when the number of people below the poverty line ($\hat{β} = -0.079$) changes 1%, median house value changes -0.07% ($(1.01^{-0.079}-1)*100$).

4.2 Quality of Model

To evaluate our model, we can first conduct the Model Utility Test (F-Test / F-ratio) to look at the overall model, then use stepwise regression to look at each predictor. As included in the regression results shown in section 3.2, we have an F-ratio of 840.957 (called “F Statistic” in the table) that is statistically significant (with p < 0.01). A significant f-ratio helps us reject the null hypothesis that all coefficients are jointly zero, and suggests that at least one of the coefficients of the predictors is not zero. This is the important first step to show that our model is not completely doomed.

Referring back to section 3.4.1 where we conducted the stepwise regression, we observe significance in all four predictors (with PCTSINGLES, PCTVACANT, and PCTBACHMOR in original form and poverty rate in log-transformed form). Therefore, in the final model, we included all four of the predictors. In addition, according to the result of k-fold cross-validation, our RMSE (root mean squared error) is lower for the model with all four predictor than the two-predictor model.

Given regression results in section 3.2 and additional verification in section 3.4, we suggest that overall this is a solid model for the purpose of social science research ($R^2$ value of 0.662 is considered high in this field.)

Indeed, given the original hedonic model referenced in our introduction, there are many more possible predictors. One major predictor that might improve the model is the bid rent curve. Given the “location, location, location” cliche, it would make sense to account for the distance from downtown to the centroid of each census tract. Some other significant predictors may fall into the category of “structural characteristics,” “contract conditions,” or “time trend.”

4.3 Limitations of Model

As we saw in Section 3.3, a number of our regression assumptions were violated. First, the relationships between our dependent variable and the predictor variables were generally not linear; based on the scatterplots, only PCTVACANT seems to have a vaguely linear relationship to ln_med_h_val. Additionally, the scatterplot of predicted values compared to standardized residuals shows heteroscedasticity. Finally, our map of the standardized residuals shows signs of spatial autocorrelation, indicating systematic patterns in our data. All three of these issues undermine key assumptions of linear regression.

An additional issue with the model is that we used the raw number of people below the poverty line per census tract. Because the predictor is a count and not a rate, it is spatially extensive, i.e., it depends on the total area of the census tract. This introduces statistical bias in the form of the modifiable areal unit problem (MAUP). Because we do not have the option to divide the area of analysis into equally-sized spatial units, our best bet is probably to use the poverty rate, which is spatially intensive, rather than the raw number of people in poverty.

Stephen Malpezzi, Larry Ozanne, and Thomas Thibodaeu, Characteristic Prices of Housing in Fifty-nine Metropolitan Areas (Washington D.C.: Department of Housing and Urban Development: 1980).↩︎
Stephen Malpezzi, Gregory H. Chun, and Richard K. Green, “New Place-to-Place Housing Price Indexes for U.S. Metropolitan Areas, and Their Determinants,” Real Estate Economics 26, No. 2 (1998): 235-274. https://doi-org.proxy.library.upenn.edu/10.1111/1540-6229.00745 ↩︎
Malpezzi, Chun, and Green, “Place-to-Place Housing Price Index,” 238.↩︎

MUSA 500, Assignment #1

Minwook Kang, Nissim Lebovits, Ann Zhang

2022-10-24