Summary

Zillow’s Market Health Index (MHI) indicates the current health of a given region’s housing market relative to other markets nationwide. It is calculated on a scale from 0 to 10, with 0 representing the least healthy markets and 10 the healthiest markets. I created a model to see how well I could predict MHI using U.S. Census Demographics from the American Community Survey. The model provides some insight into the census variables that are most influential in explaning MHI, but the model does not conform as well as hoped to the assumptions required by ordinary least squares regression. The variation of MHI explained by the final model is around 10%. There may be better techniques than ordinary least squares regression for data of this nature. Adding Core Based Statistical Area as a factor variable increases R^2 to well over 50%, suggesting metropolitan location is a very strong predictor in and of itself.

Methods

Zillow publishes a variety of interesting statistics on home sales and home values. In 2013, Zillow rolled out its Zillow Market Health Index (MHI)1. MHI provides a single number ranging from 0 to 10 (where 0=least healthy and 10=most healthy) that is intended to measure overall health of a given geographical area. MHI is a composite of many home sales / home value data points including:

Zillow states that the MHI is not intended for ratio-scale interpretations (e.g., We can’t say San Jose is twice as healthy as Phoenix, even though on has an MHI of 9.26 while the other is 4.55), although they do suggest it can be used for ordinal purposes (“By design, the index captures current vigor relative to other markets.”).

MHI is is reported for all geography levels including State, Metropolitan Area, County, City, Zip Code, and neighborhood. Zillow updates its metrics monthly and the file I used was downloaded in August 2017.

The United States Census Bureau (Census) publishes an Annual Community Survey (ACS)2, an ongoing survey that provides vital information on a yearly basis about our nation and its people. Information from the survey generates data that help determine how more than $675 billion in federal and state funds are distributed each year.

The ACS has four basic questionnaires covering the following topics:

I downloaded the latest data available in November 2017, and this coincided with the data series that span 2011-2015 (2016 data became available during the interim time period, but given the amount of time needed for data preparation, I stuck with the 2015 series). For my import file, I merged the MHI file (n=13,498 zip codes) with the Census data files (n=33,120 zip codes). I then isolated 23 census variables with MHI and discarded 18 incomplete observations, with a final n=13,480. The variables I started with included the following:

The final model ends up being

\(Y_i = \) the Zillow Market Health Index Average in the \(i^{th}\) zip code

\(X_{i1} =\) the Percent of Married Households in the \(i^{th}\) zip code

\(X_{i2} = \) the Percent of Unmoved Residents in the past year for residents age 1+ in the \(i^{th}\) zip code

\(X_{i3} =\) the log of the Percent of Employed People who Work from Home in the \(i^{th}\) zip code

\(X_{i4} = \) the log of the Percent of Vacant Homes in the \(i^{th}\) zip code

\(X_{i5} =\) The percent of Single Family Detached homes in the \(i^{th}\) zip code

\(X_{i6} =\) The percent of Owner Occupied Homes in the \(i^{th}\) zip code

\(X_{i7} =\) The percent of White residents in the \(i^{th}\) zip code

\(X_{i8} =\) The log of the percent of Black residents in the \(i^{th}\) zip code

The final model is given by \[Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i3} + \beta_4X_{i4} + \beta_5X_{i5} + \beta_6X_{i6} + \beta_7X_{i7} + \beta_8X_{i8} + \varepsilon_i\]

where \(\varepsilon_i \sim iidN(0,\sigma^2)\), \(i = 1, 2, . . . , n\), and \(\beta_0, \beta_1, . . . , \beta_8,\) and \(\sigma^2\) are the unknown model parameters.

An interactive datatable with the untransformed final model variables

rm(list=ls())
zillow <- read.csv(
"C:/Users/Zach/Documents/Stat 840 Regression/Final project/Import3.csv", 
sep=",", header=TRUE)
zillow$zip <- clean.zipcodes(zillow$zip)
newdata <- zillow[c(1,2,4,5,9,15,17,19,21,24,25)]
library(DT)
datatable(newdata)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html

Detailed Analysis

I started out by dividing my sample into two randomly assigned, equal size dataframes, then checking the overall fit on the training dataset. R^2 was 0.11.

smp_size <- floor(0.5 * nrow(zillow))
set.seed(123)
train_ind <- sample(seq_len(nrow(zillow)), size = smp_size)
ztrain <- zillow[train_ind, ]
ztest <- zillow[-train_ind, ]

fit.1 <- lm(mhi~married+nonfam+senior+graddeg+unmoved+foreign+emp+drivers+
              pubtran+walkers+wfh+income+hvacant+rvacant+singunit+multunit+
              ownocc+rentocc+age+white+black+asian+hisp, data=ztrain)

summary(fit.1)
## 
## Call:
## lm(formula = mhi ~ married + nonfam + senior + graddeg + unmoved + 
##     foreign + emp + drivers + pubtran + walkers + wfh + income + 
##     hvacant + rvacant + singunit + multunit + ownocc + rentocc + 
##     age + white + black + asian + hisp, data = ztrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7186 -2.2694  0.0129  2.2619  8.1495 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.318e+03  1.226e+03   1.075  0.28240    
## married      6.864e-02  9.286e-03   7.392 1.62e-13 ***
## nonfam       4.729e-02  9.964e-03   4.746 2.11e-06 ***
## senior      -1.404e-02  8.180e-03  -1.717  0.08602 .  
## graddeg      1.221e-02  7.982e-03   1.530  0.12603    
## unmoved     -5.080e-02  8.362e-03  -6.075 1.30e-09 ***
## foreign     -4.026e-02  8.408e-03  -4.788 1.72e-06 ***
## emp          4.841e-04  6.595e-03   0.073  0.94149    
## drivers     -6.009e-02  9.187e-03  -6.541 6.58e-11 ***
## pubtran     -5.910e-02  1.115e-02  -5.299 1.20e-07 ***
## walkers     -8.813e-02  1.531e-02  -5.755 9.03e-09 ***
## wfh          8.672e-02  1.610e-02   5.386 7.46e-08 ***
## income      -2.440e-05  5.799e-06  -4.208 2.61e-05 ***
## hvacant     -1.108e-02  4.386e-03  -2.527  0.01153 *  
## rvacant     -7.415e-03  5.788e-03  -1.281  0.20022    
## singunit     1.464e-02  3.148e-03   4.650 3.38e-06 ***
## multunit     9.406e-03  5.458e-03   1.723  0.08486 .  
## ownocc      -1.310e+01  1.226e+01  -1.068  0.28555    
## rentocc     -1.307e+01  1.226e+01  -1.066  0.28641    
## age          1.812e-02  1.147e-02   1.580  0.11423    
## white       -9.272e-03  6.937e-03  -1.337  0.18141    
## black       -2.088e-02  7.264e-03  -2.875  0.00406 ** 
## asian        5.328e-02  1.200e-02   4.442 9.05e-06 ***
## hisp         2.774e-02  5.122e-03   5.416 6.30e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.738 on 6716 degrees of freedom
## Multiple R-squared:   0.11,  Adjusted R-squared:  0.1069 
## F-statistic: 36.09 on 23 and 6716 DF,  p-value: < 2.2e-16

Histograms of the data shown in the appendix lead me to do log transformations on more than half of the variables. The transformed data improves the overall R^2 slightly to 0.1185. A small constant value needs to be added to the log variables because log 0 is undefined.

fit.2 <- lm(mhi~married+nonfam+log(senior+0.1)+log(graddeg+0.1)+unmoved+
            log(foreign+0.1)+emp+drivers+log(pubtran+0.1)+log(walkers+0.1)+
            log(wfh+0.1)+log(income)+log(hvacant+0.1)+log(rvacant+0.1)+
            singunit+log(multunit+0.1)+ownocc+log(rentocc+0.1)+age+white+
            log(black+0.1)+log(asian+0.1)+log(hisp+0.1),data=ztrain)

summary(fit.2)
## 
## Call:
## lm(formula = mhi ~ married + nonfam + log(senior + 0.1) + log(graddeg + 
##     0.1) + unmoved + log(foreign + 0.1) + emp + drivers + log(pubtran + 
##     0.1) + log(walkers + 0.1) + log(wfh + 0.1) + log(income) + 
##     log(hvacant + 0.1) + log(rvacant + 0.1) + singunit + log(multunit + 
##     0.1) + ownocc + log(rentocc + 0.1) + age + white + log(black + 
##     0.1) + log(asian + 0.1) + log(hisp + 0.1), data = ztrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8227 -2.2740 -0.0021  2.2290  8.0888 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         16.212416   2.169201   7.474 8.77e-14 ***
## married              0.062160   0.009273   6.703 2.20e-11 ***
## nonfam               0.041988   0.009468   4.435 9.36e-06 ***
## log(senior + 0.1)   -0.236173   0.195576  -1.208 0.227254    
## log(graddeg + 0.1)   0.100002   0.085717   1.167 0.243390    
## unmoved             -0.059111   0.008260  -7.156 9.17e-13 ***
## log(foreign + 0.1)   0.009416   0.056963   0.165 0.868708    
## emp                 -0.003403   0.006752  -0.504 0.614297    
## drivers             -0.027210   0.005285  -5.149 2.70e-07 ***
## log(pubtran + 0.1)  -0.247106   0.031670  -7.803 6.99e-15 ***
## log(walkers + 0.1)  -0.124669   0.037505  -3.324 0.000892 ***
## log(wfh + 0.1)       0.403288   0.055431   7.275 3.85e-13 ***
## log(income)         -0.655313   0.219516  -2.985 0.002844 ** 
## log(hvacant + 0.1)  -0.297357   0.055893  -5.320 1.07e-07 ***
## log(rvacant + 0.1)   0.011679   0.022391   0.522 0.601971    
## singunit             0.016084   0.003002   5.357 8.75e-08 ***
## log(multunit + 0.1)  0.072054   0.031056   2.320 0.020362 *  
## ownocc              -0.017006   0.007677  -2.215 0.026778 *  
## log(rentocc + 0.1)   0.310578   0.165836   1.873 0.061139 .  
## age                  0.003711   0.010741   0.346 0.729701    
## white               -0.021836   0.003166  -6.898 5.75e-12 ***
## log(black + 0.1)    -0.403199   0.029063 -13.873  < 2e-16 ***
## log(asian + 0.1)     0.058208   0.039008   1.492 0.135684    
## log(hisp + 0.1)      0.095462   0.040747   2.343 0.019168 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.724 on 6716 degrees of freedom
## Multiple R-squared:  0.1185, Adjusted R-squared:  0.1155 
## F-statistic: 39.25 on 23 and 6716 DF,  p-value: < 2.2e-16

Keeping only those variables that had a significant beta estimate and a relatively low standard error reduced the number of predictor variables from 23 to 15, and the R^2 stays about the same

fit.3 <- lm(mhi~married+nonfam+unmoved+
            drivers+log(pubtran+0.1)+log(walkers+0.1)+
            log(wfh+0.1)+log(income)+log(hvacant+0.1)+
            singunit+log(multunit+0.1)+ownocc+white+
            log(black+0.1)+log(hisp+0.1),data=ztrain)

summary(fit.3)
## 
## Call:
## lm(formula = mhi ~ married + nonfam + unmoved + drivers + log(pubtran + 
##     0.1) + log(walkers + 0.1) + log(wfh + 0.1) + log(income) + 
##     log(hvacant + 0.1) + singunit + log(multunit + 0.1) + ownocc + 
##     white + log(black + 0.1) + log(hisp + 0.1), data = ztrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9346 -2.2848  0.0159  2.2248  8.2636 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         15.470811   1.390409  11.127  < 2e-16 ***
## married              0.065968   0.008933   7.385 1.71e-13 ***
## nonfam               0.043375   0.009064   4.785 1.74e-06 ***
## unmoved             -0.064285   0.007398  -8.690  < 2e-16 ***
## drivers             -0.025012   0.005136  -4.870 1.14e-06 ***
## log(pubtran + 0.1)  -0.239003   0.030884  -7.739 1.15e-14 ***
## log(walkers + 0.1)  -0.120126   0.037026  -3.244  0.00118 ** 
## log(wfh + 0.1)       0.413980   0.054680   7.571 4.20e-14 ***
## log(income)         -0.455427   0.149140  -3.054  0.00227 ** 
## log(hvacant + 0.1)  -0.299589   0.049281  -6.079 1.27e-09 ***
## singunit             0.016049   0.002983   5.380 7.70e-08 ***
## log(multunit + 0.1)  0.096098   0.028678   3.351  0.00081 ***
## ownocc              -0.028712   0.005205  -5.516 3.59e-08 ***
## white               -0.023079   0.003028  -7.621 2.87e-14 ***
## log(black + 0.1)    -0.392040   0.028364 -13.822  < 2e-16 ***
## log(hisp + 0.1)      0.119602   0.030448   3.928 8.65e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.725 on 6724 degrees of freedom
## Multiple R-squared:  0.1172, Adjusted R-squared:  0.1152 
## F-statistic: 59.49 on 15 and 6724 DF,  p-value: < 2.2e-16

It is worth noting that adding the variable metro which is the name of the Core Based Statistical Area greatly increases the overall model R^2, but unfortunately the output includes coefficients for as many of the 695 CBSA’s as the program will physically output. We can deduce the explanatory value of the “metro” variable by seeing the R^2=0.5581 and AdjR^2=0.5106

fit.4 <- lm(mhi~metro+married+nonfam+unmoved+
            drivers+log(pubtran+0.1)+log(walkers+0.1)+
            log(wfh+0.1)+log(income)+log(hvacant+0.1)+
            singunit+log(multunit+0.1)+ownocc+white+
            log(black+0.1)+log(hisp+0.1),data=ztrain)

summary(fit.4)
## 
## Call:
## lm(formula = mhi ~ metro + married + nonfam + unmoved + drivers + 
##     log(pubtran + 0.1) + log(walkers + 0.1) + log(wfh + 0.1) + 
##     log(income) + log(hvacant + 0.1) + singunit + log(multunit + 
##     0.1) + ownocc + white + log(black + 0.1) + log(hisp + 0.1), 
##     data = ztrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0351 -1.2884 -0.0055  1.2535  7.7989 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                  0.733410   2.371159   0.309
## metroAda, OK                                 3.623575   2.343461   1.546
## metroAdrian, MI                              5.467140   2.127332   2.570
## metroAkron, OH                              -0.288168   2.071949  -0.139
## metroAlbany, GA                             -1.682621   2.872496  -0.586
## metroAlbany, NY                             -1.154302   2.049291  -0.563
## metroAlbany, OR                              5.037729   2.222483   2.267
## metroAlbemarle, NC                           4.433122   2.270107   1.953
## metroAlbuquerque, NM                         3.109278   2.099475   1.481
## metroAlexandria, LA                          1.608877   2.486364   0.647
##                                             Pr(>|t|)    
## (Intercept)                                 0.757100    
## metroAda, OK                                0.122096    
## metroAdrian, MI                             0.010195 *  
## metroAkron, OH                              0.889391    
## metroAlbany, GA                             0.558052    
## metroAlbany, NY                             0.573272    
## metroAlbany, OR                             0.023443 *  
## metroAlbemarle, NC                          0.050886 .  
## metroAlbuquerque, NM                        0.138664    
## metroAlexandria, LA                         0.517604    
##  [ reached getOption("max.print") -- omitted 645 rows ]
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.027 on 6085 degrees of freedom
## Multiple R-squared:  0.5581, Adjusted R-squared:  0.5106 
## F-statistic: 11.75 on 654 and 6085 DF,  p-value: < 2.2e-16

In order to narrow my list of predictor variables, I looked at the scatter plots and the correlations. Linear relationships are difficult to detect using the scatterplots, but some relationships between the predictor variables are evident and somewhat expected - black vs white, ownocc vs rentocc, singunit vs multunit. Unmoved, wfh, black, asian and hisp all have the strongest correlations with the response variable. We can definitely drop either rentocc or ownocc because they are 0.99. Nothing else stands out as being extremely highly correlated- married and nonfam are the highest besides rentocc and ownocc.

pairs(mhi~married+nonfam+unmoved, data=ztrain)

pairs(mhi~drivers+log(pubtran+0.1)+log(walkers+0.1), data=ztrain)

pairs(mhi~log(wfh+0.1)+log(income)+log(hvacant+0.1), data=ztrain) 

pairs(mhi~singunit+log(multunit+0.1)+ownocc, data=ztrain) 

pairs(mhi~white+log(black+0.1)+log(hisp+0.1), data=ztrain) 

cor.zil <- ztrain[c(-1,-2,-3)]
cor(cor.zil)
##                  mhi     married       nonfam       senior      graddeg
## mhi       1.00000000  0.07321272 -0.012059365 -0.052684283  0.071830786
## married   0.07321272  1.00000000 -0.836168305  0.148413312  0.096706576
## nonfam   -0.01205937 -0.83616830  1.000000000 -0.037362689  0.182222085
## senior   -0.05268428  0.14841331 -0.037362689  1.000000000 -0.008339695
## graddeg   0.07183079  0.09670658  0.182222085 -0.008339695  1.000000000
## unmoved  -0.11857888  0.56324172 -0.576131999  0.305865311 -0.128351097
## foreign   0.08031620 -0.18615234  0.060537332 -0.155909138  0.228621810
## emp       0.03045622  0.13941994 -0.088710279 -0.628447821  0.230309910
## drivers  -0.07582520  0.38436393 -0.377071589  0.113369596 -0.308668971
## pubtran  -0.03944913 -0.31443214  0.259784901 -0.142640531  0.300872304
## walkers   0.03335264 -0.38527731  0.498761159 -0.122373923  0.265149436
## wfh       0.15898605  0.21177445 -0.002457092  0.226044353  0.428756345
## income    0.03872994  0.28659576  0.014614659  0.098795488  0.815043397
## hvacant  -0.04649090 -0.18710050  0.227484942  0.371937175 -0.088686514
## rvacant  -0.04603547 -0.11483241  0.113092970  0.154493091 -0.032660190
## singunit  0.01927067  0.66322414 -0.629198718  0.188145076 -0.152366297
## multunit  0.04230514 -0.47282583  0.586785302 -0.156142677  0.351484212
## ownocc   -0.04612724  0.79918371 -0.655327613  0.343357287 -0.018185106
## rentocc   0.04612538 -0.79918193  0.655326652 -0.343354090  0.018184902
## age      -0.05327098  0.30540947 -0.072830240  0.770880655  0.108857739
## white     0.04426115  0.52625513 -0.209508552  0.252454370  0.016873807
## black    -0.15362057 -0.53053806  0.234566707 -0.166192339 -0.116328499
## asian     0.12469935 -0.02584123  0.045414414 -0.144095635  0.364629505
## hisp      0.09974202 -0.18810844 -0.058547182 -0.171752377 -0.136600901
##              unmoved     foreign          emp      drivers     pubtran
## mhi      -0.11857888  0.08031620  0.030456221 -0.075825195 -0.03944913
## married   0.56324172 -0.18615234  0.139419937  0.384363934 -0.31443214
## nonfam   -0.57613200  0.06053733 -0.088710279 -0.377071589  0.25978490
## senior    0.30586531 -0.15590914 -0.628447821  0.113369596 -0.14264053
## graddeg  -0.12835110  0.22862181  0.230309910 -0.308668971  0.30087230
## unmoved   1.00000000 -0.16536490  0.054996288  0.318413524 -0.11940097
## foreign  -0.16536490  1.00000000  0.131965172 -0.481992871  0.47921876
## emp       0.05499629  0.13196517  1.000000000  0.002258872  0.10524608
## drivers   0.31841352 -0.48199287  0.002258872  1.000000000 -0.78604109
## pubtran  -0.11940097  0.47921876  0.105246079 -0.786041089  1.00000000
## walkers  -0.41789628  0.16596012 -0.030783737 -0.649939110  0.35308739
## wfh       0.02624587  0.04151206 -0.032646741 -0.288523163  0.01587239
## income    0.11811067  0.13446902  0.277658359 -0.180251808  0.21722033
## hvacant  -0.05203542 -0.13650974 -0.434901975 -0.089518150 -0.06880266
## rvacant  -0.08196683 -0.05272115 -0.205819021  0.007684059 -0.04735578
## singunit  0.49896661 -0.46952585 -0.003552292  0.545631465 -0.51223930
## multunit -0.45319785  0.44336821  0.090408024 -0.527730888  0.47228243
## ownocc    0.66866626 -0.47634927 -0.005159052  0.530252793 -0.42024562
## rentocc  -0.66866388  0.47635207  0.005158074 -0.530255018  0.42024728
## age       0.45699080 -0.27088361 -0.354344345  0.224665833 -0.18058965
## white     0.29961567 -0.49101957  0.028092861  0.379335021 -0.42454367
## black    -0.21795572  0.08820159 -0.099107029 -0.191114748  0.28036173
## asian    -0.14319871  0.65574989  0.159023700 -0.339971500  0.35514834
## hisp     -0.15103771  0.70557286  0.002832721 -0.269065301  0.17737765
##              walkers          wfh       income     hvacant      rvacant
## mhi       0.03335264  0.158986050  0.038729939 -0.04649090 -0.046035468
## married  -0.38527731  0.211774447  0.286595761 -0.18710050 -0.114832415
## nonfam    0.49876116 -0.002457092  0.014614659  0.22748494  0.113092970
## senior   -0.12237392  0.226044353  0.098795488  0.37193718  0.154493091
## graddeg   0.26514944  0.428756345  0.815043397 -0.08868651 -0.032660190
## unmoved  -0.41789628  0.026245875  0.118110672 -0.05203542 -0.081966832
## foreign   0.16596012  0.041512064  0.134469017 -0.13650974 -0.052721152
## emp      -0.03078374 -0.032646741  0.277658359 -0.43490198 -0.205819021
## drivers  -0.64993911 -0.288523163 -0.180251808 -0.08951815  0.007684059
## pubtran   0.35308739  0.015872391  0.217220333 -0.06880266 -0.047355784
## walkers   1.00000000  0.111432092  0.120065881  0.11566088  0.015815089
## wfh       0.11143209  1.000000000  0.482275183  0.17733877  0.014906959
## income    0.12006588  0.482275183  1.000000000 -0.08933728 -0.058141140
## hvacant   0.11566088  0.177338768 -0.089337275  1.00000000  0.405976259
## rvacant   0.01581509  0.014906959 -0.058141140  0.40597626  1.000000000
## singunit -0.43851058  0.077189405  0.002836292 -0.05329581 -0.097769659
## multunit  0.52018593  0.098273237  0.271341519  0.03362323  0.067849217
## ownocc   -0.44869910  0.139458075  0.201317205 -0.02474489 -0.048031645
## rentocc   0.44869980 -0.139456096 -0.201315955  0.02474579  0.048028435
## age      -0.21439484  0.302508030  0.332688423  0.35002959  0.106932215
## white    -0.12629722  0.182410040  0.179595439  0.02232680 -0.058623374
## black     0.04954477 -0.219761046 -0.251129713  0.08244940  0.114901972
## asian     0.16492451  0.080563788  0.287239110 -0.18708986 -0.085774118
## hisp      0.04592290 -0.082115376 -0.204764271 -0.08729903 -0.017111999
##              singunit    multunit       ownocc      rentocc         age
## mhi       0.019270669  0.04230514 -0.046127241  0.046125383 -0.05327098
## married   0.663224138 -0.47282583  0.799183713 -0.799181930  0.30540947
## nonfam   -0.629198718  0.58678530 -0.655327613  0.655326652 -0.07283024
## senior    0.188145076 -0.15614268  0.343357287 -0.343354090  0.77088066
## graddeg  -0.152366297  0.35148421 -0.018185106  0.018184902  0.10885774
## unmoved   0.498966610 -0.45319785  0.668666264 -0.668663883  0.45699080
## foreign  -0.469525846  0.44336821 -0.476349272  0.476352073 -0.27088361
## emp      -0.003552292  0.09040802 -0.005159052  0.005158074 -0.35434434
## drivers   0.545631465 -0.52773089  0.530252793 -0.530255018  0.22466583
## pubtran  -0.512239301  0.47228243 -0.420245617  0.420247283 -0.18058965
## walkers  -0.438510581  0.52018593 -0.448699103  0.448699800 -0.21439484
## wfh       0.077189405  0.09827324  0.139458075 -0.139456096  0.30250803
## income    0.002836292  0.27134152  0.201317205 -0.201315955  0.33268842
## hvacant  -0.053295805  0.03362323 -0.024744890  0.024745793  0.35002959
## rvacant  -0.097769659  0.06784922 -0.048031645  0.048028435  0.10693222
## singunit  1.000000000 -0.67873233  0.767213984 -0.767214444  0.30217620
## multunit -0.678732332  1.00000000 -0.601655451  0.601655709 -0.16976278
## ownocc    0.767213984 -0.60165545  1.000000000 -0.999999986  0.51888839
## rentocc  -0.767214444  0.60165571 -0.999999986  1.000000000 -0.51888556
## age       0.302176198 -0.16976278  0.518888389 -0.518885557  1.00000000
## white     0.448544983 -0.29157643  0.568917474 -0.568916798  0.41722203
## black    -0.299769885  0.14083219 -0.393201596  0.393200558 -0.28003515
## asian    -0.302682927  0.34643518 -0.271704010  0.271704553 -0.15670950
## hisp     -0.269715105  0.17411667 -0.416483628  0.416484943 -0.37011166
##                white       black       asian         hisp
## mhi       0.04426115 -0.15362057  0.12469935  0.099742016
## married   0.52625513 -0.53053806 -0.02584123 -0.188108441
## nonfam   -0.20950855  0.23456671  0.04541441 -0.058547182
## senior    0.25245437 -0.16619234 -0.14409564 -0.171752377
## graddeg   0.01687381 -0.11632850  0.36462951 -0.136600901
## unmoved   0.29961567 -0.21795572 -0.14319871 -0.151037715
## foreign  -0.49101957  0.08820159  0.65574989  0.705572862
## emp       0.02809286 -0.09910703  0.15902370  0.002832721
## drivers   0.37933502 -0.19111475 -0.33997150 -0.269065301
## pubtran  -0.42454367  0.28036173  0.35514834  0.177377648
## walkers  -0.12629722  0.04954477  0.16492451  0.045922903
## wfh       0.18241004 -0.21976105  0.08056379 -0.082115376
## income    0.17959544 -0.25112971  0.28723911 -0.204764271
## hvacant   0.02232680  0.08244940 -0.18708986 -0.087299034
## rvacant  -0.05862337  0.11490197 -0.08577412 -0.017111999
## singunit  0.44854498 -0.29976989 -0.30268293 -0.269715105
## multunit -0.29157643  0.14083219  0.34643518  0.174116667
## ownocc    0.56891747 -0.39320160 -0.27170401 -0.416483628
## rentocc  -0.56891680  0.39320056  0.27170455  0.416484943
## age       0.41722203 -0.28003515 -0.15670950 -0.370111663
## white     1.00000000 -0.82118013 -0.42334406 -0.350844573
## black    -0.82118013  1.00000000 -0.03028360  0.032912066
## asian    -0.42334406 -0.03028360  1.00000000  0.192484674
## hisp     -0.35084457  0.03291207  0.19248467  1.000000000

Added variable plots, shown in the appendix, give me a lot of good candidates for thinning the number of variables further. Using these as guidance I feel pretty good about removing pubtran, walkers, and multiunit, as they are all fairly flat in appearance, indicating they do not add much variance explanation with all other variables accounted for in the model.

This still leaves me with 12 predictor variables which is too many so I moved into the automatic variable selection techniques. Base on these coutcome I am inclined to keep all eight that were outputted in the exhaustive agorithm, because those would coincide with the lowest reported Cp, highest adjusted R^2, and the lowest BIC.

library(leaps)
ma <- regsubsets(mhi~married+nonfam+unmoved+
                   drivers+
                   log(wfh+0.1)+log(income)+log(hvacant+0.1)+
                   singunit+ownocc+white+
                   log(black+0.1)+log(hisp+0.1),data=ztrain)
(sma <- summary(ma))
## Subset selection object
## Call: regsubsets.formula(mhi ~ married + nonfam + unmoved + drivers + 
##     log(wfh + 0.1) + log(income) + log(hvacant + 0.1) + singunit + 
##     ownocc + white + log(black + 0.1) + log(hisp + 0.1), data = ztrain)
## 12 Variables  (and intercept)
##                    Forced in Forced out
## married                FALSE      FALSE
## nonfam                 FALSE      FALSE
## unmoved                FALSE      FALSE
## drivers                FALSE      FALSE
## log(wfh + 0.1)         FALSE      FALSE
## log(income)            FALSE      FALSE
## log(hvacant + 0.1)     FALSE      FALSE
## singunit               FALSE      FALSE
## ownocc                 FALSE      FALSE
## white                  FALSE      FALSE
## log(black + 0.1)       FALSE      FALSE
## log(hisp + 0.1)        FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          married nonfam unmoved drivers log(wfh + 0.1) log(income)
## 1  ( 1 ) " "     " "    " "     " "     "*"            " "        
## 2  ( 1 ) " "     " "    "*"     " "     " "            " "        
## 3  ( 1 ) " "     " "    "*"     " "     "*"            " "        
## 4  ( 1 ) " "     " "    "*"     " "     "*"            " "        
## 5  ( 1 ) "*"     " "    "*"     " "     "*"            " "        
## 6  ( 1 ) "*"     " "    "*"     " "     "*"            " "        
## 7  ( 1 ) "*"     " "    "*"     " "     "*"            " "        
## 8  ( 1 ) "*"     " "    "*"     " "     "*"            " "        
##          log(hvacant + 0.1) singunit ownocc white log(black + 0.1)
## 1  ( 1 ) " "                " "      " "    " "   " "             
## 2  ( 1 ) " "                " "      " "    " "   "*"             
## 3  ( 1 ) " "                " "      " "    " "   "*"             
## 4  ( 1 ) "*"                " "      " "    " "   "*"             
## 5  ( 1 ) " "                " "      " "    "*"   "*"             
## 6  ( 1 ) " "                " "      "*"    "*"   "*"             
## 7  ( 1 ) " "                "*"      "*"    "*"   "*"             
## 8  ( 1 ) "*"                "*"      "*"    "*"   "*"             
##          log(hisp + 0.1)
## 1  ( 1 ) " "            
## 2  ( 1 ) " "            
## 3  ( 1 ) " "            
## 4  ( 1 ) " "            
## 5  ( 1 ) " "            
## 6  ( 1 ) " "            
## 7  ( 1 ) " "            
## 8  ( 1 ) " "
sma$cp 
## [1] 609.54982 362.03800 252.43255 198.07426 127.29664  85.04244  65.43556
## [8]  50.17375
sma$bic
## [1] -156.0411 -380.1466 -478.1790 -523.9697 -586.3304 -621.1655 -633.7652
## [8] -642.1073
sma$adjr2
## [1] 0.02529368 0.05826250 0.07293640 0.08028046 0.08980673 0.09554815
## [7] 0.09828345 0.10044239
fit.5 <- lm(mhi~married+unmoved+
            log(wfh+0.1)+log(hvacant+0.1)+
            singunit+ownocc+white+log(black+0.1),data=ztrain)

summary(fit.5)
## 
## Call:
## lm(formula = mhi ~ married + unmoved + log(wfh + 0.1) + log(hvacant + 
##     0.1) + singunit + ownocc + white + log(black + 0.1), data = ztrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5611 -2.3032  0.0112  2.2827  8.3464 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        13.172125   0.595675  22.113  < 2e-16 ***
## married             0.039820   0.004917   8.098 6.55e-16 ***
## unmoved            -0.081620   0.007086 -11.519  < 2e-16 ***
## log(wfh + 0.1)      0.447780   0.047502   9.427  < 2e-16 ***
## log(hvacant + 0.1) -0.193967   0.046828  -4.142 3.48e-05 ***
## singunit            0.012711   0.002706   4.697 2.69e-06 ***
## ownocc             -0.034753   0.004680  -7.425 1.27e-13 ***
## white              -0.016830   0.002640  -6.375 1.95e-10 ***
## log(black + 0.1)   -0.401159   0.027548 -14.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.748 on 6731 degrees of freedom
## Multiple R-squared:  0.1015, Adjusted R-squared:  0.1004 
## F-statistic: 95.06 on 8 and 6731 DF,  p-value: < 2.2e-16

Owner occupied is a little high on the VIF measure of multicollinearity, but we will keep it for now. Checking normality and some standard residual plots, we can see there is some issues with heavy tails. There is an obvious down slope of the studentized residuals as the fitted values increase, indicating model bias and lurking variable(s). The Breusch Pagan test for heteroskedasticity does suggest adquately equal variance. A sequence plot of the residuals uncovers nothing suspicious about the independence of the error terms. The output from the GVLMA is troubling - stating that the assumptions have not been satisfied for the ‘Global Stat’, the ‘Kurtosis’ or the ‘Link Function’.

vif(fit.5)
##            married            unmoved     log(wfh + 0.1) 
##           3.508731           1.925396           1.084238 
## log(hvacant + 0.1)           singunit             ownocc 
##           1.162303           2.504836           5.165443 
##              white   log(black + 0.1) 
##           2.441918           2.158622
plot(rstandard(fit.5)~predict(fit.5), xlab = expression(hat(Y)), 
     ylab = "Studentized Deleted Residuals")
abline(h=0)

ncvTest(fit.5)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.161995    Df = 1     p = 0.1414609
plot(residuals(fit.5),type="l",ylab=expression(e[i]),
     main="Sequence Plot of Residuals")
points(residuals(fit.5),pch=16,col="darkgray")
abline(0,0,lty=2)

library(gvlma)
gvmodel <- gvlma(fit.5) 
summary(gvmodel)
## 
## Call:
## lm(formula = mhi ~ married + unmoved + log(wfh + 0.1) + log(hvacant + 
##     0.1) + singunit + ownocc + white + log(black + 0.1), data = ztrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5611 -2.3032  0.0112  2.2827  8.3464 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        13.172125   0.595675  22.113  < 2e-16 ***
## married             0.039820   0.004917   8.098 6.55e-16 ***
## unmoved            -0.081620   0.007086 -11.519  < 2e-16 ***
## log(wfh + 0.1)      0.447780   0.047502   9.427  < 2e-16 ***
## log(hvacant + 0.1) -0.193967   0.046828  -4.142 3.48e-05 ***
## singunit            0.012711   0.002706   4.697 2.69e-06 ***
## ownocc             -0.034753   0.004680  -7.425 1.27e-13 ***
## white              -0.016830   0.002640  -6.375 1.95e-10 ***
## log(black + 0.1)   -0.401159   0.027548 -14.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.748 on 6731 degrees of freedom
## Multiple R-squared:  0.1015, Adjusted R-squared:  0.1004 
## F-statistic: 95.06 on 8 and 6731 DF,  p-value: < 2.2e-16
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = fit.5) 
## 
##                        Value   p-value                   Decision
## Global Stat        287.82367 0.000e+00 Assumptions NOT satisfied!
## Skewness             0.05285 8.182e-01    Assumptions acceptable.
## Kurtosis           258.39266 0.000e+00 Assumptions NOT satisfied!
## Link Function       29.16469 6.648e-08 Assumptions NOT satisfied!
## Heteroscedasticity   0.21348 6.441e-01    Assumptions acceptable.

There are many many outliers and high leverage points. The observation with the greatest Cook’s Distance by far is#2022, which corresponds to zip code 12570 in the NYC metro area.

outlierTest(fit.5)
## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferonni p
## 6164 3.048528          0.0023086           NA
summary(influence.measures(fit.5))
## Potentially influential observations of
##   lm(formula = mhi ~ married + unmoved + log(wfh + 0.1) + log(hvacant +      0.1) + singunit + ownocc + white + log(black + 0.1), data = ztrain) :
## 
##       dfb.1_ dfb.mrrd dfb.unmv dfb.lg(w+0.1) dfb.lg(h+0.1) dfb.sngn
## 11974 -0.01  -0.01     0.00     0.00          0.00          0.01   
## 1920  -0.02   0.04     0.00     0.01          0.13         -0.01   
## 1637   0.06  -0.10    -0.05     0.00         -0.03          0.01   
##       dfb.ownc dfb.whit dfb.lg(b+0.1) dffit   cov.r   cook.d hat    
## 11974  0.01     0.03     0.02         -0.04    1.00_*  0.00   0.00  
## 1920  -0.02    -0.01     0.01         -0.14_*  1.00    0.00   0.01_*
## 1637   0.10    -0.04    -0.05         -0.13_*  1.00_*  0.00   0.01_*
##  [ reached getOption("max.print") -- omitted 414 rows ]
leveragePlots(fit.5)

m2i <- influence(fit.5)
halfnorm(cooks.distance(fit.5))

even though I am unlikely to remove an observation, we can check to see how removing obs #2022 impacts the model. As expected, this model is robust. Taking away a single observation from a sample this large has almost no impact.

summary(lm(mhi~married+unmoved+
            log(wfh+0.1)+log(hvacant+0.1)+
            singunit+ownocc+white+log(black+0.1),data=ztrain))
## 
## Call:
## lm(formula = mhi ~ married + unmoved + log(wfh + 0.1) + log(hvacant + 
##     0.1) + singunit + ownocc + white + log(black + 0.1), data = ztrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5611 -2.3032  0.0112  2.2827  8.3464 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        13.172125   0.595675  22.113  < 2e-16 ***
## married             0.039820   0.004917   8.098 6.55e-16 ***
## unmoved            -0.081620   0.007086 -11.519  < 2e-16 ***
## log(wfh + 0.1)      0.447780   0.047502   9.427  < 2e-16 ***
## log(hvacant + 0.1) -0.193967   0.046828  -4.142 3.48e-05 ***
## singunit            0.012711   0.002706   4.697 2.69e-06 ***
## ownocc             -0.034753   0.004680  -7.425 1.27e-13 ***
## white              -0.016830   0.002640  -6.375 1.95e-10 ***
## log(black + 0.1)   -0.401159   0.027548 -14.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.748 on 6731 degrees of freedom
## Multiple R-squared:  0.1015, Adjusted R-squared:  0.1004 
## F-statistic: 95.06 on 8 and 6731 DF,  p-value: < 2.2e-16
summary(lm(mhi~married+unmoved+
            log(wfh+0.1)+log(hvacant+0.1)+
            singunit+ownocc+white+log(black+0.1),data=ztrain, subset=-2022))
## 
## Call:
## lm(formula = mhi ~ married + unmoved + log(wfh + 0.1) + log(hvacant + 
##     0.1) + singunit + ownocc + white + log(black + 0.1), data = ztrain, 
##     subset = -2022)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5687 -2.3007  0.0142  2.2831  8.3322 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        13.291009   0.597643  22.239  < 2e-16 ***
## married             0.039528   0.004917   8.039 1.06e-15 ***
## unmoved            -0.082670   0.007097 -11.648  < 2e-16 ***
## log(wfh + 0.1)      0.440222   0.047596   9.249  < 2e-16 ***
## log(hvacant + 0.1) -0.202885   0.046968  -4.320 1.59e-05 ***
## singunit            0.012647   0.002705   4.675 3.00e-06 ***
## ownocc             -0.034339   0.004682  -7.334 2.50e-13 ***
## white              -0.016881   0.002639  -6.396 1.70e-10 ***
## log(black + 0.1)   -0.403735   0.027561 -14.649  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.747 on 6730 degrees of freedom
## Multiple R-squared:  0.1019, Adjusted R-squared:  0.1008 
## F-statistic: 95.43 on 8 and 6730 DF,  p-value: < 2.2e-16

We want to see how good a job our model does in describing a different dataset from the same population, so we will run the model using the test data ztest which was created earlier as the complement to our training data ztrain, and then comparing the two in terms of mean squares,coefficients of determinations. The output is very similar for the two models. One notable exception is singunit, which has a significant F test value for the test data but not using the training data.

fit.6 <- lm(mhi~married+unmoved+
            log(wfh+0.1)+log(hvacant+0.1)+
            singunit+ownocc+white+log(black+0.1),data=ztest)
summary(fit.6)
## 
## Call:
## lm(formula = mhi ~ married + unmoved + log(wfh + 0.1) + log(hvacant + 
##     0.1) + singunit + ownocc + white + log(black + 0.1), data = ztest)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0889 -2.2969  0.0051  2.2820  8.0813 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        11.731738   0.598656  19.597  < 2e-16 ***
## married             0.043410   0.004871   8.911  < 2e-16 ***
## unmoved            -0.068083   0.006971  -9.766  < 2e-16 ***
## log(wfh + 0.1)      0.485324   0.048407  10.026  < 2e-16 ***
## log(hvacant + 0.1) -0.171502   0.046026  -3.726 0.000196 ***
## singunit            0.020412   0.002679   7.619 2.90e-14 ***
## ownocc             -0.046051   0.004676  -9.849  < 2e-16 ***
## white              -0.013933   0.002570  -5.421 6.15e-08 ***
## log(black + 0.1)   -0.382667   0.027812 -13.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.732 on 6731 degrees of freedom
## Multiple R-squared:  0.1018, Adjusted R-squared:  0.1007 
## F-statistic: 95.36 on 8 and 6731 DF,  p-value: < 2.2e-16
summary(fit.5)
## 
## Call:
## lm(formula = mhi ~ married + unmoved + log(wfh + 0.1) + log(hvacant + 
##     0.1) + singunit + ownocc + white + log(black + 0.1), data = ztrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5611 -2.3032  0.0112  2.2827  8.3464 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        13.172125   0.595675  22.113  < 2e-16 ***
## married             0.039820   0.004917   8.098 6.55e-16 ***
## unmoved            -0.081620   0.007086 -11.519  < 2e-16 ***
## log(wfh + 0.1)      0.447780   0.047502   9.427  < 2e-16 ***
## log(hvacant + 0.1) -0.193967   0.046828  -4.142 3.48e-05 ***
## singunit            0.012711   0.002706   4.697 2.69e-06 ***
## ownocc             -0.034753   0.004680  -7.425 1.27e-13 ***
## white              -0.016830   0.002640  -6.375 1.95e-10 ***
## log(black + 0.1)   -0.401159   0.027548 -14.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.748 on 6731 degrees of freedom
## Multiple R-squared:  0.1015, Adjusted R-squared:  0.1004 
## F-statistic: 95.06 on 8 and 6731 DF,  p-value: < 2.2e-16
anova(fit.6)
## Analysis of Variance Table
## 
## Response: mhi
##                      Df Sum Sq Mean Sq  F value    Pr(>F)    
## married               1    539  539.14  72.2139 < 2.2e-16 ***
## unmoved               1   1625 1624.61 217.6038 < 2.2e-16 ***
## log(wfh + 0.1)        1    871  871.46 116.7249 < 2.2e-16 ***
## log(hvacant + 0.1)    1    126  126.38  16.9277 3.929e-05 ***
## singunit              1     62   61.63   8.2547 0.0040775 ** 
## ownocc                1    967  966.67 129.4770 < 2.2e-16 ***
## white                 1     93   92.50  12.3899 0.0004345 ***
## log(black + 0.1)      1   1413 1413.39 189.3128 < 2.2e-16 ***
## Residuals          6731  50253    7.47                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit.5)
## Analysis of Variance Table
## 
## Response: mhi
##                      Df Sum Sq Mean Sq  F value    Pr(>F)    
## married               1    303  303.12  40.1550 2.497e-10 ***
## unmoved               1   2116 2115.50 280.2439 < 2.2e-16 ***
## log(wfh + 0.1)        1    830  829.71 109.9126 < 2.2e-16 ***
## log(hvacant + 0.1)    1    151  151.45  20.0623 7.620e-06 ***
## singunit              1     15   14.81   1.9615  0.161395    
## ownocc                1    660  659.67  87.3873 < 2.2e-16 ***
## white                 1     66   65.54   8.6815  0.003226 ** 
## log(black + 0.1)      1   1601 1600.76 212.0560 < 2.2e-16 ***
## Residuals          6731  50811    7.55                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(fit.6)
##                          2.5 %       97.5 %
## (Intercept)        10.55818283 12.905293219
## married             0.03386090  0.052959278
## unmoved            -0.08174901 -0.054417536
## log(wfh + 0.1)      0.39043034  0.580217008
## log(hvacant + 0.1) -0.26172671 -0.081277299
## singunit            0.01516014  0.025663426
## ownocc             -0.05521688 -0.036885872
## white              -0.01897134 -0.008893934
## log(black + 0.1)   -0.43718698 -0.328146675
confint(fit.5)
##                           2.5 %      97.5 %
## (Intercept)        12.004413921 14.33983553
## married             0.030181159  0.04945882
## unmoved            -0.095510005 -0.06773023
## log(wfh + 0.1)      0.354661623  0.54089904
## log(hvacant + 0.1) -0.285765784 -0.10216872
## singunit            0.007406171  0.01801610
## ownocc             -0.043927972 -0.02557759
## white              -0.022005271 -0.01165460
## log(black + 0.1)   -0.455161552 -0.34715574

Conclusion & Discussion

There appears to be some predictive power in using census data to model MHI. Increases in the % married, % wfh, and the % single unit detached homes in a zip code correspond with increases in MHI, on average. For every log(0.485324) (i.e., 1.62%) increase in the percentage of people working from home, the MHI goes up 1 index point, or 10%. On the other hand, a decrease in the population that self-reports as being Black (one race only) of log(-0.401159) or -0.68% also increase MHI by 1 index point, or 10%. These demographic variables could add to the accuracy of MHI.

While this model leaves a great deal of unexplained variance in MHI, it is significantly better than no model at all. The model R^2 would be vastly improved if we were to incorporate the factor variable for the associated metropolitan area. Cities like Denver and Seattle tend to have many high MHI zip codes whereas Cleveland and St. Louis may have relatively few. It would be interesting to control for this variability and then use the model for zip code level predictions.

The model and the data have several problems that may be impacting the validity of the findings. First and foremost, this is a purely observational study and no experimental controls have been put in place. Given that Zillow only has the MHI for 13,498 zip codes and the Census shows that there are 33,120 - we are missing data for about 60% of the zip codes, so it is somewhat of a convenience sample. I will note that even though Zillow’s MHI is only reported for 40% of the zip codes, those 13,498 zip codes contain about 75% of the U.S. Population.

There is also the issue of the ecological fallacy, given that we are describing statistical properties of individuals while using aggregated units of observation for our analysis. There are also potential issues with non-conforming time periods of the different data - Zillow’s data is from August 2017 while the Census Data is a cumultive sample spanning 2011-2015. To the extent that one or more of the census variable are either leading or lagging, we would expect some error to be present due to these time differences. There are also some clear warning signs that the data may be unsuited for linear regression - e.g., heavy tails and kurtosis violations could be impacting the applicability of my findings.

Additional analysis could be considered– It would make sense to look at the primary interactions and to experiment with polynomial terms to see if we could get better overall fit. Other potential areas for exploration could include a fixed effect model to control for CBSA variability, a GLM to attenuate the impact of violated distributional assumptions.

Appendix

Histograms of the Variables and Added Variable Plots


  1. https://www.zillow.com/research/data/

  2. https://www.census.gov/acs/www/data/data-tables-and-tools/data-profiles/2016/