Introduction

From the previous AT2 result, the study shows the relationship between property value and earthquakes is not significant. Other social indicators weigh greater during modeling. The model is multiple linear regression (MLR), and it explains 67.47% of the quarter-on-quarter change in house prices (STDS Group Disaster, 2020). However, it has limitations and assumptions. In this exploration, it will discuss a further analysis using multilevel regression on the existing datasets. It will consider a random effect on several features and, ultimately, to improve the model accuracy.

Background and Justifications

As the previous analysis on California’s home value from a perspective of earthquakes, social indicators, and the team noticed the quake is not significant for predicting the home value. The team summary that the predictors, which in the final model only explains 67.47% of the home value variation. The performance is lower than the expectation, and during the data exploration, home value shows cluster groups. Some groups have a higher home value all the time. There are some assumptions of linear regression may be violated.

Data points may lack independence. Each county home value is derived from each city by the calculated median price. Assume each city or county home value indicator is independent of each other. The data points can have an autocorrelation in the spatial and time perspective.
The assumption of constant variance (homoscedasticity). From the diagnostic plots in the AT2 report, the Scale-Location plot shows the residuals do not spread equally (STDS Group Disaster, 2020). Linear models are not able to control correct standard errors.

For the reasons above, there is a need to incorporate hierarchical nature. There are time and spatial granularities. To simplify the model complexity, we still assume the data points are independent of a time-series perspective.

Research Questions

Based on the datasets from AT2, this paper will discuss the potential multilevel to solve the research questions below:

What are the effects of social indicators and earthquakes on whether the home value is in a different spatial group?
What features could be utilized as random effects besides spatial groups, and what are the patterns?

Initial Model Review

This model is the final model proposed in assessment task 2.

Please find the summary of the model below:

summary(lm.quake)

## 
## Call:
## lm(formula = log(house_price) ~ date + crime_index + interest + 
##     unemployment_rate + fault_score + quake + log(income) + log(population), 
##     data = train_quake)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87985 -0.23738 -0.02105  0.23739  1.09757 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       19.064542  17.053909   1.118 0.263820    
## date              -0.008393   0.008445  -0.994 0.320465    
## crime_index       -4.203468   1.188437  -3.537 0.000419 ***
## interest          -0.047282   0.043940  -1.076 0.282110    
## unemployment_rate -0.050240   0.003371 -14.902  < 2e-16 ***
## fault_score        0.104306   0.008013  13.017  < 2e-16 ***
## quake1            -0.036217   0.023855  -1.518 0.129202    
## log(income)        1.199167   0.059208  20.254  < 2e-16 ***
## log(population)    0.035957   0.007532   4.774 2.01e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3585 on 1279 degrees of freedom
## Multiple R-squared:  0.6747, Adjusted R-squared:  0.6727 
## F-statistic: 331.6 on 8 and 1279 DF,  p-value: < 2.2e-16

The model’s result suggests that income, population, fault score, unemployment rate, and crime index are significant. The next section uses this model as a baseline to develop new mixed random effect models.

Model Analysis

The mixed-effect models start with the random intercept (county) because from the plots below, we can see there is an evident variation. Some counties like Marin, Santa Clara, Santa Mateo are having a higher value boxplot. Following the research questions, I also discuss other indicators as random effects. To compare the result, it uses RMSE to evaluate model accuracy and the likelihood ratio test to assess the models’ significance. The model with the lowest AIC (Akaike information criterion) can also be the possible model (Zajic, 2019).

Because the lme4 package does not provide a p-value, so I fit the model using the lmerTest package. Then it can show the features’ significance. Also, to avoid a higher chance of Type I error, models fit with the random intercept (Bodo Winter, n.d.).

First of all, the multilevel model starts with house price trends for each county. The figure 1 below show the house price trend from 2010 to 2018 and the house price distribution. Therefore, we should include the county factor in model A as a random intercept. Figure 2 shows the boxplot of house prices among counties. As we can see that Santa Barbara has the highest median price($1250000), and Modoc is the lowest price(98000). The range is broad, so it is wise to choose an intercept for each county.

Figure 1: Median house price trends for each county

Figure 2: House price boxplot

Model A

To compare with the final model from AT2, model A is the simplest mixed model with a random intercept. The Formula is showing below.

Formula:

## $call
## lmerTest::lmer(formula = log(house_price) ~ date + crime_index + 
##     interest + unemployment_rate + fault_score + quake + log(income) + 
##     log(population) + (1 | county), data = train_quake)

The RMSE of AT2 drops from 206110.5 to 67374.16. It is a positive sign that our model performs better than the original one. A smaller RMSE value on the test dataset means a better model prediction(Moody, 2019). Meanwhile, an ANOVA has been implemented, and model A has a lower AIC and BIC. It strength the evaluation result – model A is better(Rblog, 2018).

# Model evaluation
anova(lm.a,lm.quake,test='Chisq')

## Data: train_quake
## Models:
## lm.quake: log(house_price) ~ date + crime_index + interest + unemployment_rate + 
## lm.quake:     fault_score + quake + log(income) + log(population)
## lm.a: log(house_price) ~ date + crime_index + interest + unemployment_rate + 
## lm.a:     fault_score + quake + log(income) + log(population) + (1 | 
## lm.a:     county)
##          npar     AIC     BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## lm.quake   10  1023.5  1075.1 -501.77   1003.5                         
## lm.a       11 -3377.0 -3320.3 1699.51  -3399.0 4402.6  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##RMSE test A
test_quake$prediction <- predict(lm.a, test_quake)

p <- length(attr(summary(lm.a)$terms, 'term.labels'))
n <- nrow(test_quake)
y <- test_quake$house_price
y_predict <- exp(test_quake$prediction)

RSS <- sum((y - y_predict)^2)
MSE <- RSS / (n - p - 1)
RMSE.a <- sqrt(MSE)
RMSE.a

## [1] 67377.02

Model B

From the summary of model A, we remove the features insignificant such as crime_index, fault_score, log(income). And add the county to both fixed and random effects. The possible explanation is that spatial information may have an overlap between counties and fault scores.

Formula:

## $call
## lmerTest::lmer(formula = log(house_price) ~ date + interest + 
##     unemployment_rate + quake + county + log(population) + (1 | 
##     county), data = train_quake)

Then, the ANOVA test between models A and B shows that model B is more significant. However, it has a little higher RMSE 67503.36.

# Model evaluation
anova(lm.b,lm.a,test='Chisq')

## Data: train_quake
## Models:
## lm.a: log(house_price) ~ date + crime_index + interest + unemployment_rate + 
## lm.a:     fault_score + quake + log(income) + log(population) + (1 | 
## lm.a:     county)
## lm.b: log(house_price) ~ date + interest + unemployment_rate + quake + 
## lm.b:     county + log(population) + (1 | county)
##      npar     AIC     BIC logLik deviance  Chisq Df Pr(>Chisq)    
## lm.a   11 -3377.0 -3320.3 1699.5  -3399.0                         
## lm.b   53 -3746.3 -3472.8 1926.2  -3852.3 453.29 42  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##RMSE test B
test_quake$prediction <- predict(lm.b, test_quake)

p <- length(attr(summary(lm.b)$terms, 'term.labels'))
n <- nrow(test_quake)
y <- test_quake$house_price
y_predict <- exp(test_quake$prediction)

RSS <- sum((y - y_predict)^2)
MSE <- RSS / (n - p - 1)
RMSE.b <- sqrt(MSE)
RMSE.b

## [1] 67503.36

Model C

Figure 3: House price vs quake each county

Another data exploration in the figure 3 shows that a county with quake does not seem to have little impact on the model. But from the model perspective, we use quake as another random effect.

Formula:

## $call
## lmerTest::lmer(formula = log(house_price) ~ date + interest + 
##     unemployment_rate + quake + county + log(population) + (1 | 
##     county) + (1 | quake), data = train_quake)

Then, there is an ANOVA test which confirms that model B and C are not significantly different. I decide to use model B because the more straightforward the model, the better interpretation (Wenger & Olden, 2012).

# Model evaluation
anova(lm.b,lm.c,test='Chisq')

## Data: train_quake
## Models:
## lm.b: log(house_price) ~ date + interest + unemployment_rate + quake + 
## lm.b:     county + log(population) + (1 | county)
## lm.c: log(house_price) ~ date + interest + unemployment_rate + quake + 
## lm.c:     county + log(population) + (1 | county) + (1 | quake)
##      npar     AIC     BIC logLik deviance Chisq Df Pr(>Chisq)
## lm.b   53 -3746.3 -3472.8 1926.2  -3852.3                    
## lm.c   54 -3744.3 -3465.6 1926.2  -3852.3     0  1          1

##RMSE test C
test_quake$prediction <- predict(lm.c, test_quake)

p <- length(attr(summary(lm.c)$terms, 'term.labels'))
n <- nrow(test_quake)
y <- test_quake$house_price
y_predict <- exp(test_quake$prediction)

RSS <- sum((y - y_predict)^2)
MSE <- RSS / (n - p - 1)
RMSE.c <- sqrt(MSE)
RMSE.c

## [1] 67503.36

Model D

For each county, the income trend looks like figure 4 Also, figure 5 average income among counties has a similar tendency as house prices. In model D, income is added as a random slope based on model B.

Figure 4: Average income trend for each county

Figure 5: Average income trend

Formula:

## $call
## lmerTest::lmer(formula = log(house_price) ~ date + interest + 
##     unemployment_rate + quake + county + log(population) + (1 + 
##     log(income) | county), data = train_quake)

The likelihood ratio test shows that the county varies significantly in the effect of log(income) (p <2.2e-16). The RMSE drop from 67503.36(model B) to 40430.97(model D) now.

# Model evaluation
anova(lm.b,lm.d,test='Chisq')

## Data: train_quake
## Models:
## lm.b: log(house_price) ~ date + interest + unemployment_rate + quake + 
## lm.b:     county + log(population) + (1 | county)
## lm.d: log(house_price) ~ date + interest + unemployment_rate + quake + 
## lm.d:     county + log(population) + (1 + log(income) | county)
##      npar     AIC     BIC logLik deviance  Chisq Df Pr(>Chisq)    
## lm.b   53 -3746.3 -3472.8 1926.2  -3852.3                         
## lm.d   55 -3987.6 -3703.7 2048.8  -4097.6 245.28  2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##RMSE test D
test_quake$prediction <- predict(lm.d, test_quake)

p <- length(attr(summary(lm.d)$terms, 'term.labels'))
n <- nrow(test_quake)
y <- test_quake$house_price
y_predict <- exp(test_quake$prediction)

RSS <- sum((y - y_predict)^2)
MSE <- RSS / (n - p - 1)
RMSE.d <- sqrt(MSE)
RMSE.d

## [1] 40430.97

Model E

For model E, I did some feature engineering to categorize the population into three classes. The reason for this because it can be found that the population among counties has significant variations as figure 6 below. Therefore, we classify it with two thresholds, which are 75% and 25% value. So the dataset now has a new column called pop_tier. If the population is greater than 75% of all the population value, it will be marked as tier1. If the population is between 75% and 25 % of all the population value, it will be marked as tier2. The rest will be tier3.

Figure 6: Population for each county

Figure 7: Population for each county

Figure 7 shows the relationship between house price and population tier. As it can be found that, the top-level (more population) has a higher house price.

Formula:

## $call
## lmerTest::lmer(formula = log(house_price) ~ date + interest + 
##     unemployment_rate + quake + county + log(population) + (1 + 
##     log(income) | county) + (1 + 1 | pop_tier), data = train_quake)

Although the RMSE drops to 38551.22, the likelihood ratio test does not show a significant difference between models D and E. So I keep model D as the final model.

# Model evaluation
anova(lm.e,lm.d,test='Chisq')

## Data: train_quake
## Models:
## lm.d: log(house_price) ~ date + interest + unemployment_rate + quake + 
## lm.d:     county + log(population) + (1 + log(income) | county)
## lm.e: log(house_price) ~ date + interest + unemployment_rate + quake + 
## lm.e:     county + log(population) + (1 + log(income) | county) + (1 + 
## lm.e:     1 | pop_tier)
##      npar     AIC     BIC logLik deviance Chisq Df Pr(>Chisq)
## lm.d   55 -3987.6 -3703.7 2048.8  -4097.6                    
## lm.e   56 -3985.6 -3696.5 2048.8  -4097.6     0  1          1

##RMSE test E
test_quake$prediction <- predict(lm.e, test_quake)

p <- length(attr(summary(lm.e)$terms, 'term.labels'))
n <- nrow(test_quake)
y <- test_quake$house_price
y_predict <- exp(test_quake$prediction)

RSS <- sum((y - y_predict)^2)
MSE <- RSS / (n - p - 1)
RMSE.e <- sqrt(MSE)
RMSE.e

## [1] 37986.34

Final Model

The final model D is the best performance of these five models. The random effects is showing in the figure 8 below.

Model D expression:

log(house_price)~date+interest+unemployment_rate+quake+county+log(population)+(1+log(income)|county)+(1+1|pop_tier)+ ϵ

Caveat The time series perspective is not controlled in the part of the random effect. In this model, we assume the data points are independent. So it can focus on spatial perspective.

Figure 8: Random effect for each county

Conclusion

Above all, the house price in each county are mostly similar, but some counties like Humboldt county has a negative coefficient. It means investors should consider carefully to buy property value in this county. In the counties like Solano, Stanislaus, it has a positive random effect. The rest of the social indicators such as population, income also impact house prices. Furthermore, the model RMSE has been improved dramatically from 206110.5 to 40430.97. Indeed, a mixed-effect model performs better than linear regression. Additionally, the model D is statistically significant comparing to other models. Therefore, location is a vital indicator to buy a house for investment.

Reflection

The mixed-effect models have a lot of interesting strategies. With more time given, I believe I can build a better knowledge of this domain. With this experience, I am more confident to deal with this kind of model. Also, there are plenty of learning materials online now; I do not have enough time to review all of them. But I can browse it and look it back whenever I need it—some useful websites such as Datacamp, the lme4 book and so on.

Another critical point is about datasets. As I always believe, no matter how good your model is. We should keep high criteria for data collection and wrangling. During the whole project, the most challenging part is not modeling. Data exploration and understand the relationship are critical as well.

Reference

Bodo Winter. (n.d.). Linear models and linear mixed-effects models in R with linguistic applications [University of California]. https://arxiv.org/ftp/arxiv/papers/1308/1308.5499.pdf

Moody, J. (2019, September 6). What does RMSE really mean? Medium. https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e

Rblog, B. (2018, April 12). How do I interpret the AIC. R-Bloggers. https://www.r-bloggers.com/2018/04/how-do-i-interpret-the-aic/

STDS Group Disaster. (2020). The impact of earthquake activity on property market values in California. UTS.

Wenger, S., & Olden, J. (2012). Assessing transferability of ecological models: An underappreciated aspect of statistical validation. Methods in Ecology and Evolution, 3(2), 260–267. https://doi.org/10.1111/j.2041-210X.2011.00170.x Zajic, A. (2019, December 27).

Introduction to AIC — Akaike Information Criterion. Medium. https://towardsdatascience.com/introduction-to-aic-akaike-information-criterion-9c9ba1c96ced

Appendix

Model A summary

summary(lm.a)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: 
## log(house_price) ~ date + crime_index + interest + unemployment_rate +  
##     fault_score + quake + log(income) + log(population) + (1 |      county)
##    Data: train_quake
## 
## REML criterion at convergence: -3342.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.4044 -0.6294  0.0468  0.6179  4.0869 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  county   (Intercept) 0.49725  0.70516 
##  Residual             0.00312  0.05586 
## Number of obs: 1288, groups:  county, 46
## 
## Fixed effects:
##                     Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)       -1.638e+02  5.086e+00  1.241e+03 -32.208  < 2e-16 ***
## date               8.504e-02  2.560e-03  1.238e+03  33.225  < 2e-16 ***
## crime_index       -5.043e-01  3.702e-01  1.244e+03  -1.362   0.1734    
## interest          -8.102e-02  7.064e-03  1.204e+03 -11.469  < 2e-16 ***
## unemployment_rate -3.620e-03  1.581e-03  1.214e+03  -2.290   0.0222 *  
## fault_score       -1.972e-03  6.817e-02  2.369e+01  -0.029   0.9772    
## quake1            -1.270e-02  4.861e-03  1.201e+03  -2.613   0.0091 ** 
## log(income)       -2.225e-03  3.015e-02  1.210e+03  -0.074   0.9412    
## log(population)    4.263e-01  4.548e-02  8.782e+01   9.374  7.1e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) date   crm_nd intrst unmpl_ flt_sc quake1 lg(nc)
## date        -0.992                                                 
## crime_index -0.009 -0.006                                          
## interest     0.506 -0.490 -0.034                                   
## unmplymnt_r -0.844  0.835 -0.035 -0.237                            
## fault_score -0.084  0.023 -0.039  0.017  0.003                     
## quake1      -0.005 -0.001  0.061 -0.072 -0.030 -0.021              
## log(income)  0.271 -0.309 -0.039 -0.087 -0.008 -0.007  0.058       
## log(popltn)  0.015 -0.101  0.164 -0.052 -0.046 -0.236  0.040 -0.055

Model B summary

summary(lm.b)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log(house_price) ~ date + interest + unemployment_rate + quake +  
##     county + log(population) + (1 | county)
##    Data: train_quake
## 
## REML criterion at convergence: -3472
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.3941 -0.6411  0.0348  0.6285  4.0063 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  county   (Intercept) 0.034392 0.18545 
##  Residual             0.003063 0.05534 
## Number of obs: 1288, groups:  county, 46
## 
## Fixed effects:
##                         Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)           -1.645e+02  4.836e+00  3.461e+01 -34.020   <2e-16 ***
## date                   8.316e-02  2.433e-03  1.169e+03  34.175   <2e-16 ***
## interest              -8.400e-02  6.977e-03  1.228e+03 -12.040   <2e-16 ***
## unemployment_rate     -3.983e-03  1.569e-03  1.191e+03  -2.538   0.0113 *  
## quake1                -1.105e-02  4.802e-03  1.237e+03  -2.301   0.0216 *  
## countycolusa           1.922e+00  3.762e-01  3.568e-04   5.108   0.9978    
## countycontra costa    -1.487e-03  2.637e-01  8.618e-05  -0.006   1.0000    
## countydel norte        1.728e+00  3.643e-01  3.141e-04   4.743   0.9980    
## countyel dorado        9.329e-01  2.953e-01  1.356e-04   3.159   0.9991    
## countyfresno          -1.013e+00  2.648e-01  8.767e-05  -3.823   0.9994    
## countyglenn            1.731e+00  3.643e-01  3.141e-04   4.751   0.9980    
## countyhumboldt         1.011e+00  3.043e-01  1.528e-04   3.323   0.9990    
## countyimperial         2.931e-01  2.987e-01  1.419e-04   0.981   0.9993    
## countyinyo             2.554e+00  3.831e-01  3.840e-04   6.666   0.9975    
## countykern            -1.005e+00  2.657e-01  8.883e-05  -3.782   0.9994    
## countylake             1.013e+00  3.303e-01  2.120e-04   3.068   0.9987    
## countylassen           1.379e+00  3.571e-01  2.899e-04   3.862   0.9982    
## countylos angeles     -1.645e+00  2.871e-01  1.211e-04  -5.729   0.9992    
## countymarin            1.834e+00  2.862e-01  1.196e-04   6.409   0.9992    
## countymendocino        1.431e+00  3.189e-01  1.844e-04   4.487   0.9988    
## countymerced           5.002e-02  2.863e-01  1.197e-04   0.175   0.9996    
## countymodoc            1.813e+00  4.152e-01  5.296e-04   4.367   0.9969    
## countymono             2.974e+00  3.953e-01  4.351e-04   7.523   0.9971    
## countymonterey         5.256e-01  2.753e-01  1.024e-04   1.909   0.9994    
## countynapa             1.760e+00  3.036e-01  1.513e-04   5.796   0.9990    
## countynevada           1.419e+00  3.148e-01  1.751e-04   4.506   0.9989    
## countyorange          -5.299e-01  2.661e-01  8.933e-05  -1.992   0.9995    
## countyplacer           5.027e-01  2.784e-01  1.070e-04   1.806   0.9994    
## countyplumas           1.800e+00  3.807e-01  3.745e-04   4.728   0.9977    
## countyriverside       -1.114e+00  2.637e-01  8.625e-05  -4.222   0.9994    
## countysan benito       2.324e+00  3.352e-01  2.250e-04   6.933   0.9985    
## countysan bernardino  -1.103e+00  2.633e-01  8.569e-05  -4.188   0.9994    
## countysan diego       -8.551e-01  2.664e-01  8.982e-05  -3.210   0.9994    
## countysan joaquin     -1.985e-01  2.676e-01  9.145e-05  -0.742   0.9995    
## countysan luis obispo  1.078e+00  2.845e-01  1.167e-04   3.789   0.9992    
## countysan mateo        1.075e+00  2.669e-01  9.043e-05   4.029   0.9994    
## countysanta barbara    8.464e-01  2.748e-01  1.017e-04   3.080   0.9993    
## countysanta clara      3.109e-01  2.629e-01  8.513e-05   1.183   0.9995    
## countysanta cruz       1.336e+00  2.854e-01  1.183e-04   4.680   0.9992    
## countyshasta           3.149e-01  2.962e-01  1.372e-04   1.063   0.9993    
## countysierra           3.095e+00  4.703e-01  8.720e-04   6.581   0.9947    
## countysiskiyou         1.214e+00  3.451e-01  2.528e-04   3.517   0.9985    
## countysolano           2.610e-01  2.752e-01  1.023e-04   0.948   0.9995    
## countysonoma           5.506e-01  2.726e-01  9.836e-05   2.020   0.9994    
## countystanislaus      -2.508e-01  2.718e-01  9.725e-05  -0.923   0.9995    
## countysutter           9.022e-01  3.161e-01  1.779e-04   2.855   0.9989    
## countytehama           1.019e+00  3.309e-01  2.138e-04   3.078   0.9987    
## countytulare          -6.033e-01  2.743e-01  1.010e-04  -2.199   0.9994    
## countyventura          1.627e-01  2.657e-01  8.877e-05   0.612   0.9996    
## countyyolo             7.260e-01  2.919e-01  1.295e-04   2.487   0.9992    
## log(population)        7.306e-01  6.269e-02  1.235e+03  11.655   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## Correlation matrix not shown by default, as p = 51 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it

Model C summary

summary(lm.c)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log(house_price) ~ date + interest + unemployment_rate + quake +  
##     county + log(population) + (1 | county) + (1 | quake)
##    Data: train_quake
## 
## REML criterion at convergence: -3472
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.3941 -0.6411  0.0348  0.6285  4.0063 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  county   (Intercept) 3.795e-02 0.194809
##  quake    (Intercept) 1.193e-05 0.003453
##  Residual             3.063e-03 0.055343
## Number of obs: 1288, groups:  county, 46; quake, 2
## 
## Fixed effects:
##                         Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)           -1.645e+02  4.836e+00  8.656e+01 -34.017   <2e-16 ***
## date                   8.316e-02  2.433e-03  7.011e+01  34.175   <2e-16 ***
## interest              -8.400e-02  6.977e-03  4.860e+02 -12.040   <2e-16 ***
## unemployment_rate     -3.983e-03  1.569e-03  1.110e+02  -2.538   0.0125 *  
## quake1                -1.105e-02  6.849e-03  5.509e-05  -1.613   0.9997    
## countycolusa           1.922e+00  3.855e-01  2.100e-04   4.984   0.9986    
## countycontra costa    -1.487e-03  2.769e-01  5.588e-05  -0.005   1.0000    
## countydel norte        1.728e+00  3.740e-01  1.860e-04   4.621   0.9988    
## countyel dorado        9.329e-01  3.071e-01  8.463e-05   3.037   0.9995    
## countyfresno          -1.013e+00  2.779e-01  5.675e-05  -3.643   0.9996    
## countyglenn            1.731e+00  3.740e-01  1.860e-04   4.629   0.9988    
## countyhumboldt         1.011e+00  3.158e-01  9.457e-05   3.202   0.9994    
## countyimperial         2.931e-01  3.104e-01  8.829e-05   0.944   0.9995    
## countyinyo             2.554e+00  3.923e-01  2.252e-04   6.510   0.9985    
## countykern            -1.005e+00  2.788e-01  5.743e-05  -3.604   0.9996    
## countylake             1.013e+00  3.409e-01  1.284e-04   2.973   0.9992    
## countylassen           1.379e+00  3.669e-01  1.724e-04   3.759   0.9989    
## countylos angeles     -1.645e+00  2.992e-01  7.625e-05  -5.496   0.9995    
## countymarin            1.834e+00  2.984e-01  7.541e-05   6.147   0.9995    
## countymendocino        1.431e+00  3.299e-01  1.126e-04   4.338   0.9992    
## countymerced           5.002e-02  2.984e-01  7.543e-05   0.168   0.9997    
## countymodoc            1.813e+00  4.237e-01  3.064e-04   4.280   0.9981    
## countymono             2.974e+00  4.042e-01  2.538e-04   7.357   0.9983    
## countymonterey         5.256e-01  2.880e-01  6.540e-05   1.825   0.9996    
## countynapa             1.760e+00  3.151e-01  9.370e-05   5.585   0.9993    
## countynevada           1.419e+00  3.259e-01  1.073e-04   4.353   0.9993    
## countyorange          -5.299e-01  2.791e-01  5.773e-05  -1.899   0.9996    
## countyplacer           5.027e-01  2.909e-01  6.808e-05   1.728   0.9996    
## countyplumas           1.800e+00  3.900e-01  2.199e-04   4.616   0.9986    
## countyriverside       -1.114e+00  2.769e-01  5.592e-05  -4.021   0.9996    
## countysan benito       2.324e+00  3.456e-01  1.357e-04   6.723   0.9990    
## countysan bernardino  -1.103e+00  2.765e-01  5.559e-05  -3.988   0.9996    
## countysan diego       -8.551e-01  2.795e-01  5.802e-05  -3.060   0.9996    
## countysan joaquin     -1.985e-01  2.806e-01  5.898e-05  -0.707   0.9997    
## countysan luis obispo  1.078e+00  2.967e-01  7.369e-05   3.632   0.9995    
## countysan mateo        1.075e+00  2.799e-01  5.838e-05   3.842   0.9996    
## countysanta barbara    8.464e-01  2.875e-01  6.495e-05   2.944   0.9996    
## countysanta clara      3.109e-01  2.761e-01  5.526e-05   1.126   0.9997    
## countysanta cruz       1.336e+00  2.976e-01  7.463e-05   4.488   0.9995    
## countyshasta           3.149e-01  3.080e-01  8.554e-05   1.023   0.9995    
## countysierra           3.095e+00  4.778e-01  4.957e-04   6.478   0.9968    
## countysiskiyou         1.214e+00  3.553e-01  1.515e-04   3.416   0.9990    
## countysolano           2.610e-01  2.879e-01  6.530e-05   0.907   0.9996    
## countysonoma           5.506e-01  2.853e-01  6.302e-05   1.930   0.9996    
## countystanislaus      -2.508e-01  2.846e-01  6.237e-05  -0.881   0.9997    
## countysutter           9.022e-01  3.271e-01  1.089e-04   2.758   0.9993    
## countytehama           1.019e+00  3.415e-01  1.294e-04   2.983   0.9992    
## countytulare          -6.033e-01  2.870e-01  6.454e-05  -2.102   0.9996    
## countyventura          1.627e-01  2.787e-01  5.740e-05   0.584   0.9997    
## countyyolo             7.260e-01  3.039e-01  8.110e-05   2.389   0.9995    
## log(population)        7.306e-01  6.269e-02  1.236e+03  11.655   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## Correlation matrix not shown by default, as p = 51 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it

Model D summary

summary(lm.d)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log(house_price) ~ date + interest + unemployment_rate + quake +  
##     county + log(population) + (1 + log(income) | county)
##    Data: train_quake
## 
## REML criterion at convergence: -3769.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8083 -0.6361 -0.0365  0.5957  2.7256 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  county   (Intercept) 0.002846 0.05334       
##           log(income) 0.176533 0.42016  -0.55
##  Residual             0.002197 0.04687       
## Number of obs: 1288, groups:  county, 46
## 
## Fixed effects:
##                         Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)           -1.687e+02  4.547e+00  1.160e+02 -37.099  < 2e-16 ***
## date                   8.631e-02  2.299e-03  1.007e+02  37.538  < 2e-16 ***
## interest              -7.808e-02  6.027e-03  7.015e+02 -12.955  < 2e-16 ***
## unemployment_rate     -3.721e-03  1.454e-03  1.788e+02  -2.559   0.0113 *  
## quake1                -9.642e-03  4.226e-03  1.205e+03  -2.282   0.0227 *  
## countycolusa           4.706e+00  8.841e-01  5.618e-01   5.323   0.2463    
## countycontra costa    -3.444e+00  1.169e+00  2.191e-01  -2.947   0.5830    
## countydel norte        7.405e-01  1.810e+00  1.314e-01   0.409   0.8851    
## countyel dorado        4.820e+00  1.071e+00  2.449e-01   4.502   0.5014    
## countyfresno           9.292e-01  1.240e+00  1.788e-01   0.750   0.7958    
## countyglenn            3.871e+00  1.024e+00  2.831e-01   3.779   0.4841    
## countyhumboldt         8.859e+00  1.200e+00  1.860e-01   7.383   0.5247    
## countyimperial         7.575e+00  1.424e+00  1.551e-01   5.320   0.6049    
## countyinyo             2.835e+00  1.276e+00  1.816e-01   2.222   0.6600    
## countykern            -9.463e-02  1.649e+00  1.388e-01  -0.057   0.9805    
## countylake             1.655e+00  1.431e+00  1.465e-01   1.156   0.7713    
## countylassen           2.379e+00  1.102e+00  2.114e-01   2.160   0.6312    
## countylos angeles      7.937e-01  1.210e+00  2.041e-01   0.656   0.7984    
## countymarin            2.098e+00  1.116e+00  2.410e-01   1.880   0.6220    
## countymendocino        6.539e+00  1.095e+00  2.205e-01   5.972   0.4982    
## countymerced          -2.287e+00  1.136e+00  2.104e-01  -2.014   0.6413    
## countymodoc            4.907e-01  9.248e-01  3.627e-01   0.531   0.7742    
## countymono             5.807e+00  8.979e-01  5.365e-01   6.467   0.2329    
## countymonterey         2.003e+00  1.208e+00  1.917e-01   1.658   0.6849    
## countynapa             1.063e+00  1.139e+00  2.259e-01   0.933   0.7366    
## countynevada           2.955e+00  1.226e+00  1.881e-01   2.409   0.6430    
## countyorange           4.985e+00  1.295e+00  1.830e-01   3.850   0.5961    
## countyplacer           2.988e+00  1.269e+00  1.908e-01   2.355   0.6425    
## countyplumas           8.331e+00  1.179e+00  1.878e-01   7.068   0.5262    
## countyriverside       -1.686e-02  1.340e+00  1.617e-01  -0.013   0.9954    
## countysan benito       7.636e+00  1.151e+00  2.353e-01   6.635   0.4679    
## countysan bernardino  -2.111e+00  1.341e+00  1.638e-01  -1.574   0.7197    
## countysan diego        2.032e+00  1.316e+00  1.784e-01   1.545   0.7069    
## countysan joaquin     -4.501e+00  1.308e+00  1.718e-01  -3.440   0.6234    
## countysan luis obispo  3.713e+00  1.238e+00  1.866e-01   2.999   0.6192    
## countysan mateo        1.114e-01  6.899e-01  4.919e+00   0.161   0.8782    
## countysanta barbara    6.938e+00  1.333e+00  1.722e-01   5.206   0.5801    
## countysanta clara     -3.437e+00  9.301e-01  4.249e-01  -3.696   0.3758    
## countysanta cruz       3.114e+00  1.361e+00  1.683e-01   2.288   0.6724    
## countyshasta           6.524e-01  1.242e+00  1.794e-01   0.525   0.8379    
## countysierra           4.517e+00  1.108e+00  3.021e-01   4.077   0.4559    
## countysiskiyou         4.450e+00  1.067e+00  2.291e-01   4.169   0.5289    
## countysolano          -4.428e+00  1.273e+00  1.903e-01  -3.477   0.5977    
## countysonoma          -1.735e+00  1.166e+00  2.064e-01  -1.488   0.6854    
## countystanislaus      -6.684e+00  1.275e+00  1.775e-01  -5.242   0.5716    
## countysutter           3.055e+00  1.210e+00  1.908e-01   2.526   0.6341    
## countytehama           3.921e+00  9.725e-01  3.127e-01   4.032   0.4478    
## countytulare           4.300e-01  1.108e+00  2.112e-01   0.388   0.8606    
## countyventura          3.392e+00  1.254e+00  1.872e-01   2.705   0.6304    
## countyyolo             2.381e+00  1.234e+00  1.961e-01   1.929   0.6614    
## log(population)        5.012e-01  6.763e-02  1.214e+03   7.411 2.34e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## Correlation matrix not shown by default, as p = 51 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it

Model E summary

summary(lm.e)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log(house_price) ~ date + interest + unemployment_rate + quake +  
##     county + log(population) + (1 + log(income) | county) + (1 +  
##     1 | pop_tier)
##    Data: train_quake
## 
## REML criterion at convergence: -3776.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8641 -0.6432 -0.0285  0.5947  2.7341 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  county   (Intercept) 0.002098 0.04580       
##           log(income) 0.175277 0.41866  -0.17
##  pop_tier (Intercept) 0.001227 0.03502       
##  Residual             0.002179 0.04667       
## Number of obs: 1288, groups:  county, 46; pop_tier, 3
## 
## Fixed effects:
##                         Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)           -1.675e+02  4.542e+00  9.774e+01 -36.885  < 2e-16 ***
## date                   8.586e-02  2.294e-03  1.352e+02  37.431  < 2e-16 ***
## interest              -7.762e-02  6.029e-03  8.313e+02 -12.873  < 2e-16 ***
## unemployment_rate     -3.937e-03  1.450e-03  2.381e+02  -2.715 0.007103 ** 
## quake1                -9.234e-03  4.215e-03  1.205e+03  -2.191 0.028665 *  
## countycolusa           4.682e+00  8.858e-01  1.941e+00   5.286 0.036191 *  
## countycontra costa    -3.538e+00  1.171e+00  1.555e+01  -3.022 0.008305 ** 
## countydel norte        7.302e-01  1.814e+00  6.990e+01   0.403 0.688529    
## countyel dorado        4.846e+00  1.073e+00  7.646e+00   4.518 0.002194 ** 
## countyfresno           9.003e-01  1.242e+00  3.377e+01   0.725 0.473610    
## countyglenn            3.846e+00  1.026e+00  5.344e+00   3.747 0.011805 *  
## countyhumboldt         8.910e+00  1.202e+00  2.330e+01   7.410 1.44e-07 ***
## countyimperial         7.635e+00  1.427e+00  3.422e+02   5.350 1.61e-07 ***
## countyinyo             2.805e+00  1.278e+00  4.434e+01   2.195 0.033457 *  
## countykern            -1.381e-01  1.653e+00  1.997e+02  -0.084 0.933478    
## countylake             1.663e+00  1.435e+00  4.929e+02   1.159 0.246910    
## countylassen           2.438e+00  1.104e+00  1.035e+01   2.207 0.050905 .  
## countylos angeles      8.276e-01  1.213e+00  2.239e+01   0.682 0.501947    
## countymarin            2.085e+00  1.118e+00  1.010e+01   1.865 0.091384 .  
## countymendocino        6.582e+00  1.097e+00  9.589e+00   5.999 0.000157 ***
## countymerced          -2.280e+00  1.138e+00  1.296e+01  -2.004 0.066460 .  
## countymodoc            4.800e-01  9.268e-01  2.735e+00   0.518 0.643459    
## countymono             5.785e+00  8.994e-01  2.114e+00   6.432 0.020285 *  
## countymonterey         2.017e+00  1.210e+00  2.387e+01   1.666 0.108740    
## countynapa             1.038e+00  1.142e+00  1.246e+01   0.909 0.380497    
## countynevada           2.976e+00  1.229e+00  2.818e+01   2.422 0.022127 *  
## countyorange           4.973e+00  1.297e+00  5.099e+01   3.833 0.000349 ***
## countyplacer           2.940e+00  1.271e+00  3.864e+01   2.313 0.026143 *  
## countyplumas           8.373e+00  1.181e+00  1.972e+01   7.088 7.77e-07 ***
## countyriverside       -9.427e-02  1.343e+00  1.077e+02  -0.070 0.944176    
## countysan benito       7.591e+00  1.153e+00  1.304e+01   6.582 1.74e-05 ***
## countysan bernardino  -2.151e+00  1.344e+00  1.046e+02  -1.601 0.112472    
## countysan diego        2.001e+00  1.318e+00  6.526e+01   1.518 0.133767    
## countysan joaquin     -4.542e+00  1.311e+00  6.644e+01  -3.464 0.000935 ***
## countysan luis obispo  3.703e+00  1.241e+00  3.116e+01   2.985 0.005476 ** 
## countysan mateo        5.729e-01  7.068e-01  5.857e-01   0.811 0.630901    
## countysanta barbara    6.937e+00  1.335e+00  8.180e+01   5.195 1.47e-06 ***
## countysanta clara     -3.513e+00  9.317e-01  2.665e+00  -3.771 0.039962 *  
## countysanta cruz       3.102e+00  1.364e+00  1.201e+02   2.275 0.024678 *  
## countyshasta           6.889e-01  1.245e+00  3.423e+01   0.553 0.583629    
## countysierra           4.472e+00  1.110e+00  8.366e+00   4.030 0.003453 ** 
## countysiskiyou         4.462e+00  1.070e+00  7.840e+00   4.171 0.003259 ** 
## countysolano          -4.497e+00  1.276e+00  4.004e+01  -3.525 0.001078 ** 
## countysonoma          -1.739e+00  1.169e+00  1.627e+01  -1.488 0.155796    
## countystanislaus      -6.723e+00  1.278e+00  4.598e+01  -5.261 3.64e-06 ***
## countysutter           3.072e+00  1.212e+00  2.432e+01   2.534 0.018101 *  
## countytehama           3.518e+00  1.049e+00  6.315e+00   3.355 0.014182 *  
## countytulare           4.504e-01  1.110e+00  1.078e+01   0.406 0.692862    
## countyventura          3.382e+00  1.256e+00  3.516e+01   2.692 0.010814 *  
## countyyolo             2.352e+00  1.236e+00  2.824e+01   1.903 0.067343 .  
## log(population)        4.790e-01  6.770e-02  1.219e+03   7.075 2.51e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## Correlation matrix not shown by default, as p = 51 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it

## convergence code: 0
## unable to evaluate scaled gradient
## Model failed to converge: degenerate  Hessian with 2 negative eigenvalues

Further analysis of earthquake activity on property market values in California

Tianyang Gao

12 November 2020

Introduction

Background and Justifications

Research Questions

Initial Model Review

Model Analysis

Model A

Model B

Model C

Model D

Model E

Final Model

Conclusion

Reflection

Reference

Appendix

Model A summary

Model B summary

Model C summary

Model D summary

Model E summary