This document was created with R Markdown, and then printed as pdf for peer-graded evaluation purposes.
Code chunks will not be echoed in the paper.
This project is of an applied nature and uses data that are available in the data file Capstone-HousePrices. The source of these data is Anglin and Gencay, “Semiparametric Estimation of a Hedonic Price Function”(Journal of Applied Econometrics 11, 1996, pages 633-648). We consider the modeling and prediction of house prices.
Data are available for 546 observations of the following variables:
- sell: Sale price of the house
- lot: Lot size of the property in square feet - bdms: Number of bedrooms
- fb: Number of full bathrooms
- sty: Number of stories excluding basement
- drv: Dummy that is 1 if the house has a driveway and 0 otherwise
- rec: Dummy that is 1 if the house has a recreational room and 0 otherwise
- ffin: Dummy that is 1 if the house has a full finished basement and 0 otherwise
- ghw: Dummy that is 1 if the house uses gas for hot water heating and 0 otherwise
- ca: Dummy that is 1 if there is central air conditioning and 0 otherwise
- gar: Number of covered garage places
- reg: Dummy that is 1 if the house is located in a preferred neighborhood of the city and 0 otherwise
- obs: Observation number, needed in part (h)
##
## Call:
## lm(formula = sell ~ . - obs, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41389 -9307 -591 7353 74875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4038.3504 3409.4713 -1.184 0.236762
## lot 3.5463 0.3503 10.124 < 2e-16 ***
## bdms 1832.0035 1047.0002 1.750 0.080733 .
## fb 14335.5585 1489.9209 9.622 < 2e-16 ***
## sty 6556.9457 925.2899 7.086 4.37e-12 ***
## drv 6687.7789 2045.2458 3.270 0.001145 **
## rec 4511.2838 1899.9577 2.374 0.017929 *
## ffin 5452.3855 1588.0239 3.433 0.000642 ***
## ghw 12831.4063 3217.5971 3.988 7.60e-05 ***
## ca 12632.8904 1555.0211 8.124 3.15e-15 ***
## gar 4244.8290 840.5442 5.050 6.07e-07 ***
## reg 9369.5132 1669.0907 5.614 3.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15420 on 534 degrees of freedom
## Multiple R-squared: 0.6731, Adjusted R-squared: 0.6664
## F-statistic: 99.97 on 11 and 534 DF, p-value: < 2.2e-16
##
## RESET test
##
## data: model_a
## RESET = 26.986, df1 = 1, df2 = 533, p-value = 2.922e-07
With a statistic of ~26.986 and a p-value of ~0, the Ramsey’s RESET test suggests that the linear model is NOT correctly specified (\(H_0\) of correct/linear specification rejected).
##
## Jarque Bera Test
##
## data: model_a$residuals
## X-squared = 247.62, df = 2, p-value < 2.2e-16
In addition we tested also the residuals: with a statistic of ~247.62 and a p-value of ~0, the Jarque-Bera test suggests that the linear model residuals are NOT normally distributed, therefore the linear model is NOT correctly specified.
##
## Call:
## lm(formula = log(sell) ~ . - obs, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67865 -0.12211 0.01666 0.12868 0.67737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.003e+01 4.724e-02 212.210 < 2e-16 ***
## lot 5.057e-05 4.854e-06 10.418 < 2e-16 ***
## bdms 3.402e-02 1.451e-02 2.345 0.01939 *
## fb 1.678e-01 2.065e-02 8.126 3.10e-15 ***
## sty 9.227e-02 1.282e-02 7.197 2.10e-12 ***
## drv 1.307e-01 2.834e-02 4.610 5.04e-06 ***
## rec 7.352e-02 2.633e-02 2.792 0.00542 **
## ffin 9.940e-02 2.200e-02 4.517 7.72e-06 ***
## ghw 1.784e-01 4.458e-02 4.000 7.22e-05 ***
## ca 1.780e-01 2.155e-02 8.262 1.14e-15 ***
## gar 5.076e-02 1.165e-02 4.358 1.58e-05 ***
## reg 1.271e-01 2.313e-02 5.496 6.02e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2137 on 534 degrees of freedom
## Multiple R-squared: 0.6766, Adjusted R-squared: 0.6699
## F-statistic: 101.6 on 11 and 534 DF, p-value: < 2.2e-16
##
## RESET test
##
## data: model_b
## RESET = 0.27031, df1 = 1, df2 = 533, p-value = 0.6033
With a statistic of ~0.27 and a p-value of ~0.6033, the Ramsey’s RESET test suggests that the linear model on logarithm of sell price might be correctly specified (\(H_0\) of correct/linear specification NOT rejected, at the 5% level of significance).
##
## Jarque Bera Test
##
## data: model_b$residuals
## X-squared = 8.4432, df = 2, p-value = 0.01467
The residuals still are not satisfactory: with a statistic of ~8.443 and a p-value of ~0.0147, the Jarque-Bera Test rejects the null hypothesis of normal distribution.
##
## Call:
## lm(formula = log(sell) ~ . - obs + log(lot), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68573 -0.12380 0.00785 0.12521 0.68112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.150e+00 6.830e-01 10.469 < 2e-16 ***
## lot -1.490e-05 1.624e-05 -0.918 0.359086
## bdms 3.489e-02 1.429e-02 2.442 0.014915 *
## fb 1.659e-01 2.033e-02 8.161 2.40e-15 ***
## sty 9.121e-02 1.263e-02 7.224 1.76e-12 ***
## drv 1.068e-01 2.847e-02 3.752 0.000195 ***
## rec 5.467e-02 2.630e-02 2.078 0.038156 *
## ffin 1.052e-01 2.171e-02 4.848 1.64e-06 ***
## ghw 1.791e-01 4.390e-02 4.079 5.20e-05 ***
## ca 1.643e-01 2.146e-02 7.657 9.01e-14 ***
## gar 4.826e-02 1.148e-02 4.203 3.09e-05 ***
## reg 1.344e-01 2.284e-02 5.884 7.10e-09 ***
## log(lot) 3.827e-01 9.070e-02 4.219 2.88e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2104 on 533 degrees of freedom
## Multiple R-squared: 0.687, Adjusted R-squared: 0.68
## F-statistic: 97.51 on 12 and 533 DF, p-value: < 2.2e-16
We conclude that it would be better to include the lot size logarithm in the model, rather than the lot size variable itself, since the lot size logarithm has a much lower p-value (~0), compared to the lot size variable itself (p-value=0.359)
Indeed the lot variable ending with a ~zero (0) coefficient in the last two models hinted it had no practical value.
##
## Call:
## lm(formula = log(sell) ~ log(lot) + bdms + fb + sty + drv + rec +
## ffin + ghw + ca + gar + reg + log(lot) * bdms + log(lot) *
## fb + log(lot) * sty + log(lot) * drv + log(lot) * rec + log(lot) *
## ffin + log(lot) * ghw + log(lot) * ca + log(lot) * gar +
## log(lot) * reg, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68306 -0.11612 0.00591 0.12486 0.65998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.966499 1.070667 8.375 5.09e-16 ***
## log(lot) 0.152685 0.128294 1.190 0.2345
## bdms 0.019075 0.326700 0.058 0.9535
## fb -0.368234 0.429048 -0.858 0.3911
## sty 0.488885 0.309700 1.579 0.1150
## drv -1.463371 0.717225 -2.040 0.0418 *
## rec 1.673992 0.655919 2.552 0.0110 *
## ffin -0.031844 0.445543 -0.071 0.9430
## ghw -0.505889 0.902733 -0.560 0.5754
## ca -0.340276 0.496041 -0.686 0.4930
## gar 0.401941 0.258646 1.554 0.1208
## reg 0.118484 0.479856 0.247 0.8051
## log(lot):bdms 0.002070 0.038654 0.054 0.9573
## log(lot):fb 0.062037 0.050145 1.237 0.2166
## log(lot):sty -0.046361 0.035942 -1.290 0.1977
## log(lot):drv 0.191542 0.087361 2.193 0.0288 *
## log(lot):rec -0.188462 0.076373 -2.468 0.0139 *
## log(lot):ffin 0.015913 0.052851 0.301 0.7635
## log(lot):ghw 0.081135 0.106929 0.759 0.4483
## log(lot):ca 0.059549 0.058024 1.026 0.3052
## log(lot):gar -0.041359 0.030142 -1.372 0.1706
## log(lot):reg 0.001515 0.055990 0.027 0.9784
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2095 on 524 degrees of freedom
## Multiple R-squared: 0.6951, Adjusted R-squared: 0.6829
## F-statistic: 56.89 on 21 and 524 DF, p-value: < 2.2e-16
Using the 5% significance level, only two of the ten interaction variables used are individually significant:
- \(LOG(lot)*drv\)
- \(LOG(lot)*rec\)
## Linear hypothesis test
##
## Hypothesis:
## log(lot):bdms = 0
## log(lot):fb = 0
## log(lot):sty = 0
## log(lot):drv = 0
## log(lot):rec = 0
## log(lot):ffin = 0
## log(lot):ghw = 0
## log(lot):ca = 0
## log(lot):gar = 0
## log(lot):reg = 0
##
## Model 1: restricted model
## Model 2: log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw +
## ca + gar + reg + log(lot) * bdms + log(lot) * fb + log(lot) *
## sty + log(lot) * drv + log(lot) * rec + log(lot) * ffin +
## log(lot) * ghw + log(lot) * ca + log(lot) * gar + log(lot) *
## reg
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 534 23.638
## 2 524 22.993 10 0.64555 1.4712 0.1466
F-test statistic is equal to 0.65, with a p-value of 0.15.
We can not reject the null hypothesis that interaction coefficients are zero at any level of significance commonly used.
We use Akaike information criterion (AIC) to select backwards the best fitting regressors among the interaction variables:
## Start: AIC=-1685.42
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw +
## ca + gar + reg + log(lot) * bdms + log(lot) * fb + log(lot) *
## sty + log(lot) * drv + log(lot) * rec + log(lot) * ffin +
## log(lot) * ghw + log(lot) * ca + log(lot) * gar + log(lot) *
## reg
##
## Df Sum of Sq RSS AIC
## - log(lot):reg 1 0.000032 22.993 -1687.4
## - log(lot):bdms 1 0.000126 22.993 -1687.4
## - log(lot):ffin 1 0.003978 22.997 -1687.3
## - log(lot):ghw 1 0.025263 23.018 -1686.8
## - log(lot):ca 1 0.046216 23.039 -1686.3
## - log(lot):fb 1 0.067158 23.060 -1685.8
## - log(lot):sty 1 0.073009 23.066 -1685.7
## - log(lot):gar 1 0.082614 23.075 -1685.5
## <none> 22.993 -1685.4
## - log(lot):drv 1 0.210938 23.204 -1682.4
## - log(lot):rec 1 0.267192 23.260 -1681.1
##
## Step: AIC=-1687.42
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw +
## ca + gar + reg + log(lot):bdms + log(lot):fb + log(lot):sty +
## log(lot):drv + log(lot):rec + log(lot):ffin + log(lot):ghw +
## log(lot):ca + log(lot):gar
##
## Df Sum of Sq RSS AIC
## - log(lot):bdms 1 0.000120 22.993 -1689.4
## - log(lot):ffin 1 0.004469 22.997 -1689.3
## - log(lot):ghw 1 0.025258 23.018 -1688.8
## - log(lot):ca 1 0.046221 23.039 -1688.3
## - log(lot):fb 1 0.067158 23.060 -1687.8
## - log(lot):sty 1 0.075407 23.068 -1687.6
## - log(lot):gar 1 0.083053 23.076 -1687.5
## <none> 22.993 -1687.4
## - log(lot):drv 1 0.223420 23.216 -1684.1
## - log(lot):rec 1 0.267912 23.261 -1683.1
##
## Step: AIC=-1689.42
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw +
## ca + gar + reg + log(lot):fb + log(lot):sty + log(lot):drv +
## log(lot):rec + log(lot):ffin + log(lot):ghw + log(lot):ca +
## log(lot):gar
##
## Df Sum of Sq RSS AIC
## - log(lot):ffin 1 0.004691 22.998 -1691.3
## - log(lot):ghw 1 0.025142 23.018 -1690.8
## - log(lot):ca 1 0.046110 23.039 -1690.3
## - log(lot):sty 1 0.082302 23.075 -1689.5
## - log(lot):gar 1 0.083231 23.076 -1689.5
## <none> 22.993 -1689.4
## - log(lot):fb 1 0.086271 23.079 -1689.4
## - log(lot):drv 1 0.226616 23.220 -1686.1
## - log(lot):rec 1 0.268554 23.261 -1685.1
##
## Step: AIC=-1691.31
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw +
## ca + gar + reg + log(lot):fb + log(lot):sty + log(lot):drv +
## log(lot):rec + log(lot):ghw + log(lot):ca + log(lot):gar
##
## Df Sum of Sq RSS AIC
## - log(lot):ghw 1 0.026401 23.024 -1692.7
## - log(lot):ca 1 0.049554 23.047 -1692.1
## - log(lot):gar 1 0.083413 23.081 -1691.3
## <none> 22.998 -1691.3
## - log(lot):sty 1 0.085854 23.083 -1691.3
## - log(lot):fb 1 0.087649 23.085 -1691.2
## - log(lot):drv 1 0.223713 23.221 -1688.0
## - log(lot):rec 1 0.268612 23.266 -1687.0
##
## Step: AIC=-1692.68
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw +
## ca + gar + reg + log(lot):fb + log(lot):sty + log(lot):drv +
## log(lot):rec + log(lot):ca + log(lot):gar
##
## Df Sum of Sq RSS AIC
## - log(lot):ca 1 0.046603 23.071 -1693.6
## - log(lot):gar 1 0.081651 23.106 -1692.8
## <none> 23.024 -1692.7
## - log(lot):fb 1 0.086091 23.110 -1692.6
## - log(lot):sty 1 0.086948 23.111 -1692.6
## - log(lot):drv 1 0.219028 23.243 -1689.5
## - log(lot):rec 1 0.271714 23.296 -1688.3
##
## Step: AIC=-1693.58
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw +
## ca + gar + reg + log(lot):fb + log(lot):sty + log(lot):drv +
## log(lot):rec + log(lot):gar
##
## Df Sum of Sq RSS AIC
## - log(lot):gar 1 0.059114 23.130 -1694.2
## - log(lot):sty 1 0.078731 23.149 -1693.7
## <none> 23.071 -1693.6
## - log(lot):fb 1 0.087655 23.158 -1693.5
## - log(lot):drv 1 0.217903 23.288 -1690.4
## - log(lot):rec 1 0.250082 23.321 -1689.7
##
## Step: AIC=-1694.18
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw +
## ca + gar + reg + log(lot):fb + log(lot):sty + log(lot):drv +
## log(lot):rec
##
## Df Sum of Sq RSS AIC
## - log(lot):fb 1 0.076126 23.206 -1694.4
## - log(lot):sty 1 0.077329 23.207 -1694.4
## <none> 23.130 -1694.2
## - log(lot):drv 1 0.177733 23.307 -1692.0
## - log(lot):rec 1 0.236301 23.366 -1690.6
##
## Step: AIC=-1694.39
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw +
## ca + gar + reg + log(lot):sty + log(lot):drv + log(lot):rec
##
## Df Sum of Sq RSS AIC
## - log(lot):sty 1 0.048915 23.255 -1695.2
## <none> 23.206 -1694.4
## - log(lot):drv 1 0.183096 23.389 -1692.1
## - log(lot):rec 1 0.223736 23.430 -1691.2
##
## Step: AIC=-1695.24
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw +
## ca + gar + reg + log(lot):drv + log(lot):rec
##
## Df Sum of Sq RSS AIC
## <none> 23.255 -1695.2
## - log(lot):drv 1 0.16833 23.423 -1693.3
## - log(lot):rec 1 0.23412 23.489 -1691.8
Only two interaction variables proved significant predictive power. Our model is the following:
##
## Call:
## lm(formula = log(sell) ~ log(lot) + bdms + fb + sty + drv + rec +
## ffin + ghw + ca + gar + reg + log(lot):drv + log(lot):rec,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67934 -0.12225 0.00849 0.12259 0.65051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.74189 0.62863 13.906 < 2e-16 ***
## log(lot) 0.17906 0.07707 2.323 0.02053 *
## bdms 0.03881 0.01430 2.714 0.00686 **
## fb 0.16145 0.02025 7.971 9.62e-15 ***
## sty 0.09083 0.01254 7.242 1.56e-12 ***
## drv -1.18996 0.66462 -1.790 0.07395 .
## rec 1.50253 0.62553 2.402 0.01665 *
## ffin 0.10276 0.02157 4.763 2.46e-06 ***
## ghw 0.18448 0.04368 4.223 2.83e-05 ***
## ca 0.16526 0.02121 7.792 3.48e-14 ***
## gar 0.04690 0.01142 4.107 4.65e-05 ***
## reg 0.13260 0.02255 5.880 7.24e-09 ***
## log(lot):drv 0.15943 0.08124 1.962 0.05024 .
## log(lot):rec -0.16826 0.07270 -2.314 0.02103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2091 on 532 degrees of freedom
## Multiple R-squared: 0.6916, Adjusted R-squared: 0.6841
## F-statistic: 91.79 on 13 and 532 DF, p-value: < 2.2e-16
We know that only newer houses are supplied with air conditioning and we assume that lower age (and therefore better condition) of houses affect the house selling price positively.
Since we do not have a \(condition\) variable, part of its effect on price will be encapsulated in the \(ca\) variable. So the effect of the air conditioning \(ca\) variable on the logarithm of the sale price \(LOG(sell)\) variable will be overestimated.
##
## Call:
## lm(formula = log(sell) ~ . - obs - lot + log(lot), data = df[1:400,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66582 -0.13906 0.00796 0.14694 0.67596
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.67309 0.29240 26.241 < 2e-16 ***
## bdms 0.03787 0.01744 2.172 0.030469 *
## fb 0.15238 0.02469 6.170 1.71e-09 ***
## sty 0.08824 0.01819 4.850 1.79e-06 ***
## drv 0.08641 0.03141 2.751 0.006216 **
## rec 0.05465 0.03392 1.611 0.107975
## ffin 0.11471 0.02673 4.291 2.25e-05 ***
## ghw 0.19870 0.05301 3.748 0.000205 ***
## ca 0.17763 0.02724 6.521 2.17e-10 ***
## gar 0.05301 0.01480 3.583 0.000383 ***
## reg 0.15116 0.04215 3.586 0.000378 ***
## log(lot) 0.31378 0.03615 8.680 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2238 on 388 degrees of freedom
## Multiple R-squared: 0.6705, Adjusted R-squared: 0.6611
## F-statistic: 71.77 on 11 and 388 DF, p-value: < 2.2e-16
LOGmean | LOGsd | MAE |
---|---|---|
11.06 | 0.372 | 0.128 |
The Mean Absolute Error (MAE) value of 0.128 is less than a third of the dependent variable standard deviation, which leads to the conclusion that the model has some predictive ability.
##
## Jarque Bera Test
##
## data: model_h$residuals
## X-squared = 0.69757, df = 2, p-value = 0.7055
Our final model is the only one whose Jarque-Bera test does not reject the null hypothesis of normality of the residuals.
Therefore it is the only model correctly specified.