Econometrics: Methods and Applications by Erasmus University Rotterdam

Week 7 Assignment: Case Project

This document was created with R Markdown, and then printed as pdf for peer-graded evaluation purposes.

Code chunks will not be echoed in the paper.


Data set

This project is of an applied nature and uses data that are available in the data file Capstone-HousePrices. The source of these data is Anglin and Gencay, “Semiparametric Estimation of a Hedonic Price Function”(Journal of Applied Econometrics 11, 1996, pages 633-648). We consider the modeling and prediction of house prices.
Data are available for 546 observations of the following variables:
- sell: Sale price of the house
- lot: Lot size of the property in square feet - bdms: Number of bedrooms
- fb: Number of full bathrooms
- sty: Number of stories excluding basement
- drv: Dummy that is 1 if the house has a driveway and 0 otherwise
- rec: Dummy that is 1 if the house has a recreational room and 0 otherwise
- ffin: Dummy that is 1 if the house has a full finished basement and 0 otherwise
- ghw: Dummy that is 1 if the house uses gas for hot water heating and 0 otherwise
- ca: Dummy that is 1 if there is central air conditioning and 0 otherwise
- gar: Number of covered garage places
- reg: Dummy that is 1 if the house is located in a preferred neighborhood of the city and 0 otherwise
- obs: Observation number, needed in part (h)


Questions

(a) Consider a linear model where the sale price of a house is the dependent variable and the explanatory variables are the other variables given above. Perform a test for linearity. What do you conclude based on the test result?

## 
## Call:
## lm(formula = sell ~ . - obs, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41389  -9307   -591   7353  74875 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4038.3504  3409.4713  -1.184 0.236762    
## lot             3.5463     0.3503  10.124  < 2e-16 ***
## bdms         1832.0035  1047.0002   1.750 0.080733 .  
## fb          14335.5585  1489.9209   9.622  < 2e-16 ***
## sty          6556.9457   925.2899   7.086 4.37e-12 ***
## drv          6687.7789  2045.2458   3.270 0.001145 ** 
## rec          4511.2838  1899.9577   2.374 0.017929 *  
## ffin         5452.3855  1588.0239   3.433 0.000642 ***
## ghw         12831.4063  3217.5971   3.988 7.60e-05 ***
## ca          12632.8904  1555.0211   8.124 3.15e-15 ***
## gar          4244.8290   840.5442   5.050 6.07e-07 ***
## reg          9369.5132  1669.0907   5.614 3.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15420 on 534 degrees of freedom
## Multiple R-squared:  0.6731, Adjusted R-squared:  0.6664 
## F-statistic: 99.97 on 11 and 534 DF,  p-value: < 2.2e-16
## 
##  RESET test
## 
## data:  model_a
## RESET = 26.986, df1 = 1, df2 = 533, p-value = 2.922e-07

With a statistic of ~26.986 and a p-value of ~0, the Ramsey’s RESET test suggests that the linear model is NOT correctly specified (\(H_0\) of correct/linear specification rejected).

## 
##  Jarque Bera Test
## 
## data:  model_a$residuals
## X-squared = 247.62, df = 2, p-value < 2.2e-16

In addition we tested also the residuals: with a statistic of ~247.62 and a p-value of ~0, the Jarque-Bera test suggests that the linear model residuals are NOT normally distributed, therefore the linear model is NOT correctly specified.


(b) Now consider a linear model where the log of the sale price of the house is the dependent variable and the explanatory variables are as before. Perform again the test for linearity. What do you conclude now?

## 
## Call:
## lm(formula = log(sell) ~ . - obs, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67865 -0.12211  0.01666  0.12868  0.67737 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.003e+01  4.724e-02 212.210  < 2e-16 ***
## lot         5.057e-05  4.854e-06  10.418  < 2e-16 ***
## bdms        3.402e-02  1.451e-02   2.345  0.01939 *  
## fb          1.678e-01  2.065e-02   8.126 3.10e-15 ***
## sty         9.227e-02  1.282e-02   7.197 2.10e-12 ***
## drv         1.307e-01  2.834e-02   4.610 5.04e-06 ***
## rec         7.352e-02  2.633e-02   2.792  0.00542 ** 
## ffin        9.940e-02  2.200e-02   4.517 7.72e-06 ***
## ghw         1.784e-01  4.458e-02   4.000 7.22e-05 ***
## ca          1.780e-01  2.155e-02   8.262 1.14e-15 ***
## gar         5.076e-02  1.165e-02   4.358 1.58e-05 ***
## reg         1.271e-01  2.313e-02   5.496 6.02e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2137 on 534 degrees of freedom
## Multiple R-squared:  0.6766, Adjusted R-squared:  0.6699 
## F-statistic: 101.6 on 11 and 534 DF,  p-value: < 2.2e-16
## 
##  RESET test
## 
## data:  model_b
## RESET = 0.27031, df1 = 1, df2 = 533, p-value = 0.6033

With a statistic of ~0.27 and a p-value of ~0.6033, the Ramsey’s RESET test suggests that the linear model on logarithm of sell price might be correctly specified (\(H_0\) of correct/linear specification NOT rejected, at the 5% level of significance).

## 
##  Jarque Bera Test
## 
## data:  model_b$residuals
## X-squared = 8.4432, df = 2, p-value = 0.01467

The residuals still are not satisfactory: with a statistic of ~8.443 and a p-value of ~0.0147, the Jarque-Bera Test rejects the null hypothesis of normal distribution.


(c) Continue with the linear model from question (b). Estimate a model that includes both the lot size variable and its logarithm, as well as all other explanatory variables without transformation. What is your conclusion, should we include lot size itself or its logarithm?

## 
## Call:
## lm(formula = log(sell) ~ . - obs + log(lot), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68573 -0.12380  0.00785  0.12521  0.68112 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.150e+00  6.830e-01  10.469  < 2e-16 ***
## lot         -1.490e-05  1.624e-05  -0.918 0.359086    
## bdms         3.489e-02  1.429e-02   2.442 0.014915 *  
## fb           1.659e-01  2.033e-02   8.161 2.40e-15 ***
## sty          9.121e-02  1.263e-02   7.224 1.76e-12 ***
## drv          1.068e-01  2.847e-02   3.752 0.000195 ***
## rec          5.467e-02  2.630e-02   2.078 0.038156 *  
## ffin         1.052e-01  2.171e-02   4.848 1.64e-06 ***
## ghw          1.791e-01  4.390e-02   4.079 5.20e-05 ***
## ca           1.643e-01  2.146e-02   7.657 9.01e-14 ***
## gar          4.826e-02  1.148e-02   4.203 3.09e-05 ***
## reg          1.344e-01  2.284e-02   5.884 7.10e-09 ***
## log(lot)     3.827e-01  9.070e-02   4.219 2.88e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2104 on 533 degrees of freedom
## Multiple R-squared:  0.687,  Adjusted R-squared:   0.68 
## F-statistic: 97.51 on 12 and 533 DF,  p-value: < 2.2e-16

We conclude that it would be better to include the lot size logarithm in the model, rather than the lot size variable itself, since the lot size logarithm has a much lower p-value (~0), compared to the lot size variable itself (p-value=0.359)
Indeed the lot variable ending with a ~zero (0) coefficient in the last two models hinted it had no practical value.


(d) Consider now a model where the log of the sale price of the house is the dependent variable and the explanatory variables are the log transformation of lot size, with all other explanatory variables as before. We now consider interaction effects of the log lot size with the other variables. Construct these interaction variables. How many are individually significant?

## 
## Call:
## lm(formula = log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + 
##     ffin + ghw + ca + gar + reg + log(lot) * bdms + log(lot) * 
##     fb + log(lot) * sty + log(lot) * drv + log(lot) * rec + log(lot) * 
##     ffin + log(lot) * ghw + log(lot) * ca + log(lot) * gar + 
##     log(lot) * reg, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68306 -0.11612  0.00591  0.12486  0.65998 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.966499   1.070667   8.375 5.09e-16 ***
## log(lot)       0.152685   0.128294   1.190   0.2345    
## bdms           0.019075   0.326700   0.058   0.9535    
## fb            -0.368234   0.429048  -0.858   0.3911    
## sty            0.488885   0.309700   1.579   0.1150    
## drv           -1.463371   0.717225  -2.040   0.0418 *  
## rec            1.673992   0.655919   2.552   0.0110 *  
## ffin          -0.031844   0.445543  -0.071   0.9430    
## ghw           -0.505889   0.902733  -0.560   0.5754    
## ca            -0.340276   0.496041  -0.686   0.4930    
## gar            0.401941   0.258646   1.554   0.1208    
## reg            0.118484   0.479856   0.247   0.8051    
## log(lot):bdms  0.002070   0.038654   0.054   0.9573    
## log(lot):fb    0.062037   0.050145   1.237   0.2166    
## log(lot):sty  -0.046361   0.035942  -1.290   0.1977    
## log(lot):drv   0.191542   0.087361   2.193   0.0288 *  
## log(lot):rec  -0.188462   0.076373  -2.468   0.0139 *  
## log(lot):ffin  0.015913   0.052851   0.301   0.7635    
## log(lot):ghw   0.081135   0.106929   0.759   0.4483    
## log(lot):ca    0.059549   0.058024   1.026   0.3052    
## log(lot):gar  -0.041359   0.030142  -1.372   0.1706    
## log(lot):reg   0.001515   0.055990   0.027   0.9784    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2095 on 524 degrees of freedom
## Multiple R-squared:  0.6951, Adjusted R-squared:  0.6829 
## F-statistic: 56.89 on 21 and 524 DF,  p-value: < 2.2e-16

Using the 5% significance level, only two of the ten interaction variables used are individually significant:
- \(LOG(lot)*drv\)
- \(LOG(lot)*rec\)


(e) Perform an F-test for the joint significance of the interaction effects from question (d).

## Linear hypothesis test
## 
## Hypothesis:
## log(lot):bdms = 0
## log(lot):fb = 0
## log(lot):sty = 0
## log(lot):drv = 0
## log(lot):rec = 0
## log(lot):ffin = 0
## log(lot):ghw = 0
## log(lot):ca = 0
## log(lot):gar = 0
## log(lot):reg = 0
## 
## Model 1: restricted model
## Model 2: log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw + 
##     ca + gar + reg + log(lot) * bdms + log(lot) * fb + log(lot) * 
##     sty + log(lot) * drv + log(lot) * rec + log(lot) * ffin + 
##     log(lot) * ghw + log(lot) * ca + log(lot) * gar + log(lot) * 
##     reg
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    534 23.638                           
## 2    524 22.993 10   0.64555 1.4712 0.1466

F-test statistic is equal to 0.65, with a p-value of 0.15.
We can not reject the null hypothesis that interaction coefficients are zero at any level of significance commonly used.


(f) Now perform model specification on the interaction variables using the general-to-specific approach. (Only eliminate the interaction effects.)

We use Akaike information criterion (AIC) to select backwards the best fitting regressors among the interaction variables:

## Start:  AIC=-1685.42
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw + 
##     ca + gar + reg + log(lot) * bdms + log(lot) * fb + log(lot) * 
##     sty + log(lot) * drv + log(lot) * rec + log(lot) * ffin + 
##     log(lot) * ghw + log(lot) * ca + log(lot) * gar + log(lot) * 
##     reg
## 
##                 Df Sum of Sq    RSS     AIC
## - log(lot):reg   1  0.000032 22.993 -1687.4
## - log(lot):bdms  1  0.000126 22.993 -1687.4
## - log(lot):ffin  1  0.003978 22.997 -1687.3
## - log(lot):ghw   1  0.025263 23.018 -1686.8
## - log(lot):ca    1  0.046216 23.039 -1686.3
## - log(lot):fb    1  0.067158 23.060 -1685.8
## - log(lot):sty   1  0.073009 23.066 -1685.7
## - log(lot):gar   1  0.082614 23.075 -1685.5
## <none>                       22.993 -1685.4
## - log(lot):drv   1  0.210938 23.204 -1682.4
## - log(lot):rec   1  0.267192 23.260 -1681.1
## 
## Step:  AIC=-1687.42
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw + 
##     ca + gar + reg + log(lot):bdms + log(lot):fb + log(lot):sty + 
##     log(lot):drv + log(lot):rec + log(lot):ffin + log(lot):ghw + 
##     log(lot):ca + log(lot):gar
## 
##                 Df Sum of Sq    RSS     AIC
## - log(lot):bdms  1  0.000120 22.993 -1689.4
## - log(lot):ffin  1  0.004469 22.997 -1689.3
## - log(lot):ghw   1  0.025258 23.018 -1688.8
## - log(lot):ca    1  0.046221 23.039 -1688.3
## - log(lot):fb    1  0.067158 23.060 -1687.8
## - log(lot):sty   1  0.075407 23.068 -1687.6
## - log(lot):gar   1  0.083053 23.076 -1687.5
## <none>                       22.993 -1687.4
## - log(lot):drv   1  0.223420 23.216 -1684.1
## - log(lot):rec   1  0.267912 23.261 -1683.1
## 
## Step:  AIC=-1689.42
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw + 
##     ca + gar + reg + log(lot):fb + log(lot):sty + log(lot):drv + 
##     log(lot):rec + log(lot):ffin + log(lot):ghw + log(lot):ca + 
##     log(lot):gar
## 
##                 Df Sum of Sq    RSS     AIC
## - log(lot):ffin  1  0.004691 22.998 -1691.3
## - log(lot):ghw   1  0.025142 23.018 -1690.8
## - log(lot):ca    1  0.046110 23.039 -1690.3
## - log(lot):sty   1  0.082302 23.075 -1689.5
## - log(lot):gar   1  0.083231 23.076 -1689.5
## <none>                       22.993 -1689.4
## - log(lot):fb    1  0.086271 23.079 -1689.4
## - log(lot):drv   1  0.226616 23.220 -1686.1
## - log(lot):rec   1  0.268554 23.261 -1685.1
## 
## Step:  AIC=-1691.31
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw + 
##     ca + gar + reg + log(lot):fb + log(lot):sty + log(lot):drv + 
##     log(lot):rec + log(lot):ghw + log(lot):ca + log(lot):gar
## 
##                Df Sum of Sq    RSS     AIC
## - log(lot):ghw  1  0.026401 23.024 -1692.7
## - log(lot):ca   1  0.049554 23.047 -1692.1
## - log(lot):gar  1  0.083413 23.081 -1691.3
## <none>                      22.998 -1691.3
## - log(lot):sty  1  0.085854 23.083 -1691.3
## - log(lot):fb   1  0.087649 23.085 -1691.2
## - log(lot):drv  1  0.223713 23.221 -1688.0
## - log(lot):rec  1  0.268612 23.266 -1687.0
## 
## Step:  AIC=-1692.68
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw + 
##     ca + gar + reg + log(lot):fb + log(lot):sty + log(lot):drv + 
##     log(lot):rec + log(lot):ca + log(lot):gar
## 
##                Df Sum of Sq    RSS     AIC
## - log(lot):ca   1  0.046603 23.071 -1693.6
## - log(lot):gar  1  0.081651 23.106 -1692.8
## <none>                      23.024 -1692.7
## - log(lot):fb   1  0.086091 23.110 -1692.6
## - log(lot):sty  1  0.086948 23.111 -1692.6
## - log(lot):drv  1  0.219028 23.243 -1689.5
## - log(lot):rec  1  0.271714 23.296 -1688.3
## 
## Step:  AIC=-1693.58
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw + 
##     ca + gar + reg + log(lot):fb + log(lot):sty + log(lot):drv + 
##     log(lot):rec + log(lot):gar
## 
##                Df Sum of Sq    RSS     AIC
## - log(lot):gar  1  0.059114 23.130 -1694.2
## - log(lot):sty  1  0.078731 23.149 -1693.7
## <none>                      23.071 -1693.6
## - log(lot):fb   1  0.087655 23.158 -1693.5
## - log(lot):drv  1  0.217903 23.288 -1690.4
## - log(lot):rec  1  0.250082 23.321 -1689.7
## 
## Step:  AIC=-1694.18
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw + 
##     ca + gar + reg + log(lot):fb + log(lot):sty + log(lot):drv + 
##     log(lot):rec
## 
##                Df Sum of Sq    RSS     AIC
## - log(lot):fb   1  0.076126 23.206 -1694.4
## - log(lot):sty  1  0.077329 23.207 -1694.4
## <none>                      23.130 -1694.2
## - log(lot):drv  1  0.177733 23.307 -1692.0
## - log(lot):rec  1  0.236301 23.366 -1690.6
## 
## Step:  AIC=-1694.39
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw + 
##     ca + gar + reg + log(lot):sty + log(lot):drv + log(lot):rec
## 
##                Df Sum of Sq    RSS     AIC
## - log(lot):sty  1  0.048915 23.255 -1695.2
## <none>                      23.206 -1694.4
## - log(lot):drv  1  0.183096 23.389 -1692.1
## - log(lot):rec  1  0.223736 23.430 -1691.2
## 
## Step:  AIC=-1695.24
## log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + ffin + ghw + 
##     ca + gar + reg + log(lot):drv + log(lot):rec
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      23.255 -1695.2
## - log(lot):drv  1   0.16833 23.423 -1693.3
## - log(lot):rec  1   0.23412 23.489 -1691.8

Only two interaction variables proved significant predictive power. Our model is the following:

## 
## Call:
## lm(formula = log(sell) ~ log(lot) + bdms + fb + sty + drv + rec + 
##     ffin + ghw + ca + gar + reg + log(lot):drv + log(lot):rec, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67934 -0.12225  0.00849  0.12259  0.65051 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.74189    0.62863  13.906  < 2e-16 ***
## log(lot)      0.17906    0.07707   2.323  0.02053 *  
## bdms          0.03881    0.01430   2.714  0.00686 ** 
## fb            0.16145    0.02025   7.971 9.62e-15 ***
## sty           0.09083    0.01254   7.242 1.56e-12 ***
## drv          -1.18996    0.66462  -1.790  0.07395 .  
## rec           1.50253    0.62553   2.402  0.01665 *  
## ffin          0.10276    0.02157   4.763 2.46e-06 ***
## ghw           0.18448    0.04368   4.223 2.83e-05 ***
## ca            0.16526    0.02121   7.792 3.48e-14 ***
## gar           0.04690    0.01142   4.107 4.65e-05 ***
## reg           0.13260    0.02255   5.880 7.24e-09 ***
## log(lot):drv  0.15943    0.08124   1.962  0.05024 .  
## log(lot):rec -0.16826    0.07270  -2.314  0.02103 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2091 on 532 degrees of freedom
## Multiple R-squared:  0.6916, Adjusted R-squared:  0.6841 
## F-statistic: 91.79 on 13 and 532 DF,  p-value: < 2.2e-16

(g) One may argue that some of the explanatory variables are endogenous and that there may be omitted variables. For example, the ‘condition’ of the house in terms of how it is maintained is not a variable (and difficult to measure) but will affect the house price. It will also affect, or be reflected in, some of the other variables, such as whether the house has an air conditioning (which is mostly in newer houses). If the condition of the house is missing, will the effect of air conditioning on the (log of the) sale price be over- or underestimated? (For this question no computer calculations are required.)

We know that only newer houses are supplied with air conditioning and we assume that lower age (and therefore better condition) of houses affect the house selling price positively.

Since we do not have a \(condition\) variable, part of its effect on price will be encapsulated in the \(ca\) variable. So the effect of the air conditioning \(ca\) variable on the logarithm of the sale price \(LOG(sell)\) variable will be overestimated.


(h) Finally we analyze the predictive ability of the model. Consider again the model where the log of the sale price of the house is the dependent variable and the explanatory variables are the log transformation of lot size, with all other explanatory variables in their original form (and no interaction effects). Estimate the parameters of the model using the first 400 observations. Make predictions on the log of the price and calculate the MAE for the other 146 observations. How good is the predictive power of the model (relative to the variability in the log of the price)?

## 
## Call:
## lm(formula = log(sell) ~ . - obs - lot + log(lot), data = df[1:400, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66582 -0.13906  0.00796  0.14694  0.67596 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.67309    0.29240  26.241  < 2e-16 ***
## bdms         0.03787    0.01744   2.172 0.030469 *  
## fb           0.15238    0.02469   6.170 1.71e-09 ***
## sty          0.08824    0.01819   4.850 1.79e-06 ***
## drv          0.08641    0.03141   2.751 0.006216 ** 
## rec          0.05465    0.03392   1.611 0.107975    
## ffin         0.11471    0.02673   4.291 2.25e-05 ***
## ghw          0.19870    0.05301   3.748 0.000205 ***
## ca           0.17763    0.02724   6.521 2.17e-10 ***
## gar          0.05301    0.01480   3.583 0.000383 ***
## reg          0.15116    0.04215   3.586 0.000378 ***
## log(lot)     0.31378    0.03615   8.680  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2238 on 388 degrees of freedom
## Multiple R-squared:  0.6705, Adjusted R-squared:  0.6611 
## F-statistic: 71.77 on 11 and 388 DF,  p-value: < 2.2e-16
LOGmean LOGsd MAE
11.06 0.372 0.128

The Mean Absolute Error (MAE) value of 0.128 is less than a third of the dependent variable standard deviation, which leads to the conclusion that the model has some predictive ability.


Bonus Material

Plot of Actual vs Forecasted Sell Prices


Model Residuals Plots


Model Residuals Jarque-Bera Test

## 
##  Jarque Bera Test
## 
## data:  model_h$residuals
## X-squared = 0.69757, df = 2, p-value = 0.7055

Our final model is the only one whose Jarque-Bera test does not reject the null hypothesis of normality of the residuals.

Therefore it is the only model correctly specified.