MOOC Econometrics

Case Project – House Prices

Notes:

• See website for how to submit your answers and how feedback is organized. • This exercise uses the datafile CaseProject-HousePrices and requires a computer. • The dataset CaseProject-HousePrices is available on the website. • Perform all tests at a 5% significance level.

Goals and skills being used:

• Experience the processes of variable transformation and model selection. • Apply tests to evaluate models, including effects of endogeneity. • Study the predictive ability of a model.

Background

This project is of an applied nature and uses data that are available in the data file Capstone-HousePrices. The source of these data is Anglin and Gencay, “Semiparametric Estimation of a Hedonic Price Function”(Journal of Applied Econometrics 11, 1996, pages 633-648). We consider the modeling and prediction of house prices. Data are available for 546 observations of the following variables:

• sell: Sale price of the house • lot: Lot size of the property in square feet • bdms: Number of bedrooms • fb: Number of full bathrooms • sty: Number of stories excluding basement • drv: Dummy that is 1 if the house has a driveway and 0 otherwise • rec: Dummy that is 1 if the house has a recreational room and 0 otherwise • ffin: Dummy that is 1 if the house has a full finished basement and 0 otherwise • ghw: Dummy that is 1 if the house uses gas for hot water heating and 0 otherwise • ca: Dummy that is 1 if there is central air conditioning and 0 otherwise • gar: Number of covered garage places • reg: Dummy that is 1 if the house is located in a preferred neighborhood of the city and 0 otherwise • obs: Observation number, needed in part (h)

Questions

(a) Consider a linear model where the sale price of a house is the dependent variable and the explanatory variables are the other variables given above. Perform a test for linearity. What do you conclude based on the test result?

First we need to load the data:

library(readxl)
library(ggplot2)
library(dplyr)
library(aTSA)
library(knitr)
library(dynlm)
## Warning: package 'dynlm' was built under R version 4.0.2
df <- read_excel("CaseHouse.xls", col_names = TRUE)
head(df)
## # A tibble: 6 x 13
##     obs  sell   lot  bdms    fb   sty   drv   rec  ffin   ghw    ca   gar   reg
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1 42000  5850     3     1     2     1     0     1     0     0     1     0
## 2     2 38500  4000     2     1     1     1     0     0     0     0     0     0
## 3     3 49500  3060     3     1     1     1     0     0     0     0     0     0
## 4     4 60500  6650     3     1     2     1     1     0     0     0     0     0
## 5     5 61000  6360     2     1     1     1     0     0     0     0     0     0
## 6     6 66000  4160     3     1     1     1     1     1     0     1     0     0
tail(df)
## # A tibble: 6 x 13
##     obs   sell   lot  bdms    fb   sty   drv   rec  ffin   ghw    ca   gar   reg
##   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   541  85000  6525     3     2     4     1     0     0     0     0     1     0
## 2   542  91500  4800     3     2     4     1     1     0     0     1     0     0
## 3   543  94000  6000     3     2     4     1     0     0     0     1     0     0
## 4   544 103000  6000     3     2     4     1     1     0     0     1     1     0
## 5   545 105000  6000     3     2     2     1     1     0     0     1     1     0
## 6   546 105000  6000     3     1     2     1     0     0     0     1     1     0
str(df)
## tibble [546 x 13] (S3: tbl_df/tbl/data.frame)
##  $ obs : num [1:546] 1 2 3 4 5 6 7 8 9 10 ...
##  $ sell: num [1:546] 42000 38500 49500 60500 61000 66000 66000 69000 83800 88500 ...
##  $ lot : num [1:546] 5850 4000 3060 6650 6360 4160 3880 4160 4800 5500 ...
##  $ bdms: num [1:546] 3 2 3 3 2 3 3 3 3 3 ...
##  $ fb  : num [1:546] 1 1 1 1 1 1 2 1 1 2 ...
##  $ sty : num [1:546] 2 1 1 2 1 1 2 3 1 4 ...
##  $ drv : num [1:546] 1 1 1 1 1 1 1 1 1 1 ...
##  $ rec : num [1:546] 0 0 0 1 0 1 0 0 1 1 ...
##  $ ffin: num [1:546] 1 0 0 0 0 1 1 0 1 0 ...
##  $ ghw : num [1:546] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ca  : num [1:546] 0 0 0 0 0 1 0 0 0 1 ...
##  $ gar : num [1:546] 1 0 0 0 0 0 2 0 0 1 ...
##  $ reg : num [1:546] 0 0 0 0 0 0 0 0 0 0 ...

Exploratory analysis

df %>% ggplot(aes(sell, lot)) + geom_point()

df %>% ggplot(aes(sell, lot, color= sty)) + geom_point()

df %>% ggplot(aes(sell, lot, color= fb)) + geom_point()

df %>% ggplot(aes(sell, lot, color= bdms)) + geom_point()

df %>% ggplot(aes(sell, lot, color= rec)) + geom_point()

df %>% ggplot(aes(sell, lot, color= gar)) + geom_point() + facet_grid(.~ fb)

First Model

Then we need to fit the model:

fit <- lm(sell~.-obs, data = df)
summary(fit)
## 
## Call:
## lm(formula = sell ~ . - obs, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41389  -9307   -591   7353  74875 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4038.3504  3409.4713  -1.184 0.236762    
## lot             3.5463     0.3503  10.124  < 2e-16 ***
## bdms         1832.0035  1047.0002   1.750 0.080733 .  
## fb          14335.5585  1489.9209   9.622  < 2e-16 ***
## sty          6556.9457   925.2899   7.086 4.37e-12 ***
## drv          6687.7789  2045.2458   3.270 0.001145 ** 
## rec          4511.2838  1899.9577   2.374 0.017929 *  
## ffin         5452.3855  1588.0239   3.433 0.000642 ***
## ghw         12831.4063  3217.5971   3.988 7.60e-05 ***
## ca          12632.8904  1555.0211   8.124 3.15e-15 ***
## gar          4244.8290   840.5442   5.050 6.07e-07 ***
## reg          9369.5132  1669.0907   5.614 3.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15420 on 534 degrees of freedom
## Multiple R-squared:  0.6731, Adjusted R-squared:  0.6664 
## F-statistic: 99.97 on 11 and 534 DF,  p-value: < 2.2e-16

Model characteristics: R2 = 0.6731 , F-statistic: 99.97 on 11 and 534 DF

Model Linearity testing

Ramsey’s RESET (linearity) testing
library(lmtest)

# Ho : The model is correct linearly specified

modelA.RESET <- resettest(fit, power = 2, type = "fitted", 
                          data = df)

modelA.RESET
## 
##  RESET test
## 
## data:  fit
## RESET = 26.986, df1 = 1, df2 = 533, p-value = 2.922e-07

With a statistic of 26.986 and a p-value of ~0.000, the Ramsey’s RESET test suggests that the linear model is NOT correctly specified. So we reject Ho.

Jarque-Bera (residuals normality)

library(tsoutliers)
## Warning: package 'tsoutliers' was built under R version 4.0.2
# Ho: The errors of the model are distributed normal
modelA.JB <- JarqueBera.test(fit$residuals)
modelA.JB
## 
##  Jarque Bera Test
## 
## data:  fit$residuals
## X-squared = 247.62, df = 2, p-value < 2.2e-16
## 
## 
##  Skewness
## 
## data:  fit$residuals
## statistic = 0.85278, p-value = 4.119e-16
## 
## 
##  Kurtosis
## 
## data:  fit$residuals
## statistic = 5.8241, p-value < 2.2e-16

With a statistic of ~247.62 and a p-value of ~0, the Jarque-Bera test suggests that the linear model residuals are NOT normally distributed, therefore the linear model is NOT correctly specified. Also is not distributed as normal distribution, and its distribution is leptokurtic.

Both Ramsey’s RESET and Jarque-Bera tests suggest that the considered linear model is NOT correctly specified.

Real to fitted-values diagram

modelA.fitted <- fitted.values(fit)

df %>% ggplot(aes(modelA.fitted, sell)) +
    geom_point(shape=16) +
    geom_smooth() + ggtitle("Actual vs Fitted Value of Model A")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

(b) Now consider a linear model where the log of the sale price of the house is the dependent variable and the explanatory variables are as before. Perform again the test for linearity. What do you conclude now?

Second model estimation - Logarithmic

df$sell_LOG <- log(df$sell)
df$lot_LOG  <- log(df$lot)
fit2 <- lm(sell_LOG ~ lot + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg, data = df)
summary(fit2)
## 
## Call:
## lm(formula = sell_LOG ~ lot + bdms + fb + sty + drv + rec + ffin + 
##     ghw + ca + gar + reg, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67865 -0.12211  0.01666  0.12868  0.67737 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.003e+01  4.724e-02 212.210  < 2e-16 ***
## lot         5.057e-05  4.854e-06  10.418  < 2e-16 ***
## bdms        3.402e-02  1.451e-02   2.345  0.01939 *  
## fb          1.678e-01  2.065e-02   8.126 3.10e-15 ***
## sty         9.227e-02  1.282e-02   7.197 2.10e-12 ***
## drv         1.307e-01  2.834e-02   4.610 5.04e-06 ***
## rec         7.352e-02  2.633e-02   2.792  0.00542 ** 
## ffin        9.940e-02  2.200e-02   4.517 7.72e-06 ***
## ghw         1.784e-01  4.458e-02   4.000 7.22e-05 ***
## ca          1.780e-01  2.155e-02   8.262 1.14e-15 ***
## gar         5.076e-02  1.165e-02   4.358 1.58e-05 ***
## reg         1.271e-01  2.313e-02   5.496 6.02e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2137 on 534 degrees of freedom
## Multiple R-squared:  0.6766, Adjusted R-squared:  0.6699 
## F-statistic: 101.6 on 11 and 534 DF,  p-value: < 2.2e-16

Model characteristics: R2 = 0.6766 , F-statistic: 101.56 on 11 and 534 DF

Model linearity testing

Ramsey’s RESET
# Ho : The model is correct linearly specified

modelB.RESET <- resettest(fit2, power = 2, type = "fitted", 
                          data = df)

modelB.RESET
## 
##  RESET test
## 
## data:  fit2
## RESET = 0.27031, df1 = 1, df2 = 533, p-value = 0.6033

With a statistic of ~0.27 and a p-value of ~0.6033, the Ramsey’s RESET test suggests that the second linear model might be correctly specified. We accept Ho, at the 5% level of significance).

Jarque-Bera
# Ho: The errors of the model are distributed normal
modelA.JB <- JarqueBera.test(fit2$residuals)
modelA.JB
## 
##  Jarque Bera Test
## 
## data:  fit2$residuals
## X-squared = 8.4432, df = 2, p-value = 0.01467
## 
## 
##  Skewness
## 
## data:  fit2$residuals
## statistic = 0.19898, p-value = 0.05768
## 
## 
##  Kurtosis
## 
## data:  fit2$residuals
## statistic = 3.4613, p-value = 0.0278

Also with this model, With a statistic of ~8.443 and a p-value of ~0.0147, the Jarque-Bera test suggests that the linear model residuals are still NOT normally distributed.

Therefore the linear model is still NOT correctly specified, althought that the second model’s JB statistic is significantly decreased (and therefore the model significantly improved).

Both Ramsey’s RESET and Jarque-Bera tests suggest that the second model is significantly improved than the model considered first, but has a problem:

  • The Ramsey’s RESET test suggests that the second linear model might be correctly specified.

  • while the Jarque-Bera test suggests that it is still NOT correctly specified (although significantly improved).

Real to fitted-values diagram
modelB.fitted <- fitted.values(fit2)

df %>% ggplot(aes(modelB.fitted, sell_LOG)) +
    geom_point(shape=16) +
    geom_smooth() + ggtitle("Actual vs Fitted Value of Model B")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

(c) Continue with the linear model from question (b). Estimate a model that includes both the lot size variable and its logarithm, as well as all other explanatory variables without transformation. What is your conclusion, should we include lot size itself or its logarithm?

Third model estimation

fit3 <- lm(sell_LOG ~ lot + lot_LOG + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg, 
             data = df)

summary(fit3)
## 
## Call:
## lm(formula = sell_LOG ~ lot + lot_LOG + bdms + fb + sty + drv + 
##     rec + ffin + ghw + ca + gar + reg, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68573 -0.12380  0.00785  0.12521  0.68112 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.150e+00  6.830e-01  10.469  < 2e-16 ***
## lot         -1.490e-05  1.624e-05  -0.918 0.359086    
## lot_LOG      3.827e-01  9.070e-02   4.219 2.88e-05 ***
## bdms         3.489e-02  1.429e-02   2.442 0.014915 *  
## fb           1.659e-01  2.033e-02   8.161 2.40e-15 ***
## sty          9.121e-02  1.263e-02   7.224 1.76e-12 ***
## drv          1.068e-01  2.847e-02   3.752 0.000195 ***
## rec          5.467e-02  2.630e-02   2.078 0.038156 *  
## ffin         1.052e-01  2.171e-02   4.848 1.64e-06 ***
## ghw          1.791e-01  4.390e-02   4.079 5.20e-05 ***
## ca           1.643e-01  2.146e-02   7.657 9.01e-14 ***
## gar          4.826e-02  1.148e-02   4.203 3.09e-05 ***
## reg          1.344e-01  2.284e-02   5.884 7.10e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2104 on 533 degrees of freedom
## Multiple R-squared:  0.687,  Adjusted R-squared:   0.68 
## F-statistic: 97.51 on 12 and 533 DF,  p-value: < 2.2e-16

Model characteristics: R2 = 0.687 , F-statistic: 97.51 on 12 and 533 DF

Model linearity testing

Ramsey’s RESET
# Ho : The model is correct linearly specified

modelC.RESET <- resettest(fit3, power = 2, type = "fitted", 
                          data = df)

modelC.RESET
## 
##  RESET test
## 
## data:  fit3
## RESET = 0.06769, df1 = 1, df2 = 532, p-value = 0.7948

With a statistic of ~0.068 and a p-value of ~0.7948, the Ramsey’s RESET test suggests that the third linear model might be correctly specified We accept Ho, at the 5% level of significance).

It also suggests that this is the best model constructed so far, as it has the lowest statistic and the highest p-value scored by all Ramsey’s RESET tests ran so far.

Jarque-Bera
# Ho: The errors of the model are distributed normal
modelC.JB <- JarqueBera.test(fit3$residuals)
modelC.JB
## 
##  Jarque Bera Test
## 
## data:  fit3$residuals
## X-squared = 9.3643, df = 2, p-value = 0.009259
## 
## 
##  Skewness
## 
## data:  fit3$residuals
## statistic = 0.18025, p-value = 0.08553
## 
## 
##  Kurtosis
## 
## data:  fit3$residuals
## statistic = 3.5307, p-value = 0.01136

With a statistic of ~9.364 and a p-value of ~0.0093, the Jarque-Bera test suggests that the linear model residuals are still NOT normally distributed; therefore the linear model is still NOT correctly specified.

No further model improvement is indicated by the Jarque-Bera residuals normality test; in fact the second model’s residuals were slightly more normal than the third’s.

Both Ramsey’s RESET and Jarque-Bera tests suggest that the third model is significantly improved than the model considered first, while the Ramsey’s RESET test suggests that it is even more improved than the model considered second.But the problems are:

  • The Ramsey’s RESET test suggests that the third linear model might be correctly specified.

  • while the Jarque-Bera test suggests that it is still NOT correctly specified.

Real to fitted-values diagram
modelC.fitted <- fitted.values(fit3)

df %>% ggplot(aes(modelC.fitted, sell_LOG)) +
    geom_point(shape=16) +
    geom_smooth() + ggtitle("Actual vs Fitted Value of Model c")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

It is concluded that it would be better to include the lot size logarithm in the model, rather than the lot size variable itself, due to the following reasons:

  • The three models testing performed so far, see Table 4 (above): “Models linearity test results’ comparison chart”. The Ramsey’s RESET tests showed that the lot size logarithm variable significantly improves the model linearity, while the Jarque-Bera tests showed that it produces a satisfactory (so far) level of residuals normality.

  • The (much better) lot size logarithm variable coefficient p-value (0), compared to the lot size variable itself coefficient p-value (0.359), when used together. See Table 5 (above): “Third model lot related variables’ coefficients’ comparison chart”.

  • The fact that lot variable ended with a ~zero (0) coefficient anyway at the (improved) second and third models.

(d) Consider now a model where the log of the sale price of the house is the dependent variable and the explanatory variables are the log transformation of lot size, with all other explanatory variables as before. We now consider interaction effects of the log lot size with the other variables. Construct these interaction variables. How many are individually significant?

Fourth model estimation

fit4 <- lm(sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg +
                 lot_LOG * bdms + lot_LOG * fb + lot_LOG * sty + lot_LOG * drv + lot_LOG * rec + 
                 lot_LOG * ffin + lot_LOG * ghw + lot_LOG * ca + lot_LOG * gar + lot_LOG * reg, 
             data = df)
summary(fit4)
## 
## Call:
## lm(formula = sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + 
##     ffin + ghw + ca + gar + reg + lot_LOG * bdms + lot_LOG * 
##     fb + lot_LOG * sty + lot_LOG * drv + lot_LOG * rec + lot_LOG * 
##     ffin + lot_LOG * ghw + lot_LOG * ca + lot_LOG * gar + lot_LOG * 
##     reg, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68306 -0.11612  0.00591  0.12486  0.65998 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.966499   1.070667   8.375 5.09e-16 ***
## lot_LOG       0.152685   0.128294   1.190   0.2345    
## bdms          0.019075   0.326700   0.058   0.9535    
## fb           -0.368234   0.429048  -0.858   0.3911    
## sty           0.488885   0.309700   1.579   0.1150    
## drv          -1.463371   0.717225  -2.040   0.0418 *  
## rec           1.673992   0.655919   2.552   0.0110 *  
## ffin         -0.031844   0.445543  -0.071   0.9430    
## ghw          -0.505889   0.902733  -0.560   0.5754    
## ca           -0.340276   0.496041  -0.686   0.4930    
## gar           0.401941   0.258646   1.554   0.1208    
## reg           0.118484   0.479856   0.247   0.8051    
## lot_LOG:bdms  0.002070   0.038654   0.054   0.9573    
## lot_LOG:fb    0.062037   0.050145   1.237   0.2166    
## lot_LOG:sty  -0.046361   0.035942  -1.290   0.1977    
## lot_LOG:drv   0.191542   0.087361   2.193   0.0288 *  
## lot_LOG:rec  -0.188462   0.076373  -2.468   0.0139 *  
## lot_LOG:ffin  0.015913   0.052851   0.301   0.7635    
## lot_LOG:ghw   0.081135   0.106929   0.759   0.4483    
## lot_LOG:ca    0.059549   0.058024   1.026   0.3052    
## lot_LOG:gar  -0.041359   0.030142  -1.372   0.1706    
## lot_LOG:reg   0.001515   0.055990   0.027   0.9784    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2095 on 524 degrees of freedom
## Multiple R-squared:  0.6951, Adjusted R-squared:  0.6829 
## F-statistic: 56.89 on 21 and 524 DF,  p-value: < 2.2e-16

Model characteristics: R2 = 0.6951 , F-statistic: 56.89 on 21 and 524 DF, without “Lot” variable. Notice also the ten (10) interaction variables introduction, between the log lot size and each one of all other variables.

Model linearity testing

Ramsey’s RESET
# Ho : The model is correct linearly specified

modelD.RESET <- resettest(fit4, power = 2, type = "fitted", 
                          data = df)

modelD.RESET
## 
##  RESET test
## 
## data:  fit4
## RESET = 0.011571, df1 = 1, df2 = 523, p-value = 0.9144

With a statistic of ~0.012 and a p-value of ~0.9144, the Ramsey’s RESET test suggests that the fourth model might be correctly specified, We accept Ho at the 5% level of significance.

It also suggests that this is the best model constructed so far, as it has an even lower statistic and the highest p-value scored by all Ramsey’s RESET tests ran so far.

Jarque-Bera
# Ho: The errors of the model are distributed normal
modelD.JB <- JarqueBera.test(fit4$residuals)
modelD.JB
## 
##  Jarque Bera Test
## 
## data:  fit4$residuals
## X-squared = 8.2029, df = 2, p-value = 0.01655
## 
## 
##  Skewness
## 
## data:  fit4$residuals
## statistic = 0.17281, p-value = 0.09924
## 
## 
##  Kurtosis
## 
## data:  fit4$residuals
## statistic = 3.491, p-value = 0.01918

With a statistic of ~8.203 and a p-value of ~0.0165, the Jarque-Bera test suggests that the model residuals are still NOT normally distributed; therefore the model is still NOT correctly specified.

This Jarque-Bera test result, however, is the best scored so far. It seems that the interaction variables introduction slightly improves the (previous best) second model residuals normality.

Both Ramsey’s RESET and Jarque-Bera tests suggest that the fourth model is significantly improved than the models previously considered.But continues with the problems:

  • The Ramsey’s RESET test suggests that the fourth model might be correctly specified

  • While the Jarque-Bera test suggests that it is still NOT correctly specified.

Real to fitted-values diagram
modelD.fitted <- fitted.values(fit4)

df %>% ggplot(aes(modelD.fitted, sell_LOG)) +
    geom_point(shape=16) +
    geom_smooth() + ggtitle("Actual vs Fitted Value of Model D")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Using the 5% significance level, only two (2) of the ten (10) interaction variables used are individually significant:

  • LOG(lot)⋅drv

  • LOG(lot)⋅rec

(e) Perform an F-test for the joint significance of the interaction effects from question (d).

The corresponding restricted model (fifth model) will use only the two (2) significant interaction variables, as identified at the previous section (d):

  • LOG(lot)⋅drv

  • LOG(lot)⋅rec

It will also use the rest of the fourth model variables (as is)

Fifth model estimation

fit5 <- lm(sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg +
                 lot_LOG * drv + lot_LOG * rec, 
             data = df)
summary(fit5)
## 
## Call:
## lm(formula = sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + 
##     ffin + ghw + ca + gar + reg + lot_LOG * drv + lot_LOG * rec, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67934 -0.12225  0.00849  0.12259  0.65051 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.74189    0.62863  13.906  < 2e-16 ***
## lot_LOG      0.17906    0.07707   2.323  0.02053 *  
## bdms         0.03881    0.01430   2.714  0.00686 ** 
## fb           0.16145    0.02025   7.971 9.62e-15 ***
## sty          0.09083    0.01254   7.242 1.56e-12 ***
## drv         -1.18996    0.66462  -1.790  0.07395 .  
## rec          1.50253    0.62553   2.402  0.01665 *  
## ffin         0.10276    0.02157   4.763 2.46e-06 ***
## ghw          0.18448    0.04368   4.223 2.83e-05 ***
## ca           0.16526    0.02121   7.792 3.48e-14 ***
## gar          0.04690    0.01142   4.107 4.65e-05 ***
## reg          0.13260    0.02255   5.880 7.24e-09 ***
## lot_LOG:drv  0.15943    0.08124   1.962  0.05024 .  
## lot_LOG:rec -0.16826    0.07270  -2.314  0.02103 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2091 on 532 degrees of freedom
## Multiple R-squared:  0.6916, Adjusted R-squared:  0.6841 
## F-statistic: 91.79 on 13 and 532 DF,  p-value: < 2.2e-16

Model characteristics: R2 = 0.6916 , F-statistic: 91.79 on 13 and 532 DF

Model linearity testing

Ramsey’s RESET
# Ho : The model is correct linearly specified

modelE.RESET <- resettest(fit5, power = 2, type = "fitted", 
                          data = df)

modelE.RESET
## 
##  RESET test
## 
## data:  fit5
## RESET = 0.045677, df1 = 1, df2 = 531, p-value = 0.8308

With a statistic of ~0.046 and a p-value of ~0.8308, the Ramsey’s RESET test suggests that the fifth model might be correctly specified (H0 of correct/linear specification NOT rejected, at the 5% level of significance).

It also suggests that this is not the best model constructed so far, as the fourth model had scored an even lower statistic and a higher p-value than this. ##### Jarque-Bera

# Ho: The errors of the model are distributed normal
modelE.JB <- JarqueBera.test(fit5$residuals)
modelE.JB
## 
##  Jarque Bera Test
## 
## data:  fit5$residuals
## X-squared = 9.2372, df = 2, p-value = 0.009866
## 
## 
##  Skewness
## 
## data:  fit5$residuals
## statistic = 0.18929, p-value = 0.07096
## 
## 
##  Kurtosis
## 
## data:  fit5$residuals
## statistic = 3.5125, p-value = 0.0145

With a statistic of ~9.237 and a p-value of ~0.0099, the Jarque-Bera test suggests that the model residuals are still NOT normally distributed; therefore the model is still NOT correctly specified.

This Jarque-Bera test result is not the best scored so far either, as the fourth model related test had indicated an even better residuals normality.

Both Ramsey’s RESET and Jarque-Bera tests suggest that the fourth model was better than this fifth one (restricted model); which, however, seems to be the second best (according to Ramsey’s RESET testing) or the third best (according to Jarque-Bera testing) model constructed so far, but with the same problems before showed.

Real to fitted-values diagram
modelE.fitted <- fitted.values(fit5)

df %>% ggplot(aes(modelE.fitted, sell_LOG)) +
    geom_point(shape=16) +
    geom_smooth() + ggtitle("Actual vs Fitted Value of Model E")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

##### Using SSR

An interaction effects joint significance F-test can be performed either:

  • using the sum of square residuals (SSR) of the restricted and the unrestricted models, or

  • using the R2 of both the restricted and the unrestricted models.

SSR_u <- sum(fit4$residuals ^ 2)
SSR_r <- sum(fit5$residuals ^ 2)

n<- 546
k <- 22
g <- 10

SSR_nom <- (SSR_r - SSR_u) / g
SSR_den <- SSR_u / (n - k)
SSR_ <- SSR_nom / SSR_den


SSR_p <- pf(SSR_, g, n - k)
SSR_p
## [1] 0.1832596

p.valueforF∼F(10,524)=0.1833

Using R2
R2_u <- 0.6951
R2_r <- 0.6916

R2_nom <- (R2_u - R2_r) / g
R2_den <- (1 - R2_u) / (n - g)
R2_ <- R2_nom / R2_den

R2_p <- pf(R2_, g, n - k)
R2_p
## [1] 0.1986601

p.valueforF∼F(10,524)=0.1948

Both SSR and R2 methods produced an F-test statistic of 0.6−0.61, with a p-value of 0.18−0.19.

The above results conclude that interactions are jointly significant at the 5% significance level.

(f) Now perform model specification on the interaction variables using the general-to-specific approach. (Only eliminate the interaction effects.)

General-to-specific model

Variables elimination after regression, one at a time, produced the following results:

  • After regression round #1: LOG(lot)⋅reg interaction variable was chosen to be removed.

  • After regression round #2: LOG(lot)⋅bdms interaction variable was chosen to be removed.

  • After regression round #3: LOG(lot)⋅ffin interaction variable was chosen to be removed.

  • After regression round #4: LOG(lot)⋅ghw interaction variable was chosen to be removed.

  • After regression round #5: LOG(lot)⋅ca interaction variable was chosen to be removed.

  • After regression round #6: LOG(lot)⋅gar interaction variable was chosen to be removed.

  • After regression round #7: LOG(lot)⋅fb interaction variable was chosen to be removed.

  • After regression round #8: LOG(lot)⋅sty interaction variable was chosen to be removed.

  • After regression round #9: LOG(lot)⋅drv interaction variable was chosen to be removed.

  • After regression round #10: all remaining variables were found to be significant; variables removal stops here. Conclusively, the only interaction variable found to be significant is LOG(lot)⋅rec.

fit6 <- lm(sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg +
           lot_LOG * rec, data = df)
summary(fit6)  
## 
## Call:
## lm(formula = sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + 
##     ffin + ghw + ca + gar + reg + lot_LOG * rec, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68111 -0.12208  0.00593  0.12731  0.66275 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.59071    0.22656  33.505  < 2e-16 ***
## lot_LOG      0.32024    0.02770  11.562  < 2e-16 ***
## bdms         0.03842    0.01434   2.680   0.0076 ** 
## fb           0.16318    0.02029   8.043 5.71e-15 ***
## sty          0.09080    0.01258   7.220 1.80e-12 ***
## drv          0.11312    0.02815   4.018 6.72e-05 ***
## rec          1.44313    0.62646   2.304   0.0216 *  
## ffin         0.10450    0.02161   4.835 1.74e-06 ***
## ghw          0.18429    0.04380   4.208 3.03e-05 ***
## ca           0.16593    0.02126   7.804 3.19e-14 ***
## gar          0.04810    0.01144   4.206 3.05e-05 ***
## reg          0.13373    0.02260   5.917 5.89e-09 ***
## lot_LOG:rec -0.16112    0.07281  -2.213   0.0273 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2096 on 533 degrees of freedom
## Multiple R-squared:  0.6894, Adjusted R-squared:  0.6824 
## F-statistic: 98.59 on 12 and 533 DF,  p-value: < 2.2e-16

Model linearity testing

Ramsey’s RESET
# Ho : The model is correct linearly specified

modelF.RESET <- resettest(fit6, power = 2, type = "fitted", 
                          data = df)

modelF.RESET
## 
##  RESET test
## 
## data:  fit6
## RESET = 0.43102, df1 = 1, df2 = 532, p-value = 0.5118

With a statistic of ~0.431 and a p-value of ~0.5118, the Ramsey’s RESET test suggests that the sixth model might be correctly specified (H0 of correct/linear specification NOT rejected, at the 5% level of significance).

It also suggests that this is far from a linear model, as all the previous models (except from the first) had scored a lower statistic and a higher p-value than this.

# Ho: The errors of the model are distributed normal
modelF.JB <- JarqueBera.test(fit6$residuals)
modelF.JB
## 
##  Jarque Bera Test
## 
## data:  fit6$residuals
## X-squared = 10.348, df = 2, p-value = 0.005661
## 
## 
##  Skewness
## 
## data:  fit6$residuals
## statistic = 0.18972, p-value = 0.07032
## 
## 
##  Kurtosis
## 
## data:  fit6$residuals
## statistic = 3.5576, p-value = 0.007826

With a statistic of ~10.348 and a p-value of ~0.0057, the Jarque-Bera test suggests that the model residuals are still NOT normally distributed; therefore the model is still NOT correctly specified.

This Jarque-Bera test result is not the best scored so far. All the previous models (except from the first) related test had indicated an even better residuals normality.

modelF.fitted <- fitted.values(fit6)

df %>% ggplot(aes(modelF.fitted, sell_LOG)) +
    geom_point(shape=16) +
    geom_smooth() + ggtitle("Actual vs Fitted Value of Model F")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Both Ramsey’s RESET and Jarque-Bera tests suggest that the sixth model is less linear and with less residuals normality than the models tested before (except from the first model).

The Ramsey’s RESET test suggests that the sixth model might be correctly specified, while the Jarque-Bera test suggests that it is still NOT correctly specified.

(g) One may argue that some of the explanatory variables are endogenous and that there may be omitted variables.For example, the ‘condition’ of the house in terms of how it is maintained is not a variable (and difficult to measure) but will affect the house price. It will also affect, or be reflected in, some of the other variables, such as whether the house has an air conditioning (which is mostly in newer houses). If the condition of the house is missing, will the effect of air conditioning on the (log of the) sale price be over- or underestimated? (For this question no computer calculations are required.)

We know that only newer houses are supplied with air conditioning and we assume that lower age (and therefore better condition) of houses affect the house selling price positively.

The effect of the air conditioning ca variable on the logarithm of the sale price LOG(sell) variable will be overestimated, because it is usually affected by the age (and therefore the condition) of houses both of which (logically) affect the house selling price positively.

Since we do not have a condition variable, part of its effect on price will be encapsulated in the ca variable. So the effect of the air conditioning ca variable on the logarithm of the sale price LOG(sell) variable will be overestimated.

So, the effect of the age and condition house properties (which are not available to our models as variables) is partially included in the air conditioning ca variable. And since that effect is expected to be positive on the house sale price (and its logarithm), it will increase the effect of the air-conditioning ca variable in our models (thus, its estimated effect is overestimated).

(h) Finally we analyze the predictive ability of the model. Consider again the model where the log of the sale price of the house is the dependent variable and the explanatory variables are the log transformation of lot size, with all other explanatory variables in their original form (and no interaction effects). Estimate the parameters of the model using the first 400 observations. Make predictions on the log of the price and calculate the MAE for the other 146 observations. How good is the predictive power of the model (relative to the variability in the log of the price)?

Separating the data sample in two groups

df1 <- df[which(df$obs <= 400), ]
n1 <- nrow(df1)

print(paste("Data group#1 has", n1, "entries."))
## [1] "Data group#1 has 400 entries."
summary(df1)
##       obs             sell             lot             bdms     
##  Min.   :  1.0   Min.   : 25000   Min.   : 1650   Min.   :1.00  
##  1st Qu.:100.8   1st Qu.: 46150   1st Qu.: 3495   1st Qu.:2.00  
##  Median :200.5   Median : 59250   Median : 4180   Median :3.00  
##  Mean   :200.5   Mean   : 64977   Mean   : 4905   Mean   :2.95  
##  3rd Qu.:300.2   3rd Qu.: 78000   3rd Qu.: 6000   3rd Qu.:3.00  
##  Max.   :400.0   Max.   :190000   Max.   :16200   Max.   :6.00  
##        fb             sty             drv              rec        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :1.000   Median :2.000   Median :1.0000   Median :0.0000  
##  Mean   :1.278   Mean   :1.718   Mean   :0.8125   Mean   :0.1625  
##  3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :4.000   Max.   :4.000   Max.   :1.0000   Max.   :1.0000  
##       ffin             ghw             ca             gar        
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00   Median :0.000   Median :0.0000  
##  Mean   :0.3475   Mean   :0.05   Mean   :0.285   Mean   :0.6925  
##  3rd Qu.:1.0000   3rd Qu.:0.00   3rd Qu.:1.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00   Max.   :1.000   Max.   :3.0000  
##       reg           sell_LOG        lot_LOG     
##  Min.   :0.000   Min.   :10.13   Min.   :7.409  
##  1st Qu.:0.000   1st Qu.:10.74   1st Qu.:8.159  
##  Median :0.000   Median :10.99   Median :8.338  
##  Mean   :0.105   Mean   :11.01   Mean   :8.420  
##  3rd Qu.:0.000   3rd Qu.:11.26   3rd Qu.:8.700  
##  Max.   :1.000   Max.   :12.15   Max.   :9.693

Now for data 2:

df2 <- df[which(df$obs > 400), ]
n2 <- nrow(df2)
print(paste("Data group#2 has", n2, "entries."))
## [1] "Data group#2 has 146 entries."
summary(df2)
##       obs             sell             lot             bdms      
##  Min.   :401.0   Min.   : 31900   Min.   : 1950   Min.   :2.000  
##  1st Qu.:437.2   1st Qu.: 60000   1st Qu.: 4678   1st Qu.:3.000  
##  Median :473.5   Median : 72750   Median : 6000   Median :3.000  
##  Mean   :473.5   Mean   : 76737   Mean   : 5821   Mean   :3.007  
##  3rd Qu.:509.8   3rd Qu.: 91125   3rd Qu.: 6652   3rd Qu.:3.000  
##  Max.   :546.0   Max.   :174500   Max.   :12944   Max.   :5.000  
##        fb             sty             drv              rec        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :1.000   Median :2.000   Median :1.0000   Median :0.0000  
##  Mean   :1.308   Mean   :2.055   Mean   :0.9863   Mean   :0.2192  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :2.000   Max.   :4.000   Max.   :1.0000   Max.   :1.0000  
##       ffin             ghw                ca              gar        
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.3562   Mean   :0.03425   Mean   :0.4041   Mean   :0.6918  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :3.0000  
##       reg           sell_LOG        lot_LOG     
##  Min.   :0.000   Min.   :10.37   Min.   :7.576  
##  1st Qu.:0.000   1st Qu.:11.00   1st Qu.:8.451  
##  Median :1.000   Median :11.19   Median :8.700  
##  Mean   :0.589   Mean   :11.21   Mean   :8.595  
##  3rd Qu.:1.000   3rd Qu.:11.42   3rd Qu.:8.803  
##  Max.   :1.000   Max.   :12.07   Max.   :9.468

Seventh model estimation

fit7 <- lm(sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg, 
             data = df1)
summary(fit7)
## 
## Call:
## lm(formula = sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + 
##     ffin + ghw + ca + gar + reg, data = df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66582 -0.13906  0.00796  0.14694  0.67596 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.67309    0.29240  26.241  < 2e-16 ***
## lot_LOG      0.31378    0.03615   8.680  < 2e-16 ***
## bdms         0.03787    0.01744   2.172 0.030469 *  
## fb           0.15238    0.02469   6.170 1.71e-09 ***
## sty          0.08824    0.01819   4.850 1.79e-06 ***
## drv          0.08641    0.03141   2.751 0.006216 ** 
## rec          0.05465    0.03392   1.611 0.107975    
## ffin         0.11471    0.02673   4.291 2.25e-05 ***
## ghw          0.19870    0.05301   3.748 0.000205 ***
## ca           0.17763    0.02724   6.521 2.17e-10 ***
## gar          0.05301    0.01480   3.583 0.000383 ***
## reg          0.15116    0.04215   3.586 0.000378 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2238 on 388 degrees of freedom
## Multiple R-squared:  0.6705, Adjusted R-squared:  0.6611 
## F-statistic: 71.77 on 11 and 388 DF,  p-value: < 2.2e-16

Model characteristics: R2 = 0.6705 , F-statistic: 71.77 on 11 and 388 DF

Model linearity testing

Ramsey’s RESET
# Ho : The model is correct linearly specified

modelG.RESET <- resettest(fit7, power = 2, type = "fitted", 
                          data = df1)

modelG.RESET
## 
##  RESET test
## 
## data:  fit7
## RESET = 0.03955, df1 = 1, df2 = 387, p-value = 0.8425

With a statistic of ~0.04 and a p-value of ~0.8425, the Ramsey’s RESET test suggests that the seventh model might be correctly specified (H0 of correct/linear specification NOT rejected, at the 5% level of significance).

It also suggests that this is not the best model constructed so far, as the fourth model had scored an even lower statistic and a higher p-value than this; this seventh model scored the second best Ramsey’s RESET test score.

Jarque-Bera
# Ho: The errors of the model are distributed normal
modelG.JB <- JarqueBera.test(fit7$residuals)
modelG.JB
## 
##  Jarque Bera Test
## 
## data:  fit7$residuals
## X-squared = 0.69757, df = 2, p-value = 0.7055
## 
## 
##  Skewness
## 
## data:  fit7$residuals
## statistic = 0.08868, p-value = 0.469
## 
## 
##  Kurtosis
## 
## data:  fit7$residuals
## statistic = 3.102, p-value = 0.6772

With a statistic of ~0.698 and a p-value of ~0.7055, the Jarque-Bera test suggests that the model residuals are normally distributed; therefore the model is considered correctly specified.

This Jarque-Bera test result is the best scored so far, and indicates a sufficient residuals normality.

Real to fitted-values diagram
modelG.fitted <- fitted.values(fit7)

df1 %>% ggplot(aes(modelG.fitted, sell_LOG)) +
    geom_point(shape=16) +
    geom_smooth() + ggtitle("Actual vs Fitted Value of Model F")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Both Ramsey’s RESET and Jarque-Bera tests suggest that the seventh model is sufficiently linear and with good residuals normality.

Both Ramsey’s RESET and Jarque-Bera tests suggest that the seventh model might be correctly specified.

This is also intuitively demonstrated by the seventh model real to fitted-values diagram shown at the next page (looks about the same or more like a linear relationship).

Model predictive ability

The seventh model, as estimated using the first data group, produced the following LOG(lot)^ values on the second data group:

fiited_value <- fit7$fitted.values
df1$Fitted <- fiited_value
df1%>% ggplot(aes(obs, sell_LOG)) + geom_point()+ geom_line(color="red") +
  geom_line(y= df1$Fitted, color= "blue") +ggtitle("Fitted(Blue) vs Actual(Red) Model")

Residuals

library(AER)
## Warning: package 'AER' was built under R version 4.0.2
## Loading required package: car
## Warning: package 'car' was built under R version 4.0.2
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## Loading required package: sandwich
## Loading required package: survival
residualPlot(fit7)

df1%>% ggplot(aes(obs, residuals.lm(fit7))) + geom_point()+ geom_line(color="blue") + ggtitle("Residuals")

LOG(sell) variance

sell_LOG_mean <- mean(df$sell_LOG)
sell_LOG_sd   <- sd(df$sell_LOG)

sell_LOG_mean
## [1] 11.05896
sell_LOG_sd
## [1] 0.3719849

MAE (Mean Absolute Error) index calculation

digits <- 3

n <- nrow(df2)
resids_SUM <- sum(abs(fit7$residuals))
resids_SUM
## [1] 69.68977
MAE <- resids_SUM/n
round(MAE, digits)
## [1] 0.477

Calculated MAE value of 0.128 is much less than the dependent variable standard deviation half, which leads to the conclusion that the model has some/significant predictive ability.