• See website for how to submit your answers and how feedback is organized. • This exercise uses the datafile CaseProject-HousePrices and requires a computer. • The dataset CaseProject-HousePrices is available on the website. • Perform all tests at a 5% significance level.
• Experience the processes of variable transformation and model selection. • Apply tests to evaluate models, including effects of endogeneity. • Study the predictive ability of a model.
This project is of an applied nature and uses data that are available in the data file Capstone-HousePrices. The source of these data is Anglin and Gencay, “Semiparametric Estimation of a Hedonic Price Function”(Journal of Applied Econometrics 11, 1996, pages 633-648). We consider the modeling and prediction of house prices. Data are available for 546 observations of the following variables:
• sell: Sale price of the house • lot: Lot size of the property in square feet • bdms: Number of bedrooms • fb: Number of full bathrooms • sty: Number of stories excluding basement • drv: Dummy that is 1 if the house has a driveway and 0 otherwise • rec: Dummy that is 1 if the house has a recreational room and 0 otherwise • ffin: Dummy that is 1 if the house has a full finished basement and 0 otherwise • ghw: Dummy that is 1 if the house uses gas for hot water heating and 0 otherwise • ca: Dummy that is 1 if there is central air conditioning and 0 otherwise • gar: Number of covered garage places • reg: Dummy that is 1 if the house is located in a preferred neighborhood of the city and 0 otherwise • obs: Observation number, needed in part (h)
First we need to load the data:
library(readxl)
library(ggplot2)
library(dplyr)
library(aTSA)
library(knitr)
library(dynlm)
## Warning: package 'dynlm' was built under R version 4.0.2
df <- read_excel("CaseHouse.xls", col_names = TRUE)
head(df)
## # A tibble: 6 x 13
## obs sell lot bdms fb sty drv rec ffin ghw ca gar reg
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 42000 5850 3 1 2 1 0 1 0 0 1 0
## 2 2 38500 4000 2 1 1 1 0 0 0 0 0 0
## 3 3 49500 3060 3 1 1 1 0 0 0 0 0 0
## 4 4 60500 6650 3 1 2 1 1 0 0 0 0 0
## 5 5 61000 6360 2 1 1 1 0 0 0 0 0 0
## 6 6 66000 4160 3 1 1 1 1 1 0 1 0 0
tail(df)
## # A tibble: 6 x 13
## obs sell lot bdms fb sty drv rec ffin ghw ca gar reg
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 541 85000 6525 3 2 4 1 0 0 0 0 1 0
## 2 542 91500 4800 3 2 4 1 1 0 0 1 0 0
## 3 543 94000 6000 3 2 4 1 0 0 0 1 0 0
## 4 544 103000 6000 3 2 4 1 1 0 0 1 1 0
## 5 545 105000 6000 3 2 2 1 1 0 0 1 1 0
## 6 546 105000 6000 3 1 2 1 0 0 0 1 1 0
str(df)
## tibble [546 x 13] (S3: tbl_df/tbl/data.frame)
## $ obs : num [1:546] 1 2 3 4 5 6 7 8 9 10 ...
## $ sell: num [1:546] 42000 38500 49500 60500 61000 66000 66000 69000 83800 88500 ...
## $ lot : num [1:546] 5850 4000 3060 6650 6360 4160 3880 4160 4800 5500 ...
## $ bdms: num [1:546] 3 2 3 3 2 3 3 3 3 3 ...
## $ fb : num [1:546] 1 1 1 1 1 1 2 1 1 2 ...
## $ sty : num [1:546] 2 1 1 2 1 1 2 3 1 4 ...
## $ drv : num [1:546] 1 1 1 1 1 1 1 1 1 1 ...
## $ rec : num [1:546] 0 0 0 1 0 1 0 0 1 1 ...
## $ ffin: num [1:546] 1 0 0 0 0 1 1 0 1 0 ...
## $ ghw : num [1:546] 0 0 0 0 0 0 0 0 0 0 ...
## $ ca : num [1:546] 0 0 0 0 0 1 0 0 0 1 ...
## $ gar : num [1:546] 1 0 0 0 0 0 2 0 0 1 ...
## $ reg : num [1:546] 0 0 0 0 0 0 0 0 0 0 ...
df %>% ggplot(aes(sell, lot)) + geom_point()
df %>% ggplot(aes(sell, lot, color= sty)) + geom_point()
df %>% ggplot(aes(sell, lot, color= fb)) + geom_point()
df %>% ggplot(aes(sell, lot, color= bdms)) + geom_point()
df %>% ggplot(aes(sell, lot, color= rec)) + geom_point()
df %>% ggplot(aes(sell, lot, color= gar)) + geom_point() + facet_grid(.~ fb)
Then we need to fit the model:
fit <- lm(sell~.-obs, data = df)
summary(fit)
##
## Call:
## lm(formula = sell ~ . - obs, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41389 -9307 -591 7353 74875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4038.3504 3409.4713 -1.184 0.236762
## lot 3.5463 0.3503 10.124 < 2e-16 ***
## bdms 1832.0035 1047.0002 1.750 0.080733 .
## fb 14335.5585 1489.9209 9.622 < 2e-16 ***
## sty 6556.9457 925.2899 7.086 4.37e-12 ***
## drv 6687.7789 2045.2458 3.270 0.001145 **
## rec 4511.2838 1899.9577 2.374 0.017929 *
## ffin 5452.3855 1588.0239 3.433 0.000642 ***
## ghw 12831.4063 3217.5971 3.988 7.60e-05 ***
## ca 12632.8904 1555.0211 8.124 3.15e-15 ***
## gar 4244.8290 840.5442 5.050 6.07e-07 ***
## reg 9369.5132 1669.0907 5.614 3.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15420 on 534 degrees of freedom
## Multiple R-squared: 0.6731, Adjusted R-squared: 0.6664
## F-statistic: 99.97 on 11 and 534 DF, p-value: < 2.2e-16
Model characteristics: R2 = 0.6731 , F-statistic: 99.97 on 11 and 534 DF
library(lmtest)
# Ho : The model is correct linearly specified
modelA.RESET <- resettest(fit, power = 2, type = "fitted",
data = df)
modelA.RESET
##
## RESET test
##
## data: fit
## RESET = 26.986, df1 = 1, df2 = 533, p-value = 2.922e-07
With a statistic of 26.986 and a p-value of ~0.000, the Ramsey’s RESET test suggests that the linear model is NOT correctly specified. So we reject Ho.
library(tsoutliers)
## Warning: package 'tsoutliers' was built under R version 4.0.2
# Ho: The errors of the model are distributed normal
modelA.JB <- JarqueBera.test(fit$residuals)
modelA.JB
##
## Jarque Bera Test
##
## data: fit$residuals
## X-squared = 247.62, df = 2, p-value < 2.2e-16
##
##
## Skewness
##
## data: fit$residuals
## statistic = 0.85278, p-value = 4.119e-16
##
##
## Kurtosis
##
## data: fit$residuals
## statistic = 5.8241, p-value < 2.2e-16
With a statistic of ~247.62 and a p-value of ~0, the Jarque-Bera test suggests that the linear model residuals are NOT normally distributed, therefore the linear model is NOT correctly specified. Also is not distributed as normal distribution, and its distribution is leptokurtic.
Both Ramsey’s RESET and Jarque-Bera tests suggest that the considered linear model is NOT correctly specified.
modelA.fitted <- fitted.values(fit)
df %>% ggplot(aes(modelA.fitted, sell)) +
geom_point(shape=16) +
geom_smooth() + ggtitle("Actual vs Fitted Value of Model A")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df$sell_LOG <- log(df$sell)
df$lot_LOG <- log(df$lot)
fit2 <- lm(sell_LOG ~ lot + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg, data = df)
summary(fit2)
##
## Call:
## lm(formula = sell_LOG ~ lot + bdms + fb + sty + drv + rec + ffin +
## ghw + ca + gar + reg, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67865 -0.12211 0.01666 0.12868 0.67737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.003e+01 4.724e-02 212.210 < 2e-16 ***
## lot 5.057e-05 4.854e-06 10.418 < 2e-16 ***
## bdms 3.402e-02 1.451e-02 2.345 0.01939 *
## fb 1.678e-01 2.065e-02 8.126 3.10e-15 ***
## sty 9.227e-02 1.282e-02 7.197 2.10e-12 ***
## drv 1.307e-01 2.834e-02 4.610 5.04e-06 ***
## rec 7.352e-02 2.633e-02 2.792 0.00542 **
## ffin 9.940e-02 2.200e-02 4.517 7.72e-06 ***
## ghw 1.784e-01 4.458e-02 4.000 7.22e-05 ***
## ca 1.780e-01 2.155e-02 8.262 1.14e-15 ***
## gar 5.076e-02 1.165e-02 4.358 1.58e-05 ***
## reg 1.271e-01 2.313e-02 5.496 6.02e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2137 on 534 degrees of freedom
## Multiple R-squared: 0.6766, Adjusted R-squared: 0.6699
## F-statistic: 101.6 on 11 and 534 DF, p-value: < 2.2e-16
Model characteristics: R2 = 0.6766 , F-statistic: 101.56 on 11 and 534 DF
# Ho : The model is correct linearly specified
modelB.RESET <- resettest(fit2, power = 2, type = "fitted",
data = df)
modelB.RESET
##
## RESET test
##
## data: fit2
## RESET = 0.27031, df1 = 1, df2 = 533, p-value = 0.6033
With a statistic of ~0.27 and a p-value of ~0.6033, the Ramsey’s RESET test suggests that the second linear model might be correctly specified. We accept Ho, at the 5% level of significance).
# Ho: The errors of the model are distributed normal
modelA.JB <- JarqueBera.test(fit2$residuals)
modelA.JB
##
## Jarque Bera Test
##
## data: fit2$residuals
## X-squared = 8.4432, df = 2, p-value = 0.01467
##
##
## Skewness
##
## data: fit2$residuals
## statistic = 0.19898, p-value = 0.05768
##
##
## Kurtosis
##
## data: fit2$residuals
## statistic = 3.4613, p-value = 0.0278
Also with this model, With a statistic of ~8.443 and a p-value of ~0.0147, the Jarque-Bera test suggests that the linear model residuals are still NOT normally distributed.
Therefore the linear model is still NOT correctly specified, althought that the second model’s JB statistic is significantly decreased (and therefore the model significantly improved).
Both Ramsey’s RESET and Jarque-Bera tests suggest that the second model is significantly improved than the model considered first, but has a problem:
The Ramsey’s RESET test suggests that the second linear model might be correctly specified.
while the Jarque-Bera test suggests that it is still NOT correctly specified (although significantly improved).
modelB.fitted <- fitted.values(fit2)
df %>% ggplot(aes(modelB.fitted, sell_LOG)) +
geom_point(shape=16) +
geom_smooth() + ggtitle("Actual vs Fitted Value of Model B")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
fit3 <- lm(sell_LOG ~ lot + lot_LOG + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg,
data = df)
summary(fit3)
##
## Call:
## lm(formula = sell_LOG ~ lot + lot_LOG + bdms + fb + sty + drv +
## rec + ffin + ghw + ca + gar + reg, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68573 -0.12380 0.00785 0.12521 0.68112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.150e+00 6.830e-01 10.469 < 2e-16 ***
## lot -1.490e-05 1.624e-05 -0.918 0.359086
## lot_LOG 3.827e-01 9.070e-02 4.219 2.88e-05 ***
## bdms 3.489e-02 1.429e-02 2.442 0.014915 *
## fb 1.659e-01 2.033e-02 8.161 2.40e-15 ***
## sty 9.121e-02 1.263e-02 7.224 1.76e-12 ***
## drv 1.068e-01 2.847e-02 3.752 0.000195 ***
## rec 5.467e-02 2.630e-02 2.078 0.038156 *
## ffin 1.052e-01 2.171e-02 4.848 1.64e-06 ***
## ghw 1.791e-01 4.390e-02 4.079 5.20e-05 ***
## ca 1.643e-01 2.146e-02 7.657 9.01e-14 ***
## gar 4.826e-02 1.148e-02 4.203 3.09e-05 ***
## reg 1.344e-01 2.284e-02 5.884 7.10e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2104 on 533 degrees of freedom
## Multiple R-squared: 0.687, Adjusted R-squared: 0.68
## F-statistic: 97.51 on 12 and 533 DF, p-value: < 2.2e-16
Model characteristics: R2 = 0.687 , F-statistic: 97.51 on 12 and 533 DF
# Ho : The model is correct linearly specified
modelC.RESET <- resettest(fit3, power = 2, type = "fitted",
data = df)
modelC.RESET
##
## RESET test
##
## data: fit3
## RESET = 0.06769, df1 = 1, df2 = 532, p-value = 0.7948
With a statistic of ~0.068 and a p-value of ~0.7948, the Ramsey’s RESET test suggests that the third linear model might be correctly specified We accept Ho, at the 5% level of significance).
It also suggests that this is the best model constructed so far, as it has the lowest statistic and the highest p-value scored by all Ramsey’s RESET tests ran so far.
# Ho: The errors of the model are distributed normal
modelC.JB <- JarqueBera.test(fit3$residuals)
modelC.JB
##
## Jarque Bera Test
##
## data: fit3$residuals
## X-squared = 9.3643, df = 2, p-value = 0.009259
##
##
## Skewness
##
## data: fit3$residuals
## statistic = 0.18025, p-value = 0.08553
##
##
## Kurtosis
##
## data: fit3$residuals
## statistic = 3.5307, p-value = 0.01136
With a statistic of ~9.364 and a p-value of ~0.0093, the Jarque-Bera test suggests that the linear model residuals are still NOT normally distributed; therefore the linear model is still NOT correctly specified.
No further model improvement is indicated by the Jarque-Bera residuals normality test; in fact the second model’s residuals were slightly more normal than the third’s.
Both Ramsey’s RESET and Jarque-Bera tests suggest that the third model is significantly improved than the model considered first, while the Ramsey’s RESET test suggests that it is even more improved than the model considered second.But the problems are:
The Ramsey’s RESET test suggests that the third linear model might be correctly specified.
while the Jarque-Bera test suggests that it is still NOT correctly specified.
modelC.fitted <- fitted.values(fit3)
df %>% ggplot(aes(modelC.fitted, sell_LOG)) +
geom_point(shape=16) +
geom_smooth() + ggtitle("Actual vs Fitted Value of Model c")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It is concluded that it would be better to include the lot size logarithm in the model, rather than the lot size variable itself, due to the following reasons:
The three models testing performed so far, see Table 4 (above): “Models linearity test results’ comparison chart”. The Ramsey’s RESET tests showed that the lot size logarithm variable significantly improves the model linearity, while the Jarque-Bera tests showed that it produces a satisfactory (so far) level of residuals normality.
The (much better) lot size logarithm variable coefficient p-value (0), compared to the lot size variable itself coefficient p-value (0.359), when used together. See Table 5 (above): “Third model lot related variables’ coefficients’ comparison chart”.
The fact that lot variable ended with a ~zero (0) coefficient anyway at the (improved) second and third models.
fit4 <- lm(sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg +
lot_LOG * bdms + lot_LOG * fb + lot_LOG * sty + lot_LOG * drv + lot_LOG * rec +
lot_LOG * ffin + lot_LOG * ghw + lot_LOG * ca + lot_LOG * gar + lot_LOG * reg,
data = df)
summary(fit4)
##
## Call:
## lm(formula = sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec +
## ffin + ghw + ca + gar + reg + lot_LOG * bdms + lot_LOG *
## fb + lot_LOG * sty + lot_LOG * drv + lot_LOG * rec + lot_LOG *
## ffin + lot_LOG * ghw + lot_LOG * ca + lot_LOG * gar + lot_LOG *
## reg, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68306 -0.11612 0.00591 0.12486 0.65998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.966499 1.070667 8.375 5.09e-16 ***
## lot_LOG 0.152685 0.128294 1.190 0.2345
## bdms 0.019075 0.326700 0.058 0.9535
## fb -0.368234 0.429048 -0.858 0.3911
## sty 0.488885 0.309700 1.579 0.1150
## drv -1.463371 0.717225 -2.040 0.0418 *
## rec 1.673992 0.655919 2.552 0.0110 *
## ffin -0.031844 0.445543 -0.071 0.9430
## ghw -0.505889 0.902733 -0.560 0.5754
## ca -0.340276 0.496041 -0.686 0.4930
## gar 0.401941 0.258646 1.554 0.1208
## reg 0.118484 0.479856 0.247 0.8051
## lot_LOG:bdms 0.002070 0.038654 0.054 0.9573
## lot_LOG:fb 0.062037 0.050145 1.237 0.2166
## lot_LOG:sty -0.046361 0.035942 -1.290 0.1977
## lot_LOG:drv 0.191542 0.087361 2.193 0.0288 *
## lot_LOG:rec -0.188462 0.076373 -2.468 0.0139 *
## lot_LOG:ffin 0.015913 0.052851 0.301 0.7635
## lot_LOG:ghw 0.081135 0.106929 0.759 0.4483
## lot_LOG:ca 0.059549 0.058024 1.026 0.3052
## lot_LOG:gar -0.041359 0.030142 -1.372 0.1706
## lot_LOG:reg 0.001515 0.055990 0.027 0.9784
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2095 on 524 degrees of freedom
## Multiple R-squared: 0.6951, Adjusted R-squared: 0.6829
## F-statistic: 56.89 on 21 and 524 DF, p-value: < 2.2e-16
Model characteristics: R2 = 0.6951 , F-statistic: 56.89 on 21 and 524 DF, without “Lot” variable. Notice also the ten (10) interaction variables introduction, between the log lot size and each one of all other variables.
# Ho : The model is correct linearly specified
modelD.RESET <- resettest(fit4, power = 2, type = "fitted",
data = df)
modelD.RESET
##
## RESET test
##
## data: fit4
## RESET = 0.011571, df1 = 1, df2 = 523, p-value = 0.9144
With a statistic of ~0.012 and a p-value of ~0.9144, the Ramsey’s RESET test suggests that the fourth model might be correctly specified, We accept Ho at the 5% level of significance.
It also suggests that this is the best model constructed so far, as it has an even lower statistic and the highest p-value scored by all Ramsey’s RESET tests ran so far.
# Ho: The errors of the model are distributed normal
modelD.JB <- JarqueBera.test(fit4$residuals)
modelD.JB
##
## Jarque Bera Test
##
## data: fit4$residuals
## X-squared = 8.2029, df = 2, p-value = 0.01655
##
##
## Skewness
##
## data: fit4$residuals
## statistic = 0.17281, p-value = 0.09924
##
##
## Kurtosis
##
## data: fit4$residuals
## statistic = 3.491, p-value = 0.01918
With a statistic of ~8.203 and a p-value of ~0.0165, the Jarque-Bera test suggests that the model residuals are still NOT normally distributed; therefore the model is still NOT correctly specified.
This Jarque-Bera test result, however, is the best scored so far. It seems that the interaction variables introduction slightly improves the (previous best) second model residuals normality.
Both Ramsey’s RESET and Jarque-Bera tests suggest that the fourth model is significantly improved than the models previously considered.But continues with the problems:
The Ramsey’s RESET test suggests that the fourth model might be correctly specified
While the Jarque-Bera test suggests that it is still NOT correctly specified.
modelD.fitted <- fitted.values(fit4)
df %>% ggplot(aes(modelD.fitted, sell_LOG)) +
geom_point(shape=16) +
geom_smooth() + ggtitle("Actual vs Fitted Value of Model D")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Using the 5% significance level, only two (2) of the ten (10) interaction variables used are individually significant:
LOG(lot)⋅drv
LOG(lot)⋅rec
The corresponding restricted model (fifth model) will use only the two (2) significant interaction variables, as identified at the previous section (d):
LOG(lot)⋅drv
LOG(lot)⋅rec
It will also use the rest of the fourth model variables (as is)
fit5 <- lm(sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg +
lot_LOG * drv + lot_LOG * rec,
data = df)
summary(fit5)
##
## Call:
## lm(formula = sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec +
## ffin + ghw + ca + gar + reg + lot_LOG * drv + lot_LOG * rec,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67934 -0.12225 0.00849 0.12259 0.65051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.74189 0.62863 13.906 < 2e-16 ***
## lot_LOG 0.17906 0.07707 2.323 0.02053 *
## bdms 0.03881 0.01430 2.714 0.00686 **
## fb 0.16145 0.02025 7.971 9.62e-15 ***
## sty 0.09083 0.01254 7.242 1.56e-12 ***
## drv -1.18996 0.66462 -1.790 0.07395 .
## rec 1.50253 0.62553 2.402 0.01665 *
## ffin 0.10276 0.02157 4.763 2.46e-06 ***
## ghw 0.18448 0.04368 4.223 2.83e-05 ***
## ca 0.16526 0.02121 7.792 3.48e-14 ***
## gar 0.04690 0.01142 4.107 4.65e-05 ***
## reg 0.13260 0.02255 5.880 7.24e-09 ***
## lot_LOG:drv 0.15943 0.08124 1.962 0.05024 .
## lot_LOG:rec -0.16826 0.07270 -2.314 0.02103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2091 on 532 degrees of freedom
## Multiple R-squared: 0.6916, Adjusted R-squared: 0.6841
## F-statistic: 91.79 on 13 and 532 DF, p-value: < 2.2e-16
Model characteristics: R2 = 0.6916 , F-statistic: 91.79 on 13 and 532 DF
# Ho : The model is correct linearly specified
modelE.RESET <- resettest(fit5, power = 2, type = "fitted",
data = df)
modelE.RESET
##
## RESET test
##
## data: fit5
## RESET = 0.045677, df1 = 1, df2 = 531, p-value = 0.8308
With a statistic of ~0.046 and a p-value of ~0.8308, the Ramsey’s RESET test suggests that the fifth model might be correctly specified (H0 of correct/linear specification NOT rejected, at the 5% level of significance).
It also suggests that this is not the best model constructed so far, as the fourth model had scored an even lower statistic and a higher p-value than this. ##### Jarque-Bera
# Ho: The errors of the model are distributed normal
modelE.JB <- JarqueBera.test(fit5$residuals)
modelE.JB
##
## Jarque Bera Test
##
## data: fit5$residuals
## X-squared = 9.2372, df = 2, p-value = 0.009866
##
##
## Skewness
##
## data: fit5$residuals
## statistic = 0.18929, p-value = 0.07096
##
##
## Kurtosis
##
## data: fit5$residuals
## statistic = 3.5125, p-value = 0.0145
With a statistic of ~9.237 and a p-value of ~0.0099, the Jarque-Bera test suggests that the model residuals are still NOT normally distributed; therefore the model is still NOT correctly specified.
This Jarque-Bera test result is not the best scored so far either, as the fourth model related test had indicated an even better residuals normality.
Both Ramsey’s RESET and Jarque-Bera tests suggest that the fourth model was better than this fifth one (restricted model); which, however, seems to be the second best (according to Ramsey’s RESET testing) or the third best (according to Jarque-Bera testing) model constructed so far, but with the same problems before showed.
modelE.fitted <- fitted.values(fit5)
df %>% ggplot(aes(modelE.fitted, sell_LOG)) +
geom_point(shape=16) +
geom_smooth() + ggtitle("Actual vs Fitted Value of Model E")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
##### Using SSR
An interaction effects joint significance F-test can be performed either:
using the sum of square residuals (SSR) of the restricted and the unrestricted models, or
using the R2 of both the restricted and the unrestricted models.
SSR_u <- sum(fit4$residuals ^ 2)
SSR_r <- sum(fit5$residuals ^ 2)
n<- 546
k <- 22
g <- 10
SSR_nom <- (SSR_r - SSR_u) / g
SSR_den <- SSR_u / (n - k)
SSR_ <- SSR_nom / SSR_den
SSR_p <- pf(SSR_, g, n - k)
SSR_p
## [1] 0.1832596
p.valueforF∼F(10,524)=0.1833
R2_u <- 0.6951
R2_r <- 0.6916
R2_nom <- (R2_u - R2_r) / g
R2_den <- (1 - R2_u) / (n - g)
R2_ <- R2_nom / R2_den
R2_p <- pf(R2_, g, n - k)
R2_p
## [1] 0.1986601
p.valueforF∼F(10,524)=0.1948
Both SSR and R2 methods produced an F-test statistic of 0.6−0.61, with a p-value of 0.18−0.19.
The above results conclude that interactions are jointly significant at the 5% significance level.
Variables elimination after regression, one at a time, produced the following results:
After regression round #1: LOG(lot)⋅reg interaction variable was chosen to be removed.
After regression round #2: LOG(lot)⋅bdms interaction variable was chosen to be removed.
After regression round #3: LOG(lot)⋅ffin interaction variable was chosen to be removed.
After regression round #4: LOG(lot)⋅ghw interaction variable was chosen to be removed.
After regression round #5: LOG(lot)⋅ca interaction variable was chosen to be removed.
After regression round #6: LOG(lot)⋅gar interaction variable was chosen to be removed.
After regression round #7: LOG(lot)⋅fb interaction variable was chosen to be removed.
After regression round #8: LOG(lot)⋅sty interaction variable was chosen to be removed.
After regression round #9: LOG(lot)⋅drv interaction variable was chosen to be removed.
After regression round #10: all remaining variables were found to be significant; variables removal stops here. Conclusively, the only interaction variable found to be significant is LOG(lot)⋅rec.
fit6 <- lm(sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg +
lot_LOG * rec, data = df)
summary(fit6)
##
## Call:
## lm(formula = sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec +
## ffin + ghw + ca + gar + reg + lot_LOG * rec, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68111 -0.12208 0.00593 0.12731 0.66275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.59071 0.22656 33.505 < 2e-16 ***
## lot_LOG 0.32024 0.02770 11.562 < 2e-16 ***
## bdms 0.03842 0.01434 2.680 0.0076 **
## fb 0.16318 0.02029 8.043 5.71e-15 ***
## sty 0.09080 0.01258 7.220 1.80e-12 ***
## drv 0.11312 0.02815 4.018 6.72e-05 ***
## rec 1.44313 0.62646 2.304 0.0216 *
## ffin 0.10450 0.02161 4.835 1.74e-06 ***
## ghw 0.18429 0.04380 4.208 3.03e-05 ***
## ca 0.16593 0.02126 7.804 3.19e-14 ***
## gar 0.04810 0.01144 4.206 3.05e-05 ***
## reg 0.13373 0.02260 5.917 5.89e-09 ***
## lot_LOG:rec -0.16112 0.07281 -2.213 0.0273 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2096 on 533 degrees of freedom
## Multiple R-squared: 0.6894, Adjusted R-squared: 0.6824
## F-statistic: 98.59 on 12 and 533 DF, p-value: < 2.2e-16
# Ho : The model is correct linearly specified
modelF.RESET <- resettest(fit6, power = 2, type = "fitted",
data = df)
modelF.RESET
##
## RESET test
##
## data: fit6
## RESET = 0.43102, df1 = 1, df2 = 532, p-value = 0.5118
With a statistic of ~0.431 and a p-value of ~0.5118, the Ramsey’s RESET test suggests that the sixth model might be correctly specified (H0 of correct/linear specification NOT rejected, at the 5% level of significance).
It also suggests that this is far from a linear model, as all the previous models (except from the first) had scored a lower statistic and a higher p-value than this.
# Ho: The errors of the model are distributed normal
modelF.JB <- JarqueBera.test(fit6$residuals)
modelF.JB
##
## Jarque Bera Test
##
## data: fit6$residuals
## X-squared = 10.348, df = 2, p-value = 0.005661
##
##
## Skewness
##
## data: fit6$residuals
## statistic = 0.18972, p-value = 0.07032
##
##
## Kurtosis
##
## data: fit6$residuals
## statistic = 3.5576, p-value = 0.007826
With a statistic of ~10.348 and a p-value of ~0.0057, the Jarque-Bera test suggests that the model residuals are still NOT normally distributed; therefore the model is still NOT correctly specified.
This Jarque-Bera test result is not the best scored so far. All the previous models (except from the first) related test had indicated an even better residuals normality.
modelF.fitted <- fitted.values(fit6)
df %>% ggplot(aes(modelF.fitted, sell_LOG)) +
geom_point(shape=16) +
geom_smooth() + ggtitle("Actual vs Fitted Value of Model F")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Both Ramsey’s RESET and Jarque-Bera tests suggest that the sixth model is less linear and with less residuals normality than the models tested before (except from the first model).
The Ramsey’s RESET test suggests that the sixth model might be correctly specified, while the Jarque-Bera test suggests that it is still NOT correctly specified.
We know that only newer houses are supplied with air conditioning and we assume that lower age (and therefore better condition) of houses affect the house selling price positively.
The effect of the air conditioning ca variable on the logarithm of the sale price LOG(sell) variable will be overestimated, because it is usually affected by the age (and therefore the condition) of houses both of which (logically) affect the house selling price positively.
Since we do not have a condition variable, part of its effect on price will be encapsulated in the ca variable. So the effect of the air conditioning ca variable on the logarithm of the sale price LOG(sell) variable will be overestimated.
So, the effect of the age and condition house properties (which are not available to our models as variables) is partially included in the air conditioning ca variable. And since that effect is expected to be positive on the house sale price (and its logarithm), it will increase the effect of the air-conditioning ca variable in our models (thus, its estimated effect is overestimated).
df1 <- df[which(df$obs <= 400), ]
n1 <- nrow(df1)
print(paste("Data group#1 has", n1, "entries."))
## [1] "Data group#1 has 400 entries."
summary(df1)
## obs sell lot bdms
## Min. : 1.0 Min. : 25000 Min. : 1650 Min. :1.00
## 1st Qu.:100.8 1st Qu.: 46150 1st Qu.: 3495 1st Qu.:2.00
## Median :200.5 Median : 59250 Median : 4180 Median :3.00
## Mean :200.5 Mean : 64977 Mean : 4905 Mean :2.95
## 3rd Qu.:300.2 3rd Qu.: 78000 3rd Qu.: 6000 3rd Qu.:3.00
## Max. :400.0 Max. :190000 Max. :16200 Max. :6.00
## fb sty drv rec
## Min. :1.000 Min. :1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :1.000 Median :2.000 Median :1.0000 Median :0.0000
## Mean :1.278 Mean :1.718 Mean :0.8125 Mean :0.1625
## 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :1.0000 Max. :1.0000
## ffin ghw ca gar
## Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00 Median :0.000 Median :0.0000
## Mean :0.3475 Mean :0.05 Mean :0.285 Mean :0.6925
## 3rd Qu.:1.0000 3rd Qu.:0.00 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00 Max. :1.000 Max. :3.0000
## reg sell_LOG lot_LOG
## Min. :0.000 Min. :10.13 Min. :7.409
## 1st Qu.:0.000 1st Qu.:10.74 1st Qu.:8.159
## Median :0.000 Median :10.99 Median :8.338
## Mean :0.105 Mean :11.01 Mean :8.420
## 3rd Qu.:0.000 3rd Qu.:11.26 3rd Qu.:8.700
## Max. :1.000 Max. :12.15 Max. :9.693
Now for data 2:
df2 <- df[which(df$obs > 400), ]
n2 <- nrow(df2)
print(paste("Data group#2 has", n2, "entries."))
## [1] "Data group#2 has 146 entries."
summary(df2)
## obs sell lot bdms
## Min. :401.0 Min. : 31900 Min. : 1950 Min. :2.000
## 1st Qu.:437.2 1st Qu.: 60000 1st Qu.: 4678 1st Qu.:3.000
## Median :473.5 Median : 72750 Median : 6000 Median :3.000
## Mean :473.5 Mean : 76737 Mean : 5821 Mean :3.007
## 3rd Qu.:509.8 3rd Qu.: 91125 3rd Qu.: 6652 3rd Qu.:3.000
## Max. :546.0 Max. :174500 Max. :12944 Max. :5.000
## fb sty drv rec
## Min. :1.000 Min. :1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :1.000 Median :2.000 Median :1.0000 Median :0.0000
## Mean :1.308 Mean :2.055 Mean :0.9863 Mean :0.2192
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :2.000 Max. :4.000 Max. :1.0000 Max. :1.0000
## ffin ghw ca gar
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.3562 Mean :0.03425 Mean :0.4041 Mean :0.6918
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :3.0000
## reg sell_LOG lot_LOG
## Min. :0.000 Min. :10.37 Min. :7.576
## 1st Qu.:0.000 1st Qu.:11.00 1st Qu.:8.451
## Median :1.000 Median :11.19 Median :8.700
## Mean :0.589 Mean :11.21 Mean :8.595
## 3rd Qu.:1.000 3rd Qu.:11.42 3rd Qu.:8.803
## Max. :1.000 Max. :12.07 Max. :9.468
fit7 <- lm(sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec + ffin + ghw + ca + gar + reg,
data = df1)
summary(fit7)
##
## Call:
## lm(formula = sell_LOG ~ lot_LOG + bdms + fb + sty + drv + rec +
## ffin + ghw + ca + gar + reg, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66582 -0.13906 0.00796 0.14694 0.67596
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.67309 0.29240 26.241 < 2e-16 ***
## lot_LOG 0.31378 0.03615 8.680 < 2e-16 ***
## bdms 0.03787 0.01744 2.172 0.030469 *
## fb 0.15238 0.02469 6.170 1.71e-09 ***
## sty 0.08824 0.01819 4.850 1.79e-06 ***
## drv 0.08641 0.03141 2.751 0.006216 **
## rec 0.05465 0.03392 1.611 0.107975
## ffin 0.11471 0.02673 4.291 2.25e-05 ***
## ghw 0.19870 0.05301 3.748 0.000205 ***
## ca 0.17763 0.02724 6.521 2.17e-10 ***
## gar 0.05301 0.01480 3.583 0.000383 ***
## reg 0.15116 0.04215 3.586 0.000378 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2238 on 388 degrees of freedom
## Multiple R-squared: 0.6705, Adjusted R-squared: 0.6611
## F-statistic: 71.77 on 11 and 388 DF, p-value: < 2.2e-16
Model characteristics: R2 = 0.6705 , F-statistic: 71.77 on 11 and 388 DF
# Ho : The model is correct linearly specified
modelG.RESET <- resettest(fit7, power = 2, type = "fitted",
data = df1)
modelG.RESET
##
## RESET test
##
## data: fit7
## RESET = 0.03955, df1 = 1, df2 = 387, p-value = 0.8425
With a statistic of ~0.04 and a p-value of ~0.8425, the Ramsey’s RESET test suggests that the seventh model might be correctly specified (H0 of correct/linear specification NOT rejected, at the 5% level of significance).
It also suggests that this is not the best model constructed so far, as the fourth model had scored an even lower statistic and a higher p-value than this; this seventh model scored the second best Ramsey’s RESET test score.
# Ho: The errors of the model are distributed normal
modelG.JB <- JarqueBera.test(fit7$residuals)
modelG.JB
##
## Jarque Bera Test
##
## data: fit7$residuals
## X-squared = 0.69757, df = 2, p-value = 0.7055
##
##
## Skewness
##
## data: fit7$residuals
## statistic = 0.08868, p-value = 0.469
##
##
## Kurtosis
##
## data: fit7$residuals
## statistic = 3.102, p-value = 0.6772
With a statistic of ~0.698 and a p-value of ~0.7055, the Jarque-Bera test suggests that the model residuals are normally distributed; therefore the model is considered correctly specified.
This Jarque-Bera test result is the best scored so far, and indicates a sufficient residuals normality.
modelG.fitted <- fitted.values(fit7)
df1 %>% ggplot(aes(modelG.fitted, sell_LOG)) +
geom_point(shape=16) +
geom_smooth() + ggtitle("Actual vs Fitted Value of Model F")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Both Ramsey’s RESET and Jarque-Bera tests suggest that the seventh model is sufficiently linear and with good residuals normality.
Both Ramsey’s RESET and Jarque-Bera tests suggest that the seventh model might be correctly specified.
This is also intuitively demonstrated by the seventh model real to fitted-values diagram shown at the next page (looks about the same or more like a linear relationship).
The seventh model, as estimated using the first data group, produced the following LOG(lot)^ values on the second data group:
fiited_value <- fit7$fitted.values
df1$Fitted <- fiited_value
df1%>% ggplot(aes(obs, sell_LOG)) + geom_point()+ geom_line(color="red") +
geom_line(y= df1$Fitted, color= "blue") +ggtitle("Fitted(Blue) vs Actual(Red) Model")
library(AER)
## Warning: package 'AER' was built under R version 4.0.2
## Loading required package: car
## Warning: package 'car' was built under R version 4.0.2
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: sandwich
## Loading required package: survival
residualPlot(fit7)
df1%>% ggplot(aes(obs, residuals.lm(fit7))) + geom_point()+ geom_line(color="blue") + ggtitle("Residuals")
sell_LOG_mean <- mean(df$sell_LOG)
sell_LOG_sd <- sd(df$sell_LOG)
sell_LOG_mean
## [1] 11.05896
sell_LOG_sd
## [1] 0.3719849
digits <- 3
n <- nrow(df2)
resids_SUM <- sum(abs(fit7$residuals))
resids_SUM
## [1] 69.68977
MAE <- resids_SUM/n
round(MAE, digits)
## [1] 0.477
Calculated MAE value of 0.128 is much less than the dependent variable standard deviation half, which leads to the conclusion that the model has some/significant predictive ability.