This file will introduce you to model development, testing, and validation by using frequentist and Bayesian approaches.
In this scenario we played the role of a consultant hired by a real estate investment firm in Ames, Iowa, a mid-west town in the United States, to analyze data in order to help provide insight into how the firm should invest for highest profits, and to quantify and communicate to the company management what types of real estate properties are good investments and why.
As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
Use the code block below to load any necessary packages
First - location, second - quantity.
From one side we got a clear relationship between price and location, then better neighborhood, higher price.
On another hand, we can observe that with increases area price going down with respect to location. Also, we should mention that more expensive neighborhoods have bigger property area. That circumstance makes a basic comparison of mean/median price per square feet neighborhood skew.
For further convenience, we will collect all the data transformations in this block. We will discuss these transformations below.
# price/area
ames_train <- ames_train %>%
mutate(PSF = price/area,
price.log = log(price),
area.log = log(area))
# other transformations
ames_train <- ames_train %>%
mutate(Bsmt.Qual = if_else(is.na(Bsmt.Qual) == T | Bsmt.Qual == "", "NoBasement", as.character(Bsmt.Qual)),
Bsmt.Cond = if_else(is.na(Bsmt.Cond) == T | Bsmt.Cond == "", "NoBasement", as.character(Bsmt.Cond)),
Bsmt.Exposure = if_else(is.na(Bsmt.Exposure) == T | Bsmt.Exposure == "", "NoBasement", as.character(Bsmt.Exposure)),
BsmtFin.Type.1 = if_else(is.na(BsmtFin.Type.1) == T | BsmtFin.Type.1 == "", "NoBasement", as.character(BsmtFin.Type.1)),
Functional.Tr = (Functional == "Typ"))
index <- which(is.na(ames_train$Bsmt.Full.Bath) == T)
ames_train[index,"Bsmt.Full.Bath"] <- 0
# Age
vAge15 <- c("StoneBr","NridgHt","NoRidge","GrnHill","Somerst","Blmngtn")
vAge50 <- c("Timber","Greens","Veenker","CollgCr","ClearCr","NWAmes","Gilbert","SawyerW",
"Mitchel","NPkVill", "NAmes", "Sawyer", "Blueste", "BrDale", "MeadowV")
vAge80 <- c("Crawfor","SWISU", "Edwards", "BrkSide", "OldTown", "IDOTRR")
ames_train <- ames_train %>%
mutate(Age.Blt = 2020 - Year.Built) %>%
mutate(AgeGroup = factor(case_when(Neighborhood %in% vAge15 ~ "New",
Neighborhood %in% vAge50 ~ "Medium",
Neighborhood %in% vAge80 ~ "Old"),
levels = c("New","Medium","Old")))
# Sort order
## Neighborhood
dfNeighborhoodOrdLev <- ames_train %>%
group_by(Neighborhood) %>%
summarise(medianPricePeerFeet = median(price)) %>%
arrange(desc(medianPricePeerFeet))## `summarise()` ungrouping output (override with `.groups` argument)
vNeighborhoodOrdLev <- as.character(dfNeighborhoodOrdLev$Neighborhood)
ames_train$Neighborhood <- factor(ames_train$Neighborhood, levels = vNeighborhoodOrdLev)
rm(dfNeighborhoodOrdLev)Distribution of the price of the houses.
gplot_price <- ames_train %>%
ggplot(aes(x = price)) +
geom_histogram(bins = 30) +
ggtitle("Price") + ylab("")
gplot_area <- ames_train %>%
ggplot(aes(x = area)) +
geom_histogram(bins = 30) +
ggtitle("Area") + ylab("")
gplot_price_area <- ames_train %>%
ggplot(aes(x = PSF)) +
geom_histogram(bins = 30) +
ggtitle("'Price/area' of the houses") +
xlab("price peer square feet") + ylab("") +
scale_y_continuous(position = "right")
gplot_price.log <- ames_train %>%
ggplot(aes(x = price.log)) +
geom_histogram(bins = 30) +
ggtitle("log(Price)") + ylab("")
gplot_area.log <- ames_train %>%
ggplot(aes(x = area.log)) +
geom_histogram(bins = 30) +
ggtitle("log(Area)") + ylab("") +
scale_y_continuous(position = "right")
grid.arrange(grid.arrange(gplot_price, gplot_area, ncol = 1), gplot_price_area, ncol = 2)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129762 159467 181190 213000 615000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1092 1411 1477 1743 4676
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.4 102.5 121.2 123.3 141.2 276.2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.46 11.77 11.98 12.02 12.27 13.33
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.81 7.00 7.25 7.24 7.46 8.45
Price & area are both right-skewed. We can use those derivatives such as “log” or “price/area” to make distributions more symmetric.
Neighborhood
gplot_price_area_box <- ames_train %>%
ggplot(aes(y = Neighborhood, x = PSF)) +
geom_boxplot() + ylab("") +
ggtitle("Home price per sq.feet")
gplot_price_box <- ames_train %>%
ggplot(aes(y = Neighborhood, x = price)) +
geom_boxplot() + ylab("") +
ggtitle("Home price")
grid.arrange(gplot_price_box, gplot_price_area_box, ncol = 2)ames_train %>% group_by(Neighborhood) %>%
summarise(median(price),IQR(price),median(PSF),IQR(PSF)) %>% tail(5)## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 5
## Neighborhood `median(price)` `IQR(price)` `median(PSF)` `IQR(PSF)`
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Blueste 123900 10250 113. 3.55
## 2 OldTown 120000 29550 92.4 33.3
## 3 IDOTRR 99500 48900 86.8 32.0
## 4 BrDale 98750 16725 92.9 11.7
## 5 MeadowV 85750 20150 83.5 41.2
ames_train %>% group_by(Neighborhood) %>%
summarise(median(price),IQR(price),median(PSF),IQR(PSF)) %>% head(5)## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 5
## Neighborhood `median(price)` `IQR(price)` `median(PSF)` `IQR(PSF)`
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 StoneBr 340692. 151358 166. 56.7
## 2 NridgHt 336860 148800 159. 45.8
## 3 NoRidge 290000 50312. 128. 19.6
## 4 GrnHill 280000 50000 199. 21.1
## 5 Timber 232500 151200 161. 41.9
Graphs show us a clear relationship between “Home price” and “Neighborhood”. We got neighborhood starting from MeadowV to SroneBr, with an increasing average price we observe an increasing range of price distributions into groups. From this point of view variable, “home price per sq. feet” distributions look more constant across neighborhood groups. That can lead us to significant differences in area distribution between neighborhoods.
Let’s review the relationship between Price ~ area.
gplot_pricefeet_area_point_all <- ames_train %>%
ggplot(aes(x = area, y = PSF)) +
geom_point() +
ggtitle("Home price per sq.feet by area") + ylab("") +
scale_y_continuous(position = "right") +
geom_smooth(method = "lm")
gplot_price_area_point_all <- ames_train %>%
ggplot(aes(x = area, y = price)) +
geom_point() +
ggtitle("Home price by area") + ylab("") +
geom_smooth(method = "lm")
gplot_pricefeet_area_pointAgeGroup <- ames_train %>%
ggplot(aes(x = area, y = PSF)) +
geom_point() +
facet_wrap("AgeGroup", ncol = 2) +
geom_smooth(method = "lm") +
ggtitle("Home price per sq.feet by area") + ylab("") +
scale_y_continuous(position = "right")
gplot_price_area_pointAgeGroup <- ames_train %>%
ggplot(aes(x = area, y = price)) +
geom_point() +
facet_wrap("AgeGroup", ncol = 2) +
geom_smooth(method = "lm") +
ggtitle("Home price by area") + ylab("")
gplot_area_by_Neighborhood_boxplot <- ames_train %>%
ggplot(aes(y = Neighborhood, x = area)) +
geom_boxplot() +
ylab("") +
ggtitle("Home area")
grid.arrange(gplot_price_area_point_all, gplot_pricefeet_area_point_all, ncol = 2)## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
#grid.arrange(gplot_price_area_point, gplot_pricefeet_area_point, ncol = 2)
grid.arrange(gplot_price_area_pointAgeGroup, gplot_pricefeet_area_pointAgeGroup, ncol = 2)## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
As expected we got a clear positive relationship between “Home price” and “Area”. We should mention, that for a bigger price as for a bigger area we got bigger residuals different. On the other hand, we do not seem to clear the relationship between “Home price per sq.feet” and “Area”, actually it almost impossible to say that relationship we have.
But, if we construct those distributions for the separate neighborhoods we got a much clear relationship for both variances. Also, we got different behavior for each neighborhood. We show here some special cases for one of the most expensive neighborhoods “Northridge Heights” and one of a budget district “Iowa DOT and Rail Road”.
Year.Built - Tr / Age.Blt - Good by Neighborhoods
# make plot
gplot_price_yBuild_all_point <- ames_train %>%
ggplot() +
geom_point(aes(x = Age.Blt, y = price)) +
geom_smooth(aes(x = Age.Blt, y = price), method = "lm") +
xlab("") + ylab("") + ggtitle("Price by age build")
gplot_price_area_yBuild_all_point <- ames_train %>%
ggplot() +
geom_point(aes(x = Age.Blt, y = PSF)) +
geom_smooth(aes(x = Age.Blt, y = PSF), method = "lm") +
xlab("") + ylab("") + ggtitle("PSF by age build") +
scale_y_continuous(position = "right")
gplot_price_area_yBuild_all_hist <- ames_train %>%
ggplot() +
geom_histogram(aes(x = Age.Blt), bins = 30) +
ylab("count") + xlab("age")
gplot_psf_agebuild <- ames_train %>%
ggplot(aes(x = Age.Blt, y = PSF)) +
geom_point() +
facet_wrap("AgeGroup", ncol = 3) +
geom_smooth(method = "lm") +
ggtitle("~ psf") + ylab("")
gplot_price_agebuild <- ames_train %>%
ggplot(aes(x = Age.Blt, y = price)) +
geom_point() +
facet_wrap("AgeGroup", ncol = 3) +
geom_smooth(method = "lm") +
ggtitle("~ price") + ylab("") + xlab("age")
# print plot
grid.arrange(grid.arrange(gplot_price_yBuild_all_point, gplot_price_area_yBuild_all_point, ncol = 2),
gplot_price_area_yBuild_all_hist, heights=c(2.5,1)) ## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Mostly, we got the same behavior for price and PSF, with age increase - price decrease, except for a few cases, such as “OldTown”. Also, we got different strength relationships for different neighborhoods. Roughly, for neighborhoods with relatively small differences in ages price changes stronger. Distribution of home age right-skewed with 3 visible prices with most cases around 15 years. Also, we got smaller goops with centers around 60 and 100 years. Most expensive neighborhoods stay usually stay in the first group, all others roughly presented through all groups. That leads us to the same assumptions - used different approaches for different price classes of properties.
Let’s build full models that we expect to be more promising in the future.
# Bayesian approach
# BIC
m.BIC.full.baslm.price.log <- bas.lm(price.log ~ Neighborhood + area.log * AgeGroup + Age.Blt * AgeGroup +
Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure +
Bsmt.Full.Bath + Functional.Tr + Sale.Condition,
data = ames_train,
prior = "BIC",
modelprior = uniform(),
method = "MCMC", MCMC.iterations = 10 ^ 5)
m.BIC.full.baslm.PSF <- bas.lm(PSF ~ Neighborhood + area * AgeGroup + Age.Blt * AgeGroup +
Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure +
Bsmt.Full.Bath + Functional.Tr + Sale.Condition,
data = ames_train,
prior = "BIC",
modelprior = uniform(),
method = "MCMC", MCMC.iterations = 10 ^ 5)
# AIC
m.AIC.full.baslm.price.log <- bas.lm(price.log ~ Neighborhood + area.log * AgeGroup + Age.Blt * AgeGroup +
Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure +
Bsmt.Full.Bath + Functional.Tr + Sale.Condition,
data = ames_train,
prior = "AIC",
modelprior = uniform(),
method = "MCMC", MCMC.iterations = 10 ^ 5)
m.AIC.full.baslm.PSF <- bas.lm(PSF ~ Neighborhood + area * AgeGroup + Age.Blt * AgeGroup +
Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure +
Bsmt.Full.Bath + Functional.Tr + Sale.Condition,
data = ames_train,
prior = "AIC",
modelprior = uniform(),
method = "MCMC", MCMC.iterations = 10 ^ 5)
# Frequentist approach
m.full.lm.price.log <- lm(price.log ~ Neighborhood + area.log * AgeGroup + Age.Blt * AgeGroup +
Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure +
Bsmt.Full.Bath + Functional.Tr + Sale.Condition,
data = ames_train)
m.full.lm.PSF <- lm(PSF ~ Neighborhood + area * AgeGroup + Age.Blt * AgeGroup +
Bldg.Type + House.Style + Overall.Qual + Bsmt.Exposure +
Bsmt.Full.Bath + Functional.Tr + Sale.Condition,
data = ames_train)Let’s get summary and anova output for m.full.lm.price.log.
##
## Call:
## lm(formula = price.log ~ Neighborhood + area.log * AgeGroup +
## Age.Blt * AgeGroup + Bldg.Type + House.Style + Overall.Qual +
## Bsmt.Exposure + Bsmt.Full.Bath + Functional.Tr + Sale.Condition,
## data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6608 -0.0630 0.0012 0.0648 0.5870
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.14919 0.46374 13.26 < 0.0000000000000002 ***
## NeighborhoodNridgHt -0.06369 0.04115 -1.55 0.12204
## NeighborhoodNoRidge -0.06787 0.04933 -1.38 0.16920
## NeighborhoodGrnHill 0.40875 0.11256 3.63 0.00030 ***
## NeighborhoodTimber 1.34936 0.46616 2.89 0.00388 **
## NeighborhoodSomerst -0.08970 0.04121 -2.18 0.02975 *
## NeighborhoodGreens 1.32242 0.46230 2.86 0.00432 **
## NeighborhoodVeenker 1.31285 0.46615 2.82 0.00496 **
## NeighborhoodCrawfor 1.06353 0.48201 2.21 0.02759 *
## NeighborhoodCollgCr 1.29236 0.46409 2.78 0.00546 **
## NeighborhoodBlmngtn -0.16960 0.05943 -2.85 0.00442 **
## NeighborhoodClearCr 1.40153 0.46846 2.99 0.00285 **
## NeighborhoodNWAmes 1.29089 0.46684 2.77 0.00580 **
## NeighborhoodGilbert 1.28216 0.46577 2.75 0.00602 **
## NeighborhoodSawyerW 1.26901 0.46493 2.73 0.00646 **
## NeighborhoodMitchel 1.29738 0.46323 2.80 0.00520 **
## NeighborhoodNPkVill 1.31048 0.46302 2.83 0.00475 **
## NeighborhoodNAmes 1.28394 0.46332 2.77 0.00570 **
## NeighborhoodSawyer 1.29058 0.46325 2.79 0.00544 **
## NeighborhoodSWISU 0.84533 0.48163 1.76 0.07956 .
## NeighborhoodEdwards 0.85046 0.47870 1.78 0.07595 .
## NeighborhoodBrkSide 0.93640 0.47784 1.96 0.05033 .
## NeighborhoodBlueste 1.30785 0.46472 2.81 0.00499 **
## NeighborhoodOldTown 0.87431 0.47920 1.82 0.06839 .
## NeighborhoodIDOTRR 0.76613 0.47771 1.60 0.10910
## NeighborhoodBrDale 1.20077 0.45987 2.61 0.00917 **
## NeighborhoodMeadowV 1.11671 0.45771 2.44 0.01488 *
## area.log 0.77504 0.06025 12.86 < 0.0000000000000002 ***
## AgeGroupMedium NA NA NA NA
## AgeGroupOld NA NA NA NA
## Age.Blt -0.00995 0.00334 -2.97 0.00301 **
## Bldg.Type2fmCon -0.05199 0.03569 -1.46 0.14552
## Bldg.TypeDuplex -0.19317 0.02899 -6.66 0.000000000045163 ***
## Bldg.TypeTwnhs -0.12838 0.03590 -3.58 0.00037 ***
## Bldg.TypeTwnhsE -0.07110 0.02436 -2.92 0.00360 **
## House.Style1.5Unf 0.05670 0.05603 1.01 0.31183
## House.Style1Story 0.05673 0.02054 2.76 0.00586 **
## House.Style2.5Unf -0.03157 0.05074 -0.62 0.53398
## House.Style2Story -0.04604 0.02087 -2.21 0.02760 *
## House.StyleSFoyer 0.08665 0.03639 2.38 0.01745 *
## House.StyleSLvl 0.00343 0.03118 0.11 0.91245
## Overall.Qual 0.07340 0.00654 11.23 < 0.0000000000000002 ***
## Bsmt.ExposureGd 0.06718 0.02020 3.33 0.00092 ***
## Bsmt.ExposureMn -0.04033 0.02093 -1.93 0.05425 .
## Bsmt.ExposureNo -0.03234 0.01554 -2.08 0.03770 *
## Bsmt.ExposureNoBasement -0.15282 0.03715 -4.11 0.000042261941308 ***
## Bsmt.Full.Bath 0.07775 0.01000 7.78 0.000000000000019 ***
## Functional.TrTRUE 0.07275 0.02093 3.48 0.00053 ***
## Sale.ConditionAdjLand 0.16150 0.11146 1.45 0.14770
## Sale.ConditionAlloca 0.21191 0.07744 2.74 0.00633 **
## Sale.ConditionFamily -0.05314 0.04094 -1.30 0.19453
## Sale.ConditionNormal 0.11684 0.01970 5.93 0.000000004233025 ***
## Sale.ConditionPartial 0.13200 0.02824 4.67 0.000003387783519 ***
## area.log:AgeGroupMedium -0.21349 0.06025 -3.54 0.00041 ***
## area.log:AgeGroupOld -0.16640 0.06294 -2.64 0.00834 **
## AgeGroupMedium:Age.Blt 0.00709 0.00338 2.10 0.03637 *
## AgeGroupOld:Age.Blt 0.00744 0.00339 2.20 0.02822 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.145 on 945 degrees of freedom
## Multiple R-squared: 0.888, Adjusted R-squared: 0.882
## F-statistic: 139 on 54 and 945 DF, p-value: <0.0000000000000002
We got a relatively good Multiple R-squared: 0.888. Now, we going to briefly discuss the most important predictors. As expected from the EDA section the model shows us a positive relationship between area.log and price.log variable. For geGroupMedium and AgeGroupOld we got some extra fine. We should mention that the log(area) variable is statistically significant because of the very small P-value. The neighborhood variable is categorical and for most values is statistically significant too.
## Analysis of Variance Table
##
## Response: price.log
## Df Sum Sq Mean Sq F value Pr(>F)
## Neighborhood 26 106.8 4.11 196.20 < 0.0000000000000002 ***
## area.log 1 30.6 30.62 1462.23 < 0.0000000000000002 ***
## Age.Blt 1 4.0 3.98 190.00 < 0.0000000000000002 ***
## Bldg.Type 4 3.0 0.75 36.05 < 0.0000000000000002 ***
## House.Style 6 2.9 0.48 23.15 < 0.0000000000000002 ***
## Overall.Qual 1 4.9 4.93 235.33 < 0.0000000000000002 ***
## Bsmt.Exposure 4 1.4 0.36 17.26 0.0000000000001117 ***
## Bsmt.Full.Bath 1 1.3 1.32 62.85 0.0000000000000063 ***
## Functional.Tr 1 0.2 0.24 11.43 0.00075 ***
## Sale.Condition 5 1.3 0.26 12.37 0.0000000000121697 ***
## area.log:AgeGroup 2 0.3 0.14 6.63 0.00138 **
## AgeGroup:Age.Blt 2 0.1 0.05 2.44 0.08786 .
## Residuals 945 19.8 0.02
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The most significant predictor is the Neighborhood, it explains roughly 60% (106.8/176.7) of the response variable. Next going the area.log variable which explains roughly 17% of the price.log.
Let’s use Bayesian and Frequentist approaches.
# Bayesian approach
# BIC
# price.log
#BIC.BMA_pred_price.log <- predict(m.BIC.full.baslm.price.log, estimator = "BMA", se.fit = TRUE, nsim = 10000)
BIC.BPM_pred_price.log <- predict(m.BIC.full.baslm.price.log, estimator = "BPM", se.fit = TRUE, nsim = 10000)
BIC.HPM_pred_price.log <- predict(m.BIC.full.baslm.price.log, estimator = "HPM", se.fit = TRUE, nsim = 10000)
BIC.MPM_pred_price.log <- predict(m.BIC.full.baslm.price.log, estimator = "MPM", se.fit = TRUE, nsim = 10000)
# PSF
#BIC.BMA_pred_PSF <- predict(m.BIC.full.baslm.PSF, estimator = "BMA", se.fit = TRUE, nsim = 10000)
BIC.BPM_pred_PSF <- predict(m.BIC.full.baslm.PSF, estimator = "BPM", se.fit = TRUE, nsim = 10000)
BIC.HPM_pred_PSF <- predict(m.BIC.full.baslm.PSF, estimator = "HPM", se.fit = TRUE, nsim = 10000)
BIC.MPM_pred_PSF <- predict(m.BIC.full.baslm.PSF, estimator = "MPM", se.fit = TRUE, nsim = 10000)
# AIC
# price.log
#AIC.BMA_pred_price.log <- predict(m.AIC.full.baslm.price.log, estimator = "BMA", se.fit = TRUE, nsim = 10000)
AIC.BPM_pred_price.log <- predict(m.AIC.full.baslm.price.log, estimator = "BPM", se.fit = TRUE, nsim = 10000)
AIC.HPM_pred_price.log <- predict(m.AIC.full.baslm.price.log, estimator = "HPM", se.fit = TRUE, nsim = 10000)
AIC.MPM_pred_price.log <- predict(m.AIC.full.baslm.price.log, estimator = "MPM", se.fit = TRUE, nsim = 10000)
# PSF
#AIC.BMA_pred_PSF <- predict(m.AIC.full.baslm.PSF, estimator = "BMA", se.fit = TRUE, nsim = 10000)
AIC.BPM_pred_PSF <- predict(m.AIC.full.baslm.PSF, estimator = "BPM", se.fit = TRUE, nsim = 10000)
AIC.HPM_pred_PSF <- predict(m.AIC.full.baslm.PSF, estimator = "HPM", se.fit = TRUE, nsim = 10000)
AIC.MPM_pred_PSF <- predict(m.AIC.full.baslm.PSF, estimator = "MPM", se.fit = TRUE, nsim = 10000)
# ---
# Frequentist approach
# Stepwise selection
# BIC
n <- nrow(ames_train)
m.BIC.lm.price.log <- stepAIC(m.full.lm.price.log, k = log(n), trace = F)
m.BIC.lm.PSF <- stepAIC(m.full.lm.PSF, k = log(n), trace = F)
# AIC
m.AIC.lm.price.log <- stepAIC(m.full.lm.price.log, k = 2, trace = F)
m.AIC.lm.PSF <- stepAIC(m.full.lm.PSF, k = 2, trace = F)Generally, models will arrive at a similar result, with respect to different approaches. We going to discuss it in more detail in the next section.
Here, we should note, that now we comment into code the BMA method because it is very heavy for RAM on the machine this model building to develop at the same time. We did that method too and in our cases, those models do not get into the topmost accurate models.
Let’s construct credible and confidense intervals for our models.
# Prediction
# Bayesian approach
# ---
# function for get credible intervals & unite result in the data frame
get.credible.intervals <- function(pred.baslm) {
df <- data.frame(confint(pred.baslm, parm = "coef")[,1:3],
confint(pred.baslm, parm = "pred")[,1:2])
colnames(df) = c("low_mu","hight_mu","pred","low","hight")
df[,c("low","low_mu","pred","hight_mu","hight")]
}
# BIC
#pred.BIC.BMA.baslm.price.log <- exp(get.credible.intervals(BIC.BMA_pred_price.log))
pred.BIC.BPM.baslm.price.log <- exp(get.credible.intervals(BIC.BPM_pred_price.log))
pred.BIC.HPM.baslm.price.log <- exp(get.credible.intervals(BIC.HPM_pred_price.log))
pred.BIC.MPM.baslm.price.log <- exp(get.credible.intervals(BIC.MPM_pred_price.log))
#pred.BIC.BMA.baslm.PSF <- get.credible.intervals(BIC.BMA_pred_PSF)*ames_train$area
pred.BIC.BPM.baslm.PSF <- get.credible.intervals(BIC.BPM_pred_PSF)*ames_train$area
pred.BIC.HPM.baslm.PSF <- get.credible.intervals(BIC.HPM_pred_PSF)*ames_train$area
pred.BIC.MPM.baslm.PSF <- get.credible.intervals(BIC.MPM_pred_PSF)*ames_train$area
# AIC
#pred.AIC.BMA.baslm.price.log <- exp(get.credible.intervals(AIC.BMA_pred_price.log))
pred.AIC.BPM.baslm.price.log <- exp(get.credible.intervals(AIC.BPM_pred_price.log))
pred.AIC.HPM.baslm.price.log <- exp(get.credible.intervals(AIC.HPM_pred_price.log))
pred.AIC.MPM.baslm.price.log <- exp(get.credible.intervals(AIC.MPM_pred_price.log))
#pred.AIC.BMA.baslm.PSF <- get.credible.intervals(AIC.BMA_pred_PSF)*ames_train$area
pred.AIC.BPM.baslm.PSF <- get.credible.intervals(AIC.BPM_pred_PSF)*ames_train$area
pred.AIC.HPM.baslm.PSF <- get.credible.intervals(AIC.HPM_pred_PSF)*ames_train$area
pred.AIC.MPM.baslm.PSF <- get.credible.intervals(AIC.MPM_pred_PSF)*ames_train$area
# Frequentist approach
# ---
pred.BIC.lm.price.log <- data.frame(exp(predict(m.BIC.lm.price.log, ames_train, interval = "prediction")))
pred.AIC.lm.price.log <- data.frame(exp(predict(m.AIC.lm.price.log, ames_train, interval = "prediction")))
pred.BIC.lm.PSF <- data.frame(predict(m.BIC.lm.PSF, ames_train, interval = "prediction")*ames_train$area)
pred.AIC.lm.PSF <- data.frame(predict(m.AIC.lm.PSF, ames_train, interval = "prediction")*ames_train$area)Also, we going to develop the function, which helps us to make residual plots from different perspectives.
# create function which help with building resid plots
plot_resid <- function(df_ames = ames_train,
df_ames_pred,
numeric_variables = c("price","pred","area","Age.Blt"),
categorical_variables = c("Neighborhood","Bldg.Type","House.Style","Overall.Qual",
"Bsmt.Exposure","Bsmt.Full.Bath","Functional.Tr","Sale.Condition"),
indexPlot = TRUE,
pricepredPlot = TRUE,
outliersindex) {
df <- cbind(df_ames, df_ames_pred)
if(!("low_mu" %in% colnames(df))) {
df$low <- df$lwr
df$low_mu <- df$lwr
df$pred <- df$fit
df$hight_mu <- df$upr
df$hight <- df$upr
}
df$resid <- df$price - df$pred
if (pricepredPlot == T) {
gplot_price <- ggplot(data = df) +
geom_point(aes(x = pred, y = price), color = "blue", size = 0.6) +
geom_line(aes(x = pred, y = low_mu, lty = "second")) +
geom_line(aes(x = pred, y = hight_mu, lty = "second")) +
geom_line(aes(x = pred, y = low, lty = "third")) +
geom_line(aes(x = pred, y = hight, lty = "third")) +
geom_line(aes(x = pred, y = pred, color = "orange")) +
scale_colour_manual(values = c("orange"), labels = "Posterior mean", name = "") +
scale_linetype_manual(values = c(2, 3), labels = c("95% CI for mean", "95% CI for predictions")
, name = "") +
theme_bw() +
theme(legend.position = c(1, 0), legend.justification = c(1.5, 0)) +
ggtitle(paste0('Model uncertainty distribution: ',substitute(df_ames_pred))) +
geom_point(data = df[outliersindex,], aes(x = pred, y = price), color = "orange", pch = 1, cex = 6)
plot(gplot_price)
}
if (indexPlot == T) {
gplot_index <- ggplot(data = df) +
geom_point(aes(x = as.numeric(row.names(df)), y = resid), color = "blue", size = 0.6) +
geom_hline(yintercept=0, linetype="dashed") +
ggtitle(paste0("Residuals by index")) +
geom_point(data = df[outliersindex,], aes(x = as.numeric(row.names(df[outliersindex,])), y = resid), color = "orange", pch = 1, cex = 6)
plot(gplot_index)
}
if (length(numeric_variables) != 0) {
for (variable in numeric_variables) {
df[,"variable"] <- df[,variable]
gplot_numeric <- ggplot(data = df) +
geom_point(aes(x = variable, y = resid), color = "blue", size = 0.6) +
geom_hline(yintercept=0, linetype="dashed") +
ggtitle(paste0("Residuals by <",variable,">")) +
xlab(variable) +
geom_point(data = df[outliersindex,], aes(x = variable, y = resid), color = "orange", pch = 1, cex = 6)
print(gplot_numeric)
}
}
if (length(categorical_variables) != 0) {
for (variable in categorical_variables) {
df[,"variable"] <- df[,variable]
gplot_categorical <- ggplot(data = df) +
geom_boxplot(aes(y = as.factor(variable), x = resid, color = as.factor(variable))) +
ggtitle(paste0("Residuals by <",variable,">")) +
ylab(variable)
print(gplot_categorical)
}
}
}Let’s defined outliers. For all models, we got 5 outliers cases with an area above 3500 feet or price above 550k within +-250k prediction error.
outliersindex <- which(ames_train$area > 3500 | ames_train$price > 550000)
ames_train[outliersindex,c("area","price","Neighborhood","Overall.Qual")]## # A tibble: 5 x 4
## area price Neighborhood Overall.Qual
## <int> <int> <fct> <int>
## 1 2470 615000 NridgHt 10
## 2 4676 184750 Edwards 10
## 3 3672 415000 Edwards 6
## 4 2364 611657 NridgHt 9
## 5 2338 591587 StoneBr 9
Let’s briefly discuss most perspective models.
pred.AIC.lm.price.log
Increased inaccuracy roughly within:
Underestimate roughly on:
Overestimate roughly on: none.
plot_resid(df_ames_pred = pred.AIC.lm.price.log,
outliersindex = outliersindex,
categorical_variables = c("Neighborhood","Overall.Qual","Bsmt.Full.Bath"))pred.AIC.BPM.baslm.price.log
Increased inaccuracy roughly within:
Underestimate roughly on:
Overestimate roughly on:
plot_resid(df_ames_pred = pred.AIC.BPM.baslm.price.log, outliersindex = outliersindex,
categorical_variables = c("Neighborhood","Bldg.Type","Overall.Qual","Bsmt.Full.Bath","House.Style"))pred.AIC.MPM.baslm.price.log
Increased inaccuracy roughly within:
Underestimate roughly on:
Overestimate roughly on:
plot_resid(df_ames_pred = pred.AIC.MPM.baslm.price.log, outliersindex = outliersindex,
categorical_variables = c("Neighborhood","Bldg.Type","Overall.Qual","Bsmt.Full.Bath","House.Style"))Let’s culculate root mean squared error (RMSE) into original values $.
df_RMSE <- data.frame(RMSE = double())
# Bayesian approach
# BIC
#df_RMSE["pred.BIC.BMA.baslm.price.log","RMSE"] <- sqrt(mean((pred.BIC.BMA.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.BIC.BPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.BIC.BPM.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.BIC.HPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.BIC.HPM.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.BIC.MPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.BIC.MPM.baslm.price.log$pred - ames_train$price)^2))
#df_RMSE["pred.BIC.BMA.baslm.PSF","RMSE"] <- sqrt(mean((pred.BIC.BMA.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.BIC.BPM.baslm.PSF","RMSE"] <- sqrt(mean((pred.BIC.BPM.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.BIC.HPM.baslm.PSF","RMSE"] <- sqrt(mean((pred.BIC.HPM.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.BIC.MPM.baslm.PSF","RMSE"] <- sqrt(mean((pred.BIC.MPM.baslm.PSF$pred - ames_train$price)^2))
# AIC
#df_RMSE["pred.AIC.BMA.baslm.price.log","RMSE"] <- sqrt(mean((pred.AIC.BMA.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.AIC.BPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.AIC.BPM.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.AIC.HPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.AIC.HPM.baslm.price.log$pred - ames_train$price)^2))
df_RMSE["pred.AIC.MPM.baslm.price.log","RMSE"] <- sqrt(mean((pred.AIC.MPM.baslm.price.log$pred - ames_train$price)^2))
#df_RMSE["pred.AIC.BMA.baslm.PSF","RMSE"] <- sqrt(mean((pred.AIC.BMA.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.AIC.BPM.baslm.PSF","RMSE"] <- sqrt(mean((pred.AIC.BPM.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.AIC.HPM.baslm.PSF","RMSE"] <- sqrt(mean((pred.AIC.HPM.baslm.PSF$pred - ames_train$price)^2))
df_RMSE["pred.AIC.MPM.baslm.PSF","RMSE"] <- sqrt(mean((pred.AIC.MPM.baslm.PSF$pred - ames_train$price)^2))
# Frequentist approach
df_RMSE["pred.BIC.lm.price.log","RMSE"] <- sqrt(mean((pred.BIC.lm.price.log$fit - ames_train$price)^2, na.rm =T))
df_RMSE["pred.AIC.lm.price.log","RMSE"] <- sqrt(mean((pred.AIC.lm.price.log$fit - ames_train$price)^2, na.rm =T))
df_RMSE["pred.BIC.lm.PSF","RMSE"] <- sqrt(mean((pred.BIC.lm.PSF$fit - ames_train$price)^2))
df_RMSE["pred.AIC.lm.PSF","RMSE"] <- sqrt(mean((pred.AIC.lm.PSF$fit - ames_train$price)^2))
df_RMSE %>%
arrange(RMSE) %>%
head()## RMSE
## pred.AIC.lm.price.log 25602
## pred.AIC.BPM.baslm.price.log 25766
## pred.AIC.MPM.baslm.price.log 25787
## pred.AIC.HPM.baslm.price.log 26206
## pred.AIC.lm.PSF 26625
## pred.AIC.MPM.baslm.PSF 26643
For the first three models, we got a relatively close RMSE value. Also, we should mention that based on residual plots we can say about roughly similar behavior of those models.
Let’s make the needed data transformations. For all our models we have categorical explanatory variables, this leads us to be careful about new factor variables.
ames_test <- ames_test %>%
filter(Neighborhood != "Landmrk") %>%
filter(House.Style != "2.5Fin") %>%
mutate(area.log = log(area),
price.log = log(price),
PSF = price/area,
Functional.Tr = Functional == "Typ",
AgeGroup = factor(case_when(Neighborhood %in% vAge15 ~ "New",
Neighborhood %in% vAge50 ~ "Medium",
Neighborhood %in% vAge80 ~ "Old"),
levels = c("New","Medium","Old")),
Age.Blt = 2020 - Year.Built,
Bsmt.Exposure = if_else(is.na(Bsmt.Exposure) == T | Bsmt.Exposure == "" ,"NoBasement",as.character(Bsmt.Exposure)))Let’s make all the needed steps to get the RMSE result for the test data set.
k <- nrow(ames_train)/nrow(ames_test)
pred.BIC.lm.price.log_test <- data.frame(exp(predict(m.BIC.lm.price.log, ames_test, interval = "prediction")))
pred.AIC.lm.price.log_test <- data.frame(exp(predict(m.AIC.lm.price.log, ames_test, interval = "prediction")))## Warning in predict.lm(m.AIC.lm.price.log, ames_test, interval = "prediction"):
## prediction from a rank-deficient fit may be misleading
pred.BIC.lm.PSF_test <- data.frame(predict(m.BIC.lm.PSF, ames_test, interval = "prediction")*ames_test$area)
pred.AIC.lm.PSF_test <- data.frame(predict(m.AIC.lm.PSF, ames_test, interval = "prediction")*ames_test$area)## Warning in predict.lm(m.AIC.lm.PSF, ames_test, interval = "prediction"):
## prediction from a rank-deficient fit may be misleading
df_RMSE["pred.BIC.lm.price.log","RMSE_test"] <- sqrt(mean((pred.BIC.lm.price.log_test$fit - ames_test$price)^2, na.rm =T))*k
df_RMSE["pred.AIC.lm.price.log","RMSE_test"] <- sqrt(mean((pred.AIC.lm.price.log_test$fit - ames_test$price)^2, na.rm =T))*k
df_RMSE["pred.BIC.lm.PSF","RMSE_test"] <- sqrt(mean((pred.BIC.lm.PSF_test$fit - ames_test$price)^2))*k
df_RMSE["pred.AIC.lm.PSF","RMSE_test"] <- sqrt(mean((pred.AIC.lm.PSF_test$fit - ames_test$price)^2))*k
# BIC
# price.log
BIC.BPM_pred_price.log_test <- predict(m.BIC.full.baslm.price.log, estimator = "BPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
BIC.HPM_pred_price.log_test <- predict(m.BIC.full.baslm.price.log, estimator = "HPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
BIC.MPM_pred_price.log_test <- predict(m.BIC.full.baslm.price.log, estimator = "MPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
# PSF
BIC.BPM_pred_PSF_test <- predict(m.BIC.full.baslm.PSF, estimator = "BPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
BIC.HPM_pred_PSF_test <- predict(m.BIC.full.baslm.PSF, estimator = "HPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
BIC.MPM_pred_PSF_test <- predict(m.BIC.full.baslm.PSF, estimator = "MPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
# AIC
# price.log
AIC.BPM_pred_price.log_test <- predict(m.AIC.full.baslm.price.log, estimator = "BPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
AIC.HPM_pred_price.log_test <- predict(m.AIC.full.baslm.price.log, estimator = "HPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
AIC.MPM_pred_price.log_test <- predict(m.AIC.full.baslm.price.log, estimator = "MPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
# PSF
AIC.BPM_pred_PSF_test <- predict(m.AIC.full.baslm.PSF, estimator = "BPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
AIC.HPM_pred_PSF_test <- predict(m.AIC.full.baslm.PSF, estimator = "HPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
AIC.MPM_pred_PSF_test <- predict(m.AIC.full.baslm.PSF, estimator = "MPM", se.fit = TRUE, nsim = 10000, newdata = ames_test)
# BIC
pred.BIC.BPM.baslm.price.log_test <- exp(get.credible.intervals(BIC.BPM_pred_price.log_test))
pred.BIC.HPM.baslm.price.log_test <- exp(get.credible.intervals(BIC.HPM_pred_price.log_test))
pred.BIC.MPM.baslm.price.log_test <- exp(get.credible.intervals(BIC.MPM_pred_price.log_test))
pred.BIC.BPM.baslm.PSF_test <- get.credible.intervals(BIC.BPM_pred_PSF_test)*ames_test$area
pred.BIC.HPM.baslm.PSF_test <- get.credible.intervals(BIC.HPM_pred_PSF_test)*ames_test$area
pred.BIC.MPM.baslm.PSF_test <- get.credible.intervals(BIC.MPM_pred_PSF_test)*ames_test$area
# AIC
pred.AIC.BPM.baslm.price.log_test <- exp(get.credible.intervals(AIC.BPM_pred_price.log_test))
pred.AIC.HPM.baslm.price.log_test <- exp(get.credible.intervals(AIC.HPM_pred_price.log_test))
pred.AIC.MPM.baslm.price.log_test <- exp(get.credible.intervals(AIC.MPM_pred_price.log_test))
pred.AIC.BPM.baslm.PSF_test <- get.credible.intervals(AIC.BPM_pred_PSF_test)*ames_test$area
pred.AIC.HPM.baslm.PSF_test <- get.credible.intervals(AIC.HPM_pred_PSF_test)*ames_test$area
pred.AIC.MPM.baslm.PSF_test <- get.credible.intervals(AIC.MPM_pred_PSF_test)*ames_test$area
# Bayesian approach
# BIC
df_RMSE["pred.BIC.BPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.BIC.BPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.BIC.HPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.BIC.HPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.BIC.MPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.BIC.MPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.BIC.BPM.baslm.PSF","RMSE_test"] <- sqrt(mean((pred.BIC.BPM.baslm.PSF_test$pred - ames_test$price)^2))*k
df_RMSE["pred.BIC.HPM.baslm.PSF","RMSE_test"] <- sqrt(mean((pred.BIC.HPM.baslm.PSF_test$pred - ames_test$price)^2))*k
df_RMSE["pred.BIC.MPM.baslm.PSF","RMSE_test"] <- sqrt(mean((pred.BIC.MPM.baslm.PSF_test$pred - ames_test$price)^2))*k
# AIC
df_RMSE["pred.AIC.BPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.AIC.BPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.AIC.HPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.AIC.HPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.AIC.MPM.baslm.price.log","RMSE_test"] <- sqrt(mean((pred.AIC.MPM.baslm.price.log_test$pred - ames_test$price)^2))*k
df_RMSE["pred.AIC.BPM.baslm.PSF","RMSE_test"] <- sqrt(mean((pred.AIC.BPM.baslm.PSF_test$pred - ames_test$price)^2))*k
df_RMSE["pred.AIC.HPM.baslm.PSF","RMSE_test"] <- sqrt(mean((pred.AIC.HPM.baslm.PSF_test$pred - ames_test$price)^2))*k
df_RMSE["pred.AIC.MPM.baslm.PSF","RMSE_test"] <- sqrt(mean((pred.AIC.MPM.baslm.PSF_test$pred - ames_test$price)^2))*k
df_RMSE$RMSE_test_diff <- df_RMSE$RMSE_test - df_RMSE$RMSERMSE
## RMSE RMSE_test RMSE_test_diff
## pred.AIC.lm.price.log 25602 25493 -108.9
## pred.AIC.BPM.baslm.price.log 25766 25303 -463.1
## pred.AIC.MPM.baslm.price.log 25787 25346 -440.6
## pred.AIC.HPM.baslm.price.log 26206 25706 -500.0
## pred.AIC.lm.PSF 26625 28378 1753.2
## pred.AIC.MPM.baslm.PSF 26643 28379 1736.2
## pred.AIC.BPM.baslm.PSF 26652 28360 1708.1
## pred.AIC.HPM.baslm.PSF 26850 28178 1328.9
## pred.BIC.HPM.baslm.price.log 27094 26346 -748.4
## pred.BIC.MPM.baslm.price.log 27117 26835 -281.5
## pred.BIC.BPM.baslm.PSF 27274 27856 581.4
## pred.BIC.MPM.baslm.PSF 27359 27828 468.9
## pred.BIC.HPM.baslm.PSF 27359 27828 468.9
## pred.BIC.BPM.baslm.price.log 27796 27313 -482.9
## pred.BIC.lm.price.log 28582 28333 -248.9
## pred.BIC.lm.PSF 29102 28970 -131.9
We got very close results for all our top-3 models within and without themself. We will use the pred.AIC.lm.price.log model for our next step, but we should mark that in a real-life situation we probably will try all of them.
Basically, it’s mean that our top-3 models give roughly the same accuracy for prediction. We build our models used by test train data and used them on the test data and models give us roughly the same result. Based on that we can assume that they are relatively reliable with respect to own restrictions.
First, we will start with anova output of our pre-filnal model pred.AIC.lm.price.log.
## Analysis of Variance Table
##
## Response: price.log
## Df Sum Sq Mean Sq F value Pr(>F)
## Neighborhood 26 106.8 4.11 196.20 < 0.0000000000000002 ***
## area.log 1 30.6 30.62 1462.23 < 0.0000000000000002 ***
## Age.Blt 1 4.0 3.98 190.00 < 0.0000000000000002 ***
## Bldg.Type 4 3.0 0.75 36.05 < 0.0000000000000002 ***
## House.Style 6 2.9 0.48 23.15 < 0.0000000000000002 ***
## Overall.Qual 1 4.9 4.93 235.33 < 0.0000000000000002 ***
## Bsmt.Exposure 4 1.4 0.36 17.26 0.0000000000001117 ***
## Bsmt.Full.Bath 1 1.3 1.32 62.85 0.0000000000000063 ***
## Functional.Tr 1 0.2 0.24 11.43 0.00075 ***
## Sale.Condition 5 1.3 0.26 12.37 0.0000000000121697 ***
## area.log:AgeGroup 2 0.3 0.14 6.63 0.00138 **
## AgeGroup:Age.Blt 2 0.1 0.05 2.44 0.08786 .
## Residuals 945 19.8 0.02
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using this information we can make assumptions that can make our final model more accurate. We will add some variables and transform others. In more detail we discuss it below.
ames_train <- ames_train %>%
mutate(Garage.Area = if_else(is.na(Garage.Area), as.numeric(0),as.numeric(Garage.Area) ),
Total.Bsmt.SF = if_else(is.na(Total.Bsmt.SF),as.numeric(0),as.numeric(Total.Bsmt.SF)))
m.full.lm.price.log.final <- lm(price.log ~ Neighborhood + area.log * AgeGroup + Age.Blt * AgeGroup +
Bldg.Type + House.Style + log(Overall.Qual) + Bsmt.Exposure +
Bsmt.Full.Bath + Functional.Tr + Sale.Condition +
log(Garage.Area + 1) + log(Total.Bsmt.SF + 1) + log(X1st.Flr.SF) + log(X2nd.Flr.SF + 1) + log(Lot.Area) +
log(as.numeric(Exter.Qual)),
data = ames_train)
m.AIC.lm.price.log.final <- stepAIC(m.full.lm.price.log.final, k = 2, trace = F)
anova(m.AIC.lm.price.log.final)## Analysis of Variance Table
##
## Response: price.log
## Df Sum Sq Mean Sq F value Pr(>F)
## Neighborhood 26 106.8 4.11 203.80 < 0.0000000000000002 ***
## area.log 1 30.6 30.62 1518.85 < 0.0000000000000002 ***
## Age.Blt 1 4.0 3.98 197.36 < 0.0000000000000002 ***
## Bldg.Type 4 3.0 0.75 37.45 < 0.0000000000000002 ***
## log(Overall.Qual) 1 6.3 6.25 310.14 < 0.0000000000000002 ***
## Bsmt.Exposure 4 2.1 0.52 25.63 < 0.0000000000000002 ***
## Bsmt.Full.Bath 1 1.9 1.85 91.81 < 0.0000000000000002 ***
## Functional.Tr 1 0.2 0.16 7.95 0.0049 **
## Sale.Condition 5 1.1 0.23 11.30 0.00000000013 ***
## log(Garage.Area + 1) 1 0.1 0.11 5.55 0.0187 *
## log(Total.Bsmt.SF + 1) 1 0.5 0.51 25.15 0.00000063374 ***
## log(X1st.Flr.SF) 1 0.4 0.38 19.06 0.00001407664 ***
## log(Lot.Area) 1 0.2 0.20 9.79 0.0018 **
## log(as.numeric(Exter.Qual)) 1 0.2 0.20 9.75 0.0018 **
## area.log:AgeGroup 2 0.3 0.13 6.44 0.0017 **
## AgeGroup:Age.Blt 2 0.1 0.05 2.60 0.0751 .
## Residuals 946 19.1 0.02
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the compare of Residuals values we can assume that we got a more accurate model (19.1 aganst 19.8).
##
## Call:
## lm(formula = price.log ~ Neighborhood + area.log + AgeGroup +
## Age.Blt + Bldg.Type + log(Overall.Qual) + Bsmt.Exposure +
## Bsmt.Full.Bath + Functional.Tr + Sale.Condition + log(Garage.Area +
## 1) + log(Total.Bsmt.SF + 1) + log(X1st.Flr.SF) + log(Lot.Area) +
## log(as.numeric(Exter.Qual)) + area.log:AgeGroup + AgeGroup:Age.Blt,
## data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5596 -0.0619 0.0024 0.0645 0.5970
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.85814 0.46170 12.69 < 0.0000000000000002
## NeighborhoodNridgHt -0.07264 0.04041 -1.80 0.07258
## NeighborhoodNoRidge -0.07130 0.04853 -1.47 0.14213
## NeighborhoodGrnHill 0.36456 0.11087 3.29 0.00105
## NeighborhoodTimber 1.34109 0.45647 2.94 0.00338
## NeighborhoodSomerst -0.08524 0.04073 -2.09 0.03660
## NeighborhoodGreens 1.34349 0.45257 2.97 0.00307
## NeighborhoodVeenker 1.29788 0.45598 2.85 0.00452
## NeighborhoodCrawfor 1.20087 0.46834 2.56 0.01050
## NeighborhoodCollgCr 1.27945 0.45398 2.82 0.00493
## NeighborhoodBlmngtn -0.14087 0.05972 -2.36 0.01854
## NeighborhoodClearCr 1.36865 0.45819 2.99 0.00289
## NeighborhoodNWAmes 1.28478 0.45620 2.82 0.00496
## NeighborhoodGilbert 1.26717 0.45527 2.78 0.00549
## NeighborhoodSawyerW 1.25457 0.45443 2.76 0.00588
## NeighborhoodMitchel 1.29040 0.45264 2.85 0.00446
## NeighborhoodNPkVill 1.31238 0.45268 2.90 0.00383
## NeighborhoodNAmes 1.28849 0.45311 2.84 0.00456
## NeighborhoodSawyer 1.28931 0.45292 2.85 0.00451
## NeighborhoodSWISU 0.99566 0.46849 2.13 0.03382
## NeighborhoodEdwards 0.99443 0.46542 2.14 0.03289
## NeighborhoodBrkSide 1.07931 0.46475 2.32 0.02043
## NeighborhoodBlueste 1.32710 0.45444 2.92 0.00358
## NeighborhoodOldTown 1.01589 0.46628 2.18 0.02960
## NeighborhoodIDOTRR 0.91568 0.46531 1.97 0.04937
## NeighborhoodBrDale 1.21461 0.44964 2.70 0.00703
## NeighborhoodMeadowV 1.17403 0.44763 2.62 0.00886
## area.log 0.60250 0.05588 10.78 < 0.0000000000000002
## AgeGroupMedium NA NA NA NA
## AgeGroupOld NA NA NA NA
## Age.Blt -0.00943 0.00330 -2.86 0.00432
## Bldg.Type2fmCon -0.05330 0.03445 -1.55 0.12216
## Bldg.TypeDuplex -0.16997 0.02839 -5.99 0.0000000030404748
## Bldg.TypeTwnhs -0.08136 0.03868 -2.10 0.03569
## Bldg.TypeTwnhsE -0.04256 0.02586 -1.65 0.10015
## log(Overall.Qual) 0.41014 0.03360 12.21 < 0.0000000000000002
## Bsmt.ExposureGd 0.06266 0.01994 3.14 0.00173
## Bsmt.ExposureMn -0.04086 0.01976 -2.07 0.03893
## Bsmt.ExposureNo -0.03168 0.01416 -2.24 0.02545
## Bsmt.ExposureNoBasement 0.03167 0.10574 0.30 0.76463
## Bsmt.Full.Bath 0.07744 0.00982 7.89 0.0000000000000084
## Functional.TrTRUE 0.07548 0.02068 3.65 0.00028
## Sale.ConditionAdjLand 0.18937 0.11011 1.72 0.08580
## Sale.ConditionAlloca 0.20694 0.07606 2.72 0.00663
## Sale.ConditionFamily -0.05958 0.04017 -1.48 0.13829
## Sale.ConditionNormal 0.10949 0.01941 5.64 0.0000000223248549
## Sale.ConditionPartial 0.12180 0.02781 4.38 0.0000131717894850
## log(Garage.Area + 1) 0.00798 0.00409 1.95 0.05155
## log(Total.Bsmt.SF + 1) 0.02212 0.01577 1.40 0.16098
## log(X1st.Flr.SF) 0.10083 0.02627 3.84 0.00013
## log(Lot.Area) 0.04729 0.01605 2.95 0.00330
## log(as.numeric(Exter.Qual)) -0.05536 0.02263 -2.45 0.01463
## area.log:AgeGroupMedium -0.21104 0.05912 -3.57 0.00038
## area.log:AgeGroupOld -0.18486 0.06126 -3.02 0.00261
## AgeGroupMedium:Age.Blt 0.00631 0.00334 1.89 0.05914
## AgeGroupOld:Age.Blt 0.00709 0.00334 2.12 0.03393
##
## (Intercept) ***
## NeighborhoodNridgHt .
## NeighborhoodNoRidge
## NeighborhoodGrnHill **
## NeighborhoodTimber **
## NeighborhoodSomerst *
## NeighborhoodGreens **
## NeighborhoodVeenker **
## NeighborhoodCrawfor *
## NeighborhoodCollgCr **
## NeighborhoodBlmngtn *
## NeighborhoodClearCr **
## NeighborhoodNWAmes **
## NeighborhoodGilbert **
## NeighborhoodSawyerW **
## NeighborhoodMitchel **
## NeighborhoodNPkVill **
## NeighborhoodNAmes **
## NeighborhoodSawyer **
## NeighborhoodSWISU *
## NeighborhoodEdwards *
## NeighborhoodBrkSide *
## NeighborhoodBlueste **
## NeighborhoodOldTown *
## NeighborhoodIDOTRR *
## NeighborhoodBrDale **
## NeighborhoodMeadowV **
## area.log ***
## AgeGroupMedium
## AgeGroupOld
## Age.Blt **
## Bldg.Type2fmCon
## Bldg.TypeDuplex ***
## Bldg.TypeTwnhs *
## Bldg.TypeTwnhsE
## log(Overall.Qual) ***
## Bsmt.ExposureGd **
## Bsmt.ExposureMn *
## Bsmt.ExposureNo *
## Bsmt.ExposureNoBasement
## Bsmt.Full.Bath ***
## Functional.TrTRUE ***
## Sale.ConditionAdjLand .
## Sale.ConditionAlloca **
## Sale.ConditionFamily
## Sale.ConditionNormal ***
## Sale.ConditionPartial ***
## log(Garage.Area + 1) .
## log(Total.Bsmt.SF + 1)
## log(X1st.Flr.SF) ***
## log(Lot.Area) **
## log(as.numeric(Exter.Qual)) *
## area.log:AgeGroupMedium ***
## area.log:AgeGroupOld **
## AgeGroupMedium:Age.Blt .
## AgeGroupOld:Age.Blt *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.142 on 946 degrees of freedom
## Multiple R-squared: 0.892, Adjusted R-squared: 0.886
## F-statistic: 148 on 53 and 946 DF, p-value: <0.0000000000000002
We got increased residuals inaccuracy for high level “Overall.Qual” variable, which we can observe on the plot below. Also, we have roughly symmetric values distribution of this variable into ames_train data, but it can be not too easy to observe. For our model developing purpose, we decide that it can be helpful to shift the center of distribution to the right, which theoretically can increase accuracy for high-level value.
plot_resid(df_ames_pred = pred.AIC.lm.price.log,
outliersindex = outliersindex,
numeric_variables = c(),
categorical_variables = c("Overall.Qual"),indexPlot = F,pricepredPlot = F)plotOverall.Qual <- ames_train %>%
ggplot() +
geom_histogram(aes(x = Overall.Qual), bins = 10) +
ggtitle("Overall.Qual")
plotOverall.Qual.log <- ames_train %>%
ggplot() +
geom_histogram(aes(x = log(Overall.Qual)), bins = 10) +
ggtitle("log(Overall.Qual)")
grid.arrange(plotOverall.Qual, plotOverall.Qual.log, ncol = 2)## [1] "Overall.Qual summary:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 6.00 6.09 7.00 10.00
## [1] "log(Overall.Qual) summary:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.61 1.79 1.78 1.95 2.30
For final model we include interactions between area.log & AgeGroup and Age.Blt & AgeGroup. Into the EDA section, we found that we got age distribution with 3 peaks, then we decide to observe the relationship for area and Age.Blt separately for each age group and found different behavior in the variable relationship between those groups. Also, we find into area and price variables, which lead us to create PSF (price per square feet) variable as predicting variable, but it was not included in the final model.
df_area_slope <- data.frame("area_slope" = numeric(),
"age_slope" = numeric())
index_new <- which(ames_train$AgeGroup == "New")
index_medium <- which(ames_train$AgeGroup == "Medium")
index_old <- which(ames_train$AgeGroup == "Old")
df_area_slope["New","area_slope"] <- lm(price ~ area, data = ames_train[index_new,])[["coefficients"]][["area"]]
df_area_slope["Medium","area_slope"] <- lm(price ~ area, data = ames_train[index_medium,])[["coefficients"]][["area"]]
df_area_slope["Old","area_slope"] <- lm(price ~ area, data = ames_train[index_old,])[["coefficients"]][["area"]]
df_area_slope["New","age_slope"] <- lm(price ~ Age.Blt, data = ames_train[index_new,])[["coefficients"]][["Age.Blt"]]
df_area_slope["Medium","age_slope"] <- lm(price ~ Age.Blt, data = ames_train[index_medium,])[["coefficients"]][["Age.Blt"]]
df_area_slope["Old","age_slope"] <- lm(price ~ Age.Blt, data = ames_train[index_old,])[["coefficients"]][["Age.Blt"]]
ames_train %>%
ggplot(aes(x = area, y = price)) +
geom_point() +
facet_wrap("AgeGroup", ncol = 3) +
geom_smooth(method = "lm") +
ggtitle("Home price by area") + ylab("price")## `geom_smooth()` using formula 'y ~ x'
ames_train %>%
ggplot(aes(x = Age.Blt, y = price)) +
geom_point() +
facet_wrap("AgeGroup", ncol = 3) +
geom_smooth(method = "lm") +
ggtitle("Home price by Age.Blt") + ylab("price")## `geom_smooth()` using formula 'y ~ x'
## area_slope age_slope
## New 142.12 -2751.5
## Medium 85.97 -1475.1
## Old 66.93 -533.8
At first, we did exploratory data analysis to choose (or transform/create) the most reliable variable (predicted and explanatory) that can visually make logical sense to be included in full models. Second, we decide to use two conceptually different approaches: Bayesian and Frequentist.
For bout cases, we used AIC (Akaike information criterion) and BIC (Bayesian information criterion) selection. For bout criteria the model with the smallest those value is preferable. In general, using BIC leads to fewer variables for the best model compared to using adjusted 𝑅2 or AIC.
For the Bayesian approach we also used 4:
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
We got a similar RMSE for train and test data, which confirm our decision about the pre-best model was chosen. For the next step, we explore summary and ANOVA output, which give information about variables significances. Most significant is the neighborhood, after that log(area). Other variables have less weight. It seams reasonoble to add variable thish can specify meserment about property area, folow this logic we add next variable:
Also, from the residuals plot, we find increased residuals inaccuracy for the high level “Overall.Qual” variable and decide to transform it into log(“Overall.Qual). From EDA we have knowledge about good visual relationships between Exter.Qual and price with mostly the same behavior as”Overall.Qual". That lead to add log(as.numeric(Exter.Qual)) variable.
For the final model, we use the same frequentist approach to evaluate the final included variables.
## Warning in predict.lm(m.AIC.lm.price.log.final, ames_train, interval =
## "prediction"): prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(m.AIC.lm.price.log.final, ames_test, interval =
## "prediction"): prediction from a rank-deficient fit may be misleading
We got mostly constant residuals distribution.
Increased inaccuracy roughly within:
Underestimate roughly on:
Overestimate roughly on: none.
We going to culculate root mean squared error (RMSE) into original values $.
k <- nrow(ames_train)/nrow(ames_test)
df_RMSE["pred.AIC.lm.price.log.final","RMSE"] <- sqrt(mean((pred.AIC.lm.price.log.final$fit - ames_train$price)^2, na.rm =T))
df_RMSE["pred.AIC.lm.price.log.final","RMSE_test"] <- sqrt(mean((pred.AIC.lm.price.log.final_test$fit - ames_test$price)^2, na.rm =T))*k
df_RMSE$RMSE_test_diff <- df_RMSE$RMSE_test - df_RMSE$RMSE
df_RMSE %>%
arrange(RMSE) %>%
head(2)## RMSE RMSE_test RMSE_test_diff
## pred.AIC.lm.price.log.final 25357 24791 -566.0
## pred.AIC.lm.price.log 25602 25493 -108.9
For our final model pred.AIC.lm.price.log.final we got slightly better RMSE 25357 and 24791 dollars for train and test dataset respectively than for pre-final pred.AIC.lm.price.log 25602 and 25493. We have a relatively similar result for train and test datasets with respect to numbers rows. That means that the model relatively stable for outside data. Important to mark, that we got that result with the first try, which can mean, that the ames_test data set do not have a lot of opportunities to become a part of train data.
That means that in general, our average mistake stays around 25k.
What are some strengths and weaknesses of your model?
For the strengths of the model, we can refer to stable predicted results for most cases. Based on all output we got, we can assume that we can aspect the average mistake of predicted price around 25-35k.
## [1] 159467
As for weaknesses, we can say the same 25-35k is a relatively big number with respect to the most common price around 160k. Also, the model should be applied with very сaution for houses younger than 15 ages and for houses with predicted prices above 370k. The model can not be applied for neighborhoods that do not present in the train data set, for that purpose should apply some other models.
Let’s make the needed data transformations.
ames_validation <- ames_validation %>%
filter(Neighborhood != "Landmrk") %>%
filter(House.Style != "2.5Fin") %>%
mutate(area.log = log(area),
price.log = log(price),
PSF = price/area,
Functional.Tr = Functional == "Typ",
AgeGroup = factor(case_when(Neighborhood %in% vAge15 ~ "New",
Neighborhood %in% vAge50 ~ "Medium",
Neighborhood %in% vAge80 ~ "Old"),
levels = c("New","Medium","Old")),
Age.Blt = 2020 - Year.Built,
Bsmt.Exposure = if_else(is.na(Bsmt.Exposure) == T | Bsmt.Exposure == "" ,"NoBasement",as.character(Bsmt.Exposure)))k <- nrow(ames_train)/nrow(ames_validation)
pred.AIC.lm.price.log.finals_validation <- data.frame(exp(predict(m.AIC.lm.price.log.final, ames_validation, interval = "prediction")))## Warning in predict.lm(m.AIC.lm.price.log.final, ames_validation, interval =
## "prediction"): prediction from a rank-deficient fit may be misleading
df_RMSE["pred.AIC.lm.price.log.final","RMSE_validation"] <- sqrt(mean((pred.AIC.lm.price.log.finals_validation$fit -
ames_validation$price)^2, na.rm =T))*k
df_RMSE$RMSE_diff_validation <- df_RMSE$RMSE_validation - df_RMSE$RMSE
df_RMSE %>%
arrange(RMSE) %>%
head(1)## RMSE RMSE_test RMSE_test_diff RMSE_validation
## pred.AIC.lm.price.log.final 25357 24791 -566 25369
## RMSE_diff_validation
## pred.AIC.lm.price.log.final 11.59
We got 25369 dollars for RMSE validation with respect of row in validation data set. That a reletevly simmilar result by compare with RMSE for train 25357 and test data 24791.
coverage.prob <- mean(ames_validation$price > pred.AIC.lm.price.log.finals_validation$lwr &
ames_validation$price < pred.AIC.lm.price.log.finals_validation$upr, na.rm = T)
coverage.prob## [1] 0.9697
97% of confidence intervals contain the true price of the house in the validation data set. That means that the final model properly reflect uncertainty.
Multiple R-squared of the final model is 0.892, that means that the model explains roughly 89.2% of houses price.
The most interesting knowledge from the data and the model is that price for houses changes differently for different classes of property.
Since we do not have an experiment, but an observation study, we can consider the only association, and do not can make causal conclusions.