As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
load("ames_train.Rdata")
Use the code block below to load any necessary packages
library(statsr)
library(dplyr)
library(BAS)
library(ggplot2)
library(MASS)
library(GGally)
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
The response variable (“price”) shows the right-skewed distribution, meaning that most homes are in the low price range ($100,000 ~ $200,000) with tailing a few homes in high price range ( > $400,000). Home size variable (“area”), one predictor variable, also shows the right-skewed distribution, with most homes between 1,000 ~ 2,000 sq ft, with a few houses above 3,000 sq ft. When home price was plotted against home area, it shows linear relationship (which is the required condition for the linear regression model). However, this linear relationship shows an increasing variance as the home areas are getting bigger. When these variables are log-transformed, their linear relationship is improved greatly. Price and area variables are log-transformed, based on the fact that both variables are right-skewed and have values greater than 0.
summary(ames_train$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129763 159467 181190 213000 615000
summary(ames_train$area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1092 1411 1477 1743 4676
par(mfrow= c(2,2))
hist(ames_train$price)
hist(ames_train$area)
plot(ames_train$area, ames_train$price, pch = 16, cex = 0.8, col = "blue")
abline(lm(ames_train$price ~ ames_train$area), col = "red")
plot(log(ames_train$area), log(ames_train$price), pch = 16, cex = 0.8, col = "blue")
abline(lm(log(ames_train$price) ~ log(ames_train$area)), col = "red")
The relationship between home price and home area is different depending on the sale condition. Partial (Unfinished home) sale shows the highest price per unit area, while prices in adjoing land purchase or allocation are nearly constant over changes in areas. Therefore, linear regression model should be fitted separately in different sale conditions. However, there are not enough observations in different sale condition except normal. So, the regression model for this study will be built only on the normal sale condition.
ggplot(ames_train, aes(x = log(area), y = log(price), col = Sale.Condition)) + geom_point() + stat_smooth(method = "lm", se = FALSE)
ames_train %>% group_by(Sale.Condition) %>% summarize(median= median(price), n = n())
## # A tibble: 6 x 3
## Sale.Condition median n
## <fct> <dbl> <int>
## 1 Abnorml 132000 61
## 2 AdjLand 138750 2
## 3 Alloca 152556. 4
## 4 Family 149000 17
## 5 Normal 155500 834
## 6 Partial 254146. 82
ames_train1 <- ames_train %>% filter(Sale.Condition == "Normal")
Home age variable (“Year.Built”) is plotted against the price. Old homes (built before 1920) did not in the regression line, showing that both very old homes and very new homes tend to command a price premium relative to “middle age” homes. Therefore, a quadratic effect might be expected for an age variable in a multiple linear regression model to predict price. To facilitate this, home age is rescaled by subtracting 1940 (middle point) from “year built” and dividing by 10 (getting decade value instead of year value). The resulting variable has a mean close to 3 and a standard deviation of 2.9, and represents the number of decades away from 1940. And then, I created a quadratic variable, Home.Decade.Sqr (which is Home.Decade^2), and added this in the linear model. This transformation improved the constancy of the variance of the residuals.
summary(ames_train1$Year.Built)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1872 1954 1972 1970 1997 2009
ggplot(ames_train1, aes(x = Year.Built, y = price)) + geom_point(pch = 16, col = "blue") + stat_smooth(method = "lm", se = FALSE, col = "red")
ames_train1 <- ames_train1 %>% mutate(Home.Decade = (Year.Built - 1940)/10)
ames_train1 <- ames_train1 %>% mutate(Home.Decade.Sqr = (Home.Decade)^2)
lm1 <- lm(log(price) ~ Year.Built, data = ames_train1)
lm2 <- lm(log(price) ~ Home.Decade + Home.Decade.Sqr, data = ames_train1)
par(mfrow = c(1, 2))
plot(lm1, which = 1, pch = 16, cex = 0.8, col = "blue")
plot(lm2, which = 1, pch = 16, cex = 0.8, col = "blue")
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
For the initial model, I choose the most predictable variables to contribute the home price(“logprice”) such as area(“logarea”), location(“NeighborhoodII”), age(“Home.Deca”, “Home.Deca.Sqr”), quality (“Overall.Qual”), bedroom number(“Bedroom.AbvGr”), bathroom number(“Full.Bath”). Price and area are log-transformed as logprice and logarea, and Neighborhood was releveled as NeighborhoodII. Initial linear regression model for home price is as follows;
\[logprice = 7.444 + 0.559*logarea + 0.036*Home.Decade - 0.002*Home.Decade.Sqr\] \[- 0.249*NeighborhoodIIGr1 - 0.048*NeighborhoodIIGr2 + 0.108*NeighborhoodIIGr4\] \[+ 0.101*Overall.Qual - 0.027*Bedroom.AbvGr - 0.043*Full.Bath\]
Coefficients for logarea and Overall.Qual shows that the unit changes in those variables increase the home price (logprice) by 0.559% and 0.101%, respectively, as everything else held constant. Neighborhood Group 3 is used as reference and Group 1 and 2 depreciate home value by 0.002% and 0.249%, respectively, while Group 4 appreciate home value by 0.108%, as everything else held constant. Relationship of home age to the price is a little bit more complicated; unit increase in Home.Decade increases home price by 0.036% but unit increase in Home.Decade.Sqr decreases home price by 0.002%. Increasing Bedroom number or the number of full bath is shown to depreciate the home price in this model. Larger houses on average have more bedrooms and Full bathrooms and sell for higher prices. However, holding constant the size of a house, the number of bedrooms and full bathrooms decreases property valuation. This model can be used for the prediction of home prices, but is not proper for testing hypothesis for relationship of the these variables to the home price.
ames_train1 <- ames_train1 %>% mutate(logprice = log(price))
ames_train1 <- ames_train1 %>% mutate(logarea = log(area))
ames_train1 <- ames_train1 %>% mutate(logLot.Area = log(Lot.Area))
ames_train1 <- ames_train1 %>% mutate(NeighborhoodII = Neighborhood)
levels(ames_train1$NeighborhoodII)[levels(ames_train1$NeighborhoodII)%in%c("MeadowV", "BrDale", "IDOTRR")] <- "Gr1"
levels(ames_train1$NeighborhoodII)[levels(ames_train1$NeighborhoodII)%in%c("OldTown", "Blueste", "BrkSide", "Edwards", "SWISU", "Sawyer", "NAmes", "NPkVill", "Mitchel", "Landmrk")] <- "Gr2"
levels(ames_train1$NeighborhoodII)[levels(ames_train1$NeighborhoodII)%in%c("Timber", "Somerst", "Veenker", "Greens", "Crawfor", "CollgCr","SawyerW", "Gilbert", "ClearCr", "NWAmes", "Blmngtn")] <- "Gr3"
levels(ames_train1$NeighborhoodII)[levels(ames_train1$NeighborhoodII)%in%c("StoneBr", "GrnHill", "NoRidge", "NridgHt")] <- "Gr4"
im <- lm(logprice ~ logarea + Home.Decade + Home.Decade.Sqr + NeighborhoodII + Overall.Qual + Bedroom.AbvGr + Full.Bath, data = ames_train1)
summary(im)
##
## Call:
## lm(formula = logprice ~ logarea + Home.Decade + Home.Decade.Sqr +
## NeighborhoodII + Overall.Qual + Bedroom.AbvGr + Full.Bath,
## data = ames_train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81896 -0.08984 -0.00070 0.09595 0.62093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.444172 0.180236 41.302 < 2e-16 ***
## logarea 0.559485 0.029368 19.050 < 2e-16 ***
## Home.Decade 0.036103 0.003037 11.886 < 2e-16 ***
## Home.Decade.Sqr -0.002196 0.000670 -3.278 0.00109 **
## NeighborhoodIIGr2 -0.048506 0.016640 -2.915 0.00365 **
## NeighborhoodIIGr1 -0.248779 0.027048 -9.198 < 2e-16 ***
## NeighborhoodIIGr4 0.108113 0.020822 5.192 2.62e-07 ***
## Overall.Qual 0.101434 0.006694 15.153 < 2e-16 ***
## Bedroom.AbvGr -0.026766 0.008917 -3.002 0.00277 **
## Full.Bath -0.043089 0.015335 -2.810 0.00507 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1516 on 824 degrees of freedom
## Multiple R-squared: 0.8442, Adjusted R-squared: 0.8424
## F-statistic: 495.9 on 9 and 824 DF, p-value: < 2.2e-16
Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
I started with my initial model (im) and used the principled stepwise statistical methods to select a single parsimonious model. The function stepAIC using BIC(Bayesian Information Criteria) as a criterion (in which \(k = \log(n)\)) returned the same linear model (im_BIC) as the best model. In addition, bayesian linear model (im_bas) using BMA(Bayesian Model Averaging) as a criterion also select the model with the same variables (10 variables) as the best model, in which coefficients are slightly different.
im_BIC <- stepAIC(im, k = log(834))
## Start: AIC=-3089.67
## logprice ~ logarea + Home.Decade + Home.Decade.Sqr + NeighborhoodII +
## Overall.Qual + Bedroom.AbvGr + Full.Bath
##
## Df Sum of Sq RSS AIC
## <none> 18.934 -3089.7
## - Full.Bath 1 0.1814 19.115 -3088.4
## - Bedroom.AbvGr 1 0.2070 19.141 -3087.3
## - Home.Decade.Sqr 1 0.2469 19.181 -3085.6
## - NeighborhoodII 3 2.6198 21.553 -3001.8
## - Home.Decade 1 3.2464 22.180 -2964.4
## - Overall.Qual 1 5.2762 24.210 -2891.4
## - logarea 1 8.3392 27.273 -2792.0
summary(im_BIC)
##
## Call:
## lm(formula = logprice ~ logarea + Home.Decade + Home.Decade.Sqr +
## NeighborhoodII + Overall.Qual + Bedroom.AbvGr + Full.Bath,
## data = ames_train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81896 -0.08984 -0.00070 0.09595 0.62093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.444172 0.180236 41.302 < 2e-16 ***
## logarea 0.559485 0.029368 19.050 < 2e-16 ***
## Home.Decade 0.036103 0.003037 11.886 < 2e-16 ***
## Home.Decade.Sqr -0.002196 0.000670 -3.278 0.00109 **
## NeighborhoodIIGr2 -0.048506 0.016640 -2.915 0.00365 **
## NeighborhoodIIGr1 -0.248779 0.027048 -9.198 < 2e-16 ***
## NeighborhoodIIGr4 0.108113 0.020822 5.192 2.62e-07 ***
## Overall.Qual 0.101434 0.006694 15.153 < 2e-16 ***
## Bedroom.AbvGr -0.026766 0.008917 -3.002 0.00277 **
## Full.Bath -0.043089 0.015335 -2.810 0.00507 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1516 on 824 degrees of freedom
## Multiple R-squared: 0.8442, Adjusted R-squared: 0.8424
## F-statistic: 495.9 on 9 and 824 DF, p-value: < 2.2e-16
im_bas <- bas.lm(logprice ~ logarea + Home.Decade + Home.Decade.Sqr + NeighborhoodII + Overall.Qual + Bedroom.AbvGr + Full.Bath, data = ames_train1, prior = "BIC", modelprior = uniform())
summary(im_bas)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.000000 1.000000
## logarea 1.0000000 1.0000 1.000000 1.000000
## Home.Decade 1.0000000 1.0000 1.000000 1.000000
## Home.Decade.Sqr 0.8672066 1.0000 1.000000 1.000000
## NeighborhoodIIGr2 1.0000000 1.0000 1.000000 1.000000
## NeighborhoodIIGr1 1.0000000 1.0000 1.000000 1.000000
## NeighborhoodIIGr4 1.0000000 1.0000 1.000000 1.000000
## Overall.Qual 1.0000000 1.0000 1.000000 1.000000
## Bedroom.AbvGr 0.7722829 1.0000 1.000000 0.000000
## Full.Bath 0.7350049 1.0000 0.000000 1.000000
## BF NA 1.0000 0.541435 0.309893
## PostProbs NA 0.4611 0.249700 0.142900
## R2 NA 0.8442 0.842700 0.842400
## dim NA 10.0000 9.000000 9.000000
## logmarg NA -1260.0062 -1260.619758 -1261.177754
## model 4 model 5
## Intercept 1.0000000 1.0000000
## logarea 1.0000000 1.0000000
## Home.Decade 1.0000000 1.0000000
## Home.Decade.Sqr 0.0000000 0.0000000
## NeighborhoodIIGr2 1.0000000 1.0000000
## NeighborhoodIIGr1 1.0000000 1.0000000
## NeighborhoodIIGr4 1.0000000 1.0000000
## Overall.Qual 1.0000000 1.0000000
## Bedroom.AbvGr 0.0000000 1.0000000
## Full.Bath 1.0000000 1.0000000
## BF 0.1538398 0.1301926
## PostProbs 0.0709000 0.0600000
## R2 0.8409000 0.8421000
## dim 8.0000000 9.0000000
## logmarg -1261.8780697 -1262.0449662
coef(im_bas)
##
## Marginal Posterior Summaries of Coefficients:
##
## Using BMA
##
## Based on the top 141 models
## post mean post SD post p(B != 0)
## Intercept 11.995840 0.005264 1.000000
## logarea 0.541211 0.034924 1.000000
## Home.Decade 0.035194 0.003450 1.000000
## Home.Decade.Sqr -0.001985 0.001025 0.867207
## NeighborhoodIIGr2 -0.046020 0.017311 1.000000
## NeighborhoodIIGr1 -0.246385 0.027363 1.000000
## NeighborhoodIIGr4 0.109901 0.021040 1.000000
## Overall.Qual 0.101918 0.006993 1.000000
## Bedroom.AbvGr -0.021509 0.014229 0.772283
## Full.Bath -0.034911 0.025273 0.735005
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
Residual plots showed that the initial model (im) satisfy the assumption of linear regression model. (1) the variance of the residues are almost constant, (2) the resudies are nearly in the linear relationship, (3) the distribution of the residues are close to the normal distribution, (4) the residuals are independent (i.e, there is no repeated pattern related to the observation order). Cook’s distance plot shows the most influential (leveraging) outliers (observation 611, 21, 325). Observation 325 have extreme lot.area (i.e., 19,800 sq ft). Observation 21 has extreme value on Year.Built(1880). Observation 611 is in the extreme vaule in the NeighborhoodII (Gr1). This linear model might not include critical variables to predict these houses. Therefre, adding more variables might resolve this outlier issue.
par(mfrow = c(2,2))
plot(im, which = 1, pch = 16, cex = 0.7, col = "blue")
plot(im, which = 2, pch = 16, cex = 0.7, col = "blue")
hist(im$resid, col = "blue")
plot(im$resid, pch = 16, cex = 0.7, col = "blue")
plot(im, which = 4, col = "blue")
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
RMSE from the frequentist linear model(im) was $ 28681. This means that the mean of the error in the price prediction is $28681 when I use the above “im” linear model. This is a little bit smaller than RMSE from bayesian linear model (im_bas), which is $28746.66.
pred.im.train <- exp(predict(im, ames_train1))
resid.im.train <- ames_train1$price - pred.im.train
RMSE.im.train <- sqrt(mean(resid.im.train^2))
RMSE.im.train
## [1] 28681
pred.im_bas.train <- exp(predict(im_bas, data = ames_train1, estimator = "BMA")$fit)
resid.im_bas.train <- ames_train1$price - pred.im_bas.train
RMSE.im_bas.train <- sqrt(mean(resid.im_bas.train^2))
RMSE.im_bas.train
## [1] 28807.43
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.
load("ames_test.Rdata")
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
When the frequentist linear regression model (im) is used with out-of-sample data, the RMSE is $29383.23, which is lsightly greater than the RMSE from the training data ($28681). When the bayesian linear regression model (im_bas) is used , RMSE with out-of-sample data ($29442.13) was similarily slightly greater than that from the training data ($28746.66). Therefore, there are “overfitting” with the training data. However, this is not serious big enough to modify my initial model. With my initial model, the true proportion of test prices that fall within the 95% prediction interval are very close to 95% (94.86%). Therfore, my initial model properly reflects uncertainty.
ames_test1 <- ames_test %>% mutate(Sale.Condition == "Normal")
ames_test1 <- ames_test1 %>% mutate(logprice = log(price))
ames_test1 <- ames_test1 %>% mutate(logarea = log(area))
ames_test1 <- ames_test1 %>% mutate(logLot.Area = log(Lot.Area))
ames_test1 <- ames_test1 %>% mutate(Home.Decade = (Year.Built - 1940)/10)
ames_test1 <- ames_test1 %>% mutate(Home.Decade.Sqr = (Home.Decade)^2)
ames_test1 <- ames_test1 %>% mutate(NeighborhoodII = Neighborhood)
levels(ames_test1$NeighborhoodII)[levels(ames_test1$NeighborhoodII)%in%c("MeadowV", "BrDale", "IDOTRR")] <- "Gr1"
levels(ames_test1$NeighborhoodII)[levels(ames_test1$NeighborhoodII)%in%c("OldTown", "Blueste", "BrkSide", "Edwards", "SWISU", "Sawyer", "NAmes", "NPkVill", "Mitchel", "Landmrk")] <- "Gr2"
levels(ames_test1$NeighborhoodII)[levels(ames_test1$NeighborhoodII)%in%c("Timber", "Somerst", "Veenker", "Greens", "Crawfor", "CollgCr","SawyerW", "Gilbert", "ClearCr", "NWAmes", "Blmngtn")] <- "Gr3"
levels(ames_test1$NeighborhoodII)[levels(ames_test1$NeighborhoodII)%in%c("StoneBr", "GrnHill", "NoRidge", "NridgHt")] <- "Gr4"
pred.im.test <- exp(predict(im, ames_test1))
resid.im.test <- ames_test1$price - pred.im.test
RMSE.im.test <- sqrt(mean(resid.im.test^2))
RMSE.im.test
## [1] 29383.23
pred.im_bas.test <- exp(predict(im_bas, ames_test1, estimator = "BMA")$fit)
resid.im_bas.test <- ames_test1$price - pred.im_bas.test
RMSE.im_bas.test <- sqrt(mean(resid.im_bas.test^2))
RMSE.im_bas.test
## [1] 29407.58
int.im.test <- exp(predict(im, ames_test1, interval = "prediction"))
cover.im.test <- mean(ames_test1$price > int.im.test[,"lwr"] &
ames_test1$price < int.im.test[,"upr"])
cover.im.test
## [1] 0.9485924
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Provide the summary table for your model.
The final model (fm_BIC) is as follows;
\[logprice = 5.25 - 0.1*NeighborhoodIIGr1 - 0.026*NeighborhoodIIGr2 + 0.0823*NeighborhoodIIGr4+ \] \[0.349*logarea + 0.028*Home.Decade + 0.078*Overall.Qual + 0.094logLot.Area\] \[+ 0.00014*Garage.Area+ 0.00018*BsmtFin.SF.1 + 0.000075*Bsmt.Unf.SF \] \[+ 0.0012*Year.Remod.Add + 0.0512*Overall.Cond + 0.037*Fireplaces\]
fm <- lm(logprice ~ NeighborhoodII + logarea + Home.Decade + Home.Decade.Sqr + Overall.Qual + logLot.Area + Garage.Area + BsmtFin.SF.1 + Bsmt.Unf.SF + Year.Remod.Add + Overall.Cond + Full.Bath + Fireplaces, data = ames_train1)
summary(fm)
##
## Call:
## lm(formula = logprice ~ NeighborhoodII + logarea + Home.Decade +
## Home.Decade.Sqr + Overall.Qual + logLot.Area + Garage.Area +
## BsmtFin.SF.1 + Bsmt.Unf.SF + Year.Remod.Add + Overall.Cond +
## Full.Bath + Fireplaces, data = ames_train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75835 -0.05806 0.00056 0.06169 0.37042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.471e+00 5.739e-01 9.534 < 2e-16 ***
## NeighborhoodIIGr2 -2.620e-02 1.197e-02 -2.188 0.028959 *
## NeighborhoodIIGr1 -1.000e-01 2.080e-02 -4.808 1.82e-06 ***
## NeighborhoodIIGr4 8.017e-02 1.489e-02 5.384 9.54e-08 ***
## logarea 3.703e-01 1.962e-02 18.873 < 2e-16 ***
## Home.Decade 2.754e-02 2.364e-03 11.650 < 2e-16 ***
## Home.Decade.Sqr 9.158e-04 5.772e-04 1.587 0.112932
## Overall.Qual 7.555e-02 5.069e-03 14.906 < 2e-16 ***
## logLot.Area 9.474e-02 8.835e-03 10.723 < 2e-16 ***
## Garage.Area 1.376e-04 2.419e-05 5.689 1.78e-08 ***
## BsmtFin.SF.1 1.840e-04 1.266e-05 14.533 < 2e-16 ***
## Bsmt.Unf.SF 7.462e-05 1.264e-05 5.904 5.21e-09 ***
## Year.Remod.Add 9.976e-04 3.009e-04 3.316 0.000954 ***
## Overall.Cond 5.284e-02 4.331e-03 12.200 < 2e-16 ***
## Full.Bath -2.355e-02 1.072e-02 -2.197 0.028318 *
## Fireplaces 3.910e-02 7.181e-03 5.446 6.83e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1071 on 818 degrees of freedom
## Multiple R-squared: 0.9228, Adjusted R-squared: 0.9214
## F-statistic: 652 on 15 and 818 DF, p-value: < 2.2e-16
fm_BIC <- stepAIC(fm, k = log(834))
## Start: AIC=-3635.32
## logprice ~ NeighborhoodII + logarea + Home.Decade + Home.Decade.Sqr +
## Overall.Qual + logLot.Area + Garage.Area + BsmtFin.SF.1 +
## Bsmt.Unf.SF + Year.Remod.Add + Overall.Cond + Full.Bath +
## Fireplaces
##
## Df Sum of Sq RSS AIC
## - Home.Decade.Sqr 1 0.0289 9.4062 -3639.5
## - Full.Bath 1 0.0553 9.4327 -3637.1
## <none> 9.3773 -3635.3
## - Year.Remod.Add 1 0.1260 9.5034 -3630.9
## - Fireplaces 1 0.3400 9.7173 -3612.3
## - Garage.Area 1 0.3710 9.7483 -3609.7
## - Bsmt.Unf.SF 1 0.3996 9.7769 -3607.2
## - NeighborhoodII 3 0.5859 9.9633 -3605.0
## - logLot.Area 1 1.3180 10.6954 -3532.4
## - Home.Decade 1 1.5560 10.9334 -3514.0
## - Overall.Cond 1 1.7062 11.0835 -3502.6
## - BsmtFin.SF.1 1 2.4212 11.7985 -3450.5
## - Overall.Qual 1 2.5470 11.9244 -3441.6
## - logarea 1 4.0833 13.4607 -3340.6
##
## Step: AIC=-3639.48
## logprice ~ NeighborhoodII + logarea + Home.Decade + Overall.Qual +
## logLot.Area + Garage.Area + BsmtFin.SF.1 + Bsmt.Unf.SF +
## Year.Remod.Add + Overall.Cond + Full.Bath + Fireplaces
##
## Df Sum of Sq RSS AIC
## - Full.Bath 1 0.0441 9.4503 -3642.3
## <none> 9.4062 -3639.5
## - Year.Remod.Add 1 0.2485 9.6547 -3624.5
## - Fireplaces 1 0.3132 9.7194 -3618.9
## - Garage.Area 1 0.4013 9.8075 -3611.4
## - Bsmt.Unf.SF 1 0.4143 9.8205 -3610.3
## - NeighborhoodII 3 0.6402 10.0465 -3604.7
## - logLot.Area 1 1.2894 10.6956 -3539.1
## - Overall.Cond 1 1.7770 11.1832 -3501.9
## - Home.Decade 1 1.8725 11.2787 -3494.8
## - BsmtFin.SF.1 1 2.4086 11.8148 -3456.1
## - Overall.Qual 1 2.9531 12.3593 -3418.5
## - logarea 1 4.0568 13.4630 -3347.2
##
## Step: AIC=-3642.31
## logprice ~ NeighborhoodII + logarea + Home.Decade + Overall.Qual +
## logLot.Area + Garage.Area + BsmtFin.SF.1 + Bsmt.Unf.SF +
## Year.Remod.Add + Overall.Cond + Fireplaces
##
## Df Sum of Sq RSS AIC
## <none> 9.4503 -3642.3
## - Year.Remod.Add 1 0.2300 9.6803 -3629.0
## - Fireplaces 1 0.3284 9.7787 -3620.5
## - Bsmt.Unf.SF 1 0.4047 9.8550 -3614.1
## - Garage.Area 1 0.4085 9.8588 -3613.7
## - NeighborhoodII 3 0.6206 10.0709 -3609.4
## - logLot.Area 1 1.3282 10.7785 -3539.4
## - Home.Decade 1 1.8293 11.2796 -3501.5
## - Overall.Cond 1 1.8501 11.3004 -3499.9
## - BsmtFin.SF.1 1 2.4274 11.8777 -3458.4
## - Overall.Qual 1 2.9953 12.4456 -3419.4
## - logarea 1 4.8352 14.2854 -3304.4
summary(fm_BIC)
##
## Call:
## lm(formula = logprice ~ NeighborhoodII + logarea + Home.Decade +
## Overall.Qual + logLot.Area + Garage.Area + BsmtFin.SF.1 +
## Bsmt.Unf.SF + Year.Remod.Add + Overall.Cond + Fireplaces,
## data = ames_train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74716 -0.05958 -0.00059 0.06119 0.35376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.250e+00 5.037e-01 10.424 < 2e-16 ***
## NeighborhoodIIGr2 -2.610e-02 1.136e-02 -2.298 0.0218 *
## NeighborhoodIIGr1 -1.004e-01 2.033e-02 -4.939 9.52e-07 ***
## NeighborhoodIIGr4 8.228e-02 1.491e-02 5.520 4.55e-08 ***
## logarea 3.494e-01 1.706e-02 20.483 < 2e-16 ***
## Home.Decade 2.813e-02 2.233e-03 12.599 < 2e-16 ***
## Overall.Qual 7.834e-02 4.859e-03 16.122 < 2e-16 ***
## logLot.Area 9.364e-02 8.723e-03 10.736 < 2e-16 ***
## Garage.Area 1.433e-04 2.408e-05 5.953 3.89e-09 ***
## BsmtFin.SF.1 1.841e-04 1.269e-05 14.513 < 2e-16 ***
## Bsmt.Unf.SF 7.491e-05 1.264e-05 5.926 4.57e-09 ***
## Year.Remod.Add 1.175e-03 2.630e-04 4.468 9.01e-06 ***
## Overall.Cond 5.117e-02 4.039e-03 12.670 < 2e-16 ***
## Fireplaces 3.747e-02 7.020e-03 5.338 1.22e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1074 on 820 degrees of freedom
## Multiple R-squared: 0.9222, Adjusted R-squared: 0.921
## F-statistic: 747.8 on 13 and 820 DF, p-value: < 2.2e-16
fm_bas <- bas.lm(logprice ~ NeighborhoodII + logarea + Home.Decade + Home.Decade.Sqr + Overall.Qual + logLot.Area + Garage.Area + BsmtFin.SF.1 + Bsmt.Unf.SF + Year.Remod.Add + Overall.Cond + Full.Bath + Fireplaces, data = ames_train1, prior = "BIC", modelprior = uniform())
summary(fm_bas)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.00000000
## NeighborhoodIIGr2 0.99999990 1.0000 1.0000000 1.00000000
## NeighborhoodIIGr1 0.99999990 1.0000 1.0000000 1.00000000
## NeighborhoodIIGr4 0.99999990 1.0000 1.0000000 1.00000000
## logarea 1.00000000 1.0000 1.0000000 1.00000000
## Home.Decade 1.00000000 1.0000 1.0000000 1.00000000
## Home.Decade.Sqr 0.08671465 0.0000 0.0000000 1.00000000
## Overall.Qual 1.00000000 1.0000 1.0000000 1.00000000
## logLot.Area 1.00000000 1.0000 1.0000000 1.00000000
## Garage.Area 0.99999918 1.0000 1.0000000 1.00000000
## BsmtFin.SF.1 1.00000000 1.0000 1.0000000 1.00000000
## Bsmt.Unf.SF 0.99999927 1.0000 1.0000000 1.00000000
## Year.Remod.Add 0.98982858 1.0000 1.0000000 1.00000000
## Overall.Cond 1.00000000 1.0000 1.0000000 1.00000000
## Full.Bath 0.20324629 0.0000 1.0000000 0.00000000
## Fireplaces 0.99997893 1.0000 1.0000000 1.00000000
## BF NA 1.0000 0.2431523 0.07538122
## PostProbs NA 0.7338 0.1784000 0.05530000
## R2 NA 0.9222 0.9226000 0.92240000
## dim NA 14.0000 15.0000000 15.00000000
## logmarg NA -983.6842 -985.0982967 -986.26942647
## model 4 model 5
## Intercept 1.00000000 1.000000e+00
## NeighborhoodIIGr2 1.00000000 1.000000e+00
## NeighborhoodIIGr1 1.00000000 1.000000e+00
## NeighborhoodIIGr4 1.00000000 1.000000e+00
## logarea 1.00000000 1.000000e+00
## Home.Decade 1.00000000 1.000000e+00
## Home.Decade.Sqr 1.00000000 1.000000e+00
## Overall.Qual 1.00000000 1.000000e+00
## logLot.Area 1.00000000 1.000000e+00
## Garage.Area 1.00000000 1.000000e+00
## BsmtFin.SF.1 1.00000000 1.000000e+00
## Bsmt.Unf.SF 1.00000000 1.000000e+00
## Year.Remod.Add 1.00000000 0.000000e+00
## Overall.Cond 1.00000000 1.000000e+00
## Full.Bath 1.00000000 0.000000e+00
## Fireplaces 1.00000000 1.000000e+00
## BF 0.03033427 9.107045e-03
## PostProbs 0.02230000 6.700000e-03
## R2 0.92280000 9.213000e-01
## dim 16.00000000 1.400000e+01
## logmarg -987.17970675 -9.883829e+02
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
As shown above, the response variable(“price”) and the size variables (“area” and “Lot.Area”) are log-transformed, since log-transformation gives better linear relationship. I tried categorizing Lot.Area in a way that unit change in Lot is associated with a fixed change in price. However, this transformation did not give further advantage over the log-transformation.
Both very old homes and very new homes tend to command a price premium relative to “middle age” homes, suggesting a quadratic effect of home age to the home price. To include the quadratic relationship of home age variable in the model, “Year.Built” variable was transformed as “Home.Decade” and “Home.Decade.Sqr”; Home.Decade was calculated by subtrating the middle value(1940 between 1872 and 2009) from Year.Built and then divide by 10, and Home.Decade.Sqr is the squared value of Home.Decade.
Neighborhood is a catagorical variable which has 27 levels, some of which are not significantly different. Therefore, the “Neighborhood” variable is releveled into 4 levels(Gr1, Gr2, Gr3, Gr4) in NeighborhoodII variable.
neighborhood_price <- with(ames_train1, reorder(Neighborhood,price,median))
give.n <- function(x){
return(c(y = median(x)*0.96, label = length(x)))
}
ggplot(data=ames_train1, aes(x=neighborhood_price, y=log(price), fill = neighborhood_price))+
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle("Prices of houses in different neighborhoods")+
xlab("Neighborhood")+
ylab("Log(Price)")+
scale_y_continuous() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks=element_blank()) +
stat_summary(fun.data = give.n, geom = "text", fun.y = median,
position = position_dodge(width = 0.75), size = 3, color = "Red")
neighborhoodII_price <- with(ames_train1, reorder(NeighborhoodII,price,median))
give.n <- function(x){
return(c(y = median(x)*0.97, label = length(x)))
}
ggplot(data=ames_train1, aes(x=neighborhoodII_price, y=log(price), fill = neighborhoodII_price))+
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle("Prices of houses in different neighborhoodII")+
xlab("Neighborhood")+
ylab("Log(Price)")+
scale_y_continuous() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks=element_blank()) +
stat_summary(fun.data = give.n, geom = "text", fun.y = median,
position = position_dodge(width = 0.75), size = 4, color = "Red")
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.
When predictive variables are correlated, including both variables in the model cause redunant fitting and produce not-relaible coefficients. When Bedroom or Full.bath is linear-modeled agaisnt the price, these variables have positive correlation with the price (positive coefficients in the linear model). However, these variable are linear-modeled with area variable, the model shows the negative coefficients for Bedroom or Full bath (see the summary of im). This model still can be used for the prediction, but not good for hypothesis testing to study the relationship between the prediction variabale and the response varaible.
I checked whether these is any interaction between bedroom number and full bath number such that 2 and more full bath number is negatively related to the home price in the one or two bedroom home, while this is positively related to the home price in the three or four bedroom home. To study this interaction, I added “BedBath” variable which is mutiplication of Bedroom.AbvGr and Full.Bath variables. However, this variable end up not significant in the linear model, so I did not incude these varaibles.
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
Home price is determined by its location, its size, its age and its quality. As location variable, I choose “NeighborhoodII (releveled Neighborhood)” which gives 3 variables (Gr3 is reference, and Gr1, Gr2, Gr4). As size variables, I include 5 variables;area(i.e., “logarea”), Lot.Area(“logLot.Area”), “Garage.Area”, finished and unfinished basement areas (“BsmtFin.SF.1” and “Bsmt.Unf.SF”). As age variables, I include 3 variabales; “Home.Decade(transformed from Year.Built)”, “Home,Decade.Sqr”, and “Year.Remod.Add”. As quality variables, I include 4 variables; “Overall.Qual”, “Overall.Cond”, “Full.Bath”, “Fireplaces”.
I used the forward selection method using adjusted R^2 as a criteria, the greater this value, the better the model. I add each variable one at a time and keep those variables which increase R^2 values, until I built the final mmodel (fm) with 15 variables. Then I used the stepAIC function or bayesian linear model function which choose the variables to get the best model through stepwise iteration method using BIC or BMA as a criterion, respectively. fm_BIC chooseed the model without Home.Decade.Sqr and Full.Bath variables as the best model (fm_BIC), while Bayesian model picked the model without NeighborhoodIIGr2, Home.Decade.Sqr, and Full.Bath as the best model (fm_bas, model 1).
I picked fm_BIC model (model 2 in bayesian model) as my prediction model, based on the squared R value and significance of variables in the model. Therefore, Home.Decade.Sqr and Full.Bath variables are removed. From the initial model, bedroom number variable (“Bedroom.AbvGr”) is excluded due to colinearity with area variable; bigger homes have more bedrooms. When the model includes the two correlated variables, it cause the fitting process redundant.
Some variables are not included due to remote relationship to the price, some other variables are not included due to the collinearity with variables which are included in this model. For example, Garage.Cars and Garage.Area are correlated and Garage.Area is included, but Garage.Cars is not.
ggplot(ames_train1, aes(x = Bedroom.AbvGr, y = logarea)) + geom_point( col = "blue") + geom_smooth(method = lm, col= "red", lty = "dashed", se = FALSE)
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
Testing the final model with out-of-sample data gives slightly greter RMSE than one with train data; $ 19,759.67 (out-of-sample) vs $ 19,476.08 (training data). However, this value is greatly decreased from the inital model which gives RMSE of $ 29,383.23 with out-of-sample data, and $ 28,681.00 with the training data. This slight overfitting sill can be improved by simplifying this model. However I did not change this model.
Coverage probablity for the final model (94.98%) is close to 95% with the out-of-sample data, suggesting that the final model properly reflects uncertainty. However, the 95 % prediction interval is a little bit wide.
pred.fm_BIC.train <- exp(predict(fm_BIC, ames_train1))
resid.fm_BIC.train <- ames_train1$price - pred.fm_BIC.train
RMSE.fm_BIC.train <- sqrt(mean(resid.fm_BIC.train^2))
RMSE.fm_BIC.train
## [1] 19476.08
pred.fm_BIC.test <- exp(predict(fm_BIC, ames_test1))
resid.fm_BIC.test <- ames_test1$price - pred.fm_BIC.test
RMSE.fm_BIC.test <- sqrt(mean(resid.fm_BIC.test^2))
RMSE.fm_BIC.test
## [1] 19759.67
int.fm_BIC.test <- round(exp(predict(fm_BIC, ames_test1, interval = "prediction")))
cover.fm_BIC.test <- mean(ames_test1$price > int.fm_BIC.test[, "lwr"] & ames_test1$price < int.fm_BIC.test[, "upr"])
cover.fm_BIC.test
## [1] 0.9498164
prediction.interval.test <- cbind(int.fm_BIC.test, ames_test1$price)
colnames(prediction.interval.test) <-c("fit", "lwr", "upr", "saleprice")
head(prediction.interval.test, 20)
## fit lwr upr saleprice
## 1 304639 246151 377024 305900
## 2 237625 192073 293980 257500
## 3 109083 88273 134798 116000
## 4 145140 117428 179393 143450
## 5 294760 238287 364618 277500
## 6 197786 159856 244718 200000
## 7 123616 100021 152777 105000
## 8 605220 488601 749673 755000
## 9 232999 188368 288204 204000
## 10 231658 187283 286547 255000
## 11 174117 140593 215634 156500
## 12 85851 69342 106292 86000
## 13 221157 179014 273222 240900
## 14 248478 200879 307356 268000
## 15 213160 172322 263677 244000
## 16 167138 135143 206709 155000
## 17 155467 125774 192170 144500
## 18 109316 88272 135377 122500
## 19 145760 117882 180232 141000
## 20 119820 96969 148055 131000
For your final model, create and briefly interpret an informative plot of the residuals.
The final model residue plots show that this model satisfies overall the model assumptions; constant variance of the residue, linearity of the residue, normal distribution of the residue, and independence of the residue. However, there are a few outstanding outliers affecting all those assumption. Cook’s distance testing shows three outliers having high influence in the final model (#324, #560, #611). The function, hatvalues, returns the most leveraging home (#472).
Home #611 has the highest leverage, with the sold price of $40,000 compared to the predicted vale($84,440), while predicted prices for home #325 and home #560 ($296,657 and $161,471, respectively) were much lower than actual sold price ($415,000 and $230,000, respectively). High leverging home does not necessarily have the high residuals since this home aldeady influenced fitting this model into its way (the predicted price for this home is $100,127 while the sale price is $103,000.
When I investigated these homes, I could not find any possible data errors. Home #325 and #472 have extremely large Lot areas. Home #472 and #611 are in extreme MS Zoning, either indurstiral or commercial zone. These two home are also located in the extreme neighborhood(IDOTRR, Gr1). In addition, home#472 has extreme overall condition (1, lowest level). Home #325 and home #611 were outliers in the initial model, as well, suggesting that improving the regression model (adding additional predictors) did not resolve the outlier issues for these homes. Maybe I need to respeficy this model in some other way.
I used bayes.outlier funciton to see if any outlier is outside 3 times standard deviation (k = 3). There are several outliers but their probabilities are aound 4e-05, and none of them are above 0.5 threshold. Therefore, any outliers would not dramatically changes the regression if excluded.
par(mfrow = c(2,2))
plot(fm_BIC, which = 1, pch = 16, cex = 0.7, col = "blue")
plot(fm_BIC, which = 2, pch = 16, cex = 0.7, col = "blue")
hist(fm_BIC$resid, col = "blue")
plot(fm_BIC$resid, pch = 16, cex = 0.7, col = "blue")
plot(fm_BIC, which = 4, col = "blue")
which(hatvalues(fm_BIC) == max(hatvalues(fm_BIC)))
## 472
## 472
outlier1 = Bayes.outlier(fm_BIC, k=3)
ggplot(data.frame(probability = outlier1$prob.outlier,
case = 1:length(outlier1$prob.outlier)),
aes(ymax=probability, ymin=0, x=case)) +
geom_linerange() +
ylab("Probability")
which(outlier1$prob.outlier > 0.50)
## integer(0)
outlier <- ames_train1[c(325, 472, 560, 611), ]
p.price <- round(exp(predict(fm_BIC, outlier)))
out.data <- t(ames_train1[c(325, 472, 560, 611), ])
out.data <- rbind(p.price, out.data)
out.data
## 1 2 3 4
## p.price "296657" "100127" "161471" "84440"
## PID " 905427030" "1007100110" " 916252170" " 911102170"
## area "3672" "1836" "1295" "1317"
## price "415000" "103000" "230000" " 40000"
## MS.SubClass " 75" " 70" "120" " 70"
## MS.Zoning "RL" "I (all)" "RM" "C (all)"
## Lot.Frontage "60" NA NA "50"
## Lot.Area "19800" "56600" " 8239" " 8500"
## Street "Pave" "Pave" "Pave" "Pave"
## Alley NA NA NA "Pave"
## Lot.Shape "Reg" "IR1" "IR1" "Reg"
## Land.Contour "Lvl" "Low" "Lvl" "Lvl"
## Utilities "AllPub" "AllPub" "AllPub" "AllPub"
## Lot.Config "Inside" "Inside" "Inside" "Inside"
## Land.Slope "Gtl" "Gtl" "Gtl" "Gtl"
## Neighborhood "Edwards" "IDOTRR" "GrnHill" "IDOTRR"
## Condition.1 "Norm" "Norm" "Norm" "Feedr"
## Condition.2 "Norm" "Norm" "Norm" "Norm"
## Bldg.Type "1Fam" "1Fam" "TwnhsE" "1Fam"
## House.Style "2.5Unf" "2.5Unf" "1Story" "2Story"
## Overall.Qual "6" "5" "7" "4"
## Overall.Cond "8" "1" "5" "4"
## Year.Built "1935" "1900" "1986" "1920"
## Year.Remod.Add "1990" "1950" "1986" "1950"
## Roof.Style "Gable" "Hip" "Gable" "Gambrel"
## Roof.Matl "CompShg" "CompShg" "CompShg" "CompShg"
## Exterior.1st "BrkFace" "Wd Sdng" "BrkFace" "BrkFace"
## Exterior.2nd "Wd Sdng" "Wd Sdng" "Wd Sdng" "BrkFace"
## Mas.Vnr.Type "None" "None" "None" "None"
## Mas.Vnr.Area "0" "0" "0" "0"
## Exter.Qual "TA" "TA" "Gd" "TA"
## Exter.Cond "TA" "TA" "TA" "Fa"
## Foundation "BrkTil" "BrkTil" "CBlock" "BrkTil"
## Bsmt.Qual "TA" "TA" NA "TA"
## Bsmt.Cond "TA" "TA" NA "TA"
## Bsmt.Exposure "No" "No" NA "No"
## BsmtFin.Type.1 "Rec" "Unf" NA "Unf"
## BsmtFin.SF.1 "425" " 0" " 0" " 0"
## BsmtFin.Type.2 "Unf" "Unf" NA "Unf"
## BsmtFin.SF.2 "0" "0" "0" "0"
## Bsmt.Unf.SF "1411" " 686" " 0" " 649"
## Total.Bsmt.SF "1836" " 686" " 0" " 649"
## Heating "GasA" "GasA" "GasA" "GasA"
## Heating.QC "Gd" "Ex" "Gd" "TA"
## Central.Air "Y" "Y" "Y" "N"
## Electrical "SBrkr" "SBrkr" "SBrkr" "SBrkr"
## X1st.Flr.SF "1836" "1150" "1295" " 649"
## X2nd.Flr.SF "1836" " 686" " 0" " 668"
## Low.Qual.Fin.SF "0" "0" "0" "0"
## Bsmt.Full.Bath "0" "0" "0" "0"
## Bsmt.Half.Bath "0" "0" "0" "0"
## Full.Bath "3" "2" "2" "1"
## Half.Bath "1" "0" "0" "0"
## Bedroom.AbvGr "5" "4" "2" "3"
## Kitchen.AbvGr "1" "1" "1" "1"
## Kitchen.Qual "Gd" "TA" "Gd" "TA"
## TotRms.AbvGrd "7" "7" "5" "6"
## Functional "Typ" "Maj1" "Typ" "Typ"
## Fireplaces "2" "0" "0" "0"
## Fireplace.Qu "Gd" NA NA NA
## Garage.Type "Detchd" "Detchd" "Attchd" "Detchd"
## Garage.Yr.Blt "1993" "1900" "1986" "1920"
## Garage.Finish "Unf" "Unf" "RFn" "Unf"
## Garage.Cars "2" "1" "1" "1"
## Garage.Area "836" "288" "312" "250"
## Garage.Qual "TA" "TA" "TA" "TA"
## Garage.Cond "TA" "Fa" "TA" "Fa"
## Paved.Drive "Y" "N" "Y" "N"
## Wood.Deck.SF "684" " 0" " 0" " 0"
## Open.Porch.SF "80" " 0" " 0" "54"
## Enclosed.Porch " 32" " 0" " 0" "172"
## X3Ssn.Porch "0" "0" "0" "0"
## Screen.Porch "0" "0" "0" "0"
## Pool.Area "0" "0" "0" "0"
## Pool.QC NA NA NA NA
## Fence NA NA NA "MnPrv"
## Misc.Feature NA NA NA NA
## Misc.Val "0" "0" "0" "0"
## Mo.Sold "12" " 1" "11" " 7"
## Yr.Sold "2006" "2008" "2006" "2008"
## Sale.Type "WD " "WD " "WD " "WD "
## Sale.Condition "Normal" "Normal" "Normal" "Normal"
## Home.Decade "-0.5" "-4.0" " 4.6" "-2.0"
## Home.Decade.Sqr " 0.25" "16.00" "21.16" " 4.00"
## logprice "12.93603" "11.54248" "12.34583" "10.59663"
## logarea "8.208492" "7.515345" "7.166266" "7.183112"
## logLot.Area " 9.893437" "10.943764" " 9.016634" " 9.047821"
## NeighborhoodII "Gr2" "Gr1" "Gr4" "Gr1"
For your final model, calculate and briefly comment on the RMSE.
The final Model RMSE is greatly improved from the intial model from $28,681 to $19,4766. Overfitting is also improved from $702.23 for the initial model ($29,383.23 (test data) - $28,681.00 (training data)) to $283.59 for the final model ($ 19,759.67 (for test data) - $19,476.08 (for train data)).
pred.fm_BIC.train <- exp(predict(fm_BIC, ames_train1))
resid.fm_BIC.train <- ames_train1$price - pred.fm_BIC.train
RMSE.fm_BIC.train <- sqrt(mean(resid.fm_BIC.train^2))
RMSE.fm_BIC.train
## [1] 19476.08
pred.fm_bas.train <- exp(predict(fm_bas, ames_train1, estimator = "BMA")$fit)
resid.fm_bas.train <- ames_train1$price - pred.fm_bas.train
RMSE.fm_bas.train <- sqrt(mean(resid.fm_bas.train^2))
RMSE.fm_bas.train
## [1] 19441.73
What are some strengths and weaknesses of your model?
This final model might be considered quite successful in terms of ability to usefully explain the variation in sale price (R^2 = 92.1%) and only 7.9% of the variation in price remained unexplained by this model. This model also can tell the contribution of each explanatory variable in the model. However, 95 % prediction intervals are still too wide; for example, ($24,6151.47, $ 377,023.8) for home#1, ($192,073.14, $293,980.3) for home#2, ($488,600.59, $749,673.0) for home#8. Collecting more data observations or new predictor variables such as school information (elementary, middle, and high school) might tignten up this interval.
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?
load("ames_validation.Rdata")
RMSE of my final model is $20,842.02 with the validation data. This value is much greater than those from training data or test data($19,476.08 or $ 19,759.67, respectively). Coverage probality, that the 95% predictive confidence intervals contain the true price of the house in the validation data set, is slightly over 95%, i.e., 95.54%. Therefore I can conclude that my final model properly reflect uncertainty.
ames_val1 <- ames_validation %>% mutate(Sale.Condition == "Normal")
ames_val1 <- ames_val1 %>% mutate(logprice = log(price))
ames_val1 <- ames_val1 %>% mutate(logarea = log(area))
ames_val1 <- ames_val1 %>% mutate(logLot.Area = log(Lot.Area))
ames_val1 <- ames_val1 %>% mutate(Home.Decade = (Year.Built - 1940)/10)
ames_val1 <- ames_val1 %>% mutate(Home.Decade.Sqr = (Home.Decade)^2)
ames_val1 <- ames_val1 %>% mutate(NeighborhoodII = Neighborhood)
levels(ames_val1$NeighborhoodII)[levels(ames_val1$NeighborhoodII)%in%c("MeadowV", "BrDale", "IDOTRR")] <- "Gr1"
levels(ames_val1$NeighborhoodII)[levels(ames_val1$NeighborhoodII)%in%c("OldTown", "Blueste", "BrkSide", "Edwards", "SWISU", "Sawyer", "NAmes", "NPkVill", "Mitchel", "Landmrk")] <- "Gr2"
levels(ames_val1$NeighborhoodII)[levels(ames_val1$NeighborhoodII)%in%c("Timber", "Somerst", "Veenker", "Greens", "Crawfor", "CollgCr","SawyerW", "Gilbert", "ClearCr", "NWAmes", "Blmngtn")] <- "Gr3"
levels(ames_val1$NeighborhoodII)[levels(ames_val1$NeighborhoodII)%in%c("StoneBr", "GrnHill", "NoRidge", "NridgHt")] <- "Gr4"
pred.fm_BIC.val <- exp(predict(fm_BIC, ames_val1))
resid.fm_BIC.val <- ames_val1$price - pred.fm_BIC.val
RMSE.fm_BIC.val <- sqrt(mean(resid.fm_BIC.val^2))
RMSE.fm_BIC.val
## [1] 20842.02
pred.fm_bas.val <- exp(predict(fm_bas, ames_val1, estimator = "BMA")$fit)
resid.fm_bas.val <- ames_val1$price - pred.fm_bas.val
RMSE.fm_bas.val <- sqrt(mean(resid.fm_bas.val^2))
RMSE.fm_bas.val
## [1] 20827.2
int.fm_BIC.val <- round(exp(predict(fm_BIC, ames_val1, interval = "prediction")))
cover.fm_BIC.val <- mean(ames_val1$price > int.fm_BIC.val[,"lwr"] &
ames_val1$price < int.fm_BIC.val[,"upr"])
cover.fm_BIC.val
## [1] 0.9554391
prediction.interval.val <- cbind(int.fm_BIC.val, ames_val1$price)
colnames(prediction.interval.val) <-c("fit", "lwr", "upr", "saleprice")
prediction.interval.val <- as.data.frame(prediction.interval.val)
low.group <- prediction.interval.val %>% filter(saleprice < 72500)
high.group <- prediction.interval.val %>% filter(saleprice > 379000)
mid.group <- prediction.interval.val %>% filter(saleprice > 169000 & saleprice <171000)
low.group
## fit lwr upr saleprice
## 1 64162 51799 79475 55000
## 2 69723 56321 86315 58500
## 3 52110 42038 64596 35000
## 4 78172 63205 96684 64000
## 5 61274 49451 75924 60000
## 6 61880 49846 76820 57625
## 7 75668 61102 93707 55000
## 8 64337 51980 79632 64500
## 9 72230 58305 89480 68000
high.group
## fit lwr upr saleprice
## 1 346157 278539 430192 475000
## 2 355736 286848 441167 584500
## 3 446248 360447 552472 460000
## 4 449622 363095 556768 445000
## 5 369048 298171 456772 410000
## 6 393299 317876 486617 470000
## 7 370957 299811 458986 441929
## 8 366454 296140 453463 410000
## 9 420642 339918 520538 451950
mid.group
## fit lwr upr saleprice
## 1 181916 147216 224795 170000
## 2 170492 137614 211225 170000
## 3 175990 142388 217523 170000
## 4 171484 138628 212127 170000
## 5 177720 143629 219903 170000
## 6 146570 118580 181168 169500
## 7 138842 111955 172186 169500
## 8 179325 145070 221669 170000
## 9 194807 157312 241240 170000
## 10 165039 133515 204007 170000
## 11 168774 136307 208973 170000
## 12 163251 131963 201956 170000
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
My final model(fm_BIC) is fairly good model with high adjusted R^2 (92.1%) and good coverage probability. However, the 95% prediction range is quite wide, needs to be tighten up. In the lower saleprice group, the sale price is close to the lower limit of prediction range, while in the high saleprice group, the sale price is close to the high limit of prediction range. In the middle saleprice group, the sale price is close to the fit price. I can use this trend to predict home price using this model. This model showed that the location (NeighborhoodII), size(area and Lot.Area), overall quality, overall condition, and home age (decade) are deciding predictors for the home price, as I guessed. In addition, “fireplaces” also turns out to be an important predictor for the home price. I learned that the quadratic relationship between home price and home age is not evident in ames homes.