As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
load("ames_train.Rdata")
Use the code block below to load any necessary packages
library(tidyverse)
library(statsr)
library(dplyr)
library(BAS)
library(plotly)
library(corrplot)
library(MASS)
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
Let’s get an understanding of the missing values in the dataset:
ames_train <- ames_train %>% mutate(Garage.Area=replace_na(Garage.Area,0),
Garage.Cars=replace_na(Garage.Cars,0),
Lot.Frontage=replace_na(Lot.Frontage,0),
BsmtFin.SF.1=replace_na(BsmtFin.SF.1,0),
BsmtFin.SF.2=replace_na(BsmtFin.SF.2,0),
Total.Bsmt.SF=replace_na(Total.Bsmt.SF,0),
Bsmt.Unf.SF=replace_na(Bsmt.Unf.SF,0),
Bsmt.Full.Bath=replace_na(Bsmt.Full.Bath,0),
Bsmt.Half.Bath=replace_na(Bsmt.Half.Bath,0),
Mas.Vnr.Area=replace_na(Mas.Vnr.Area,0))
Let’s look at the distribution of price:
m <- list(l=50,r=50,b=50,t=50,pad=4)
price_density <- density(ames_train$price)
fig <- plot_ly(data=ames_train, x=~price, type="histogram",
nbinsx=75,name="Histogram") %>%
add_trace(x=price_density$x,y=price_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density") %>%
layout(autosize = F, width=800, height=400, margin=m,
title="Histogram and Density Plot of 'ames_train$price'",
xaxis=list(title="House Price ($)"),yaxis=list(title="Number of Observations"),
yaxis2 = list(overlaying = "y", side = "right", title="Density"),
legend = list(x = 0.7, y = 0.8))
fig
print(paste("Mean: ",round(mean(ames_train$price),0)))
## [1] "Mean: 181190"
print(paste("Median: ",round(median(ames_train$price),0)))
## [1] "Median: 159467"
print(paste("SD: ",round(sd(ames_train$price),0)))
## [1] "SD: 81910"
We can see that price is right-skewed. Because of that, it might be helpful to take a look at the distribution of log(price) = lprice.
m <- list(l=50,r=50,b=50,t=50,pad=4)
ames_train <- ames_train %>% mutate(lprice=log(price))
price_density <- density(ames_train$lprice)
fig <- plot_ly(data=ames_train, x=~lprice, type="histogram",
nbinsx=75,name="Histogram") %>%
add_trace(x=price_density$x,y=price_density$y,type="scatter",
mode="lines",fill="tozeroy",yaxis="y2",name="Density") %>%
layout(autosize = F, width=800, height=400, margin=m,
title="Histogram and Density Plot of 'lprice'",
xaxis=list(title="log(price)"),yaxis=list(title="Number of Observations"),
yaxis2 = list(overlaying = "y", side = "right", title="Density"),
legend = list(x = 0.2, y = 0.8))
fig
print(paste("Mean: ",round(mean(ames_train$lprice),2)))
## [1] "Mean: 12.02"
print(paste("Median: ",round(median(ames_train$lprice),2)))
## [1] "Median: 11.98"
print(paste("SD: ",round(sd(ames_train$lprice),2)))
## [1] "SD: 0.42"
It looks like the distribution of log(price) is less skewed. Something to keep in mind…
Now, let’s take a look at the correlation among the numerical variables in ames_train.
# Subset all numeric columns in `ames_train`
ames_train_numeric <- select_if(ames_train, is.numeric) %>%
dplyr::select(-c(PID,Garage.Yr.Blt))
corr_matrix <- cor(ames_train_numeric)
fig <- corrplot(corr_matrix,method="circle",tl.cex=0.6)
Upon examination of the correlation matrix, we can see that there are a number of variables that are correlated with `lprice’. We will use most, if not all, of these features to build our initial model.
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
Based on the correlation matrix, it probably makes sense to develop a base model comprised of the variables that have a positive correlation with lprice. Incidently, we will select lprice as the dependent variable to avoid the skewness effects of the original feature price. Our selected features include area,Overall.Cond,Total.Bsmt.SF,X1st.Flr.SF,Garage.Area,Full.Bathand Garage.Cars.
initial_model <- lm(lprice~area+Overall.Cond+Total.Bsmt.SF+X1st.Flr.SF++Full.Bath+
Garage.Area+Garage.Cars,data=ames_train)
summary(initial_model)
##
## Call:
## lm(formula = lprice ~ area + Overall.Cond + Total.Bsmt.SF + X1st.Flr.SF +
## +Full.Bath + Garage.Area + Garage.Cars, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.06512 -0.08879 0.02230 0.12331 0.86716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.050e+01 4.952e-02 212.071 < 2e-16 ***
## area 2.782e-04 2.072e-05 13.426 < 2e-16 ***
## Overall.Cond 5.656e-02 6.594e-03 8.578 < 2e-16 ***
## Total.Bsmt.SF 3.488e-04 2.979e-05 11.706 < 2e-16 ***
## X1st.Flr.SF -1.273e-05 3.514e-05 -0.362 0.717
## Full.Bath 8.822e-02 1.751e-02 5.037 5.62e-07 ***
## Garage.Area 8.112e-05 7.447e-05 1.089 0.276
## Garage.Cars 1.492e-01 2.177e-02 6.853 1.27e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.217 on 992 degrees of freedom
## Multiple R-squared: 0.7356, Adjusted R-squared: 0.7338
## F-statistic: 394.3 on 7 and 992 DF, p-value: < 2.2e-16
initial_model.res <- residuals(initial_model)
fig <- plot_ly(y=initial_model.res,type="scatter",mode="markers") %>%
layout(title="Residuals Plot for Base Linear Model",
xaxis=list(title="Observation"),
yaxis=list(title="log(price)"))
fig
We can observe that our initial model has an okay \(R^2\) of 0.7356 and the residuals seem to be evenly distributed around the \(lprice=0\) line.
Now either using BAS or another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
Let’s use the stepAIC function in the MASS package to evaluate our model.
initial_model_AIC <- stepAIC(initial_model, direction = 'backward', k=2, trace = FALSE)
summary(initial_model_AIC)
##
## Call:
## lm(formula = lprice ~ area + Overall.Cond + Total.Bsmt.SF + Full.Bath +
## Garage.Cars, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04521 -0.09056 0.02470 0.12493 0.86424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.050e+01 4.895e-02 214.502 < 2e-16 ***
## area 2.774e-04 1.962e-05 14.136 < 2e-16 ***
## Overall.Cond 5.637e-02 6.589e-03 8.555 < 2e-16 ***
## Total.Bsmt.SF 3.447e-04 1.937e-05 17.801 < 2e-16 ***
## Full.Bath 8.539e-02 1.724e-02 4.953 8.59e-07 ***
## Garage.Cars 1.692e-01 1.150e-02 14.712 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2169 on 994 degrees of freedom
## Multiple R-squared: 0.7353, Adjusted R-squared: 0.734
## F-statistic: 552.2 on 5 and 994 DF, p-value: < 2.2e-16
initial_model_BIC <- stepAIC(initial_model, direction = 'backward', k=log(nrow(ames_train)), trace = FALSE)
summary(initial_model_BIC)
##
## Call:
## lm(formula = lprice ~ area + Overall.Cond + Total.Bsmt.SF + Full.Bath +
## Garage.Cars, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04521 -0.09056 0.02470 0.12493 0.86424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.050e+01 4.895e-02 214.502 < 2e-16 ***
## area 2.774e-04 1.962e-05 14.136 < 2e-16 ***
## Overall.Cond 5.637e-02 6.589e-03 8.555 < 2e-16 ***
## Total.Bsmt.SF 3.447e-04 1.937e-05 17.801 < 2e-16 ***
## Full.Bath 8.539e-02 1.724e-02 4.953 8.59e-07 ***
## Garage.Cars 1.692e-01 1.150e-02 14.712 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2169 on 994 degrees of freedom
## Multiple R-squared: 0.7353, Adjusted R-squared: 0.734
## F-statistic: 552.2 on 5 and 994 DF, p-value: < 2.2e-16
We see that, for both cases (AIC and BIC priors), the final model selected includes the coefficients area,Overall.Cond,Total.Bsmt.SF,Full.Bath,and Garage.Cars, while omitting X1st.Flr.SF and Garage.Area. This is a good indication that these housefeatures are statistically significant in determining lprice.
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
initial_model_AIC <- lm(lprice~area+Overall.Cond+Total.Bsmt.SF+Full.Bath+Garage.Cars,
data=ames_train)
par(mfrow=c(2,2))
plot(initial_model_AIC)
An inspection of the plots above indicate that the residuals generated by initial_model_AIC do not appear to be abnormal. However, let’s take a closer look:
initial_model_AIC.res <- residuals(initial_model_AIC)
fig <- plot_ly(y=initial_model_AIC.res,type="scatter",mode="markers") %>%
layout(title="Residuals Plot for Selected Initial Linear Model - `ames_train",
xaxis=list(title="Observation"),
yaxis=list(title="log(price)"))
fig
Let’s look at a couple of the residual outliers.
The first one is observation 310 with a residual of -1.8. Upon further inspection of the original ames_training data, this particular observation was a partial sale, hence the significant difference between \(\widehat{lprice}\) and \(lprice_{obs}\).
The second one is observation 428 with a residual of -2.04. Upon further inspection of the original ames_training data, this particular observation was classified as an “abnormal” sale, hence the significant difference between \(\widehat{lprice}\) and \(lprice_{obs}\).
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
library(Metrics)
round(rmse(ames_train$lprice,predict(initial_model_AIC,ames_train)),2)
## [1] 0.22
This will be our baseline RMSE.
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.
load("ames_test.Rdata")
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
# Create the `lprice` in `ames_test`.
ames_test <- ames_test %>% mutate(lprice=log(price))
round(rmse(ames_test$lprice,predict(initial_model_AIC,ames_test)),2)
## [1] 0.18
The RMSE of the ames_test using this linear model is 0.18, which is 18% less than the training set ames_train RMSE of 0.22. This would indicate that our linear model, inital_model_AIC, seems to do a decent job of generalizing to the test data. One of the reasons why the RMSE for ames_test might be lower is that the spread of residual outliers in ames_test may be less than those in ames_train. To investigate this, let’s plot the residuals of ames_test using model initial_model_AIC:
initial_model_AIC.res_test <- predict(initial_model_AIC,ames_test)-ames_test$lprice
fig <- plot_ly(y=initial_model_AIC.res_test,type="scatter",mode="markers") %>%
layout(title="Residuals Plot for Selected Initial Linear Model - `ames_test`",
xaxis=list(title="Observation"),
yaxis=list(title="log(price)"))
fig
Indeed, the outlier spread among the ames_test residuals is quite a bit less than that of ames_train. This probably explains the lower RMSE.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Provide the summary table for your model.
For the final model, we will use all of the predictors from our previous model (except X1st.Flr.SF and Garage.Area) and add some categorical features to the mix. We can intuitively guess as to which categorical variables will likely have an impact on lprice and we will include them. We will also include some features where we might not have any inclination as to how it would affect the sales price. The predictors we will use are as follows:
area - in initial modelOverall.Cond - in initial modelTotal.Bsmt.SF - in initial modelFull.Bath - in initial modelGarage.Cars - in initial modelMS.SubClass - newly addedNeighborhood - newly addedHouse.Style - newly addedHeating.QC - newly addedKitchen.Qual - newly addedSale.Type - newly addedSale.Condition - newly added# Subset the `ames_train` down to the features of interest
ames_train_final <- ames_train %>%
dplyr::select(c(lprice,area,Overall.Cond,Total.Bsmt.SF,Garage.Area,
Garage.Cars,MS.SubClass,Neighborhood,House.Style,Heating.QC,
Kitchen.Qual,Sale.Type,Sale.Condition))
final_model <- lm(lprice~.,data=ames_train_final)
final_model_AIC <- stepAIC(final_model, direction='backward', trace = FALSE)
summary(final_model_AIC)
##
## Call:
## lm(formula = lprice ~ area + Overall.Cond + Total.Bsmt.SF + Garage.Cars +
## MS.SubClass + Neighborhood + House.Style + Heating.QC + Kitchen.Qual +
## Sale.Type + Sale.Condition, data = ames_train_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64705 -0.07017 0.00242 0.08819 0.56862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.111e+01 8.488e-02 130.872 < 2e-16 ***
## area 2.776e-04 2.016e-05 13.772 < 2e-16 ***
## Overall.Cond 6.019e-02 5.275e-03 11.410 < 2e-16 ***
## Total.Bsmt.SF 2.217e-04 2.077e-05 10.678 < 2e-16 ***
## Garage.Cars 6.294e-02 9.941e-03 6.332 3.75e-10 ***
## MS.SubClass -5.242e-04 1.662e-04 -3.154 0.001662 **
## NeighborhoodBlueste -1.071e-01 1.042e-01 -1.028 0.304440
## NeighborhoodBrDale -1.925e-01 7.328e-02 -2.627 0.008748 **
## NeighborhoodBrkSide -2.667e-01 5.861e-02 -4.550 6.07e-06 ***
## NeighborhoodClearCr -2.027e-02 6.789e-02 -0.299 0.765363
## NeighborhoodCollgCr -3.130e-02 5.239e-02 -0.598 0.550314
## NeighborhoodCrawfor -7.556e-03 5.884e-02 -0.128 0.897842
## NeighborhoodEdwards -2.067e-01 5.518e-02 -3.745 0.000191 ***
## NeighborhoodGilbert 2.044e-03 5.504e-02 0.037 0.970387
## NeighborhoodGreens 2.451e-01 9.202e-02 2.664 0.007852 **
## NeighborhoodGrnHill 6.004e-01 1.211e-01 4.956 8.54e-07 ***
## NeighborhoodIDOTRR -4.024e-01 5.951e-02 -6.762 2.39e-11 ***
## NeighborhoodMeadowV -3.217e-01 6.592e-02 -4.880 1.24e-06 ***
## NeighborhoodMitchel -6.083e-02 5.590e-02 -1.088 0.276795
## NeighborhoodNAmes -1.409e-01 5.334e-02 -2.641 0.008404 **
## NeighborhoodNoRidge 6.905e-02 5.885e-02 1.173 0.240970
## NeighborhoodNPkVill -8.631e-02 9.263e-02 -0.932 0.351705
## NeighborhoodNridgHt 1.270e-01 5.339e-02 2.378 0.017618 *
## NeighborhoodNWAmes -9.492e-02 5.713e-02 -1.662 0.096943 .
## NeighborhoodOldTown -3.655e-01 5.544e-02 -6.593 7.17e-11 ***
## NeighborhoodSawyer -1.185e-01 5.542e-02 -2.137 0.032842 *
## NeighborhoodSawyerW -9.256e-02 5.492e-02 -1.685 0.092274 .
## NeighborhoodSomerst 7.167e-02 5.130e-02 1.397 0.162757
## NeighborhoodStoneBr 2.150e-01 5.995e-02 3.587 0.000352 ***
## NeighborhoodSWISU -2.125e-01 7.052e-02 -3.013 0.002652 **
## NeighborhoodTimber 7.663e-02 6.123e-02 1.252 0.211053
## NeighborhoodVeenker 5.451e-03 7.053e-02 0.077 0.938411
## House.Style1.5Unf -3.220e-02 6.355e-02 -0.507 0.612490
## House.Style1Story -3.958e-02 2.307e-02 -1.716 0.086560 .
## House.Style2.5Unf 1.730e-02 5.450e-02 0.318 0.750916
## House.Style2Story -6.176e-03 2.253e-02 -0.274 0.784052
## House.StyleSFoyer 7.096e-02 3.637e-02 1.951 0.051343 .
## House.StyleSLvl 5.411e-02 3.263e-02 1.658 0.097589 .
## Heating.QCFa -1.632e-01 3.712e-02 -4.398 1.22e-05 ***
## Heating.QCGd -5.855e-02 1.585e-02 -3.694 0.000233 ***
## Heating.QCPo -1.074e-01 1.692e-01 -0.635 0.525628
## Heating.QCTA -7.005e-02 1.478e-02 -4.741 2.46e-06 ***
## Kitchen.QualFa -2.405e-01 4.650e-02 -5.171 2.84e-07 ***
## Kitchen.QualGd -1.019e-01 2.455e-02 -4.152 3.59e-05 ***
## Kitchen.QualPo -2.572e-01 1.710e-01 -1.505 0.132752
## Kitchen.QualTA -1.713e-01 2.808e-02 -6.103 1.52e-09 ***
## Sale.TypeCon 2.327e-01 8.043e-02 2.893 0.003910 **
## Sale.TypeConLD -8.334e-03 6.999e-02 -0.119 0.905237
## Sale.TypeConLI -1.153e-02 7.752e-02 -0.149 0.881810
## Sale.TypeConLw 3.116e-02 7.288e-02 0.427 0.669124
## Sale.TypeCWD 2.318e-01 9.634e-02 2.406 0.016316 *
## Sale.TypeNew -1.213e-01 1.039e-01 -1.167 0.243382
## Sale.TypeOth -7.395e-02 8.498e-02 -0.870 0.384384
## Sale.TypeVWD 1.888e-02 1.604e-01 0.118 0.906348
## Sale.TypeWD 2.960e-02 3.313e-02 0.894 0.371769
## Sale.ConditionAdjLand 1.283e-01 1.184e-01 1.084 0.278859
## Sale.ConditionAlloca 2.597e-01 8.224e-02 3.158 0.001641 **
## Sale.ConditionFamily 7.222e-04 4.489e-02 0.016 0.987167
## Sale.ConditionNormal 1.030e-01 2.227e-02 4.624 4.29e-06 ***
## Sale.ConditionPartial 2.962e-01 9.912e-02 2.989 0.002876 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1555 on 940 degrees of freedom
## Multiple R-squared: 0.8714, Adjusted R-squared: 0.8633
## F-statistic: 107.9 on 59 and 940 DF, p-value: < 2.2e-16
The final_model_AIC summary shows that the addition of the categorical variables increases the \(R^2\) value from 0.7353 to 0.8714, explaining 87.1% of the variance in log(price) of the houses.
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
For my proposed model, I looked at Bayesian averaging with an AIC prior using both price and log(price) as the dependent variable.
ames_train_trans <- ames_train %>%
dplyr::select(c(price,lprice,area,Overall.Cond,Total.Bsmt.SF,Full.Bath,
Garage.Cars,MS.SubClass,Neighborhood,House.Style,Heating.QC,
Kitchen.Qual,Sale.Type,Sale.Condition))
final_modela <- bas.lm(price ~ .-lprice, data=ames_train_trans,
prior = "AIC",
modelprior = uniform())
summary(final_modela)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.000000e+00 1.000000e+00
## area 1.00000000 1.0000 1.000000e+00 1.000000e+00
## Overall.Cond 0.99999999 1.0000 1.000000e+00 1.000000e+00
## Total.Bsmt.SF 1.00000000 1.0000 1.000000e+00 1.000000e+00
## Full.Bath 0.68055628 1.0000 0.000000e+00 1.000000e+00
## Garage.Cars 1.00000000 1.0000 1.000000e+00 1.000000e+00
## MS.SubClass 1.00000000 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodBlueste 0.16483759 0.0000 0.000000e+00 0.000000e+00
## NeighborhoodBrDale 0.69472858 0.0000 0.000000e+00 0.000000e+00
## NeighborhoodBrkSide 0.98745936 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodClearCr 0.36904301 0.0000 0.000000e+00 0.000000e+00
## NeighborhoodCollgCr 0.79723671 0.0000 0.000000e+00 0.000000e+00
## NeighborhoodCrawfor 0.99689033 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodEdwards 0.99015523 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodGilbert 0.98199052 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodGreens 0.99970491 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodGrnHill 1.00000000 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodIDOTRR 0.99978187 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodMeadowV 0.05295983 0.0000 0.000000e+00 0.000000e+00
## NeighborhoodMitchel 0.22163885 0.0000 0.000000e+00 0.000000e+00
## NeighborhoodNAmes 0.99465912 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodNoRidge 0.99999992 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodNPkVill 0.11172371 0.0000 0.000000e+00 0.000000e+00
## NeighborhoodNridgHt 1.00000000 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodNWAmes 0.41676567 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodOldTown 1.00000000 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodSawyer 0.71245965 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodSawyerW 0.12437260 1.0000 1.000000e+00 0.000000e+00
## NeighborhoodSomerst 0.99997841 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodStoneBr 1.00000000 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodSWISU 0.84966538 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodTimber 0.99996453 1.0000 1.000000e+00 1.000000e+00
## NeighborhoodVeenker 0.94617490 1.0000 1.000000e+00 0.000000e+00
## House.Style1.5Unf 0.05759816 0.0000 0.000000e+00 0.000000e+00
## House.Style1Story 0.07703241 0.0000 0.000000e+00 0.000000e+00
## House.Style2.5Unf 0.15262856 0.0000 0.000000e+00 0.000000e+00
## House.Style2Story 0.11852581 0.0000 0.000000e+00 0.000000e+00
## House.StyleSFoyer 0.99997759 1.0000 1.000000e+00 1.000000e+00
## House.StyleSLvl 0.99185632 1.0000 1.000000e+00 1.000000e+00
## Heating.QCFa 0.95157740 1.0000 1.000000e+00 1.000000e+00
## Heating.QCGd 0.99678851 1.0000 1.000000e+00 1.000000e+00
## Heating.QCPo 0.18797317 0.0000 0.000000e+00 0.000000e+00
## Heating.QCTA 0.99648531 1.0000 1.000000e+00 1.000000e+00
## Kitchen.QualFa 1.00000000 1.0000 1.000000e+00 1.000000e+00
## Kitchen.QualGd 1.00000000 1.0000 1.000000e+00 1.000000e+00
## Kitchen.QualPo 0.80199937 1.0000 1.000000e+00 1.000000e+00
## Kitchen.QualTA 1.00000000 1.0000 1.000000e+00 1.000000e+00
## Sale.TypeCon 0.99986579 1.0000 1.000000e+00 1.000000e+00
## Sale.TypeConLD 0.05975033 0.0000 0.000000e+00 0.000000e+00
## Sale.TypeConLI 0.06338566 0.0000 0.000000e+00 0.000000e+00
## Sale.TypeConLw 0.16338769 0.0000 0.000000e+00 0.000000e+00
## Sale.TypeCWD 0.99637480 1.0000 1.000000e+00 1.000000e+00
## Sale.TypeNew 0.06530685 0.0000 0.000000e+00 0.000000e+00
## Sale.TypeOth 0.04720290 0.0000 0.000000e+00 0.000000e+00
## Sale.TypeVWD 0.04138376 0.0000 0.000000e+00 0.000000e+00
## Sale.TypeWD 0.93998456 1.0000 1.000000e+00 1.000000e+00
## Sale.ConditionAdjLand 0.06677131 0.0000 0.000000e+00 0.000000e+00
## Sale.ConditionAlloca 0.40651006 0.0000 0.000000e+00 0.000000e+00
## Sale.ConditionFamily 0.92768803 1.0000 1.000000e+00 1.000000e+00
## Sale.ConditionNormal 0.82354779 1.0000 1.000000e+00 1.000000e+00
## Sale.ConditionPartial 0.99837380 1.0000 1.000000e+00 1.000000e+00
## BF NA 1.0000 8.748728e-01 8.369908e-01
## PostProbs NA 0.0008 7.000000e-04 7.000000e-04
## R2 NA 0.8624 8.621000e-01 8.618000e-01
## dim NA 41.0000 4.000000e+01 3.900000e+01
## logmarg NA -13816.0916 -1.381623e+04 -1.381627e+04
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## area 1.000000e+00 1.000000e+00
## Overall.Cond 1.000000e+00 1.000000e+00
## Total.Bsmt.SF 1.000000e+00 1.000000e+00
## Full.Bath 1.000000e+00 1.000000e+00
## Garage.Cars 1.000000e+00 1.000000e+00
## MS.SubClass 1.000000e+00 1.000000e+00
## NeighborhoodBlueste 0.000000e+00 0.000000e+00
## NeighborhoodBrDale 0.000000e+00 0.000000e+00
## NeighborhoodBrkSide 1.000000e+00 1.000000e+00
## NeighborhoodClearCr 0.000000e+00 0.000000e+00
## NeighborhoodCollgCr 0.000000e+00 0.000000e+00
## NeighborhoodCrawfor 1.000000e+00 1.000000e+00
## NeighborhoodEdwards 1.000000e+00 1.000000e+00
## NeighborhoodGilbert 1.000000e+00 1.000000e+00
## NeighborhoodGreens 1.000000e+00 1.000000e+00
## NeighborhoodGrnHill 1.000000e+00 1.000000e+00
## NeighborhoodIDOTRR 1.000000e+00 1.000000e+00
## NeighborhoodMeadowV 0.000000e+00 0.000000e+00
## NeighborhoodMitchel 0.000000e+00 0.000000e+00
## NeighborhoodNAmes 1.000000e+00 1.000000e+00
## NeighborhoodNoRidge 1.000000e+00 1.000000e+00
## NeighborhoodNPkVill 0.000000e+00 0.000000e+00
## NeighborhoodNridgHt 1.000000e+00 1.000000e+00
## NeighborhoodNWAmes 1.000000e+00 1.000000e+00
## NeighborhoodOldTown 1.000000e+00 1.000000e+00
## NeighborhoodSawyer 1.000000e+00 1.000000e+00
## NeighborhoodSawyerW 0.000000e+00 0.000000e+00
## NeighborhoodSomerst 1.000000e+00 1.000000e+00
## NeighborhoodStoneBr 1.000000e+00 1.000000e+00
## NeighborhoodSWISU 1.000000e+00 1.000000e+00
## NeighborhoodTimber 1.000000e+00 1.000000e+00
## NeighborhoodVeenker 1.000000e+00 1.000000e+00
## House.Style1.5Unf 0.000000e+00 0.000000e+00
## House.Style1Story 0.000000e+00 0.000000e+00
## House.Style2.5Unf 0.000000e+00 0.000000e+00
## House.Style2Story 0.000000e+00 0.000000e+00
## House.StyleSFoyer 1.000000e+00 1.000000e+00
## House.StyleSLvl 1.000000e+00 1.000000e+00
## Heating.QCFa 1.000000e+00 1.000000e+00
## Heating.QCGd 1.000000e+00 1.000000e+00
## Heating.QCPo 0.000000e+00 0.000000e+00
## Heating.QCTA 1.000000e+00 1.000000e+00
## Kitchen.QualFa 1.000000e+00 1.000000e+00
## Kitchen.QualGd 1.000000e+00 1.000000e+00
## Kitchen.QualPo 1.000000e+00 1.000000e+00
## Kitchen.QualTA 1.000000e+00 1.000000e+00
## Sale.TypeCon 1.000000e+00 1.000000e+00
## Sale.TypeConLD 0.000000e+00 0.000000e+00
## Sale.TypeConLI 0.000000e+00 0.000000e+00
## Sale.TypeConLw 0.000000e+00 0.000000e+00
## Sale.TypeCWD 1.000000e+00 1.000000e+00
## Sale.TypeNew 0.000000e+00 0.000000e+00
## Sale.TypeOth 0.000000e+00 0.000000e+00
## Sale.TypeVWD 0.000000e+00 0.000000e+00
## Sale.TypeWD 1.000000e+00 1.000000e+00
## Sale.ConditionAdjLand 0.000000e+00 0.000000e+00
## Sale.ConditionAlloca 0.000000e+00 0.000000e+00
## Sale.ConditionFamily 1.000000e+00 1.000000e+00
## Sale.ConditionNormal 1.000000e+00 0.000000e+00
## Sale.ConditionPartial 1.000000e+00 1.000000e+00
## BF 8.310813e-01 7.298727e-01
## PostProbs 7.000000e-04 6.000000e-04
## R2 8.621000e-01 8.617000e-01
## dim 4.000000e+01 3.900000e+01
## logmarg -1.381628e+04 -1.381641e+04
final_modelb <- bas.lm(lprice ~ .-price, data=ames_train_trans,
prior = "AIC",
modelprior = uniform())
summary(final_modelb)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## area 1.00000000 1.0000 1.0000000 1.0000000
## Overall.Cond 1.00000000 1.0000 1.0000000 1.0000000
## Total.Bsmt.SF 1.00000000 1.0000 1.0000000 1.0000000
## Full.Bath 0.15589379 0.0000 0.0000000 0.0000000
## Garage.Cars 1.00000000 1.0000 1.0000000 1.0000000
## MS.SubClass 0.99598799 1.0000 1.0000000 1.0000000
## NeighborhoodBlueste 0.08406351 0.0000 0.0000000 0.0000000
## NeighborhoodBrDale 0.94999463 1.0000 1.0000000 1.0000000
## NeighborhoodBrkSide 0.99999999 1.0000 1.0000000 1.0000000
## NeighborhoodClearCr 0.43189768 0.0000 0.0000000 0.0000000
## NeighborhoodCollgCr 0.95649305 0.0000 0.0000000 0.0000000
## NeighborhoodCrawfor 0.97037407 1.0000 1.0000000 1.0000000
## NeighborhoodEdwards 0.99999183 1.0000 1.0000000 1.0000000
## NeighborhoodGilbert 0.99521472 1.0000 1.0000000 1.0000000
## NeighborhoodGreens 0.99991089 1.0000 1.0000000 1.0000000
## NeighborhoodGrnHill 1.00000000 1.0000 1.0000000 1.0000000
## NeighborhoodIDOTRR 1.00000000 1.0000 1.0000000 1.0000000
## NeighborhoodMeadowV 0.99999977 1.0000 1.0000000 1.0000000
## NeighborhoodMitchel 0.34334468 1.0000 0.0000000 0.0000000
## NeighborhoodNAmes 0.99042285 1.0000 1.0000000 1.0000000
## NeighborhoodNoRidge 0.99987104 1.0000 1.0000000 1.0000000
## NeighborhoodNPkVill 0.06123173 0.0000 0.0000000 0.0000000
## NeighborhoodNridgHt 0.99999926 1.0000 1.0000000 1.0000000
## NeighborhoodNWAmes 0.15742911 1.0000 1.0000000 1.0000000
## NeighborhoodOldTown 1.00000000 1.0000 1.0000000 1.0000000
## NeighborhoodSawyer 0.69758307 1.0000 1.0000000 1.0000000
## NeighborhoodSawyerW 0.16048388 1.0000 1.0000000 1.0000000
## NeighborhoodSomerst 0.99998707 1.0000 1.0000000 1.0000000
## NeighborhoodStoneBr 1.00000000 1.0000 1.0000000 1.0000000
## NeighborhoodSWISU 0.96818995 1.0000 1.0000000 1.0000000
## NeighborhoodTimber 0.99976917 1.0000 1.0000000 1.0000000
## NeighborhoodVeenker 0.61833430 0.0000 1.0000000 0.0000000
## House.Style1.5Unf 0.11620362 0.0000 0.0000000 0.0000000
## House.Style1Story 0.75230598 1.0000 1.0000000 0.0000000
## House.Style2.5Unf 0.07071815 0.0000 0.0000000 0.0000000
## House.Style2Story 0.08001648 0.0000 0.0000000 0.0000000
## House.StyleSFoyer 0.98792143 1.0000 1.0000000 1.0000000
## House.StyleSLvl 0.97559161 1.0000 1.0000000 1.0000000
## Heating.QCFa 0.99999303 1.0000 1.0000000 1.0000000
## Heating.QCGd 0.99990390 1.0000 1.0000000 1.0000000
## Heating.QCPo 0.15167128 0.0000 0.0000000 0.0000000
## Heating.QCTA 0.99999980 1.0000 1.0000000 1.0000000
## Kitchen.QualFa 0.99999957 1.0000 1.0000000 1.0000000
## Kitchen.QualGd 0.99992032 1.0000 1.0000000 1.0000000
## Kitchen.QualPo 0.46553014 1.0000 0.0000000 0.0000000
## Kitchen.QualTA 1.00000000 1.0000 1.0000000 1.0000000
## Sale.TypeCon 0.99889764 1.0000 1.0000000 1.0000000
## Sale.TypeConLD 0.06755610 0.0000 0.0000000 0.0000000
## Sale.TypeConLI 0.07300743 0.0000 0.0000000 0.0000000
## Sale.TypeConLw 0.08785220 0.0000 0.0000000 0.0000000
## Sale.TypeCWD 0.98356802 1.0000 1.0000000 1.0000000
## Sale.TypeNew 0.32469206 1.0000 0.0000000 0.0000000
## Sale.TypeOth 0.36366065 0.0000 0.0000000 1.0000000
## Sale.TypeVWD 0.15057357 0.0000 0.0000000 0.0000000
## Sale.TypeWD 0.58715994 0.0000 1.0000000 1.0000000
## Sale.ConditionAdjLand 0.26089371 0.0000 0.0000000 0.0000000
## Sale.ConditionAlloca 0.99826757 1.0000 1.0000000 1.0000000
## Sale.ConditionFamily 0.05848334 0.0000 0.0000000 0.0000000
## Sale.ConditionNormal 0.99999980 1.0000 1.0000000 1.0000000
## Sale.ConditionPartial 0.99921753 1.0000 1.0000000 1.0000000
## BF NA 1.0000 0.9735828 0.9252926
## PostProbs NA 0.0004 0.0004000 0.0004000
## R2 NA 0.8703 0.8701000 0.8698000
## dim NA 43.0000 42.0000000 41.0000000
## logmarg NA -1608.9388 -1608.9655390 -1609.0164119
## model 4 model 5
## Intercept 1.0000000 1.0000000
## area 1.0000000 1.0000000
## Overall.Cond 1.0000000 1.0000000
## Total.Bsmt.SF 1.0000000 1.0000000
## Full.Bath 0.0000000 0.0000000
## Garage.Cars 1.0000000 1.0000000
## MS.SubClass 1.0000000 1.0000000
## NeighborhoodBlueste 0.0000000 0.0000000
## NeighborhoodBrDale 1.0000000 1.0000000
## NeighborhoodBrkSide 1.0000000 1.0000000
## NeighborhoodClearCr 0.0000000 0.0000000
## NeighborhoodCollgCr 1.0000000 1.0000000
## NeighborhoodCrawfor 0.0000000 1.0000000
## NeighborhoodEdwards 1.0000000 1.0000000
## NeighborhoodGilbert 1.0000000 1.0000000
## NeighborhoodGreens 1.0000000 1.0000000
## NeighborhoodGrnHill 1.0000000 1.0000000
## NeighborhoodIDOTRR 1.0000000 1.0000000
## NeighborhoodMeadowV 1.0000000 1.0000000
## NeighborhoodMitchel 1.0000000 0.0000000
## NeighborhoodNAmes 1.0000000 1.0000000
## NeighborhoodNoRidge 1.0000000 1.0000000
## NeighborhoodNPkVill 0.0000000 0.0000000
## NeighborhoodNridgHt 1.0000000 1.0000000
## NeighborhoodNWAmes 1.0000000 0.0000000
## NeighborhoodOldTown 1.0000000 1.0000000
## NeighborhoodSawyer 1.0000000 1.0000000
## NeighborhoodSawyerW 1.0000000 0.0000000
## NeighborhoodSomerst 1.0000000 1.0000000
## NeighborhoodStoneBr 1.0000000 1.0000000
## NeighborhoodSWISU 1.0000000 1.0000000
## NeighborhoodTimber 1.0000000 1.0000000
## NeighborhoodVeenker 0.0000000 1.0000000
## House.Style1.5Unf 0.0000000 0.0000000
## House.Style1Story 1.0000000 1.0000000
## House.Style2.5Unf 0.0000000 0.0000000
## House.Style2Story 0.0000000 0.0000000
## House.StyleSFoyer 1.0000000 1.0000000
## House.StyleSLvl 1.0000000 1.0000000
## Heating.QCFa 1.0000000 1.0000000
## Heating.QCGd 1.0000000 1.0000000
## Heating.QCPo 0.0000000 0.0000000
## Heating.QCTA 1.0000000 1.0000000
## Kitchen.QualFa 1.0000000 1.0000000
## Kitchen.QualGd 1.0000000 1.0000000
## Kitchen.QualPo 1.0000000 0.0000000
## Kitchen.QualTA 1.0000000 1.0000000
## Sale.TypeCon 1.0000000 1.0000000
## Sale.TypeConLD 0.0000000 0.0000000
## Sale.TypeConLI 0.0000000 0.0000000
## Sale.TypeConLw 0.0000000 0.0000000
## Sale.TypeCWD 1.0000000 1.0000000
## Sale.TypeNew 1.0000000 0.0000000
## Sale.TypeOth 1.0000000 0.0000000
## Sale.TypeVWD 0.0000000 0.0000000
## Sale.TypeWD 0.0000000 1.0000000
## Sale.ConditionAdjLand 0.0000000 0.0000000
## Sale.ConditionAlloca 1.0000000 1.0000000
## Sale.ConditionFamily 0.0000000 0.0000000
## Sale.ConditionNormal 1.0000000 1.0000000
## Sale.ConditionPartial 1.0000000 1.0000000
## BF 0.9227676 0.8591699
## PostProbs 0.0004000 0.0003000
## R2 0.8706000 0.8698000
## dim 44.0000000 41.0000000
## logmarg -1609.0191446 -1609.0905553
Since the model \(R^2\) where lprice is the dependent variable is higher than the model with price, we will keep the transformed feature lprice in our model.
Let’s calculate the RMSE for this model
round(rmse(ames_train_final$lprice,predict(final_model_AIC,ames_train_final)),2)
## [1] 0.15
Notice that, with the inclusion of categorical variables into the original model, the RMSE has decreased from 0.22 to 0.15 for the train dataset.
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.
There are a number of categorical variables that I included in my model to understand how the different values they may take on might influence lprice Looking at the p-values of the coefficients show that the different values of Neighborhood can significantly impact the dependent variable. Another categorical variable whose values seem to have a significant impact is Sale.Condition.
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
As a quick litmus test, I created a draft linear model and then examined the p-values of the coefficients. Any numerical variables that had a p-value significantly greater than 0.05 would likely be dropped, keeping in mind that the resultant \(R^2\) should not be significantly affected in a negative manner.
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
When I ran my final linear model on the ames_test dataset, I discovered that ames_test contains some values for Neighborhood and House.Style that were not seen in ames_train. Let’s see how many observations in ames_test this comprises:
nrow(ames_test[ames_test$Neighborhood=="Landmrk",])
## [1] 1
nrow(ames_test[ames_test$House.Style=="2.5Fin",])
## [1] 2
Since there is only three observations that fit this description, we will remove them from our analysis.
ames_test_final <- ames_test %>% filter(Neighborhood!="Landmrk",House.Style!="2.5Fin")
Let’s calculate the RMSE for the test data set…
round(rmse(ames_test_final$lprice,predict(final_model_AIC,ames_test_final)),2)
## [1] 0.13
Great! It looks as though the RMSE for both the train and test sets are very close using this this lprice-based model.
For your final model, create and briefly interpret an informative plot of the residuals.
Let’s look at the residuals for final_AIC model.
par(mfrow=c(2,2))
plot(final_model_AIC)
And a closer look…
final_model_AIC.res <- residuals(final_model_AIC)
fig <- plot_ly(y=final_model_AIC.res,type="scatter",mode="markers") %>%
layout(title="Residuals Plot for Final Linear Model - 'ames_train'",
xaxis=list(title="Observation"),
yaxis=list(title="log(price)"))
fig
Now, let’s look at the residuals generated by ames_test using this model:
ames_test_final_model <- ames_test %>% filter(Neighborhood!="Landmrk",House.Style!="2.5Fin")
final_model_AIC.res_test <- predict(final_model_AIC,ames_test_final_model)-ames_test_final_model$lprice
fig <- plot_ly(y=final_model_AIC.res_test,type="scatter",mode="markers") %>%
layout(title="Residuals Plot for Selected Linear Model - 'ames_test'",
xaxis=list(title="Observation"),
yaxis=list(title="log(price)"))
fig
For your final model, calculate and briefly comment on the RMSE.
Previously, we calculated the RMSE for both the train and test sets using our initial model. We will reproduce those results here:
# RMSE for train dataset
round(rmse(ames_train_final$lprice,predict(final_model_AIC,ames_train_final)),2)
## [1] 0.15
# RMSE for test dataset
round(rmse(ames_test_final$lprice,predict(final_model_AIC,ames_test_final)),2)
## [1] 0.13
Since the RMSEs are quite close, we can conclude that our model is not over-fitting on ames_train.
What are some strengths and weaknesses of your model?
Model Strengths:
ames_test datasetprice) to avoid skewness of priceModel Weaknesses
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?
We prepare ames_validation for modeling by removing any feature values not seen in ames_train. This will exclude 4 observations inames_validation`.
load("ames_validation.Rdata")
nrow(ames_validation)
## [1] 763
ames_validation_model <- ames_validation %>% mutate(lprice=log(price)) %>% dplyr::select(c(price,lprice,area,Overall.Cond,Total.Bsmt.SF,Garage.Area,
Garage.Cars,MS.SubClass,Neighborhood,House.Style,Heating.QC,
Kitchen.Qual,Sale.Type,Sale.Condition)) %>%
filter(House.Style!="2.5Fin")
nrow(ames_validation_model)
## [1] 759
Let’s determine what percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set.
predict.ci <- as.data.frame(exp(predict(final_model_AIC, ames_validation_model, interval = "prediction")))
ci_pct <- mean(ames_validation_model$price > predict.ci$lwr &
ames_validation_model$price < predict.ci$upr)
ci_pct
## [1] 0.9762846
We observe that 97.6% of the validation dataset prices fall within the 95% confidence interval. Therefore, we conclude that our selected final model does handle uncertainty quite well.
round(rmse(ames_validation_model$lprice,predict(final_model_AIC,ames_validation_model)),2)
## [1] 0.13
The RMSE for the ames_validation dataset is identical to that of the ames_test dataset. We can conclude that our model does a good job of generalizing when exposed to new data.
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
This housing dataset contains some very interesting features which enable a linear model to predict the selling price for houses. The final model selected achieves a low RMSE in the validation data and it appears to quantify uncertainty quite well. The variables that seem to be more important for predicting the log(price) of a house according to AIC and this model are
areaOverall.CondTotal.Bsmt.SFFull.BathGarage.CarsMS.SubClassNeighborhoodHouse.StyleHeating.QCKitchen.QualSale.TypeSale.ConditionThe final model has an adjusted \(R^2\) of 0.8714, explaining 87.1% of the variance in the logarithm of house prices.
In the future, collecting more variables in this dataset or using more advanced prediction methodologies than linear regression could contribute to achieve better predictive power starting with this final linear regression model as a springboard.