1 Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")

Loading all necessary packages

library(statsr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.2
library(BAS)
## Warning: package 'BAS' was built under R version 4.0.2
library(MASS)
library(ggplot2)
library(scales)
library(reshape)
## Warning: package 'reshape' was built under R version 4.0.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.2
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.0.2

2.1 Part 1 - Exploratory Data Analysis (EDA)

In this section we will taking a deep dive into the data by exploring all the variables.

In the dataset we have houses with normal sales and abnormal sales behavior. In order to make safe predictions, we would start by working with only houses that exhibit a normal sales behavior. The plot below shows that the statement above is valid.

n.Sale.Condition = length(levels(ames_train$Sale.Condition))
par(mar=c(5,4,4,10))
plot(log(price) ~ I(area), 
     data=ames_train, col=Sale.Condition,
     pch=as.numeric(Sale.Condition)+15, main="Training Data")
legend(x=,"right", legend=levels(ames_train$Sale.Condition),
       col=1:n.Sale.Condition, pch=15+(1:n.Sale.Condition),
       bty="n", xpd=TRUE, inset=c(-.5,0))

ames_train <- ames_train[ames_train$Sale.Condition=="Normal",]
ames_train <- ames_train[,names(ames_train) != "Sale.Condition"]
dim(ames_train)
## [1] 834  80

We see that doing this decreases the dataset from 1000 observations to 834.

Price and Area First we explore the price variable by getting a quick summary statistic.

summary(ames_train$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   39300  129000  155500  174622  205000  615000

Next, we explore spread of values in the area variable

summary(ames_train$area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1077    1383    1450    1718    3672

Finally, we make a histogram of the price and area column to see the spread and inspect the distribution of values.

par(mfrow= c(2,2))
hist(ames_train$price)
hist(ames_train$area)
plot(ames_train$area, ames_train$price, pch = 16, cex = 0.8, col = "cyan")
abline(lm(ames_train$price ~ ames_train$area), col = "purple")
plot(log(ames_train$area), log(ames_train$price), pch = 16, cex = 0.8, col = "cyan")
abline(lm(log(ames_train$price) ~ log(ames_train$area)), col = "purple")

From the plots shown above, - We see that the response variable, price is right-skewed, meaning that most of the homes are in the $(100k - 200) range. - The predictor variable in this case, area which is the size of a home, also exhibits a right-skewed distribution, with most of the homes falling between falling in between 1,000 to 2,000 sq. ft. - A plot of price against area shows a linear relationship between the two variables, however there seems to be increased variance as the area increases. - Taking a log transform of price vs. area shows an increased linear relationship between the two variables.

Price and Year.Built The age of a home Year.Built would be inspected in this section. Let’s start by checking the statistical summary of the values in this variable.

summary(ames_train$Year.Built)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1872    1954    1972    1970    1997    2009
ggplot(ames_train, aes(x = Year.Built, y = price)) + geom_point(pch = 16, col = "purple") + stat_smooth(method = "lm", se = FALSE, col = "red")
## `geom_smooth()` using formula 'y ~ x'

  • We regard homes built before 1920 as Old homes
  • These homes did not fit in the regression line, showing that both very old homes and very new homes tend to command a price premium relative to “middle age” homes. - Therefore, a quadratic effect might be expected for an age variable in a multiple linear regression model to predict price.
ames_train <- ames_train %>% mutate(Home.Decade = (Year.Built - 1940)/10)
lm1 <- lm(log(price) ~ Year.Built, data = ames_train)
lm2 <- lm(log(price) ~ Home.Decade, data = ames_train)
par(mfrow = c(1, 2))
plot(lm1, which = 1, pch = 16, cex = 0.8, col = "grey")
plot(lm2, which = 1, pch = 16, cex = 0.8, col = "grey")

  • To facilitate this, home age is rescaled by subtracting 1940 (middle point) from “year built” and dividing by 10 (getting decade value instead of year value).
  • The resulting variable has a mean close to 3 and a standard deviation of 2.9, and represents the number of decades away from 1940. * * *

2.2 Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

2.2.1 Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).


The principal goal of this analysis is to construct a linear model to predict home prices in Ames, Iowa. For our initial model, we’ll pick 10 variables that we suspect may play a role in predicting sale prices.

basic.model <- lm(log(price) ~ log(area) + Neighborhood + Lot.Config + Overall.Qual + Overall.Cond + Year.Built + Bedroom.AbvGr + TotRms.AbvGrd + Full.Bath + Half.Bath, data = ames_train)

summary(basic.model)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Neighborhood + Lot.Config + 
##     Overall.Qual + Overall.Cond + Year.Built + Bedroom.AbvGr + 
##     TotRms.AbvGrd + Full.Bath + Half.Bath, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82470 -0.06932  0.00258  0.06941  0.57418 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.3570337  0.8159562  -1.663 0.096682 .  
## log(area)            0.6135130  0.0322996  18.994  < 2e-16 ***
## NeighborhoodBlueste -0.1697473  0.0906972  -1.872 0.061631 .  
## NeighborhoodBrDale  -0.2293101  0.0720926  -3.181 0.001526 ** 
## NeighborhoodBrkSide  0.0352540  0.0602078   0.586 0.558351    
## NeighborhoodClearCr  0.2001283  0.0654703   3.057 0.002312 ** 
## NeighborhoodCollgCr  0.0644743  0.0523347   1.232 0.218328    
## NeighborhoodCrawfor  0.1842985  0.0601109   3.066 0.002243 ** 
## NeighborhoodEdwards -0.0262641  0.0559441  -0.469 0.638862    
## NeighborhoodGilbert  0.0481809  0.0550334   0.875 0.381574    
## NeighborhoodGreens   0.1003157  0.0839611   1.195 0.232526    
## NeighborhoodGrnHill  0.4227095  0.1041967   4.057 5.46e-05 ***
## NeighborhoodIDOTRR  -0.0617053  0.0617986  -0.998 0.318346    
## NeighborhoodMeadowV -0.2131498  0.0624733  -3.412 0.000678 ***
## NeighborhoodMitchel  0.1077189  0.0548759   1.963 0.050000 *  
## NeighborhoodNAmes    0.0826843  0.0543084   1.522 0.128282    
## NeighborhoodNoRidge  0.1773116  0.0561163   3.160 0.001639 ** 
## NeighborhoodNPkVill -0.0129324  0.0833957  -0.155 0.876804    
## NeighborhoodNridgHt  0.2179680  0.0539758   4.038 5.91e-05 ***
## NeighborhoodNWAmes   0.0591712  0.0557710   1.061 0.289027    
## NeighborhoodOldTown -0.0223834  0.0599983  -0.373 0.709197    
## NeighborhoodSawyer   0.0700234  0.0558962   1.253 0.210669    
## NeighborhoodSawyerW -0.0029331  0.0546950  -0.054 0.957247    
## NeighborhoodSomerst  0.0911723  0.0533145   1.710 0.087640 .  
## NeighborhoodStoneBr  0.1572116  0.0620226   2.535 0.011444 *  
## NeighborhoodSWISU   -0.0055450  0.0696201  -0.080 0.936539    
## NeighborhoodTimber   0.1829796  0.0583770   3.134 0.001785 ** 
## NeighborhoodVeenker  0.2054619  0.0672313   3.056 0.002318 ** 
## Lot.ConfigCulDSac    0.0396866  0.0204631   1.939 0.052803 .  
## Lot.ConfigFR2       -0.0539824  0.0263539  -2.048 0.040852 *  
## Lot.ConfigFR3       -0.0373977  0.0670469  -0.558 0.577149    
## Lot.ConfigInside    -0.0045445  0.0122872  -0.370 0.711587    
## Overall.Qual         0.0769272  0.0061076  12.595  < 2e-16 ***
## Overall.Cond         0.0546726  0.0046223  11.828  < 2e-16 ***
## Year.Built           0.0041750  0.0003825  10.914  < 2e-16 ***
## Bedroom.AbvGr       -0.0257164  0.0089748  -2.865 0.004275 ** 
## TotRms.AbvGrd        0.0006206  0.0057566   0.108 0.914182    
## Full.Bath           -0.0312279  0.0145077  -2.152 0.031658 *  
## Half.Bath           -0.0458998  0.0117595  -3.903 0.000103 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1293 on 795 degrees of freedom
## Multiple R-squared:  0.8905, Adjusted R-squared:  0.8853 
## F-statistic: 170.2 on 38 and 795 DF,  p-value: < 2.2e-16

From the summary of the output of the initial model, the adjusted \(R^2\) of 88.53% means that 88.53% of the variation in the log price of homes in the training data is explained by the variables in the model.


2.2.2 Section 2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?


Our next step will be to trim down variables that limit the predictive efficacy of our model. We’ll contrast two different approaches, starting with

  • BIC backward-selection, followed by
  • AIC stepwise regression

Model One: BIC Backward-Selection Here we will utilize BIC backward-selection. Starting with the full array of regressors, we’ll remove regressors at each step until the BIC for the model is at its maximum. This method utilizes k = log(n) to penalize free parameters; thus, we’ll start by calculating k before including it in the model.

k <- log(nrow(ames_train))

basic.bic <- stepAIC(object = basic.model, 
                     k = k, 
                     direction = "backward",
                     trace = F)
summary(basic.bic)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Neighborhood + Overall.Qual + 
##     Overall.Cond + Year.Built + Bedroom.AbvGr + Half.Bath, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81247 -0.07022  0.00254  0.07092  0.53892 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.9584890  0.7806089  -1.228 0.219855    
## log(area)            0.5921246  0.0257431  23.001  < 2e-16 ***
## NeighborhoodBlueste -0.1363319  0.0907711  -1.502 0.133509    
## NeighborhoodBrDale  -0.2164677  0.0724463  -2.988 0.002894 ** 
## NeighborhoodBrkSide  0.0470206  0.0604722   0.778 0.437060    
## NeighborhoodClearCr  0.2216233  0.0655798   3.379 0.000761 ***
## NeighborhoodCollgCr  0.0767736  0.0524170   1.465 0.143404    
## NeighborhoodCrawfor  0.1937265  0.0603992   3.207 0.001392 ** 
## NeighborhoodEdwards -0.0103211  0.0560116  -0.184 0.853850    
## NeighborhoodGilbert  0.0495075  0.0552419   0.896 0.370419    
## NeighborhoodGreens   0.1490987  0.0828698   1.799 0.072365 .  
## NeighborhoodGrnHill  0.4290936  0.1047953   4.095 4.66e-05 ***
## NeighborhoodIDOTRR  -0.0500023  0.0620712  -0.806 0.420734    
## NeighborhoodMeadowV -0.2010462  0.0625700  -3.213 0.001365 ** 
## NeighborhoodMitchel  0.1231850  0.0547510   2.250 0.024725 *  
## NeighborhoodNAmes    0.1014822  0.0540927   1.876 0.061008 .  
## NeighborhoodNoRidge  0.1873107  0.0562265   3.331 0.000904 ***
## NeighborhoodNPkVill -0.0475013  0.0828138  -0.574 0.566405    
## NeighborhoodNridgHt  0.2230420  0.0541394   4.120 4.19e-05 ***
## NeighborhoodNWAmes   0.0662393  0.0559956   1.183 0.237185    
## NeighborhoodOldTown -0.0121881  0.0603080  -0.202 0.839892    
## NeighborhoodSawyer   0.0942239  0.0555920   1.695 0.090480 .  
## NeighborhoodSawyerW  0.0003273  0.0549068   0.006 0.995245    
## NeighborhoodSomerst  0.0904320  0.0534635   1.691 0.091136 .  
## NeighborhoodStoneBr  0.1700781  0.0622356   2.733 0.006418 ** 
## NeighborhoodSWISU    0.0021616  0.0698347   0.031 0.975315    
## NeighborhoodTimber   0.1908097  0.0587137   3.250 0.001203 ** 
## NeighborhoodVeenker  0.2123185  0.0666101   3.187 0.001491 ** 
## Overall.Qual         0.0779738  0.0061247  12.731  < 2e-16 ***
## Overall.Cond         0.0553607  0.0046495  11.907  < 2e-16 ***
## Year.Built           0.0040205  0.0003725  10.793  < 2e-16 ***
## Bedroom.AbvGr       -0.0294856  0.0081357  -3.624 0.000308 ***
## Half.Bath           -0.0375582  0.0111360  -3.373 0.000780 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1303 on 801 degrees of freedom
## Multiple R-squared:  0.8881, Adjusted R-squared:  0.8836 
## F-statistic: 198.7 on 32 and 801 DF,  p-value: < 2.2e-16

In this first instance, we see that three non-significant variables were removed from the full model. Unfortunately, this backward-selection does not seem to have improved our adjusted \(R^2\) value or our residual standard error. This suggests that the present model does not hold any more predictive power than the model in full.

Model Two: AIC Stepwise-Regression Our second approach will utilize AIC stepwise regression. In this method, we’ll consider adding or subtracting regressors at each step - this will allow us to consider all variables at each step, which gives us confidence that the most robust model will be selected. Unlike BIC regression, our AIC approach will utilize k = 2 to penalize free parameters; therefore, no need to calculate k prior to selecting our regressors.

basic.aic <- stepAIC(basic.model, 
                     k = 2,
                     direction = "both",
                     trace = F)
summary(basic.aic)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Neighborhood + Lot.Config + 
##     Overall.Qual + Overall.Cond + Year.Built + Bedroom.AbvGr + 
##     Full.Bath + Half.Bath, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82492 -0.06937  0.00329  0.06946  0.57216 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.363227   0.813426  -1.676 0.094149 .  
## log(area)            0.615197   0.028256  21.772  < 2e-16 ***
## NeighborhoodBlueste -0.170027   0.090604  -1.877 0.060938 .  
## NeighborhoodBrDale  -0.229311   0.072048  -3.183 0.001515 ** 
## NeighborhoodBrkSide  0.035087   0.060151   0.583 0.559841    
## NeighborhoodClearCr  0.199953   0.065409   3.057 0.002311 ** 
## NeighborhoodCollgCr  0.064186   0.052234   1.229 0.219501    
## NeighborhoodCrawfor  0.184297   0.060074   3.068 0.002229 ** 
## NeighborhoodEdwards -0.026624   0.055810  -0.477 0.633466    
## NeighborhoodGilbert  0.047985   0.054969   0.873 0.382959    
## NeighborhoodGreens   0.099608   0.083652   1.191 0.234110    
## NeighborhoodGrnHill  0.422220   0.104033   4.059 5.43e-05 ***
## NeighborhoodIDOTRR  -0.061964   0.061713  -1.004 0.315652    
## NeighborhoodMeadowV -0.213667   0.062250  -3.432 0.000629 ***
## NeighborhoodMitchel  0.107515   0.054809   1.962 0.050155 .  
## NeighborhoodNAmes    0.082460   0.054235   1.520 0.128802    
## NeighborhoodNoRidge  0.177129   0.056056   3.160 0.001638 ** 
## NeighborhoodNPkVill -0.013581   0.083127  -0.163 0.870268    
## NeighborhoodNridgHt  0.218061   0.053935   4.043 5.79e-05 ***
## NeighborhoodNWAmes   0.059118   0.055734   1.061 0.289144    
## NeighborhoodOldTown -0.022467   0.059956  -0.375 0.707969    
## NeighborhoodSawyer   0.069814   0.055828   1.251 0.211475    
## NeighborhoodSawyerW -0.003127   0.054631  -0.057 0.954367    
## NeighborhoodSomerst  0.090815   0.053178   1.708 0.088073 .  
## NeighborhoodStoneBr  0.157017   0.061958   2.534 0.011459 *  
## NeighborhoodSWISU   -0.006106   0.069382  -0.088 0.929892    
## NeighborhoodTimber   0.182825   0.058323   3.135 0.001784 ** 
## NeighborhoodVeenker  0.205168   0.067134   3.056 0.002317 ** 
## Lot.ConfigCulDSac    0.039716   0.020449   1.942 0.052457 .  
## Lot.ConfigFR2       -0.054150   0.026291  -2.060 0.039759 *  
## Lot.ConfigFR3       -0.037472   0.067002  -0.559 0.576136    
## Lot.ConfigInside    -0.004517   0.012277  -0.368 0.713012    
## Overall.Qual         0.076918   0.006103  12.603  < 2e-16 ***
## Overall.Cond         0.054671   0.004619  11.835  < 2e-16 ***
## Year.Built           0.004174   0.000382  10.925  < 2e-16 ***
## Bedroom.AbvGr       -0.025336   0.008245  -3.073 0.002193 ** 
## Full.Bath           -0.031154   0.014482  -2.151 0.031765 *  
## Half.Bath           -0.045974   0.011732  -3.919 9.67e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1293 on 796 degrees of freedom
## Multiple R-squared:  0.8905, Adjusted R-squared:  0.8854 
## F-statistic:   175 on 37 and 796 DF,  p-value: < 2.2e-16

The remaining coefficients in this latter model are nearly identical to those retained in the BIC model, except for its inclusion of Lot.Config as a predictive variable. We see a slight improvement in adjusted \(R^2\) and residual standard error relative to the full model. Given our three options - the full model, BIC model, and AIC model - the latter approach grants us the most robust predictive power.

We’ll proceed by using our AIC Stepwise Regression model.


2.2.3 Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.


Next, we’ll plot the residuals for the selected model. This will give us some indication of its predictive accuracy; if we observe a non-normative distribution of residual values, it may suggest that the model does a poor job predicting price values.

par(mfrow = c(2,2))
plot(basic.aic, which = 1, pch = 16, cex = 0.7, col = "orange")
plot(basic.aic, which = 2, pch = 16, cex = 0.7, col = "orange")
hist(basic.aic$resid, col = "orange")
plot(basic.aic$resid, pch = 16, cex = 0.7, col = "orange")

As we see from the plots above, the residuals in this plot satisfy the assumptions of linear regression. The residuals:

  1. Appear relatively symmetrical and are centered near the middle of the plot
  2. Are clustered near 0 on the y-axis, and
  3. Do not display a clear pattern or shape
  4. Do not display a repeated pattern (hence the appear indepenedent)

Between the robust \(R^2\) value and the normal distribution of residual values, we can move forward with confidence in this model’s predictive efficacy.

plot(basic.aic, which = 4, col = "orange")

Cook’s distance plot shows the most influential (leveraging) outliers (observation 611, 90, 325). This linear model might not include critical variables to predict these houses. Therefore, adding more variables might resolve this outlier issue. * * *

2.2.4 Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).


Next, we’ll check our chosen model’s RMSE. This will give us a basic estimate of how our model will perform when applied to data outside of the training set.

basic.predictions <- exp(predict(basic.aic, ames_train))

basic.resids <- ames_train$price - basic.predictions

basic.rmse <- sqrt(mean(basic.resids^2))

paste("Basic Model RMSE: $", round(basic.rmse,2), sep = "")
## [1] "Basic Model RMSE: $25404.73"

The RMSE for the present model translates to $25,404.73. This value may be interpreted as the standard deviation of the unexplained variance (or prediction errors) in the present model. Considering the range of home prices in the current data set - 12,789 dollars to 615,000 dollars - this RMSE value is acceptable, as it is relatively low in the context of the grand data set. * * *

2.2.5 Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("ames_test.Rdata")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?


Since the basic.aic model does not include the Landmark neighborhood as a factor level, we must first clean the set so the data and model are homogenous. We’ll then use the model to predict log(price) values in the test data set, before adding the predicted values to the original dataframe for comparison.

ames_test <- ames_test %>% 
        filter(Neighborhood != "Landmrk")

test.values <- exp(predict(basic.aic, ames_test))

ames_test['predicted'] = test.values
ggplot(ames_test, aes(x = price, y = predicted)) +
        geom_point(alpha = 0.45) +
        geom_smooth() +
        labs(x = "Actual Values",
             y = "Predicted Vales")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Applying the selected model to the test data set provides evidence for its effectiveness. We see a tight cluster of data around the line of best fit, with very few data points straying too far from the line * * *

2.3 Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

2.3.1 Section 3.1 Final Model

Provide the summary table for your model.


To begin building our final model, we’ll need to first identify and remove variables that will not significantly contribute to predicting home prices.

# Identify variables with lots of missing data
which(sapply(ames_train, function(x) (sum(is.na(x)) >= 50)))
## Lot.Frontage        Alley Fireplace.Qu      Pool.QC        Fence Misc.Feature 
##            6            9           59           74           75           76
# Identify factors with < 2 unique levels ... add these to running list
which(sapply(ames_train, function(x) (is.factor(x)) & length(unique(x))<2)) 
## Utilities 
##        12
# Assign poorly represented variables to string vector + create new dataset without these variables included
bad.variables <- c("Lot.Frontage", "Alley", "Fireplace.Qu", "Pool.QC", "Fence", "Misc.Feature", "Utilities")

clean.data <- ames_train[, !(names(ames_train) %in% bad.variables)]

# Remove incomplete cases from new data set
clean.data <- na.omit(clean.data)
dim(clean.data)
## [1] 778  74

We now have an improved data set (cleaned) but with 778 observations.

# Base model including all variables (besides instance identifier, which is meaningless at a predictive level)
final.model <- lm(log(price) ~ . -PID, 
                  data = clean.data)

# Replace skewed variables with natural log-transformed versions
final.model <- update(final.model, . ~ . -area -Lot.Area -X1st.Flr.SF
                      +log(area) +log(Lot.Area) +log(X1st.Flr.SF))

# Remove extraneous variables via stepwise regression
final.step <- stepAIC(final.model,
                      direction = "both", 
                      k = 2,
                      trace = F)

initial.call <- final.step$call

# Print adjusted R^2 value for reduced model
initial.r2 <- summary(final.step)$adj.r.squared

paste(round(initial.r2 * 100, 2), "% Explained Variance", sep = "")
## [1] "96.02% Explained Variance"

Given the maximum number of coefficients allowed for this project to be 20, we will work to cut down on extraneous coefficients while maximizing the model’s adjusted \(R^2\) value

summary(final.step)
## 
## Call:
## lm(formula = log(price) ~ MS.Zoning + Neighborhood + Condition.1 + 
##     Condition.2 + Bldg.Type + Overall.Qual + Overall.Cond + Year.Built + 
##     Year.Remod.Add + Exterior.1st + Exterior.2nd + Mas.Vnr.Type + 
##     Mas.Vnr.Area + Exter.Qual + Exter.Cond + Bsmt.Qual + Bsmt.Exposure + 
##     BsmtFin.Type.1 + BsmtFin.SF.1 + BsmtFin.SF.2 + Bsmt.Unf.SF + 
##     Heating + Heating.QC + Central.Air + X2nd.Flr.SF + Low.Qual.Fin.SF + 
##     Bsmt.Full.Bath + Kitchen.AbvGr + Kitchen.Qual + Functional + 
##     Fireplaces + Garage.Type + Garage.Finish + Garage.Cars + 
##     Garage.Area + Garage.Qual + Garage.Cond + Paved.Drive + Wood.Deck.SF + 
##     Open.Porch.SF + X3Ssn.Porch + Screen.Porch + Sale.Type + 
##     log(area) + log(Lot.Area) + log(X1st.Flr.SF), data = clean.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31667 -0.03754  0.00262  0.04157  0.21983 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          8.014e-02  8.642e-01   0.093 0.926144    
## MS.ZoningFV          5.180e-01  5.981e-02   8.660  < 2e-16 ***
## MS.ZoningI (all)     5.833e-01  1.107e-01   5.271 1.87e-07 ***
## MS.ZoningRH          4.435e-01  6.374e-02   6.957 8.77e-12 ***
## MS.ZoningRL          5.437e-01  5.068e-02  10.728  < 2e-16 ***
## MS.ZoningRM          4.883e-01  4.796e-02  10.182  < 2e-16 ***
## NeighborhoodBlueste  6.227e-03  6.028e-02   0.103 0.917756    
## NeighborhoodBrDale  -4.696e-03  5.228e-02  -0.090 0.928459    
## NeighborhoodBrkSide  6.289e-02  4.392e-02   1.432 0.152637    
## NeighborhoodClearCr -1.437e-02  4.449e-02  -0.323 0.746726    
## NeighborhoodCollgCr -5.293e-02  3.565e-02  -1.485 0.138131    
## NeighborhoodCrawfor  1.165e-01  4.057e-02   2.870 0.004241 ** 
## NeighborhoodEdwards -6.834e-02  3.763e-02  -1.816 0.069806 .  
## NeighborhoodGilbert -4.711e-02  3.684e-02  -1.279 0.201447    
## NeighborhoodGreens   4.945e-02  5.198e-02   0.951 0.341849    
## NeighborhoodGrnHill  4.807e-01  8.242e-02   5.833 8.75e-09 ***
## NeighborhoodIDOTRR   2.211e-02  4.942e-02   0.447 0.654788    
## NeighborhoodMeadowV -1.300e-01  5.144e-02  -2.528 0.011720 *  
## NeighborhoodMitchel -2.760e-02  3.736e-02  -0.739 0.460317    
## NeighborhoodNAmes   -3.919e-02  3.730e-02  -1.051 0.293808    
## NeighborhoodNoRidge -2.320e-02  3.867e-02  -0.600 0.548737    
## NeighborhoodNPkVill  1.542e-02  8.266e-02   0.187 0.852060    
## NeighborhoodNridgHt  8.585e-03  3.674e-02   0.234 0.815327    
## NeighborhoodNWAmes  -8.565e-02  3.905e-02  -2.194 0.028638 *  
## NeighborhoodOldTown -7.344e-03  4.476e-02  -0.164 0.869706    
## NeighborhoodSawyer  -7.792e-03  3.811e-02  -0.204 0.838057    
## NeighborhoodSawyerW -7.425e-02  3.712e-02  -2.000 0.045927 *  
## NeighborhoodSomerst  4.243e-02  4.524e-02   0.938 0.348651    
## NeighborhoodStoneBr  1.104e-03  3.947e-02   0.028 0.977687    
## NeighborhoodSWISU    5.404e-03  4.607e-02   0.117 0.906668    
## NeighborhoodTimber  -2.997e-02  3.967e-02  -0.755 0.450277    
## NeighborhoodVeenker -4.417e-03  4.372e-02  -0.101 0.919561    
## Condition.1Feedr     1.406e-02  2.796e-02   0.503 0.615306    
## Condition.1Norm      8.024e-02  2.402e-02   3.341 0.000883 ***
## Condition.1PosA      1.105e-01  3.913e-02   2.825 0.004886 ** 
## Condition.1PosN      9.374e-02  3.709e-02   2.527 0.011739 *  
## Condition.1RRAe      8.164e-02  4.081e-02   2.000 0.045899 *  
## Condition.1RRAn      4.338e-02  4.347e-02   0.998 0.318650    
## Condition.1RRNe     -7.907e-02  6.406e-02  -1.234 0.217556    
## Condition.1RRNn     -1.097e-02  5.871e-02  -0.187 0.851813    
## Condition.2Feedr     9.643e-02  1.084e-01   0.890 0.373998    
## Condition.2Norm      2.184e-01  1.023e-01   2.135 0.033110 *  
## Condition.2PosN      1.813e-01  1.343e-01   1.349 0.177727    
## Condition.2RRNn      2.665e-01  1.321e-01   2.018 0.043980 *  
## Bldg.Type2fmCon     -1.804e-02  3.041e-02  -0.593 0.553145    
## Bldg.TypeDuplex     -1.280e-01  3.162e-02  -4.046 5.86e-05 ***
## Bldg.TypeTwnhs      -3.296e-02  2.475e-02  -1.332 0.183421    
## Bldg.TypeTwnhsE     -2.450e-02  1.700e-02  -1.442 0.149860    
## Overall.Qual         4.372e-02  4.710e-03   9.281  < 2e-16 ***
## Overall.Cond         4.221e-02  3.960e-03  10.660  < 2e-16 ***
## Year.Built           2.770e-03  3.522e-04   7.864 1.64e-14 ***
## Year.Remod.Add       7.042e-04  2.466e-04   2.856 0.004435 ** 
## Exterior.1stBrkComm  9.435e-02  1.089e-01   0.867 0.386548    
## Exterior.1stBrkFace  3.795e-02  5.467e-02   0.694 0.487897    
## Exterior.1stCemntBd -4.035e-02  8.254e-02  -0.489 0.625104    
## Exterior.1stHdBoard -6.979e-02  5.391e-02  -1.295 0.195928    
## Exterior.1stImStucc -1.709e-02  9.631e-02  -0.177 0.859240    
## Exterior.1stMetalSd  2.859e-02  6.281e-02   0.455 0.649150    
## Exterior.1stPlywood -6.052e-02  5.226e-02  -1.158 0.247280    
## Exterior.1stStucco  -7.600e-03  6.425e-02  -0.118 0.905872    
## Exterior.1stVinylSd -8.988e-02  5.302e-02  -1.695 0.090540 .  
## Exterior.1stWd Sdng -8.148e-02  5.096e-02  -1.599 0.110353    
## Exterior.1stWdShing -5.048e-02  5.531e-02  -0.913 0.361815    
## Exterior.2ndBrk Cmn  6.139e-02  1.057e-01   0.581 0.561457    
## Exterior.2ndBrkFace  1.268e-01  6.308e-02   2.011 0.044800 *  
## Exterior.2ndCmentBd  1.775e-01  8.554e-02   2.075 0.038392 *  
## Exterior.2ndHdBoard  1.530e-01  5.930e-02   2.581 0.010091 *  
## Exterior.2ndImStucc  1.233e-01  6.533e-02   1.887 0.059622 .  
## Exterior.2ndMetalSd  1.011e-01  6.715e-02   1.505 0.132771    
## Exterior.2ndPlywood  1.329e-01  5.704e-02   2.329 0.020155 *  
## Exterior.2ndStucco   7.436e-02  7.231e-02   1.028 0.304219    
## Exterior.2ndVinylSd  1.821e-01  5.664e-02   3.214 0.001376 ** 
## Exterior.2ndWd Sdng  1.850e-01  5.581e-02   3.315 0.000968 ***
## Exterior.2ndWd Shng  1.831e-01  5.834e-02   3.139 0.001774 ** 
## Mas.Vnr.TypeBrkFace  2.529e-02  4.005e-02   0.632 0.527898    
## Mas.Vnr.TypeNone     4.469e-02  4.044e-02   1.105 0.269526    
## Mas.Vnr.TypeStone    2.083e-02  4.188e-02   0.497 0.619025    
## Mas.Vnr.Area         6.602e-05  2.559e-05   2.580 0.010121 *  
## Exter.QualFa        -4.605e-02  4.637e-02  -0.993 0.321056    
## Exter.QualGd        -5.407e-02  2.620e-02  -2.064 0.039459 *  
## Exter.QualTA        -7.876e-02  2.838e-02  -2.775 0.005679 ** 
## Exter.CondFa        -8.544e-02  5.662e-02  -1.509 0.131793    
## Exter.CondGd        -3.039e-02  4.898e-02  -0.620 0.535206    
## Exter.CondTA        -1.203e-02  4.829e-02  -0.249 0.803281    
## Bsmt.QualFa         -2.854e-02  2.983e-02  -0.957 0.338946    
## Bsmt.QualGd         -1.207e-02  1.521e-02  -0.794 0.427782    
## Bsmt.QualPo          1.024e+00  1.332e-01   7.690 5.73e-14 ***
## Bsmt.QualTA         -1.317e-02  1.858e-02  -0.709 0.478862    
## Bsmt.ExposureAv      7.033e-02  7.410e-02   0.949 0.342927    
## Bsmt.ExposureGd      1.032e-01  7.485e-02   1.379 0.168271    
## Bsmt.ExposureMn      4.553e-02  7.444e-02   0.612 0.541017    
## Bsmt.ExposureNo      6.184e-02  7.395e-02   0.836 0.403331    
## BsmtFin.Type.1BLQ   -2.177e-02  1.187e-02  -1.835 0.067038 .  
## BsmtFin.Type.1GLQ    8.842e-03  1.047e-02   0.844 0.398743    
## BsmtFin.Type.1LwQ   -5.131e-02  1.530e-02  -3.355 0.000843 ***
## BsmtFin.Type.1Rec   -2.791e-02  1.175e-02  -2.375 0.017829 *  
## BsmtFin.Type.1Unf   -1.017e-02  1.244e-02  -0.818 0.413738    
## BsmtFin.SF.1         1.402e-04  2.033e-05   6.898 1.30e-11 ***
## BsmtFin.SF.2         1.293e-04  2.453e-05   5.272 1.86e-07 ***
## Bsmt.Unf.SF          7.441e-05  1.887e-05   3.943 8.96e-05 ***
## HeatingGasW          1.878e-01  4.215e-02   4.454 9.96e-06 ***
## HeatingGrav          3.648e-01  1.104e-01   3.306 0.001001 ** 
## Heating.QCFa        -6.170e-02  2.473e-02  -2.495 0.012854 *  
## Heating.QCGd        -1.299e-02  9.197e-03  -1.412 0.158324    
## Heating.QCPo        -1.087e-01  8.914e-02  -1.219 0.223289    
## Heating.QCTA        -1.568e-02  8.928e-03  -1.756 0.079559 .  
## Central.AirY         9.812e-02  2.368e-02   4.145 3.87e-05 ***
## X2nd.Flr.SF          2.143e-04  3.020e-05   7.093 3.56e-12 ***
## Low.Qual.Fin.SF      1.322e-04  6.656e-05   1.986 0.047421 *  
## Bsmt.Full.Bath       2.520e-02  7.798e-03   3.231 0.001297 ** 
## Kitchen.AbvGr        7.386e-02  3.206e-02   2.304 0.021577 *  
## Kitchen.QualFa      -1.053e-01  3.193e-02  -3.296 0.001035 ** 
## Kitchen.QualGd      -5.801e-02  1.890e-02  -3.069 0.002242 ** 
## Kitchen.QualPo       3.168e-01  9.516e-02   3.330 0.000921 ***
## Kitchen.QualTA      -8.161e-02  2.045e-02  -3.991 7.35e-05 ***
## FunctionalMin1       1.659e-01  5.643e-02   2.939 0.003415 ** 
## FunctionalMin2       1.347e-01  5.526e-02   2.438 0.015027 *  
## FunctionalMod        2.912e-02  5.723e-02   0.509 0.611003    
## FunctionalTyp        1.885e-01  5.223e-02   3.609 0.000331 ***
## Fireplaces           1.716e-02  5.922e-03   2.897 0.003898 ** 
## Garage.TypeAttchd    9.580e-02  3.081e-02   3.109 0.001964 ** 
## Garage.TypeBasment   9.528e-02  4.411e-02   2.160 0.031138 *  
## Garage.TypeBuiltIn   6.881e-02  3.428e-02   2.007 0.045165 *  
## Garage.TypeCarPort   2.708e-02  8.217e-02   0.330 0.741851    
## Garage.TypeDetchd    8.461e-02  3.136e-02   2.698 0.007174 ** 
## Garage.FinishRFn    -1.610e-02  9.053e-03  -1.779 0.075767 .  
## Garage.FinishUnf    -8.284e-04  1.051e-02  -0.079 0.937208    
## Garage.Cars          2.982e-02  1.087e-02   2.743 0.006260 ** 
## Garage.Area          4.878e-05  3.693e-05   1.321 0.187094    
## Garage.QualFa       -9.435e-02  8.429e-02  -1.119 0.263426    
## Garage.QualGd       -7.712e-02  9.687e-02  -0.796 0.426280    
## Garage.QualPo       -6.011e-01  1.267e-01  -4.746 2.58e-06 ***
## Garage.QualTA       -8.945e-02  8.237e-02  -1.086 0.277898    
## Garage.CondFa       -3.456e-02  2.742e-02  -1.260 0.208058    
## Garage.CondGd       -6.601e-03  4.625e-02  -0.143 0.886555    
## Garage.CondPo        1.661e-01  5.643e-02   2.944 0.003361 ** 
## Garage.CondTA               NA         NA      NA       NA    
## Paved.DriveP        -4.977e-02  2.190e-02  -2.273 0.023369 *  
## Paved.DriveY        -1.493e-02  1.850e-02  -0.807 0.420002    
## Wood.Deck.SF         6.604e-05  2.636e-05   2.505 0.012493 *  
## Open.Porch.SF        1.655e-04  5.022e-05   3.296 0.001035 ** 
## X3Ssn.Porch          2.034e-04  9.403e-05   2.163 0.030951 *  
## Screen.Porch         1.789e-04  5.418e-05   3.302 0.001014 ** 
## Sale.TypeCon         5.147e-02  4.469e-02   1.152 0.249908    
## Sale.TypeConLD       1.798e-01  5.115e-02   3.516 0.000469 ***
## Sale.TypeConLI       4.706e-03  5.793e-02   0.081 0.935273    
## Sale.TypeConLw       7.613e-02  4.132e-02   1.843 0.065844 .  
## Sale.TypeCWD         4.915e-02  5.874e-02   0.837 0.403080    
## Sale.TypeOth        -1.723e-01  8.082e-02  -2.132 0.033373 *  
## Sale.TypeVWD         8.075e-03  8.119e-02   0.099 0.920810    
## Sale.TypeWD          2.027e-02  2.132e-02   0.951 0.341953    
## log(area)            1.029e-01  4.565e-02   2.255 0.024468 *  
## log(Lot.Area)        5.846e-02  1.104e-02   5.296 1.65e-07 ***
## log(X1st.Flr.SF)     2.611e-01  3.993e-02   6.538 1.30e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07256 on 625 degrees of freedom
## Multiple R-squared:  0.968,  Adjusted R-squared:  0.9602 
## F-statistic: 124.2 on 152 and 625 DF,  p-value: < 2.2e-16

We’ll anchor on the most-significant level per factor. As long as one level of the factor is highly robust, we’ll consider it important to the efficacy of the model (at least for the time being).

final.step <- update(final.step, . ~ . -Lot.Shape -Lot.Config -Condition.1 -Condition.2 -Year.Remod.Add
                     -Exter.Qual -Exter.Cond -BsmtExposure -Heating -X2nd.Flr.SF -Bedroom.AbvGr -Functional
                     -Fireplaces -Garage.Cars -Wood.Deck.SF -Enclosed.Porch -X3Ssn.Porch -Sale.Type)

x <- (summary(final.step)$adj.r.squared) * 100

paste(round(x, 2), "% Explained Variance", sep = "")
## [1] "94.61% Explained Variance"
final.step <- update(final.step, . ~ . -Bsmt.Exposure -  Mas.Vnr.Type- Mas.Vnr.Area-BsmtFin.SF.2 -Bsmt.Unf.SF -Heating.QC)

x <- (summary(final.step)$adj.r.squared) * 100

paste(round(x, 2), "% Explained Variance", sep = "")
## [1] "94.28% Explained Variance"

Our final model explains 94.28% of the variance in house prices in Ames, IA.

summary(final.step)
## 
## Call:
## lm(formula = log(price) ~ MS.Zoning + Neighborhood + Bldg.Type + 
##     Overall.Qual + Overall.Cond + Year.Built + Exterior.1st + 
##     Exterior.2nd + Bsmt.Qual + BsmtFin.Type.1 + BsmtFin.SF.1 + 
##     Central.Air + Low.Qual.Fin.SF + Bsmt.Full.Bath + Kitchen.AbvGr + 
##     Kitchen.Qual + Garage.Type + Garage.Finish + Garage.Area + 
##     Garage.Qual + Garage.Cond + Paved.Drive + Open.Porch.SF + 
##     Screen.Porch + log(area) + log(Lot.Area) + log(X1st.Flr.SF), 
##     data = clean.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42182 -0.04496  0.00283  0.04650  0.41391 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.495e-01  8.092e-01   0.432 0.665960    
## MS.ZoningFV          4.015e-01  6.439e-02   6.235 7.93e-10 ***
## MS.ZoningI (all)     2.942e-01  1.065e-01   2.761 0.005914 ** 
## MS.ZoningRH          2.423e-01  7.072e-02   3.427 0.000648 ***
## MS.ZoningRL          4.026e-01  5.499e-02   7.321 6.99e-13 ***
## MS.ZoningRM          3.437e-01  5.196e-02   6.614 7.58e-11 ***
## NeighborhoodBlueste -4.560e-02  6.801e-02  -0.671 0.502728    
## NeighborhoodBrDale  -5.145e-02  5.876e-02  -0.876 0.381592    
## NeighborhoodBrkSide -2.226e-02  4.942e-02  -0.451 0.652462    
## NeighborhoodClearCr -2.734e-02  5.161e-02  -0.530 0.596399    
## NeighborhoodCollgCr -8.909e-02  4.062e-02  -2.193 0.028637 *  
## NeighborhoodCrawfor  6.461e-02  4.614e-02   1.400 0.161871    
## NeighborhoodEdwards -1.257e-01  4.285e-02  -2.932 0.003477 ** 
## NeighborhoodGilbert -8.216e-02  4.228e-02  -1.943 0.052439 .  
## NeighborhoodGreens   7.570e-02  5.941e-02   1.274 0.203026    
## NeighborhoodGrnHill  4.157e-01  9.692e-02   4.289 2.06e-05 ***
## NeighborhoodIDOTRR  -6.240e-02  5.492e-02  -1.136 0.256330    
## NeighborhoodMeadowV -1.577e-01  5.847e-02  -2.698 0.007159 ** 
## NeighborhoodMitchel -7.832e-02  4.257e-02  -1.840 0.066220 .  
## NeighborhoodNAmes   -9.857e-02  4.253e-02  -2.318 0.020767 *  
## NeighborhoodNoRidge -8.747e-03  4.307e-02  -0.203 0.839133    
## NeighborhoodNPkVill  1.347e-02  9.700e-02   0.139 0.889582    
## NeighborhoodNridgHt  9.746e-04  4.107e-02   0.024 0.981077    
## NeighborhoodNWAmes  -1.373e-01  4.359e-02  -3.150 0.001705 ** 
## NeighborhoodOldTown -6.347e-02  5.062e-02  -1.254 0.210292    
## NeighborhoodSawyer  -8.762e-02  4.341e-02  -2.019 0.043919 *  
## NeighborhoodSawyerW -1.126e-01  4.248e-02  -2.651 0.008217 ** 
## NeighborhoodSomerst -2.017e-02  4.887e-02  -0.413 0.679953    
## NeighborhoodStoneBr -3.081e-02  4.516e-02  -0.682 0.495306    
## NeighborhoodSWISU   -6.701e-02  5.337e-02  -1.255 0.209738    
## NeighborhoodTimber  -6.934e-02  4.502e-02  -1.540 0.123957    
## NeighborhoodVeenker -6.111e-02  5.035e-02  -1.214 0.225318    
## Bldg.Type2fmCon     -3.117e-02  3.382e-02  -0.922 0.357064    
## Bldg.TypeDuplex     -1.565e-01  3.573e-02  -4.381 1.37e-05 ***
## Bldg.TypeTwnhs      -3.863e-02  2.761e-02  -1.399 0.162199    
## Bldg.TypeTwnhsE     -2.074e-02  1.930e-02  -1.075 0.282871    
## Overall.Qual         5.314e-02  5.068e-03  10.484  < 2e-16 ***
## Overall.Cond         4.722e-02  4.083e-03  11.565  < 2e-16 ***
## Year.Built           3.227e-03  3.862e-04   8.355 3.72e-16 ***
## Exterior.1stBrkComm  2.236e-01  1.204e-01   1.857 0.063728 .  
## Exterior.1stBrkFace  7.032e-02  6.277e-02   1.120 0.262964    
## Exterior.1stCemntBd -4.562e-02  9.329e-02  -0.489 0.624976    
## Exterior.1stHdBoard -3.245e-02  6.118e-02  -0.530 0.596015    
## Exterior.1stImStucc -9.805e-02  1.121e-01  -0.875 0.381927    
## Exterior.1stMetalSd  3.181e-02  7.047e-02   0.451 0.651830    
## Exterior.1stPlywood -2.628e-02  5.914e-02  -0.444 0.656914    
## Exterior.1stStucco  -9.856e-03  7.146e-02  -0.138 0.890336    
## Exterior.1stVinylSd -8.072e-02  6.100e-02  -1.323 0.186194    
## Exterior.1stWd Sdng -5.266e-02  5.758e-02  -0.915 0.360751    
## Exterior.1stWdShing -5.048e-02  6.263e-02  -0.806 0.420542    
## Exterior.2ndBrk Cmn -2.867e-02  1.233e-01  -0.233 0.816202    
## Exterior.2ndBrkFace  3.272e-02  7.193e-02   0.455 0.649319    
## Exterior.2ndCmentBd  1.227e-01  9.636e-02   1.273 0.203320    
## Exterior.2ndHdBoard  5.366e-02  6.717e-02   0.799 0.424602    
## Exterior.2ndImStucc  7.068e-02  7.486e-02   0.944 0.345403    
## Exterior.2ndMetalSd  3.433e-02  7.549e-02   0.455 0.649417    
## Exterior.2ndPlywood  2.970e-02  6.445e-02   0.461 0.645075    
## Exterior.2ndStucco   1.052e-01  8.059e-02   1.306 0.192122    
## Exterior.2ndVinylSd  1.097e-01  6.488e-02   1.690 0.091393 .  
## Exterior.2ndWd Sdng  1.044e-01  6.331e-02   1.649 0.099518 .  
## Exterior.2ndWd Shng  1.087e-01  6.647e-02   1.635 0.102452    
## Bsmt.QualFa         -9.881e-02  3.361e-02  -2.940 0.003399 ** 
## Bsmt.QualGd         -3.587e-02  1.759e-02  -2.040 0.041779 *  
## Bsmt.QualPo          7.161e-01  1.414e-01   5.064 5.29e-07 ***
## Bsmt.QualTA         -4.318e-02  2.140e-02  -2.017 0.044044 *  
## BsmtFin.Type.1BLQ   -2.474e-02  1.339e-02  -1.847 0.065130 .  
## BsmtFin.Type.1GLQ    5.098e-03  1.188e-02   0.429 0.667983    
## BsmtFin.Type.1LwQ   -4.295e-02  1.750e-02  -2.455 0.014336 *  
## BsmtFin.Type.1Rec   -4.125e-02  1.339e-02  -3.080 0.002152 ** 
## BsmtFin.Type.1Unf   -1.459e-02  1.415e-02  -1.031 0.302838    
## BsmtFin.SF.1         6.501e-05  1.374e-05   4.730 2.74e-06 ***
## Central.AirY         9.648e-02  2.394e-02   4.030 6.21e-05 ***
## Low.Qual.Fin.SF     -8.282e-05  6.868e-05  -1.206 0.228276    
## Bsmt.Full.Bath       4.270e-02  8.548e-03   4.995 7.49e-07 ***
## Kitchen.AbvGr        8.884e-02  3.551e-02   2.502 0.012582 *  
## Kitchen.QualFa      -1.128e-01  3.497e-02  -3.226 0.001314 ** 
## Kitchen.QualGd      -8.090e-02  1.952e-02  -4.144 3.84e-05 ***
## Kitchen.QualPo       5.574e-02  1.015e-01   0.549 0.583182    
## Kitchen.QualTA      -1.190e-01  2.144e-02  -5.551 4.08e-08 ***
## Garage.TypeAttchd    8.783e-02  3.524e-02   2.492 0.012933 *  
## Garage.TypeBasment   7.970e-02  4.825e-02   1.652 0.099045 .  
## Garage.TypeBuiltIn   6.236e-02  3.864e-02   1.614 0.106972    
## Garage.TypeCarPort   3.612e-02  9.577e-02   0.377 0.706213    
## Garage.TypeDetchd    8.175e-02  3.583e-02   2.282 0.022821 *  
## Garage.FinishRFn    -1.843e-02  1.037e-02  -1.777 0.075956 .  
## Garage.FinishUnf    -1.070e-02  1.195e-02  -0.895 0.371284    
## Garage.Area          1.531e-04  2.589e-05   5.912 5.35e-09 ***
## Garage.QualFa       -8.644e-02  9.581e-02  -0.902 0.367282    
## Garage.QualGd       -1.538e-01  1.069e-01  -1.439 0.150753    
## Garage.QualPo       -4.479e-01  1.294e-01  -3.461 0.000572 ***
## Garage.QualTA       -1.090e-01  9.400e-02  -1.160 0.246505    
## Garage.CondFa       -6.430e-02  2.964e-02  -2.169 0.030426 *  
## Garage.CondGd        2.127e-02  5.097e-02   0.417 0.676594    
## Garage.CondPo        1.517e-01  6.022e-02   2.519 0.012015 *  
## Garage.CondTA               NA         NA      NA       NA    
## Paved.DriveP        -6.128e-02  2.501e-02  -2.450 0.014540 *  
## Paved.DriveY        -7.179e-03  2.023e-02  -0.355 0.722844    
## Open.Porch.SF        1.568e-04  5.777e-05   2.715 0.006795 ** 
## Screen.Porch         1.810e-04  6.130e-05   2.952 0.003267 ** 
## log(area)            4.004e-01  1.760e-02  22.752  < 2e-16 ***
## log(Lot.Area)        8.447e-02  1.195e-02   7.067 3.95e-12 ***
## log(X1st.Flr.SF)     8.246e-02  1.810e-02   4.555 6.21e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08696 on 677 degrees of freedom
## Multiple R-squared:  0.9502, Adjusted R-squared:  0.9428 
## F-statistic:   129 on 100 and 677 DF,  p-value: < 2.2e-16

2.3.2 Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.


The area and lot area of a home should be somewhat predictive of its eventual sales price - larger homes ought to command higher prices. This relationship is moderately robust: .

paste("Correlation between Price and Area:", round(cor(log(ames_train$price), ames_train$area), 2))
## [1] "Correlation between Price and Area: 0.75"
area.values <- ames_train %>% 
                summarise(normal.area.median = median(area),
                  normal.area.mean = mean(area),
                  log.area.median = median(log(area)),
                  log.area.mean = mean(log(area)))
kable(area.values) %>%
        kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
normal.area.median normal.area.mean log.area.median log.area.mean
1383 1450.048 7.23201 7.224067
ggplot(ames_train, aes(x = area)) +
        geom_histogram(color = 'white') +
        labs(x = "Area",
             y = "")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(ames_train, aes(x = log(area))) +
        geom_histogram(color = 'white') +
        labs(x = "Log(Area)",
             y = "")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

* * *

2.3.3 Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.


The present model does not include any variable interactions. This is mainly for the sake of parsimony - we achieved a highly robust \(R^2\) value without including any interactions, and ultimately the simpler and more streamlined a model the better. While it’s theoretically possible that an interaction would marginally improve the efficacy of the model, this parsimonious approach has facilitated a robust model as is. * * *

2.3.4 Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.


We arrived at the final model using AIC stepwise regression, in which each regressor is considered at every step - this method is more effective than backward-only-selection, as it does not permanently remove a regressor from consideration. Using the basic.aic model developed in section two as a baseline, the final.step model starts with the majority of exploratory variables considered, and adds and removes variables stepwise until the most robust model possible remains. The addition of these addition variables makes a quantitative difference in the efficacy of the model - see the difference between the basic and final models below:

basic <- round(summary(basic.aic)$adj.r.squared * 100, 2)
final <- round(summary(final.step)$adj.r.squared * 100, 2)
improved <- final - basic

paste("Basic Model Explained Variance: ", basic, "%", sep = "")
## [1] "Basic Model Explained Variance: 88.54%"
paste("Final Model Explained Variance: ", final, "%", sep = "")
## [1] "Final Model Explained Variance: 94.28%"
paste("Improvement: ", improved, "%", sep = "")
## [1] "Improvement: 5.73999999999999%"

A 5.7% increase in explained variance is significant, and certainly provides evidence to support our methodological approach.

Notably, the stepwise regression model initially yielded a model with even more predictive power than our final model. Eliminating regressors to comply with the requirements of this assignment deteriorated the adjusted R^2 value of the final model, albeit slightly.

massive <- round(initial.r2 * 100, 2)
paste("Final Model Without Variable Subtraction: ", massive, "% Explained Variance", sep = "")
## [1] "Final Model Without Variable Subtraction: 96.02% Explained Variance"

2.3.5 Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.


Our basic model was quite successful in predicting real estate prices in Ames. However, an RMSE value of $25404.73 is certainly not an insignificant average squared residual - this suggests that the average prediction error is higher than we would like. All things being equal, we’d hope that adding more significant regressors would afford us a boost in \(R^2\), which would reduce our average residuals.

ames_test <- ames_test %>% 
        filter(Exterior.2nd != "AsphShn" & Exterior.2nd != "Stone")

ames_test['final.predictions'] <- exp(predict(final.step, ames_test))
ggplot(ames_test, aes(x = final.predictions)) +
        geom_histogram(color = 'white') +
        labs(x = "Predicted Prices via Final Model", y = "")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 54 rows containing non-finite values (stat_bin).

ggplot(ames_test, aes(x = price, y = final.predictions)) +
        geom_point(alpha = 0.45) +
        geom_smooth() +
        labs(x = "Actual Prices", y = "Predicted Prices")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 54 rows containing non-finite values (stat_smooth).
## Warning: Removed 54 rows containing missing values (geom_point).

* * *

2.4 Part 4 Final Model Assessment

2.4.1 Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.


Here, we plot the distribution of residual values in the final model.

ggplot(final.step, aes(x = .resid)) +
        geom_histogram(color = 'white', binwidth = 0.05) +
        labs(x = "Distribution of Residuals")

ggplot(final.step, aes(x = .fitted, y = .resid)) +
        geom_point(alpha = 0.45) +
        geom_hline(yintercept = 0) +
        labs(x = "Fitted Values",
             y = "Residual Values")

Much like our training model, the residuals in the final model are clearly normally distributed. Plotted against the fitted values, the residuals in this model show no defined pattern, are centered around the y-intercept line, and appear to be fairly symmetrical. This suggests that the final model’s predictive power does not have a “blind spot”, so to speak.


2.4.2 Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.


final.predictions <- exp(predict(final.step, ames_test))

final.resids <- ames_test$price - final.predictions

final.rmse <- sqrt(mean(final.resids^2, na.rm = T))

paste("Final Model RMSE in Test Data: $", round(final.rmse, 2), sep = "")
## [1] "Final Model RMSE in Test Data: $20682.28"

Compared to the RMSE in our test model, this value is markedly improved. We observe a mean standard error of $01 in predicting price. This is not an insignificant level of error, but in the context of the data it is certainly acceptable.

paste("Basic Model RMSE in Test Data: $", round(basic.rmse, 2), sep = "")
## [1] "Basic Model RMSE in Test Data: $25404.73"
paste("Improvment in RMSE: $", round(basic.rmse - final.rmse, 2), sep = "")
## [1] "Improvment in RMSE: $4722.45"

In the context of our data, our final model predicts home prices $4,722.45 more accurately. * * *

2.4.3 Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model? Strengths - Good variety of regressors used in building the model.

final.step$call
## lm(formula = log(price) ~ MS.Zoning + Neighborhood + Bldg.Type + 
##     Overall.Qual + Overall.Cond + Year.Built + Exterior.1st + 
##     Exterior.2nd + Bsmt.Qual + BsmtFin.Type.1 + BsmtFin.SF.1 + 
##     Central.Air + Low.Qual.Fin.SF + Bsmt.Full.Bath + Kitchen.AbvGr + 
##     Kitchen.Qual + Garage.Type + Garage.Finish + Garage.Area + 
##     Garage.Qual + Garage.Cond + Paved.Drive + Open.Porch.SF + 
##     Screen.Porch + log(area) + log(Lot.Area) + log(X1st.Flr.SF), 
##     data = clean.data)

Weakness - We reduced the number of regressors used down to 20, which resulted in a decrease in predictive power

initial.call
## lm(formula = log(price) ~ MS.Zoning + Neighborhood + Condition.1 + 
##     Condition.2 + Bldg.Type + Overall.Qual + Overall.Cond + Year.Built + 
##     Year.Remod.Add + Exterior.1st + Exterior.2nd + Mas.Vnr.Type + 
##     Mas.Vnr.Area + Exter.Qual + Exter.Cond + Bsmt.Qual + Bsmt.Exposure + 
##     BsmtFin.Type.1 + BsmtFin.SF.1 + BsmtFin.SF.2 + Bsmt.Unf.SF + 
##     Heating + Heating.QC + Central.Air + X2nd.Flr.SF + Low.Qual.Fin.SF + 
##     Bsmt.Full.Bath + Kitchen.AbvGr + Kitchen.Qual + Functional + 
##     Fireplaces + Garage.Type + Garage.Finish + Garage.Cars + 
##     Garage.Area + Garage.Qual + Garage.Cond + Paved.Drive + Wood.Deck.SF + 
##     Open.Porch.SF + X3Ssn.Porch + Screen.Porch + Sale.Type + 
##     log(area) + log(Lot.Area) + log(X1st.Flr.SF), data = clean.data)
paste(round(initial.r2 * 100, 2), "% Explained Variance", sep = "")
## [1] "96.02% Explained Variance"


2.4.4 Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("ames_validation.Rdata")

First, check the number of observations in the validation set

paste(nrow(ames_validation), "observations")
## [1] "763 observations"

We need to create harmony between the test and validation set. Hence;

ames_validation <- ames_validation %>% 
        filter(MS.Zoning != "A (agr)") %>% 
        filter(Exterior.2nd != "CBlock" & Exterior.2nd != "PreCast" & Exterior.2nd != "Stone")

paste(nrow(ames_validation), "observations")
## [1] "757 observations"

2.4.5 Fitting the Model

ames_validation['predicted.values'] <- exp(predict(final.step, ames_validation))
ggplot(ames_validation, aes(x = price, y = predicted.values)) +
        geom_point(alpha = 0.45) +
        geom_smooth() +
        labs(x = "Actual Values", y = "Predicted Values")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

As expected, the model provides an excellent fit to this data set. - We observe a tight cluster of data around the regression line, with few data points landing far away from the group. - The slight curve of the regression line near the maximum limit of the x-axis suggests that the model accurately follows the trend of the data - in other words, it is not strictly linear and instead “bends” to fit the data.

RMSE

# Exponentiate the data
valid.predictions <- exp(predict(final.step, ames_validation))

valid.resids <- ames_validation$price - valid.predictions

valid.squares <- valid.resids^2

valid.rmse <- sqrt(mean(valid.squares, na.rm = T))

paste("Validation RMSE: $", round(valid.rmse, 2), sep = "")
## [1] "Validation RMSE: $18540.22"
  • The final model yields a much lower RMSE value.
  • We can interpret this as less anticipated error in fitting our model.
  • $18540.22 is tolerable in this case, especially given the variance in home prices in the validation data.
val.priceRange <- range(ames_validation$price)

paste("Price Range in Validation Data Set: $", val.priceRange[2] - val.priceRange[1], sep = "")
## [1] "Price Range in Validation Data Set: $549500"

In general, the final model seems a better and more accurate predictor of real estate prices than the test model.

Confidence Intervals

# Calculate 95% confidence interval for each observation
predicted.values.ci <- exp(predict(final.step, ames_validation, interval = "prediction", level = 0.95))

# Calculate proportion of homes with prices within CI
coverage.probability <- mean(ames_validation$price > predicted.values.ci[, "lwr"] &
                                     ames_validation$price < predicted.values.ci[, "upr"], na.rm = T) * 100

paste(round(coverage.probability, 2), "% Coverage Probability", sep = "")
## [1] "94.56% Coverage Probability"

We observe that there is less than a 6% chance that predicted values derived from this model fall outside of a 95% confidence interval.


2.5 Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.


The final model offers a highly accurate means of predicting real estate prices in Ames, IA. Using a separate validation data set, we observe a coverage probability of 94.56%. Ames is observed to be a relatively diverse real estate market, marked by ample variability in price, square footage, and age, to name a few factors. Given this variability, the predictive efficacy of our model is that much more important.