Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata.gz")

Use the code block below to load any necessary packages

library(statsr)
library(dplyr)
library(BAS)
library(ggplot2)
library(GGally)
library(magrittr)
library(MASS)

Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).


filtering the dataframe for only homes sold under normal conditions

ames_train <- ames_train %>%
  filter(Sale.Condition == "Normal")

Visual Group 1

ordering_med <- order(as.numeric(by(ames_train$price, ames_train$Neighborhood, median)), decreasing = TRUE)
ames_train$Neighborhood <- ordered(ames_train$Neighborhood, levels=levels(ames_train$Neighborhood)[ordering_med]) 
boxplot(price ~ Neighborhood, data = ames_train, las = 2, ylab = "Price", col = "gold",main = "House price by neighborhood in Ames, Iowa")

This box plot shows us the mean, interquartile range, and spread of prices of homes by neighborhood. Its ordering by price lends to its informative impact.

Visual Group 2

Scatterplot of average home price by year

avg_price_year <- sapply(split(ames_train$price, ames_train$Year.Built), mean)
plot(avg_price_year, xaxt="n", main="Average Price of Homes in Ames by Year")
axis(1, at=seq(1,102,4), labels=sort(unique(ames_train$Year.Built))[seq(1,102,4)])

This plot gives us the average home price per year. One caveat: the x axis is not missing some values. Though this does violate a maxim of graph design, this graph is still informative.

Scatterplot of home price by year

ggplot(ames_train, aes(x=Year.Built, y=price)) +
  labs(title="Prices of Homes in Ames by Year") +
  geom_point()

This plot is similar to the one above but gives us more information on the number of homes built by year.

Visual Group 3 (two visuals)

suppressWarnings(suppressMessages(ggpairs(ames_train, columns = c(3,2,18,20,21,22))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

suppressWarnings(suppressMessages(ggpairs(ames_train, columns=c(3,31,53,56,81))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

These visuals above give us a matrix of graphs to inspect how multiple variables interact. These variables were chosen for their affect on the variable price evidenced through EDA. I broke up the ggpairs graph into two graphs to make it easier to interpret.


Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).


initial_model_adjr2 <- lm(log(price) ~ Overall.Qual + Neighborhood + X1st.Flr.SF +
                            X2nd.Flr.SF + Overall.Cond + Year.Built + log(Lot.Area) +
                            Bsmt.Full.Bath + Condition.1 + Bldg.Type, data=ames_train)
summary(initial_model_adjr2)
## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + X1st.Flr.SF + 
##     X2nd.Flr.SF + Overall.Cond + Year.Built + log(Lot.Area) + 
##     Bsmt.Full.Bath + Condition.1 + Bldg.Type, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67235 -0.05374 -0.00048  0.06030  0.35242 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.374e+00  6.108e-01   2.249 0.024801 *  
## Overall.Qual      7.421e-02  4.814e-03  15.415  < 2e-16 ***
## Neighborhood.L   -2.912e-01  4.857e-02  -5.995 3.10e-09 ***
## Neighborhood.Q    4.558e-02  3.282e-02   1.389 0.165321    
## Neighborhood.C   -3.028e-02  3.012e-02  -1.005 0.315102    
## Neighborhood^4   -1.126e-01  3.084e-02  -3.651 0.000278 ***
## Neighborhood^5    1.797e-02  3.295e-02   0.545 0.585565    
## Neighborhood^6   -2.523e-02  3.343e-02  -0.755 0.450533    
## Neighborhood^7    3.128e-02  3.426e-02   0.913 0.361558    
## Neighborhood^8   -4.657e-02  3.114e-02  -1.495 0.135244    
## Neighborhood^9    6.342e-02  3.186e-02   1.991 0.046823 *  
## Neighborhood^10   8.027e-02  3.015e-02   2.662 0.007920 ** 
## Neighborhood^11  -1.727e-01  3.772e-02  -4.578 5.45e-06 ***
## Neighborhood^12   8.869e-02  4.423e-02   2.005 0.045283 *  
## Neighborhood^13  -1.898e-01  3.848e-02  -4.933 9.88e-07 ***
## Neighborhood^14   8.983e-02  3.568e-02   2.518 0.012012 *  
## Neighborhood^15  -1.294e-02  3.198e-02  -0.405 0.685890    
## Neighborhood^16   6.450e-02  3.150e-02   2.048 0.040939 *  
## Neighborhood^17  -5.001e-02  3.649e-02  -1.371 0.170845    
## Neighborhood^18   5.139e-02  3.324e-02   1.546 0.122531    
## Neighborhood^19   3.643e-02  3.366e-02   1.082 0.279494    
## Neighborhood^20  -3.077e-02  2.643e-02  -1.164 0.244697    
## Neighborhood^21  -2.186e-02  3.068e-02  -0.713 0.476234    
## Neighborhood^22  -1.096e-01  3.767e-02  -2.911 0.003709 ** 
## Neighborhood^23   3.227e-02  2.693e-02   1.198 0.231144    
## Neighborhood^24   1.796e-02  2.595e-02   0.692 0.489047    
## Neighborhood^25   6.338e-02  2.909e-02   2.179 0.029625 *  
## Neighborhood^26   8.219e-02  3.103e-02   2.649 0.008245 ** 
## X1st.Flr.SF       4.202e-04  1.594e-05  26.357  < 2e-16 ***
## X2nd.Flr.SF       2.912e-04  1.160e-05  25.094  < 2e-16 ***
## Overall.Cond      5.828e-02  3.729e-03  15.627  < 2e-16 ***
## Year.Built        4.219e-03  2.974e-04  14.186  < 2e-16 ***
## log(Lot.Area)     1.004e-01  1.240e-02   8.096 2.14e-15 ***
## Bsmt.Full.Bath    8.102e-02  7.592e-03  10.672  < 2e-16 ***
## Condition.1Feedr -1.619e-02  2.958e-02  -0.547 0.584239    
## Condition.1Norm   7.182e-02  2.504e-02   2.868 0.004243 ** 
## Condition.1PosA   4.222e-02  4.721e-02   0.894 0.371430    
## Condition.1PosN   7.540e-02  4.324e-02   1.744 0.081595 .  
## Condition.1RRAe  -3.880e-03  4.500e-02  -0.086 0.931321    
## Condition.1RRAn   2.166e-02  4.311e-02   0.503 0.615436    
## Condition.1RRNe   1.477e-02  7.705e-02   0.192 0.847979    
## Condition.1RRNn  -4.188e-02  7.752e-02  -0.540 0.589178    
## Bldg.Type2fmCon  -1.180e-02  2.637e-02  -0.448 0.654572    
## Bldg.TypeDuplex  -1.404e-01  2.154e-02  -6.517 1.28e-10 ***
## Bldg.TypeTwnhs   -6.290e-02  2.871e-02  -2.191 0.028775 *  
## Bldg.TypeTwnhsE  -2.094e-02  1.978e-02  -1.058 0.290156    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1021 on 788 degrees of freedom
## Multiple R-squared:  0.9324, Adjusted R-squared:  0.9285 
## F-statistic: 241.4 on 45 and 788 DF,  p-value: < 2.2e-16

Section 2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?


First we’ll use Bayesian Model Averaging to find which variables ought to be in the final model. This method uses Bayesian Adaptive Sampling without replacement from posterior distributions.

As we can see, at least one level of each of the variables have a 100% chance of being included in the final model. As some of the variables involved in this analysis are multi-factor categorical, certain levels of a variables may not be found to be as significant as others.

Final Model using adjusted R2

The model below was built through forward selection using the adjusted R2 score. This process involves checking the adjusted R2 score at each step and including the variable with the highest score. At the first step, one checks the score for each variable, finding the highest. At the second step, one would find the highest score derived from the combination of the aforementioned variable with each other remaining variable, respectively. The process continues in that fashion.

initial_model_adjr2 <- lm(log(price) ~ Overall.Qual + Neighborhood + X1st.Flr.SF + X2nd.Flr.SF + Overall.Cond + Year.Built + log(Lot.Area) + Bsmt.Full.Bath + Condition.1 + Bldg.Type, data=ames_train)
summary(initial_model_adjr2)
## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + X1st.Flr.SF + 
##     X2nd.Flr.SF + Overall.Cond + Year.Built + log(Lot.Area) + 
##     Bsmt.Full.Bath + Condition.1 + Bldg.Type, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.67235 -0.05374 -0.00048  0.06030  0.35242 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.374e+00  6.108e-01   2.249 0.024801 *  
## Overall.Qual      7.421e-02  4.814e-03  15.415  < 2e-16 ***
## Neighborhood.L   -2.912e-01  4.857e-02  -5.995 3.10e-09 ***
## Neighborhood.Q    4.558e-02  3.282e-02   1.389 0.165321    
## Neighborhood.C   -3.028e-02  3.012e-02  -1.005 0.315102    
## Neighborhood^4   -1.126e-01  3.084e-02  -3.651 0.000278 ***
## Neighborhood^5    1.797e-02  3.295e-02   0.545 0.585565    
## Neighborhood^6   -2.523e-02  3.343e-02  -0.755 0.450533    
## Neighborhood^7    3.128e-02  3.426e-02   0.913 0.361558    
## Neighborhood^8   -4.657e-02  3.114e-02  -1.495 0.135244    
## Neighborhood^9    6.342e-02  3.186e-02   1.991 0.046823 *  
## Neighborhood^10   8.027e-02  3.015e-02   2.662 0.007920 ** 
## Neighborhood^11  -1.727e-01  3.772e-02  -4.578 5.45e-06 ***
## Neighborhood^12   8.869e-02  4.423e-02   2.005 0.045283 *  
## Neighborhood^13  -1.898e-01  3.848e-02  -4.933 9.88e-07 ***
## Neighborhood^14   8.983e-02  3.568e-02   2.518 0.012012 *  
## Neighborhood^15  -1.294e-02  3.198e-02  -0.405 0.685890    
## Neighborhood^16   6.450e-02  3.150e-02   2.048 0.040939 *  
## Neighborhood^17  -5.001e-02  3.649e-02  -1.371 0.170845    
## Neighborhood^18   5.139e-02  3.324e-02   1.546 0.122531    
## Neighborhood^19   3.643e-02  3.366e-02   1.082 0.279494    
## Neighborhood^20  -3.077e-02  2.643e-02  -1.164 0.244697    
## Neighborhood^21  -2.186e-02  3.068e-02  -0.713 0.476234    
## Neighborhood^22  -1.096e-01  3.767e-02  -2.911 0.003709 ** 
## Neighborhood^23   3.227e-02  2.693e-02   1.198 0.231144    
## Neighborhood^24   1.796e-02  2.595e-02   0.692 0.489047    
## Neighborhood^25   6.338e-02  2.909e-02   2.179 0.029625 *  
## Neighborhood^26   8.219e-02  3.103e-02   2.649 0.008245 ** 
## X1st.Flr.SF       4.202e-04  1.594e-05  26.357  < 2e-16 ***
## X2nd.Flr.SF       2.912e-04  1.160e-05  25.094  < 2e-16 ***
## Overall.Cond      5.828e-02  3.729e-03  15.627  < 2e-16 ***
## Year.Built        4.219e-03  2.974e-04  14.186  < 2e-16 ***
## log(Lot.Area)     1.004e-01  1.240e-02   8.096 2.14e-15 ***
## Bsmt.Full.Bath    8.102e-02  7.592e-03  10.672  < 2e-16 ***
## Condition.1Feedr -1.619e-02  2.958e-02  -0.547 0.584239    
## Condition.1Norm   7.182e-02  2.504e-02   2.868 0.004243 ** 
## Condition.1PosA   4.222e-02  4.721e-02   0.894 0.371430    
## Condition.1PosN   7.540e-02  4.324e-02   1.744 0.081595 .  
## Condition.1RRAe  -3.880e-03  4.500e-02  -0.086 0.931321    
## Condition.1RRAn   2.166e-02  4.311e-02   0.503 0.615436    
## Condition.1RRNe   1.477e-02  7.705e-02   0.192 0.847979    
## Condition.1RRNn  -4.188e-02  7.752e-02  -0.540 0.589178    
## Bldg.Type2fmCon  -1.180e-02  2.637e-02  -0.448 0.654572    
## Bldg.TypeDuplex  -1.404e-01  2.154e-02  -6.517 1.28e-10 ***
## Bldg.TypeTwnhs   -6.290e-02  2.871e-02  -2.191 0.028775 *  
## Bldg.TypeTwnhsE  -2.094e-02  1.978e-02  -1.058 0.290156    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1021 on 788 degrees of freedom
## Multiple R-squared:  0.9324, Adjusted R-squared:  0.9285 
## F-statistic: 241.4 on 45 and 788 DF,  p-value: < 2.2e-16
  1. Variables used:
  • Overall.Qual
  • Neighborhood
  • X1st.Flr.SF
  • X2nd.Flr.SF
  • Overall.Cond
  • Year.Built
  • log(Lot.Area)
  • Bsmt.Full.Bath
  • Condition.1
  • Bldg.Type

Model selection using AIC score

Here we will use another method of model-building, using the Akaike information criterion (AIC) as the metric of interest. Per wikipedia, the AIC “offers a relative estimate of the information lost when a given model is used to represent the process that generates the data. In doing so, it deals with the trade-off between the goodness of fit of the model and the complexity of the model.”

The goal of the stepwise AIC method is to minimize the AIC score. At each step, the variable with the lowest AIC score is removed, and if the resulting AIC score of the model at large is lower than it was at the previous step, than that model is considered superior and the process continues.

stepAIC(initial_model_adjr2, direction="backward", trace=TRUE)
## Start:  AIC=-3761.09
## log(price) ~ Overall.Qual + Neighborhood + X1st.Flr.SF + X2nd.Flr.SF + 
##     Overall.Cond + Year.Built + log(Lot.Area) + Bsmt.Full.Bath + 
##     Condition.1 + Bldg.Type
## 
##                  Df Sum of Sq     RSS     AIC
## <none>                         8.2171 -3761.1
## - Condition.1     8    0.4271  8.6443 -3734.8
## - Bldg.Type       4    0.4799  8.6970 -3721.8
## - log(Lot.Area)   1    0.6836  8.9007 -3696.4
## - Bsmt.Full.Bath  1    1.1877  9.4048 -3650.5
## - Neighborhood   26    1.8893 10.1064 -3640.5
## - Year.Built      1    2.0986 10.3158 -3573.4
## - Overall.Qual    1    2.4778 10.6950 -3543.3
## - Overall.Cond    1    2.5465 10.7636 -3537.9
## - X2nd.Flr.SF     1    6.5663 14.7835 -3273.3
## - X1st.Flr.SF     1    7.2444 15.4615 -3235.9
## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + X1st.Flr.SF + 
##     X2nd.Flr.SF + Overall.Cond + Year.Built + log(Lot.Area) + 
##     Bsmt.Full.Bath + Condition.1 + Bldg.Type, data = ames_train)
## 
## Coefficients:
##      (Intercept)      Overall.Qual    Neighborhood.L    Neighborhood.Q  
##        1.3735383         0.0742148        -0.2911943         0.0455773  
##   Neighborhood.C    Neighborhood^4    Neighborhood^5    Neighborhood^6  
##       -0.0302815        -0.1126002         0.0179737        -0.0252344  
##   Neighborhood^7    Neighborhood^8    Neighborhood^9   Neighborhood^10  
##        0.0312779        -0.0465685         0.0634244         0.0802664  
##  Neighborhood^11   Neighborhood^12   Neighborhood^13   Neighborhood^14  
##       -0.1727068         0.0886878        -0.1898106         0.0898339  
##  Neighborhood^15   Neighborhood^16   Neighborhood^17   Neighborhood^18  
##       -0.0129395         0.0645015        -0.0500129         0.0513899  
##  Neighborhood^19   Neighborhood^20   Neighborhood^21   Neighborhood^22  
##        0.0364303        -0.0307745        -0.0218634        -0.1096496  
##  Neighborhood^23   Neighborhood^24   Neighborhood^25   Neighborhood^26  
##        0.0322705         0.0179585         0.0633843         0.0821874  
##      X1st.Flr.SF       X2nd.Flr.SF      Overall.Cond        Year.Built  
##        0.0004202         0.0002912         0.0582798         0.0042194  
##    log(Lot.Area)    Bsmt.Full.Bath  Condition.1Feedr   Condition.1Norm  
##        0.1003969         0.0810199        -0.0161937         0.0718197  
##  Condition.1PosA   Condition.1PosN   Condition.1RRAe   Condition.1RRAn  
##        0.0422218         0.0753995        -0.0038797         0.0216620  
##  Condition.1RRNe   Condition.1RRNn   Bldg.Type2fmCon   Bldg.TypeDuplex  
##        0.0147742        -0.0418779        -0.0118010        -0.1404000  
##   Bldg.TypeTwnhs   Bldg.TypeTwnhsE  
##       -0.0628974        -0.0209419

We see from the above result that the original model had the lowest AIC score of any combination of its constituent variables. By this metric, this is the optimal model.

  1. Variables used:
  • Overall.Qual
  • Neighborhood
  • X1st.Flr.SF
  • X2nd.Flr.SF
  • Overall.Cond
  • Year.Built
  • log(Lot.Area)
  • Bsmt.Full.Bath
  • Condition.1
  • Bldg.Type

The above three methods (forward selection using adjusted R2, stepwise AIC, and Bayesian Model Averaging) produce the same results. All 10 variables from the original model are retained.


Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.


ggplot(ames_train, aes(x=initial_model_adjr2$residuals)) + geom_histogram() + labs(title="Initial Model Residuals") + xlab("residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We see what resembles a normal distribution in the spread of the residuals, which fulfills a stipulation for model legitimacy.

plot(initial_model_adjr2$residuals, col="red", ylab="residuals", main="Residuals Scatterplot") + 
abline(h=0, lty=2)

## integer(0)

We see a random scatter along 0 on the y-axis, which fulfills a condition for model legitimacy.

qqnorm(initial_model_adjr2$residuals, col="red")
qqline(initial_model_adjr2$residuals)

We see a mostly-linear relationship between theoretical and sample quantiles, which is what we would look for.


Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).


initial_residuals <- residuals(initial_model_adjr2)
initial_rmse <- sd(initial_residuals)
initial_rmse
## [1] 0.09932028

Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("ames_test.Rdata.gz")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?


Below is the code to find the RMSE of the model on the test data.

initial_testmodel_adjr2 <- lm(log(price) ~ Overall.Qual + Neighborhood + X1st.Flr.SF + X2nd.Flr.SF + Overall.Cond + Year.Built + log(Lot.Area) + Bsmt.Full.Bath + Condition.1 + Bldg.Type, data=ames_test)
initial_test_residuals <- residuals(initial_testmodel_adjr2)
initial_test_rmse <- sd(initial_test_residuals)
initial_test_rmse
## [1] 0.1035496

We see that the RMSE is greater on the test data, as we would have guessed. The difference is not too large (0.09932028 vs. 0.1035496 for the test data). We can therefore confidently say that the model does not suffer from overfitting.

We want to see if Bayesian Model Averaging produces the same model using the ames_test dataset. Let’s try that now.

The results of running our model through Bayesian Model Averaging give us the same model.


Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

Section 3.1 Final Model

Provide the summary table for your model.


final_model <- lm(log(price) ~ Overall.Qual + Neighborhood + X1st.Flr.SF + 
                    X2nd.Flr.SF + Overall.Cond + Year.Built + log(Lot.Area) + 
                    Bsmt.Full.Bath + Condition.1 + Bldg.Type + Garage.Area + 
                    Functional + Kitchen.Qual, data=ames_train)
summary(final_model)
## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + X1st.Flr.SF + 
##     X2nd.Flr.SF + Overall.Cond + Year.Built + log(Lot.Area) + 
##     Bsmt.Full.Bath + Condition.1 + Bldg.Type + Garage.Area + 
##     Functional + Kitchen.Qual, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70395 -0.05130  0.00145  0.05612  0.30457 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.644e+00  6.330e-01   4.176 3.30e-05 ***
## Overall.Qual      6.372e-02  4.834e-03  13.180  < 2e-16 ***
## Neighborhood.L   -2.892e-01  4.727e-02  -6.118 1.50e-09 ***
## Neighborhood.Q    1.595e-02  3.250e-02   0.491  0.62380    
## Neighborhood.C   -2.331e-02  2.902e-02  -0.803  0.42207    
## Neighborhood^4   -1.248e-01  3.013e-02  -4.142 3.82e-05 ***
## Neighborhood^5    4.100e-02  3.202e-02   1.280  0.20080    
## Neighborhood^6   -2.729e-02  3.240e-02  -0.842  0.39987    
## Neighborhood^7    4.328e-02  3.323e-02   1.302  0.19315    
## Neighborhood^8   -6.432e-02  3.027e-02  -2.125  0.03392 *  
## Neighborhood^9    4.811e-02  3.109e-02   1.548  0.12208    
## Neighborhood^10   7.519e-02  2.921e-02   2.574  0.01022 *  
## Neighborhood^11  -1.637e-01  3.648e-02  -4.487 8.31e-06 ***
## Neighborhood^12   1.273e-01  4.304e-02   2.957  0.00320 ** 
## Neighborhood^13  -1.786e-01  3.721e-02  -4.799 1.91e-06 ***
## Neighborhood^14   8.977e-02  3.432e-02   2.616  0.00908 ** 
## Neighborhood^15  -2.054e-02  3.105e-02  -0.662  0.50834    
## Neighborhood^16   7.107e-02  3.038e-02   2.339  0.01957 *  
## Neighborhood^17  -5.040e-02  3.532e-02  -1.427  0.15406    
## Neighborhood^18   4.757e-02  3.201e-02   1.486  0.13761    
## Neighborhood^19   2.584e-02  3.255e-02   0.794  0.42759    
## Neighborhood^20  -2.578e-02  2.548e-02  -1.012  0.31203    
## Neighborhood^21  -2.788e-02  2.962e-02  -0.941  0.34677    
## Neighborhood^22  -9.370e-02  3.648e-02  -2.569  0.01039 *  
## Neighborhood^23   5.116e-02  2.601e-02   1.967  0.04952 *  
## Neighborhood^24   2.136e-02  2.504e-02   0.853  0.39389    
## Neighborhood^25   5.207e-02  2.857e-02   1.822  0.06877 .  
## Neighborhood^26   7.162e-02  2.984e-02   2.400  0.01663 *  
## X1st.Flr.SF       4.089e-04  1.642e-05  24.894  < 2e-16 ***
## X2nd.Flr.SF       2.888e-04  1.158e-05  24.938  < 2e-16 ***
## Overall.Cond      5.411e-02  3.767e-03  14.366  < 2e-16 ***
## Year.Built        3.613e-03  3.046e-04  11.861  < 2e-16 ***
## log(Lot.Area)     9.425e-02  1.220e-02   7.723 3.49e-14 ***
## Bsmt.Full.Bath    7.516e-02  7.334e-03  10.248  < 2e-16 ***
## Condition.1Feedr -2.393e-02  2.866e-02  -0.835  0.40395    
## Condition.1Norm   6.490e-02  2.434e-02   2.666  0.00783 ** 
## Condition.1PosA   4.669e-02  4.565e-02   1.023  0.30674    
## Condition.1PosN   7.038e-02  4.164e-02   1.690  0.09136 .  
## Condition.1RRAe  -9.503e-03  4.390e-02  -0.216  0.82869    
## Condition.1RRAn   1.485e-02  4.167e-02   0.356  0.72161    
## Condition.1RRNe   2.119e-03  7.411e-02   0.029  0.97720    
## Condition.1RRNn  -5.330e-02  7.461e-02  -0.714  0.47517    
## Bldg.Type2fmCon  -1.215e-02  2.569e-02  -0.473  0.63644    
## Bldg.TypeDuplex  -1.366e-01  2.102e-02  -6.500 1.43e-10 ***
## Bldg.TypeTwnhs   -3.394e-02  2.833e-02  -1.198  0.23122    
## Bldg.TypeTwnhsE  -7.674e-03  1.935e-02  -0.397  0.69176    
## Garage.Area       1.222e-04  2.361e-05   5.175 2.91e-07 ***
## FunctionalMaj2   -8.363e-02  8.617e-02  -0.971  0.33208    
## FunctionalMin1    6.341e-02  5.692e-02   1.114  0.26566    
## FunctionalMin2    5.571e-02  5.541e-02   1.005  0.31498    
## FunctionalMod    -5.768e-03  5.813e-02  -0.099  0.92099    
## FunctionalTyp     1.049e-01  5.157e-02   2.035  0.04222 *  
## Kitchen.QualFa   -1.281e-01  3.313e-02  -3.867  0.00012 ***
## Kitchen.QualGd   -5.621e-02  1.988e-02  -2.827  0.00482 ** 
## Kitchen.QualPo    1.219e-01  1.121e-01   1.088  0.27705    
## Kitchen.QualTA   -8.663e-02  2.187e-02  -3.961 8.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09798 on 778 degrees of freedom
## Multiple R-squared:  0.9385, Adjusted R-squared:  0.9342 
## F-statistic: 215.9 on 55 and 778 DF,  p-value: < 2.2e-16

Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.


I transformed the variables price and Lot.Area.

The distributions of these variables pre-transform were skewed. By distributions of their log-transformations were much closer to normal, which is what we want in a variable in order to include it in the model.

ggplot(ames_train, aes(x=price)) + geom_histogram() + labs(title="Histogram of Prices")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(ames_train, aes(x=log(price))) + geom_histogram() + labs(title="Histogram of Prices, Log-Transformed")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(ames_train, aes(x=Lot.Area)) + geom_histogram() + labs(title="Histogram of Lot Areas")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(ames_train, aes(x=log(Lot.Area))) + geom_histogram() + labs(title="Histogram of Lot Areas, Log-Transformed")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that the distributions of the log-transformations are much closer to a normal distribution.


Section 3.3 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.


I built my model using forward-selection with the adjusted R2 score as the criterion. I felt using by using this metric, I could go about building an informative model in a reliable way. With adjusted R2, finding the best variable to add to the model at each step is straightforward. Using the p-value in either forward or backward selections creates much more grey area; a certain level of a multi-level categorical variable may be very significant, while its other variables are less significant or even very insignificant.

Through the course of my EDA I found that certain variables would not be informative to the model.

For instance, I took a look at which variables had a large number of missing values.

sort(sapply(ames_train, function(x) sum(is.na(x))), decreasing=TRUE)
##         Pool.QC    Misc.Feature           Alley           Fence    Fireplace.Qu 
##             831             806             781             655             414 
##    Lot.Frontage   Garage.Yr.Blt     Garage.Qual     Garage.Cond     Garage.Type 
##             156              36              36              36              35 
##   Garage.Finish       Bsmt.Qual       Bsmt.Cond   Bsmt.Exposure  BsmtFin.Type.1 
##              35              20              20              20              20 
##  BsmtFin.Type.2    Mas.Vnr.Area             PID            area           price 
##              20               4               0               0               0 
##     MS.SubClass       MS.Zoning        Lot.Area          Street       Lot.Shape 
##               0               0               0               0               0 
##    Land.Contour       Utilities      Lot.Config      Land.Slope    Neighborhood 
##               0               0               0               0               0 
##     Condition.1     Condition.2       Bldg.Type     House.Style    Overall.Qual 
##               0               0               0               0               0 
##    Overall.Cond      Year.Built  Year.Remod.Add      Roof.Style       Roof.Matl 
##               0               0               0               0               0 
##    Exterior.1st    Exterior.2nd    Mas.Vnr.Type      Exter.Qual      Exter.Cond 
##               0               0               0               0               0 
##      Foundation    BsmtFin.SF.1    BsmtFin.SF.2     Bsmt.Unf.SF   Total.Bsmt.SF 
##               0               0               0               0               0 
##         Heating      Heating.QC     Central.Air      Electrical     X1st.Flr.SF 
##               0               0               0               0               0 
##     X2nd.Flr.SF Low.Qual.Fin.SF  Bsmt.Full.Bath  Bsmt.Half.Bath       Full.Bath 
##               0               0               0               0               0 
##       Half.Bath   Bedroom.AbvGr   Kitchen.AbvGr    Kitchen.Qual   TotRms.AbvGrd 
##               0               0               0               0               0 
##      Functional      Fireplaces     Garage.Cars     Garage.Area     Paved.Drive 
##               0               0               0               0               0 
##    Wood.Deck.SF   Open.Porch.SF  Enclosed.Porch     X3Ssn.Porch    Screen.Porch 
##               0               0               0               0               0 
##       Pool.Area        Misc.Val         Mo.Sold         Yr.Sold       Sale.Type 
##               0               0               0               0               0 
##  Sale.Condition 
##               0

Off the bat, we can disqualify Pool.QC, Misc.Feature, Alley, Fence, and Fireplace.Qu because of their very high proportion of missing values. These variables will not be useful in our model.

The variable Utilities would also not prove insightful:

levels(ames_train$Utilities)
## [1] "AllPub" "NoSeWa" "NoSewr"
#[1] "AllPub" "NoSeWa" "NoSewr"
length(which(ames_train$Utilities=="AllPub"))
## [1] 834
#[1] 1000

The variable has three distinct levels, but all instances are of the value AllPub.

Street, too, would not be useful to the model.

length(which(ames_train$Street=="Pave"))
## [1] 832
#[1] 997
length(which(ames_train$Street=="Grvl"))
## [1] 2
#[1] 3

As virtually all of streets of paved, this variable would not be very informative.

The values of the variable Roof.Matl which are not NA overwhelmingly fall under just of its 8 levels– CompShg. This lack of variety and lopsidedness would not be useful to the model.

summary(ames_train$Roof.Matl)
## ClyTile CompShg Membran   Metal    Roll Tar&Grv WdShake WdShngl 
##       0     822       0       1       0       8       1       2
#ClyTile CompShg Membran   Metal    Roll Tar&Grv WdShake WdShngl 
#      0     822       0       1       0       8       1       2 

Section 3.4 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.


I ran the Bayesian Model Averaging algorithm on the same model with testing data and I found that of the 20 variables in the model, five were deemed not worthy of inclusion. The five that did not significantly inform the model were Exterior.1st, `Exter.Qual,Lot.Frontage,Heating.QC, andYear.Remod.Add```.


Part 4 Final Model Assessment

Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.


ggplot(na.omit(ames_train), aes(x=final_model$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot(final_model$residuals, col="red", main="Scatterplot of Residuals")
abline(h=0, lty=2)

qqnorm(final_model$residuals, col="red")
qqline(final_model$residuals)


Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.


final_model_residuals <- residuals(final_model)
final_model_rmse <- sd(final_model_residuals)
final_model_rmse
## [1] 0.09469104

The RMSE of the final model is somewhat lower than the original: 0.0910005 vs 0.09932028; a change of 0.00831978.


Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?


I am sure that I could have benefited from taking a closer look at variable interactions, in retrospect. While I was building this model, I felt confident that the methods I used using only solitary variables would have created a strong enough model. The final score was decent, in fact, but I believe it could have been higher had I taken advantage of investigating variable interactions more closely.


Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("ames_validation.Rdata.gz")

final_model_validation <- lm(log(price) ~ Overall.Qual + Neighborhood + X1st.Flr.SF + 
                    X2nd.Flr.SF + Overall.Cond + Year.Built + log(Lot.Area) + 
                    Bsmt.Full.Bath + Condition.1 + Bldg.Type + MS.Zoning +
                    Garage.Area + Functional + Kitchen.Qual + Heating, data=ames_validation)
final_model_validation_residuals <- residuals(final_model_validation)
final_model_validation_rmse <- sd(final_model_validation_residuals)
final_model_validation_rmse
## [1] 0.09174737

The final RMSE is greater than that of the final model but less than that of the initial model.

predictions <- as.data.frame(predict(final_model, ames_validation, interval = "prediction", level = 0.95))
predictions$log_validation.prices <- log(ames_validation$price)
predictions <- predictions %>%
  mutate(in_range = ifelse(log_validation.prices > lwr & log_validation.prices < upr, "yes", "no"))
length(which(predictions$in_range=="yes"))/nrow(predictions)
## [1] 0.948886

We see that 94.89% of our predictions fall within the interval generated using a 95% confidence level.


Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.


The model performed well when judged by proportion of predictions falling within the 95% confidence interval, with 94.89% accuracy.

While building the model, I was surprised at times at which variables were informative and which were not. For instance, I assumed going into this that central AC would be an important factor. My experience as a Houston, Texas native may have colored my assumption. The weather in this region is very hot and humid many months of the year, and AC is a necessity. The average highs temperatures in Ames, Iowa are only 84 degrees F in July and 83 degrees F in August. Quite tolerable as contrasted against the weather in the Gulf region.

Another variable that made it to the model that I found interesting was regarding a house’s number of basement full baths. While this made it to the model, the number of basement half baths was quite insignificant.

I am sure that I could have benefited from taking a closer look at variable interactions, in retrospect. While I was building this model, I felt confident that the methods I used using only solitary variables would have created a strong enough model. The final score was decent, in fact, but I believe it could have been higher had I taken advantage of investigating variable interactions more closely.