1 Determinants of Housing Prices in Ames, Iowa

1.1 Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

1.2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")

library(statsr)
library(dplyr)
library(BAS)
library(ggplot2)
library(MASS)
library(gridExtra)

1.2.1 Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).

1.2.1.1 Price and Area

In the first histogram we see that the variable we are trying to predict in this project, the response variable price, is right-skewed, meaning that most homes in the training data set are in the lower price range, with few homes sold at much higher prices. The size variable area, a predictor variable of price, is also right skewed with most homes between 1,000 - 2,000 square feet, and with a few houses above 3,000 square feet. When these two variables are plotted against each other, we see a mostly-linear relationship with increasing variability as the homes get larger. However, when these variables are log-transformed, the linear relationship between these two variables is greatly improved. In the rest of this analysis, price and area will continue to be log-transformed since they are right-skewed and always have values greater than 0.

summary(ames_train$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129762  159467  181190  213000  615000

summary(ames_train$area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1092    1411    1477    1743    4676

ggplot(data = ames_train, aes(x = price)) + geom_histogram(bins = 25) + geom_vline(xintercept = mean(ames_train$price), linetype = "dashed", color = "purple") + geom_label(label = "Mean Price", x = mean(ames_train$price)+30000, y = 100, label.padding = unit(0.3, "lines"), label.size = .35, color = "purple") + geom_vline(xintercept = median(ames_train$price), color = "light blue") + geom_label(label = "Median Price", x = median(ames_train$price)-40000, y = 150, label.padding = unit(0.3, "lines"), label.size = .35, color = "light blue")

ggplot(data = ames_train, aes(x = area)) + geom_histogram(bins = 50) + geom_vline(xintercept = mean(ames_train$area), linetype = "dashed", color = "purple") + geom_label(label = "Mean Area", x = mean(ames_train$area) + 300, y = 75, label.padding = unit(0.3, "lines"), label.size = .35, color = "purple") + geom_vline(xintercept = median(ames_train$area), color = "light blue") + geom_label(label = "Median Area", x = median(ames_train$area) - 350, y = 55, label.padding = unit(0.3, "lines"), label.size = .35, color = "light blue")

areaplot <- ggplot(ames_train, aes(x = area, y = price)) + geom_point() + geom_smooth(method = "lm")
logareaplot <- ggplot(ames_train, aes(x = log(area), y = log(price))) + geom_point() + geom_smooth(method = "lm")
grid.arrange(areaplot, logareaplot, nrow = 1)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

1.2.1.2 Sale Conditions

In a linear model, we assume that all observations in the data are generated from the same process. However, houses sold in abnormal sale conditions do not exhibit the same behavior as houses sold in normal sale conditions. A house sold under abnormal conditions often sells for much less than expected given its square footage. Similarly, a partial sale of a house results in a higher price on average holding constant square footage. However, one partial sale has a total square footage of over 4000, which is highly influential on the regression results. Since the relationship between log(price) and log(area) varies by sale conditions, a linear regression model should be fitted separately for different sale conditions. Because houses with non-normal selling conditions exhibit atypical behavior and can disproportionately influence the model, the rest of this analysis will only model housing prices under normal sale conditions, which also has the most observations in the data set, comprising 83.4% of the original data set.

ggplot(data = ames_train, aes(x = log(area), y = log(price), col = Sale.Condition)) + geom_jitter() + stat_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

normal <- ames_train %>% filter(Sale.Condition == 'Normal')
print(paste('The percentage of the original observations remaining is', nrow(normal)/nrow(ames_train)*100))

## [1] "The percentage of the original observations remaining is 83.4"

1.2.1.3 Neighborhoods

The mantra in real estate is “Location, Location, Location!” Below is a graphical display that relates a home price to its neighborhood in Ames, Iowa. The median price is the most relevant summary statistic in determining the most expensive and least expensive neighborhoods. The IQR is the most relevant summary statistic in determining the most heterogenous neighborhood in terms of price. However, when included in the model, Neighborhood wasn’t very useful since it has 27 levels and many of the levels overlap each other as seen in the first boxplot. Using the summary of the price variable, the Neighborhood variable was re-leveled by quartile of price overall. This new variable, NeighborhoodGrouped, has 4 levels and narrows the spread of each level so that there is less overlap between levels. * * *

normal$Neighborhood = with(normal, reorder(Neighborhood, log(price), median))
neighborhoods <- ggplot(normal, aes(x = log(price), y = Neighborhood, fill = Neighborhood)) + geom_boxplot()
neighborhoods

pricesum <- summary(normal$price)
qt1 <- pricesum["1st Qu."]
med <- pricesum["Median"]
qt3 <- pricesum["3rd Qu."]
normal %>% group_by(Neighborhood) %>% summarise(median= median(price), .groups = 'drop') %>% filter(median <= qt1)

## # A tibble: 7 x 2
##   Neighborhood median
##   <fct>         <dbl>
## 1 MeadowV       85750
## 2 BrDale       100500
## 3 IDOTRR       103000
## 4 OldTown      120000
## 5 Blueste      123900
## 6 BrkSide      125250
## 7 Edwards      125400

normal %>% group_by(Neighborhood) %>% summarise(median= median(price), .groups = 'drop') %>% filter(median > qt1 & median <= med)

## # A tibble: 4 x 2
##   Neighborhood median
##   <fct>         <dbl>
## 1 SWISU        129500
## 2 Sawyer       135000
## 3 NAmes        140000
## 4 NPkVill      142100

normal %>% group_by(Neighborhood) %>% summarise(median= median(price), .groups = 'drop') %>% filter(median > med & median <= qt3)

## # A tibble: 8 x 2
##   Neighborhood median
##   <fct>         <dbl>
## 1 Mitchel      160000
## 2 SawyerW      181000
## 3 Gilbert      184000
## 4 ClearCr      187500
## 5 NWAmes       190000
## 6 Blmngtn      192000
## 7 CollgCr      195000
## 8 Crawfor      198000

normal %>% group_by(Neighborhood) %>% summarise(median= median(price), .groups = 'drop') %>% filter(median > qt3)

## # A tibble: 8 x 2
##   Neighborhood median
##   <fct>         <dbl>
## 1 Greens       212625
## 2 Veenker      217500
## 3 Somerst      220000
## 4 Timber       232500
## 5 StoneBr      255500
## 6 GrnHill      280000
## 7 NoRidge      290000
## 8 NridgHt      305500

normal <- normal %>% mutate(NeighborhoodGrouped = Neighborhood)
levels(normal$NeighborhoodGrouped)[levels(normal$NeighborhoodGrouped)%in%c("MeadowV", "BrDale", "IDOTRR", "OldTown", "Blueste", "BrkSide", "Edwards")]<-"Group1" 
levels(normal$NeighborhoodGrouped)[levels(normal$NeighborhoodGrouped)%in%c("SWISU", "Sawyer", "NAmes", "NPkVill")]<- "Group2"
levels(normal$NeighborhoodGrouped)[levels(normal$NeighborhoodGrouped)%in%c("Mitchel", "SawyerW", "Gilbert", "ClearCr", "NWAmes", "Blmngtn", "CollgCr", "Crawfor")]<- "Group3"
 levels(normal$NeighborhoodGrouped)[levels(normal$NeighborhoodGrouped)%in%c("Greens", "Veenker", "Somerst","Timber","StoneBr", "GrnHill", "NoRidge", "NridgHt")]<-"Group4"

group_neighborhood_price <- with(normal, reorder(NeighborhoodGrouped, log(price), median))
groupedneighborhoods <- ggplot(normal, aes(x = log(price), y = group_neighborhood_price, fill = group_neighborhood_price)) + geom_boxplot()
groupedneighborhoods

###Development and assessment of an initial model, following a semi-guided process of analysis

1.2.1.4 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).

For the initial model, I chose variables that I expected to contribute to price, mostly variables related to a home’s location (NeighborhoodGrouped), size(log(area), log(Lot.Area)), Garage.Cars, Full.Bath, Half.Bath, Bedroom.AbvGr), age(Year.Built), and quality(Overall.Qual, Central.Air). The variables price, area, and Lot.Area are log-transformed, and Neighborhood is re-leveled to NeighborhoodGrouped. The initial linear regression model is as follows: \[ log(price) = 2.529 + .49 log(area) + .091 Overall.Qual + .039 Garage.Cars - .030 Full.Bath\] \[- .035 Half.Bath -.030 Bedroom.AbvGr + .002 Year.Built +.122 log(Lot.Area) + .188 Central.Air\] \[+ .070 NeighborhoodGroupedGroup2 + .077 NeighborhoodGroupedGroup3 \] \[+ .177 NeighborhoodGroupedGroup4\]

Coefficients for log(area) and log(Lot.Area) show that the unit changes in these variables increase the prediction of log(price) by .490% and .122%, respectively, when everything else is held constant. NeighborhoodGrouped Group 1 is used as a reference level to groups 2, 3, and 4 which all increase the predicted price by .070%, .077%, and .177% respectively. Increasing the Year.Built increases price by only .002%. The quality variables Overall.Qual and Central.Air both increase the projected price by .091% and .188% respectively. Among the other variables which enumerate other specific qualities of any given house, increasing BedroomAbvGr, Half.Bath, or Full.Bath is shown to depreciate the home price in this model. Larger houses on average have more bedrooms and bathrooms and would sell for a higher price, however if size is held constant, the number of bedrooms and bathrooms decreases the property value. This model can be used to predict home price, but is not proper for testing hypotheses related to the relationship of these variables.

i_model <- lm(log(price) ~ log(area) + Overall.Qual +  Garage.Cars + Full.Bath + Half.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) +  Central.Air + NeighborhoodGrouped, data = normal)
summary(i_model)

## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + Garage.Cars + 
##     Full.Bath + Half.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) + 
##     Central.Air + NeighborhoodGrouped, data = normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83142 -0.07087  0.00467  0.07418  0.61810 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.528774   0.609275   4.150 3.66e-05 ***
## log(area)                  0.489732   0.028455  17.211  < 2e-16 ***
## Overall.Qual               0.090792   0.005703  15.921  < 2e-16 ***
## Garage.Cars                0.039351   0.008471   4.646 3.95e-06 ***
## Full.Bath                 -0.030561   0.013629  -2.242 0.025202 *  
## Half.Bath                 -0.034653   0.011106  -3.120 0.001870 ** 
## Bedroom.AbvGr             -0.030281   0.007890  -3.838 0.000134 ***
## Year.Built                 0.002079   0.000273   7.615 7.24e-14 ***
## log(Lot.Area)              0.122484   0.010419  11.756  < 2e-16 ***
## Central.AirY               0.188363   0.021824   8.631  < 2e-16 ***
## NeighborhoodGroupedGroup2  0.069989   0.014752   4.745 2.46e-06 ***
## NeighborhoodGroupedGroup3  0.077277   0.018038   4.284 2.05e-05 ***
## NeighborhoodGroupedGroup4  0.176752   0.023402   7.553 1.14e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1287 on 821 degrees of freedom
## Multiple R-squared:  0.888,  Adjusted R-squared:  0.8864 
## F-statistic: 542.4 on 12 and 821 DF,  p-value: < 2.2e-16

1.2.1.5 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?

To refine and select a better model, I started with my initial model (i_model) and used the stepwise methods to select a single parsimonious model. The function stepAIC using BIC(Bayesian Information Criteria) as a criterion (in which k=log(n)) returned the same linear model (imodel_BIC) but without the bathroom variables, Full.Bath and Half.Bath, as the best model. The bayesian linear model (imodel_bas) using BMA (Bayesian Model Averaging) as a criterion also selected the same model as BIC (removing Full.Bath and Half.Bath) with the same remaining variables (8 variables) as the best model, but with different coefficients. It’s possible that these two variables were removed because they are corelated to other variables like bedroom number in a house.

imodel_BIC <- stepAIC(i_model, k = log(nrow(normal)))

## Start:  AIC=-3344.97
## log(price) ~ log(area) + Overall.Qual + Garage.Cars + Full.Bath + 
##     Half.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) + 
##     Central.Air + NeighborhoodGrouped
## 
##                       Df Sum of Sq    RSS     AIC
## - Full.Bath            1    0.0833 13.691 -3346.6
## <none>                             13.608 -3345.0
## - Half.Bath            1    0.1614 13.769 -3341.9
## - Bedroom.AbvGr        1    0.2441 13.852 -3336.9
## - Garage.Cars          1    0.3577 13.965 -3330.1
## - NeighborhoodGrouped  3    1.1209 14.729 -3299.1
## - Year.Built           1    0.9612 14.569 -3294.8
## - Central.Air          1    1.2347 14.842 -3279.3
## - log(Lot.Area)        1    2.2905 15.898 -3222.0
## - Overall.Qual         1    4.2012 17.809 -3127.3
## - log(area)            1    4.9094 18.517 -3094.8
## 
## Step:  AIC=-3346.61
## log(price) ~ log(area) + Overall.Qual + Garage.Cars + Half.Bath + 
##     Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air + 
##     NeighborhoodGrouped
## 
##                       Df Sum of Sq    RSS     AIC
## - Half.Bath            1    0.1057 13.797 -3346.9
## <none>                             13.691 -3346.6
## - Bedroom.AbvGr        1    0.3222 14.013 -3333.9
## - Garage.Cars          1    0.3435 14.034 -3332.7
## - Year.Built           1    0.8780 14.569 -3301.5
## - NeighborhoodGrouped  3    1.1438 14.835 -3299.9
## - Central.Air          1    1.3530 15.044 -3274.7
## - log(Lot.Area)        1    2.4623 16.153 -3215.4
## - Overall.Qual         1    4.2419 17.933 -3128.2
## - log(area)            1    5.3467 19.038 -3078.4
## 
## Step:  AIC=-3346.92
## log(price) ~ log(area) + Overall.Qual + Garage.Cars + Bedroom.AbvGr + 
##     Year.Built + log(Lot.Area) + Central.Air + NeighborhoodGrouped
## 
##                       Df Sum of Sq    RSS     AIC
## <none>                             13.797 -3346.9
## - Bedroom.AbvGr        1    0.3365 14.133 -3333.5
## - Garage.Cars          1    0.3614 14.158 -3332.1
## - Year.Built           1    0.8125 14.609 -3305.9
## - NeighborhoodGrouped  3    1.1741 14.971 -3299.0
## - Central.Air          1    1.3217 15.118 -3277.3
## - log(Lot.Area)        1    2.7185 16.515 -3203.6
## - Overall.Qual         1    4.3626 18.159 -3124.5
## - log(area)            1    5.4124 19.209 -3077.6

summary(imodel_BIC)

## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + Garage.Cars + 
##     Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air + 
##     NeighborhoodGrouped, data = normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81147 -0.07578  0.00434  0.07664  0.58971 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.2581704  0.5668561   5.748 1.27e-08 ***
## log(area)                  0.4423930  0.0246207  17.968  < 2e-16 ***
## Overall.Qual               0.0922315  0.0057174  16.132  < 2e-16 ***
## Garage.Cars                0.0394760  0.0085023   4.643 4.00e-06 ***
## Bedroom.AbvGr             -0.0347306  0.0077517  -4.480 8.51e-06 ***
## Year.Built                 0.0018177  0.0002611   6.962 6.87e-12 ***
## log(Lot.Area)              0.1301937  0.0102238  12.734  < 2e-16 ***
## Central.AirY               0.1927620  0.0217093   8.879  < 2e-16 ***
## NeighborhoodGroupedGroup2  0.0741760  0.0147042   5.045 5.60e-07 ***
## NeighborhoodGroupedGroup3  0.0706354  0.0180062   3.923 9.48e-05 ***
## NeighborhoodGroupedGroup4  0.1731179  0.0234618   7.379 3.91e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1295 on 823 degrees of freedom
## Multiple R-squared:  0.8864, Adjusted R-squared:  0.8851 
## F-statistic: 642.4 on 10 and 823 DF,  p-value: < 2.2e-16

imodel_bas <- bas.lm(log(price) ~ log(area) + Overall.Qual +  Garage.Cars + Full.Bath + Half.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) +  Central.Air + NeighborhoodGrouped, data = normal, prior = "BIC", modelprior = uniform())
summary(imodel_bas)

##                           P(B != 0 | Y)    model 1       model 2       model 3
## Intercept                     1.0000000     1.0000     1.0000000     1.0000000
## log(area)                     1.0000000     1.0000     1.0000000     1.0000000
## Overall.Qual                  1.0000000     1.0000     1.0000000     1.0000000
## Garage.Cars                   0.9992891     1.0000     1.0000000     1.0000000
## Full.Bath                     0.1992516     0.0000     0.0000000     1.0000000
## Half.Bath                     0.5329967     0.0000     1.0000000     1.0000000
## Bedroom.AbvGr                 0.9958146     1.0000     1.0000000     1.0000000
## Year.Built                    1.0000000     1.0000     1.0000000     1.0000000
## log(Lot.Area)                 1.0000000     1.0000     1.0000000     1.0000000
## Central.AirY                  1.0000000     1.0000     1.0000000     1.0000000
## NeighborhoodGroupedGroup2     0.9982808     1.0000     1.0000000     1.0000000
## NeighborhoodGroupedGroup3     0.9894356     1.0000     1.0000000     1.0000000
## NeighborhoodGroupedGroup4     1.0000000     1.0000     1.0000000     1.0000000
## BF                                   NA     1.0000     0.8557881     0.3780827
## PostProbs                            NA     0.4255     0.3641000     0.1609000
## R2                                   NA     0.8864     0.8873000     0.8880000
## dim                                  NA    11.0000    12.0000000    13.0000000
## logmarg                              NA -1131.3810 -1131.5367253 -1132.3536351
##                                 model 4       model 5
## Intercept                  1.000000e+00     1.0000000
## log(area)                  1.000000e+00     1.0000000
## Overall.Qual               1.000000e+00     1.0000000
## Garage.Cars                1.000000e+00     1.0000000
## Full.Bath                  1.000000e+00     0.0000000
## Half.Bath                  0.000000e+00     0.0000000
## Bedroom.AbvGr              1.000000e+00     1.0000000
## Year.Built                 1.000000e+00     1.0000000
## log(Lot.Area)              1.000000e+00     1.0000000
## Central.AirY               1.000000e+00     1.0000000
## NeighborhoodGroupedGroup2  1.000000e+00     1.0000000
## NeighborhoodGroupedGroup3  1.000000e+00     0.0000000
## NeighborhoodGroupedGroup4  1.000000e+00     1.0000000
## BF                         8.000118e-02     0.0127523
## PostProbs                  3.400000e-02     0.0054000
## R2                         8.867000e-01     0.8843000
## dim                        1.200000e+01    10.0000000
## logmarg                   -1.133907e+03 -1135.7430366

coef(imodel_bas)

## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  4096 models 
##                            post mean  post SD    post p(B != 0)
## Intercept                  11.995840   0.004474   1.000000     
## log(area)                   0.457832   0.030796   1.000000     
## Overall.Qual                0.091655   0.005756   1.000000     
## Garage.Cars                 0.039120   0.008568   0.999289     
## Full.Bath                  -0.005638   0.013044   0.199252     
## Half.Bath                  -0.015474   0.016693   0.532997     
## Bedroom.AbvGr              -0.033541   0.008206   0.995815     
## Year.Built                  0.001903   0.000285   1.000000     
## log(Lot.Area)               0.127409   0.010837   1.000000     
## Central.AirY                0.192814   0.021870   1.000000     
## NeighborhoodGroupedGroup2   0.072994   0.015449   0.998281     
## NeighborhoodGroupedGroup3   0.071826   0.019531   0.989436     
## NeighborhoodGroupedGroup4   0.172940   0.024548   1.000000

1.2.1.6 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.

The following residual plots show that the initial model (i_model) satisfies the assumptions of linear regression modeling. These criteria are: 1. the variance of the residues are almost constant 2. the residuals show a nearly linear relationship 3. the residuals are nearly-normally distributed 4. the residuals are independent (there is no particular pattern observed in the residuals) Cook’s distance plot shows the most influential (leveraging) outliers (observation 611, 21, 325). Observation 325 has an extreme lot.area (i.e., 19,800 sq ft). Observation 21 has extreme value on Year.Built(1880). Observation 611 is an extreme value in the NeighborhoodGrouped (Group1). To explain these outliers, this linear model might not include critical variables necessary to predict these houses. Therefore, adding more variables might resolve this outlier issue.

par(mfrow = c(2,2))
plot(i_model, which = 1, pch = 16, cex = .7, col = "light blue")
plot(i_model, which = 2, pch = 16, cex = .7, col = "light blue")
hist(i_model$resid, col = "light blue")
plot(i_model$resid, pch = 16, cex = 0.7, col = "light blue")

par(mfrow = c(1,2))
plot(i_model, which = 3, col = "light blue")
plot(i_model, which = 4, col = "light blue")

1.2.1.7 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).

One metric for comparing out-of-sample performance for multiple models is called root mean squared error (RMSE). Within the context of linear modeling, this involves taking the square root of the mean of the squared residuals. In general, the better the model fit, the lower the RMSE.The RMSE for the frequentist linear model (i_model) is $24,405.07 while the RMSE for the bayesian model (i_model_bas) was slightly higher, $24,527.42.

pred.imodel.train <- exp(predict(i_model, normal))
resid.imodel.train <- normal$price - pred.imodel.train
RMSE.imodel.train <- sqrt(mean(resid.imodel.train^2))                
RMSE.imodel.train

## [1] 24405.07

pred.im_BIC.train <- exp(predict(imodel_BIC, normal))
resid.im_BIC.train <- normal$price - pred.im_BIC.train
RMSE.im_BIC.train <- sqrt(mean(resid.im_BIC.train^2))
RMSE.im_BIC.train

## [1] 24644.17

pred.im_bas.train <- exp(predict(imodel_bas, data = normal, estimator = "BMA")$fit)
resid.im_bas.train <- normal$price - pred.im_bas.train
RMSE.im_bas.train <- sqrt(mean(resid.im_bas.train^2))                
RMSE.im_bas.train

## [1] 24527.42

1.2.1.8 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("ames_test.Rdata")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?

The RMSE for the frequentist model is still smaller than that of the bayesian model when used with out-of-sample data (ames_test). Each of these is higher than their respective training data RMSEs. The test data RMSE for i_model (frequentist) is $26,257.29 while that of i_model_bas (bayesian) is $26,311.66. Since they are slightly greater than that of the training data (24,405 and 24,527), the models are “over-fitting” with the training data. However, since the model is built to fit the training data, it will generally fit out-of-sample data worse. As a result, the RMSE for predictions to out-of-sample data will generally be higher.

One way to assess how well a model reflects uncertainty is determining coverage probability. For example, if assumptions are met, a 95% prediction interval for price should include the true value of price roughly 95% of the time. If the true proportion of out-of-sample prices that fall within the 95% prediction interval are significantly greater than or less than 0.95, then some assumptions regarding uncertainty may not be met. This initial model has a coverage of roughly 95% so we can say that it properly reflects uncertainty.

normal_test <- ames_test %>% filter(Sale.Condition == "Normal") %>% mutate(NeighborhoodGrouped = Neighborhood)
levels(normal_test$NeighborhoodGrouped)[levels(normal_test$NeighborhoodGrouped)%in%c("MeadowV", "BrDale", "IDOTRR", "OldTown", "Blueste", "BrkSide", "Edwards")]<-"Group1" 
levels(normal_test$NeighborhoodGrouped)[levels(normal_test$NeighborhoodGrouped)%in%c("SWISU", "Sawyer", "NAmes", "NPkVill", "Landmrk")]<- "Group2"
levels(normal_test$NeighborhoodGrouped)[levels(normal_test$NeighborhoodGrouped)%in%c("Mitchel", "SawyerW", "Gilbert", "ClearCr", "NWAmes", "Blmngtn", "CollgCr", "Crawfor")]<- "Group3"
 levels(normal_test$NeighborhoodGrouped)[levels(normal_test$NeighborhoodGrouped)%in%c("Greens", "Veenker", "Somerst","Timber","StoneBr", "GrnHill", "NoRidge", "NridgHt")]<-"Group4"

pred.im.test <- exp(predict(i_model, normal_test))
resid.im.test <- normal_test$price - pred.im.test
RMSE.im.test <- sqrt(mean(resid.im.test^2))                
RMSE.im.test

## [1] 26257.29

pred.im_BIC.test <- exp(predict(imodel_BIC, normal_test))
resid.im_BIC.test <- normal_test$price - pred.im_BIC.test
RMSE.im_BIC.test <- sqrt(mean(resid.im_BIC.test^2))
RMSE.im_BIC.test

## [1] 26467.19

pred.im_bas.test <- exp(predict(imodel_bas, normal_test, estimator = "BMA")$fit)
resid.im_bas.test <- normal_test$price - pred.im_bas.test
RMSE.im_bas.test <- sqrt(mean(resid.im_bas.test^2))                
RMSE.im_bas.test

## [1] 26311.66

int.im.test <- exp(predict(i_model, normal_test, interval = "prediction"))
cover.im.test <- mean(normal_test$price > int.im.test[,"lwr"] &
                            normal_test$price < int.im.test[,"upr"])
cover.im.test

## [1] 0.9559364

Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

1.2.2 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

1.2.2.1 Final Model

Provide the summary table for your model.

The final model (f_model) is as follows: \[log(price) = 0.001 + .492 log(area) + .073 Overall.Qual + .039 Garage.Cars - .026 Half.Bath\] \[ -.039 Bedroom.AbvGr +.003 Year.Built + .130 log(Lot.Area) + .081 Central.Air \] \[ + .064 NeighborhoodGroupedGroup2 + .059 NeighborhoodGroupedGroup3 \] \[+ .173 NeighborhoodGroupedGroup4 + .029 log(Total.Bsmt.SF + 1)\]

f_model <- lm(log(price) ~ log(area) + Overall.Qual +  Garage.Cars + Full.Bath + Half.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) +  Central.Air + NeighborhoodGrouped + log(Total.Bsmt.SF + 1) + log(Garage.Area + 1) + Overall.Cond, data = normal)
summary(f_model)

## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + Garage.Cars + 
##     Full.Bath + Half.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) + 
##     Central.Air + NeighborhoodGrouped + log(Total.Bsmt.SF + 1) + 
##     log(Garage.Area + 1) + Overall.Cond, data = normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80183 -0.06701  0.00298  0.06210  0.48850 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -0.2105725  0.5853546  -0.360  0.71914    
## log(area)                  0.5097184  0.0255206  19.973  < 2e-16 ***
## Overall.Qual               0.0733576  0.0052134  14.071  < 2e-16 ***
## Garage.Cars                0.0517727  0.0095197   5.438 7.10e-08 ***
## Full.Bath                 -0.0210566  0.0123004  -1.712  0.08730 .  
## Half.Bath                 -0.0323238  0.0099339  -3.254  0.00119 ** 
## Bedroom.AbvGr             -0.0367339  0.0070326  -5.223 2.23e-07 ***
## Year.Built                 0.0032410  0.0002608  12.428  < 2e-16 ***
## log(Lot.Area)              0.1287537  0.0092879  13.863  < 2e-16 ***
## Central.AirY               0.0865723  0.0212907   4.066 5.24e-05 ***
## NeighborhoodGroupedGroup2  0.0654199  0.0132403   4.941 9.43e-07 ***
## NeighborhoodGroupedGroup3  0.0638032  0.0161074   3.961 8.11e-05 ***
## NeighborhoodGroupedGroup4  0.1752058  0.0208216   8.415  < 2e-16 ***
## log(Total.Bsmt.SF + 1)     0.0285908  0.0038848   7.360 4.49e-13 ***
## log(Garage.Area + 1)      -0.0095445  0.0045720  -2.088  0.03714 *  
## Overall.Cond               0.0533910  0.0042641  12.521  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1145 on 818 degrees of freedom
## Multiple R-squared:  0.9117, Adjusted R-squared:  0.9101 
## F-statistic:   563 on 15 and 818 DF,  p-value: < 2.2e-16

fm_BIC <- stepAIC(f_model, k = log(nrow(normal)))

## Start:  AIC=-3523.04
## log(price) ~ log(area) + Overall.Qual + Garage.Cars + Full.Bath + 
##     Half.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) + 
##     Central.Air + NeighborhoodGrouped + log(Total.Bsmt.SF + 1) + 
##     log(Garage.Area + 1) + Overall.Cond
## 
##                          Df Sum of Sq    RSS     AIC
## - Full.Bath               1    0.0384 10.767 -3526.8
## - log(Garage.Area + 1)    1    0.0572 10.786 -3525.3
## <none>                                10.729 -3523.0
## - Half.Bath               1    0.1389 10.868 -3519.0
## - Central.Air             1    0.2169 10.946 -3513.1
## - Bedroom.AbvGr           1    0.3578 11.087 -3502.4
## - Garage.Cars             1    0.3879 11.117 -3500.1
## - log(Total.Bsmt.SF + 1)  1    0.7104 11.439 -3476.3
## - NeighborhoodGrouped     3    1.2091 11.938 -3454.2
## - Year.Built              1    2.0258 12.755 -3385.5
## - Overall.Cond            1    2.0562 12.785 -3383.5
## - log(Lot.Area)           1    2.5205 13.249 -3353.8
## - Overall.Qual            1    2.5968 13.326 -3349.0
## - log(area)               1    5.2321 15.961 -3198.5
## 
## Step:  AIC=-3526.78
## log(price) ~ log(area) + Overall.Qual + Garage.Cars + Half.Bath + 
##     Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air + 
##     NeighborhoodGrouped + log(Total.Bsmt.SF + 1) + log(Garage.Area + 
##     1) + Overall.Cond
## 
##                          Df Sum of Sq    RSS     AIC
## - log(Garage.Area + 1)    1    0.0467 10.814 -3529.9
## <none>                                10.767 -3526.8
## - Half.Bath               1    0.1060 10.873 -3525.3
## - Central.Air             1    0.2327 11.000 -3515.7
## - Garage.Cars             1    0.3651 11.132 -3505.7
## - Bedroom.AbvGr           1    0.4270 11.194 -3501.1
## - log(Total.Bsmt.SF + 1)  1    0.7576 11.525 -3476.8
## - NeighborhoodGrouped     3    1.2310 11.998 -3456.7
## - Year.Built              1    2.0140 12.781 -3390.5
## - Overall.Cond            1    2.0651 12.832 -3387.2
## - Overall.Qual            1    2.6036 13.371 -3352.9
## - log(Lot.Area)           1    2.6509 13.418 -3349.9
## - log(area)               1    5.9370 16.704 -3167.3
## 
## Step:  AIC=-3529.89
## log(price) ~ log(area) + Overall.Qual + Garage.Cars + Half.Bath + 
##     Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air + 
##     NeighborhoodGrouped + log(Total.Bsmt.SF + 1) + Overall.Cond
## 
##                          Df Sum of Sq    RSS     AIC
## <none>                                10.814 -3529.9
## - Half.Bath               1    0.1039 10.918 -3528.6
## - Central.Air             1    0.1991 11.013 -3521.4
## - Garage.Cars             1    0.3528 11.167 -3509.8
## - Bedroom.AbvGr           1    0.4153 11.229 -3505.2
## - log(Total.Bsmt.SF + 1)  1    0.7550 11.569 -3480.3
## - NeighborhoodGrouped     3    1.2279 12.042 -3460.4
## - Year.Built              1    2.0656 12.880 -3390.8
## - Overall.Cond            1    2.0660 12.880 -3390.8
## - Overall.Qual            1    2.5997 13.414 -3356.9
## - log(Lot.Area)           1    2.6220 13.436 -3355.6
## - log(area)               1    5.9589 16.773 -3170.6

summary(fm_BIC)

## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + Garage.Cars + 
##     Half.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) + 
##     Central.Air + NeighborhoodGrouped + log(Total.Bsmt.SF + 1) + 
##     Overall.Cond, data = normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80653 -0.06724  0.00348  0.06008  0.46365 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                0.0014579  0.5587384   0.003 0.997919    
## log(area)                  0.4918998  0.0231409  21.257  < 2e-16 ***
## Overall.Qual               0.0733940  0.0052273  14.040  < 2e-16 ***
## Garage.Cars                0.0391379  0.0075667   5.172 2.91e-07 ***
## Half.Bath                 -0.0263243  0.0093770  -2.807 0.005114 ** 
## Bedroom.AbvGr             -0.0386649  0.0068901  -5.612 2.74e-08 ***
## Year.Built                 0.0031624  0.0002527  12.515  < 2e-16 ***
## log(Lot.Area)              0.1299868  0.0092187  14.100  < 2e-16 ***
## Central.AirY               0.0807343  0.0207799   3.885 0.000110 ***
## NeighborhoodGroupedGroup2  0.0643906  0.0130608   4.930 9.95e-07 ***
## NeighborhoodGroupedGroup3  0.0591165  0.0160145   3.691 0.000238 ***
## NeighborhoodGroupedGroup4  0.1729250  0.0208114   8.309 3.97e-16 ***
## log(Total.Bsmt.SF + 1)     0.0292871  0.0038707   7.566 1.03e-13 ***
## Overall.Cond               0.0535114  0.0042753  12.516  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1148 on 820 degrees of freedom
## Multiple R-squared:  0.911,  Adjusted R-squared:  0.9096 
## F-statistic: 645.5 on 13 and 820 DF,  p-value: < 2.2e-16

fm_bas <- bas.lm(log(price) ~ log(area) + Overall.Qual +  Garage.Cars + Full.Bath + Half.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) +  Central.Air + NeighborhoodGrouped + log(Total.Bsmt.SF + 1) + log(Garage.Area + 1) + Overall.Cond, data = normal, prior = "BIC", modelprior = uniform())
summary(fm_bas)

##                           P(B != 0 | Y)    model 1       model 2       model 3
## Intercept                    1.00000000     1.0000     1.0000000     1.0000000
## log(area)                    1.00000000     1.0000     1.0000000     1.0000000
## Overall.Qual                 1.00000000     1.0000     1.0000000     1.0000000
## Garage.Cars                  0.99996527     1.0000     1.0000000     1.0000000
## Full.Bath                    0.07895577     0.0000     0.0000000     0.0000000
## Half.Bath                    0.66388938     1.0000     0.0000000     1.0000000
## Bedroom.AbvGr                0.99999378     1.0000     1.0000000     1.0000000
## Year.Built                   1.00000000     1.0000     1.0000000     1.0000000
## log(Lot.Area)                1.00000000     1.0000     1.0000000     1.0000000
## Central.AirY                 0.98399432     1.0000     1.0000000     1.0000000
## NeighborhoodGroupedGroup2    0.99668639     1.0000     1.0000000     1.0000000
## NeighborhoodGroupedGroup3    0.96383823     1.0000     1.0000000     1.0000000
## NeighborhoodGroupedGroup4    1.00000000     1.0000     1.0000000     1.0000000
## log(Total.Bsmt.SF + 1)       1.00000000     1.0000     1.0000000     1.0000000
## log(Garage.Area + 1)         0.17054435     0.0000     0.0000000     1.0000000
## Overall.Cond                 1.00000000     1.0000     1.0000000     1.0000000
## BF                                   NA     1.0000     0.5349702     0.2106944
## PostProbs                            NA     0.4725     0.2528000     0.0996000
## R2                                   NA     0.9110     0.9101000     0.9114000
## dim                                  NA    14.0000    13.0000000    15.0000000
## logmarg                              NA -1039.8926 -1040.5181024 -1041.4499046
##                                 model 4       model 5
## Intercept                     1.0000000     1.0000000
## log(area)                     1.0000000     1.0000000
## Overall.Qual                  1.0000000     1.0000000
## Garage.Cars                   1.0000000     1.0000000
## Full.Bath                     0.0000000     1.0000000
## Half.Bath                     0.0000000     1.0000000
## Bedroom.AbvGr                 1.0000000     1.0000000
## Year.Built                    1.0000000     1.0000000
## log(Lot.Area)                 1.0000000     1.0000000
## Central.AirY                  1.0000000     1.0000000
## NeighborhoodGroupedGroup2     1.0000000     1.0000000
## NeighborhoodGroupedGroup3     1.0000000     1.0000000
## NeighborhoodGroupedGroup4     1.0000000     1.0000000
## log(Total.Bsmt.SF + 1)        1.0000000     1.0000000
## log(Garage.Area + 1)          1.0000000     0.0000000
## Overall.Cond                  1.0000000     1.0000000
## BF                            0.1022366     0.1020917
## PostProbs                     0.0483000     0.0482000
## R2                            0.9105000     0.9112000
## dim                          14.0000000    15.0000000
## logmarg                   -1042.1730239 -1042.1744422

1.2.2.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.

As shown in 2.1.1 Price and Area, the response variable price and size variable area were both log-transformed to make their relationship more linear and here we can see the same for the Lot.Area variable. Some other variables like Garage.Area and Total.Bsmt.SF were also log-transformed. These two, Garage.Area and Total.Bsmt.SF also had 1 added to all values before transforming since these variables can take values equal to zero (since $\log(0) = -\infty$).

Neighborhood is a categorical variable with 27 levels, some of which have a lot of overlap so it was releveled into 4 different groups (Group1, Group2, Group3, and Group4) in the NeighborhoodGrouped variable, seen in 2.1.3 Neighborhoods. These groups were made by the quantiles of the original Neighborhood variable’s median prices.

lotarea <- ggplot(data = normal, aes(x = Lot.Area, y = price)) + geom_point() + geom_smooth(method = "lm")
loglotarea <- ggplot(data = normal, aes(x = log(Lot.Area), y = log(price))) + geom_point() + geom_smooth(method = "lm")
grid.arrange(lotarea, loglotarea, nrow = 1)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

neighborhoods

groupedneighborhoods

1.2.2.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.

When predictive variables are correlated, including both variables in the model cause redunant fitting and produce not-relaible coefficients. When Bedroom.AbvGr or Full.bath are both used in the linear-model agaisnt price, these variables have a positive correlation with the price seen by their positive coefficients in the linear model. However, when these variable are linear-modeled with area variable, the model shows the negative coefficients for Bedroom.AbvGr or Full.bath as seen in the summary of i_model. Thus I did not include any variable interactions to avoid redundant correlations.

1.2.2.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.

I decided to categorize the variables in the data set into ones that I thought would generally contribute to price. The price of a home is generally determined by its location, size, age and quality. I choose NeighborhoodGrouped (the releveled Neighborhood) to represent a home’s location which gives 3 variables (Groups 2, 3, and 4 with Group 1 as the reference). As size variables, I included 4 variables: area, Lot.Area, Garage.Area, and Bsmt.Fin.SF (finished basement areas). I chose Year.Built to represent a home’s age. To assess quality, I included 7 variables: Garage.Cars, Overall.Cond, Overall.Qual, Full.Bath, Half.Bath, Bedroom.AbvGr, and Central.Air.

Some variables are not included due to remote relationship to the price, while some other variables are not included due to the collinearity with variables which are included in this model.

I picked fm_BIC model as my prediction model, based on the squared R value and significance of variables in the model. Therefore the Garage.Area and Full.Bath variables are removed.

1.2.2.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.

Testing the final model with out-of-sample data gives slightly greater RMSE than one with train data; $ 24,306.57 (out-of-sample) vs $ 22,821.09 (training data). However, this value has decreased from the inital model which gives RMSE of $ 26,467.19 with out-of-sample data, and $ 24644.17 with the training data. This slight overfitting can be further improved by simplifying this model.

Coverage probablity for the final model (94.86%) is close to 95% with the out-of-sample data, suggesting that the final model properly reflects uncertainty. However, the 95 % prediction interval is a little bit wide.

pred.fm_BIC.train <- exp(predict(fm_BIC, normal))
resid.fm_BIC.train <- normal$price - pred.fm_BIC.train
RMSE.fm_BIC.train <- sqrt(mean(resid.fm_BIC.train^2))
RMSE.fm_BIC.train

## [1] 22821.09

pred.fm_BIC.test <- exp(predict(fm_BIC, normal_test))
resid.fm_BIC.test <- normal_test$price - pred.fm_BIC.test
RMSE.fm_BIC.test <- sqrt(mean(resid.fm_BIC.test^2))
RMSE.fm_BIC.test

## [1] 24306.57

int.fm_BIC.test <- round(exp(predict(fm_BIC, normal_test, interval = "prediction")))
cover.fm_BIC.test <- mean(normal_test$price > int.fm_BIC.test[, "lwr"] & normal_test$price < int.fm_BIC.test[, "upr"]) 
cover.fm_BIC.test

## [1] 0.9485924

1.2.3 Final Model Assessment

1.2.3.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.

The residual plots for the final model show that this model satisfies the model assumptions: constant variance, linearity, normal distribution, and independence of the residuals. However, there are a few outliers affecting these assumptions. Cook’s distance shows three outliers having some influence in the final model (#21, #560, #611). The function, hatvalues, returns the most leveraging home (#506). Two of these four outliers are houses that were built in 1880 which is significantly earlier than many of the other houses. It’s possible that this fact contributes to other differences that make it an outlier. Home #611 has the highest leverage, with the sold price of $40,000 compared to the predicted vale($84,440), while predicted prices for home #21 and home #560 ($182,109 and $158,874, respectively) were much lower than actual sold price ($265,979 and $230,000, respectively). A high leverging home does not necessarily have high residuals as seen with home #506 where the predicted price for this home is $135,669 and the actual sale price is $131,000.

par(mfrow = c(2,2))
plot(fm_BIC, which = 1, pch = 16, cex = .7, col = "light blue")
plot(fm_BIC, which = 2, pch = 16, cex = .7, col = "light blue")
hist(fm_BIC$resid, col = "light blue")
plot(fm_BIC$resid, pch = 16, cex = 0.7, col = "light blue")

par(mfrow = c(1,2))
plot(fm_BIC, which = 3, col = "light blue")
plot(fm_BIC, which = 4, col = "light blue")

which(hatvalues(fm_BIC) == max(hatvalues(fm_BIC)))

## 506 
## 506

outlier <- normal[c(21, 506, 560, 611), ]
p.price <- round(exp(predict(fm_BIC, outlier)))
outlier$prediction <- p.price
outlier %>% dplyr::select(prediction, price, Lot.Area, Year.Built)

## # A tibble: 4 x 4
##   prediction  price Lot.Area Year.Built
##        <dbl>  <int>    <int>      <int>
## 1     182109 265979    11700       1880
## 2     135669 131000    26400       1880
## 3     158874 230000     8239       1986
## 4      89605  40000     8500       1920

1.2.3.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.

The final models RMSE is lower than that of the initial model for both the training and the testing data. Overfitting was also improved as it was $1,823.017 for the initial model and $1,485.479 for the final model.

RMSE.im_BIC.train

## [1] 24644.17

RMSE.im_BIC.test

## [1] 26467.19

RMSE.im_BIC.test - RMSE.im_BIC.train

## [1] 1823.017

RMSE.fm_BIC.train

## [1] 22821.09

RMSE.fm_BIC.test

## [1] 24306.57

RMSE.fm_BIC.test - RMSE.fm_BIC.train

## [1] 1485.479

1.2.3.3 Final Model Evaluation

What are some strengths and weaknesses of your model?

This final model might be considered quite successful in terms of ability to explain the variation in sale price (R^2 = 90.96%) and only 9.04% of the variation in price remained unexplained by this model. This model also can tell the contribution of each explanatory variable in the model. However, 95 % prediction intervals are still too wide. Collecting more data observations or new predictor variables such as school district information (elementary, middle, and high school) or other amenities in the area might tignten up this interval.

1.2.3.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("ames_validation.Rdata")

The RMSE of the final model is $21,844.61 with the validation data. This value is lower than those from training data or test data($24,644.17 or $26,467.19, respectively). Coverage probality, that the 95% predictive confidence intervals contain the true price of the house in the validation data set, is just under 95%, at 94.5%. Therefore I can conclude that my final model properly reflect uncertainty.

normal_val <- ames_validation %>% filter(Sale.Condition == "Normal")
normal_val <- normal_val %>% mutate(NeighborhoodGrouped = Neighborhood)
levels(normal_val$NeighborhoodGrouped)[levels(normal_val$NeighborhoodGrouped)%in%c("MeadowV", "BrDale", "IDOTRR", "OldTown", "Blueste", "BrkSide", "Edwards")]<-"Group1" 
levels(normal_val$NeighborhoodGrouped)[levels(normal_val$NeighborhoodGrouped)%in%c("SWISU", "Sawyer", "NAmes", "NPkVill")]<- "Group2"
levels(normal_val$NeighborhoodGrouped)[levels(normal_val$NeighborhoodGrouped)%in%c("Mitchel", "SawyerW", "Gilbert", "ClearCr", "NWAmes", "Blmngtn", "CollgCr", "Crawfor")]<- "Group3"
 levels(normal_val$NeighborhoodGrouped)[levels(normal_val$NeighborhoodGrouped)%in%c("Greens", "Veenker", "Somerst","Timber","StoneBr", "GrnHill", "NoRidge", "NridgHt")]<-"Group4"

pred.fm_BIC.val <- exp(predict(fm_BIC, normal_val))
resid.fm_BIC.val <- normal_val$price - pred.fm_BIC.val
RMSE.fm_BIC.val <- sqrt(mean(resid.fm_BIC.val^2))                
RMSE.fm_BIC.val

## [1] 21844.61

pred.fm_bas.val <- exp(predict(fm_bas, normal_val, estimator = "BMA")$fit)
resid.fm_bas.val <- normal_val$price - pred.fm_bas.val
RMSE.fm_bas.val <- sqrt(mean(resid.fm_bas.val^2))                
RMSE.fm_bas.val

## [1] 21784.54

int.fm_BIC.val <- round(exp(predict(fm_BIC, normal_val, interval = "prediction")))
cover.fm_BIC.val <- mean(normal_val$price > int.fm_BIC.val[,"lwr"] &
                            normal_val$price < int.fm_BIC.val[,"upr"])
cover.fm_BIC.val

## [1] 0.9449541

1.2.4 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.

My final model(fm_BIC) is a decent model with high adjusted R^2 (90.9%) and good coverage probability. However, the 95% prediction range is quite wide. It seems that the predictability is more precise at homes with lower sale prices. Many of the variables included in the data set are already related to each other, so although there are many variables to choose from, too many would introduce redundancy to the model. Maybe the predictability would be better if other variables were included that wouldn’t correlate to existing variables. This model showed that the location, size, quality, home age, and some other specifications are deciding predictors for the home price, as I guessed. In addition, Half.Bath also turned out to be a predictor for the home price, moreso than Full.Bath which was surprising. I learned that selection of the variables for the initial model is very critical for the development to the final model. * * *