1 Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("~/Desktop/R Programming/Statistics_Coursera/Capstone/Peer_Assignment_II/ames_train.Rdata")

Use the code block below to load any necessary packages

library(statsr)
library(dplyr)
library(MASS)
library(plyr)
library(BAS)
library(broom)
library(kableExtra)
library(corrplot)
library(ggplot2)
library(ggthemes)
library(graphics)
library(PerformanceAnalytics)

2.1 Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).


Figure 1 shows the distribution of sale price of homes in Ames, Iowa, with the median price of 159,467 USD. It is apparent that the distribution of sale price is extremely right skewed, suggesting a number of outliers in the dataset.

The distribution of price in figure 2 compares the median prices by neighborhood. It suggests that location is an important factor affecting the price of properties. At the median, stoneBr is the most expensive neighborhood (340,691.5 USD), while MeadowV is the least expensive (85,750 USD).

The overall quality of the property is another important factor to determine price as shown in figure 3. It shows that the newly remodeled house with highly rated quality (8 - 10 score points) can fetch higher price.

Figure 4 illustrates the distribution of sale price of from 25 USD per square feet by neighborhood and zoning. selling_diff variable is created to understand when the house is sold after being remodeled or built. The figure indicates that the house in the floating village (FV) zone located in Somerst neighborhood is the most expensive in terms of price per square feet even though most houses remain on the market for almost 10 years after remodeled. However, most houses which can be sold within a year after remodeled are mostly located in NridgHT neighborhood - the low density residential area (RL).

price_data <- ames_train %>%
  filter(!is.na(price)) %>%
  dplyr::select(price, Neighborhood) 

median_price <- median(price_data$price/1000)
line <- data.frame(vlines = 270, labels = "median price = 159,467 USD", stringsAsFactors = FALSE)
  
p1 <- price_data %>%
  ggplot(aes(x = price/1000)) +
  geom_histogram() +
  geom_vline(xintercept = median_price, linetype = "dashed") +
  geom_text(data = line, aes(x = vlines, y = 170, label = labels)) +
  theme_solarized() +
  labs(title = "Figure 1 - Distribution of sale price",
       x = "sale price (thousands)",
       y = "count")

p2 <- price_data %>%
  ggplot(aes(x = reorder(Neighborhood, price/1000, FUN = median), y = price/1000)) +
  geom_boxplot() +
  theme_solarized() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Figure 2 - Distribution of sale price by neighborhood",
       y = "Sale price (thousands)",
       x = "Neighborhood")

p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p2

location_price <- ames_train %>%
  dplyr::select(Neighborhood, price) 

price_summary <- location_price %>%
  group_by(Neighborhood) %>%
  summarise(min_price = min(price),
            max_price = max(price),
            mean_price = mean(price),
            median_price = median(price),
            IQR_price = IQR(price),
            sd_price = sd(price))

ex_neighborhood <- price_summary[which(price_summary$median_price == max(price_summary$median_price)),]
leastex_neighborhood <- price_summary[which(price_summary$median_price == min(price_summary$median_price)),]

ex_neighborhood %>%
  kbl(caption = "<b>Most expensive neighborhood</b>") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Most expensive neighborhood
min_price max_price mean_price median_price IQR_price sd_price
12789 615000 181190.1 159467 83237.5 81909.79
leastex_neighborhood %>%
  kbl(caption = "<b>Least expensive neighborhood</b>") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Least expensive neighborhood
min_price max_price mean_price median_price IQR_price sd_price
12789 615000 181190.1 159467 83237.5 81909.79
ames_train$Overall.Qual <- as.factor(ames_train$Overall.Qual)

ggplot(ames_train, aes(x = Year.Remod.Add, y = price/1000, 
                       color = Overall.Qual)) +
  geom_point() +
  theme_solarized() +
  labs(title = "Figure 3 - Sale price distribution by remodel year and overall quality",
       x = "Remodel year",
       y = "Sale price (thousands)")

ames_train$price.Sqfeet <- ames_train$price/ames_train$Lot.Area
ames_train$selling_diff <- ames_train$Yr.Sold - ames_train$Year.Remod.Add

ames_train$selling_diff <- case_when(
    ames_train$selling_diff <= 1 ~ "within a year",
    ames_train$selling_diff <= 5 ~ "within five years",
    ames_train$selling_diff <= 10 ~ "within ten years",
    ames_train$selling_diff <= 25 ~ "within fifteen years",
    ames_train$selling_diff <= 50 ~ "within twenty years",
    TRUE ~ "more than twenty years")
  
ames_train$selling_diff <- as.factor(ames_train$selling_diff)  
price.more25 <- ames_train %>%
  filter(price.Sqfeet >= 25) %>%
  dplyr::select(price.Sqfeet, Neighborhood, selling_diff, MS.Zoning)

ggplot(price.more25, aes(x = Neighborhood, y = price.Sqfeet, 
                       color = selling_diff, shape = MS.Zoning)) +
  geom_point() +
  theme_solarized() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Figure 4 - Sale price per square feet by neighborhood & zoning",
       x = "Neighborhood",
       y = "Sale price per square feet",
       color = "House sold",
       shape = "House zone")


2.2 Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

2.2.1 Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).


selling_diff, Overall.Qual, Lot.Area, Neighborhood, MS.Zoning are selected to create a linear model for price. A stepwise regression method using AIC is used to select the most contributive predictors.

The summary table shows that all variables are important predictors, with \(Adj.R^2\) of 0.8208 and RSE of 0.178. It is apparent that the p-value of selling_diff, Overall.Qual, Lot.Area are statistically significant and should be included in the model.

#clean ames_train dataset
ames_train$Garage.Area[is.na(ames_train$Garage.Area)] <- 0
ames_train$Total.Bsmt.SF[is.na(ames_train$Total.Bsmt.SF)] <- 0
initial_model <- lm(log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
                      Neighborhood + MS.Zoning, 
                    data = ames_train)

initial_model_AIC <- stepAIC(initial_model, k = 2)
## Start:  AIC=-3405.64
## log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) + Neighborhood + 
##     MS.Zoning
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       30.208 -3405.6
## - MS.Zoning      5    0.7560 30.964 -3390.9
## - selling_diff   5    3.2860 33.494 -3312.4
## - Neighborhood  26    4.8295 35.038 -3309.3
## - log(Lot.Area)  1    4.8296 35.038 -3259.3
## - Overall.Qual   9   19.2858 49.494 -2929.9
summary(initial_model_AIC)
## 
## Call:
## lm(formula = log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) + 
##     Neighborhood + MS.Zoning, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.40748 -0.09052  0.00262  0.09133  0.81040 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       8.844553   0.236605  37.381  < 2e-16 ***
## selling_diffwithin a year         0.204444   0.027375   7.468 1.83e-13 ***
## selling_diffwithin fifteen years  0.175801   0.024113   7.291 6.47e-13 ***
## selling_diffwithin five years     0.196541   0.025383   7.743 2.47e-14 ***
## selling_diffwithin ten years      0.231974   0.024902   9.316  < 2e-16 ***
## selling_diffwithin twenty years   0.102827   0.022184   4.635 4.06e-06 ***
## Overall.Qual2                     0.124814   0.193324   0.646 0.518680    
## Overall.Qual3                     0.632486   0.190539   3.319 0.000936 ***
## Overall.Qual4                     0.808254   0.182785   4.422 1.09e-05 ***
## Overall.Qual5                     0.948730   0.181905   5.216 2.25e-07 ***
## Overall.Qual6                     1.091936   0.182245   5.992 2.94e-09 ***
## Overall.Qual7                     1.203669   0.182888   6.581 7.68e-11 ***
## Overall.Qual8                     1.360050   0.183938   7.394 3.12e-13 ***
## Overall.Qual9                     1.589812   0.186388   8.530  < 2e-16 ***
## Overall.Qual10                    1.650782   0.193904   8.513  < 2e-16 ***
## log(Lot.Area)                     0.198656   0.016094  12.344  < 2e-16 ***
## NeighborhoodBlueste              -0.077823   0.121803  -0.639 0.523024    
## NeighborhoodBrDale               -0.263515   0.086477  -3.047 0.002373 ** 
## NeighborhoodBrkSide              -0.254784   0.069076  -3.688 0.000238 ***
## NeighborhoodClearCr              -0.150555   0.081591  -1.845 0.065312 .  
## NeighborhoodCollgCr              -0.162075   0.060711  -2.670 0.007723 ** 
## NeighborhoodCrawfor              -0.009752   0.068078  -0.143 0.886127    
## NeighborhoodEdwards              -0.249268   0.063991  -3.895 0.000105 ***
## NeighborhoodGilbert              -0.191096   0.063946  -2.988 0.002876 ** 
## NeighborhoodGreens               -0.091811   0.107415  -0.855 0.392916    
## NeighborhoodGrnHill               0.190387   0.142768   1.334 0.182672    
## NeighborhoodIDOTRR               -0.324120   0.075876  -4.272 2.13e-05 ***
## NeighborhoodMeadowV              -0.192138   0.080614  -2.383 0.017347 *  
## NeighborhoodMitchel              -0.176658   0.066526  -2.655 0.008052 ** 
## NeighborhoodNAmes                -0.210709   0.061855  -3.407 0.000685 ***
## NeighborhoodNoRidge               0.063023   0.069330   0.909 0.363569    
## NeighborhoodNPkVill              -0.089052   0.106713  -0.835 0.404207    
## NeighborhoodNridgHt               0.059462   0.063103   0.942 0.346282    
## NeighborhoodNWAmes               -0.119535   0.066457  -1.799 0.072385 .  
## NeighborhoodOldTown              -0.306890   0.068198  -4.500 7.63e-06 ***
## NeighborhoodSawyer               -0.223468   0.065084  -3.434 0.000621 ***
## NeighborhoodSawyerW              -0.182844   0.064147  -2.850 0.004461 ** 
## NeighborhoodSomerst              -0.003329   0.071951  -0.046 0.963106    
## NeighborhoodStoneBr               0.045517   0.071364   0.638 0.523750    
## NeighborhoodSWISU                -0.202073   0.079543  -2.540 0.011229 *  
## NeighborhoodTimber               -0.079227   0.071757  -1.104 0.269825    
## NeighborhoodVeenker              -0.146865   0.084142  -1.745 0.081232 .  
## MS.ZoningFV                       0.236980   0.088476   2.678 0.007524 ** 
## MS.ZoningI (all)                 -0.100718   0.191190  -0.527 0.598456    
## MS.ZoningRH                       0.195618   0.100782   1.941 0.052553 .  
## MS.ZoningRL                       0.297223   0.073254   4.057 5.37e-05 ***
## MS.ZoningRM                       0.275782   0.067623   4.078 4.92e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.178 on 953 degrees of freedom
## Multiple R-squared:  0.8291, Adjusted R-squared:  0.8208 
## F-statistic: 100.5 on 46 and 953 DF,  p-value: < 2.2e-16

2.2.2 Section 2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?


Garage.Area and Total.Bsmt.SF are added. Using BIC for stepwise variable selection, the summary table suggests that all variables - selling_diff, Overall.Qual, Lot.Area, Neighbourhood, MS.Zoning, Garage.Area, Total.Bsmt.SF - are important predictors.

The same result is produced using BAS, with the highest posterior probabilities of 0.1754. This suggests that the selected variables are contributive predictors.

model_BIC <- lm(log(price) ~ selling_diff + Overall.Qual +
                              log(Lot.Area) + Neighborhood + 
                              MS.Zoning + Garage.Area + Total.Bsmt.SF, 
                    data = ames_train)

k.BIC <- log(nrow(ames_train))
initial_model_BIC <- stepAIC(model_BIC, k = k.BIC)
## Start:  AIC=-3306.64
## log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) + Neighborhood + 
##     MS.Zoning + Garage.Area + Total.Bsmt.SF
## 
##                 Df Sum of Sq    RSS     AIC
## - Neighborhood  26    3.4018 29.520 -3363.8
## - MS.Zoning      5    0.8139 26.932 -3310.5
## <none>                       26.118 -3306.6
## - Garage.Area    1    1.4493 27.568 -3259.5
## - log(Lot.Area)  1    1.7787 27.897 -3247.7
## - Total.Bsmt.SF  1    2.1223 28.241 -3235.4
## - selling_diff   5    3.0673 29.186 -3230.1
## - Overall.Qual   9   12.2628 38.381 -2983.9
## 
## Step:  AIC=-3363.8
## log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) + MS.Zoning + 
##     Garage.Area + Total.Bsmt.SF
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       29.520 -3363.8
## - MS.Zoning      5    1.9613 31.481 -3334.0
## - Garage.Area    1    1.8650 31.385 -3309.5
## - log(Lot.Area)  1    2.2560 31.776 -3297.1
## - Total.Bsmt.SF  1    2.7437 32.264 -3281.8
## - selling_diff   5    3.6956 33.216 -3280.4
## - Overall.Qual   9   20.9165 50.437 -2890.3
model_bas <- bas.lm(log(price) ~ selling_diff + Overall.Qual +
                              log(Lot.Area) + Neighborhood + 
                              MS.Zoning + Garage.Area + Total.Bsmt.SF, 
                       data = ames_train, prior = "BIC",
                       modelprior = uniform())

summary(model_bas)
##                                  P(B != 0 | Y)    model 1       model 2
## Intercept                           1.00000000     1.0000     1.0000000
## selling_diffwithin a year           1.00000000     1.0000     1.0000000
## selling_diffwithin fifteen years    1.00000000     1.0000     1.0000000
## selling_diffwithin five years       1.00000000     1.0000     1.0000000
## selling_diffwithin ten years        1.00000000     1.0000     1.0000000
## selling_diffwithin twenty years     0.99845238     1.0000     1.0000000
## Overall.Qual2                       0.04535582     0.0000     0.0000000
## Overall.Qual3                       0.98466910     1.0000     1.0000000
## Overall.Qual4                       0.99956592     1.0000     1.0000000
## Overall.Qual5                       0.99999258     1.0000     1.0000000
## Overall.Qual6                       0.99999987     1.0000     1.0000000
## Overall.Qual7                       1.00000000     1.0000     1.0000000
## Overall.Qual8                       1.00000000     1.0000     1.0000000
## Overall.Qual9                       1.00000000     1.0000     1.0000000
## Overall.Qual10                      1.00000000     1.0000     1.0000000
## log(Lot.Area)                       1.00000000     1.0000     1.0000000
## NeighborhoodBlueste                 0.05539778     0.0000     0.0000000
## NeighborhoodBrDale                  0.03583905     0.0000     0.0000000
## NeighborhoodBrkSide                 0.03063710     0.0000     0.0000000
## NeighborhoodClearCr                 0.15582098     0.0000     0.0000000
## NeighborhoodCollgCr                 0.04863838     0.0000     0.0000000
## NeighborhoodCrawfor                 0.99999931     1.0000     1.0000000
## NeighborhoodEdwards                 0.08070309     0.0000     0.0000000
## NeighborhoodGilbert                 0.04811348     0.0000     0.0000000
## NeighborhoodGreens                  0.03786438     0.0000     0.0000000
## NeighborhoodGrnHill                 0.99672306     1.0000     1.0000000
## NeighborhoodIDOTRR                  0.04664861     0.0000     0.0000000
## NeighborhoodMeadowV                 0.09176215     0.0000     0.0000000
## NeighborhoodMitchel                 0.02934080     0.0000     0.0000000
## NeighborhoodNAmes                   0.06308684     0.0000     0.0000000
## NeighborhoodNoRidge                 0.99981501     1.0000     1.0000000
## NeighborhoodNPkVill                 0.03015002     0.0000     0.0000000
## NeighborhoodNridgHt                 0.99880119     1.0000     1.0000000
## NeighborhoodNWAmes                  0.12720562     0.0000     0.0000000
## NeighborhoodOldTown                 0.25734326     0.0000     1.0000000
## NeighborhoodSawyer                  0.03016981     0.0000     0.0000000
## NeighborhoodSawyerW                 0.03049529     0.0000     0.0000000
## NeighborhoodSomerst                 0.13349009     0.0000     0.0000000
## NeighborhoodStoneBr                 0.99457804     1.0000     1.0000000
## NeighborhoodSWISU                   0.03916785     0.0000     0.0000000
## NeighborhoodTimber                  0.04655959     0.0000     0.0000000
## NeighborhoodVeenker                 0.03363327     0.0000     0.0000000
## MS.ZoningFV                         0.99902756     1.0000     1.0000000
## MS.ZoningI (all)                    0.03200003     0.0000     0.0000000
## MS.ZoningRH                         0.92405762     1.0000     1.0000000
## MS.ZoningRL                         0.99969750     1.0000     1.0000000
## MS.ZoningRM                         0.99891563     1.0000     1.0000000
## Garage.Area                         1.00000000     1.0000     1.0000000
## Total.Bsmt.SF                       1.00000000     1.0000     1.0000000
## BF                                          NA     1.0000     0.3424334
## PostProbs                                   NA     0.1759     0.0602000
## R2                                          NA     0.8475     0.8482000
## dim                                         NA    26.0000    27.0000000
## logmarg                                     NA -1736.7529 -1737.8245952
##                                        model 3       model 4       model 5
## Intercept                            1.0000000     1.0000000     1.0000000
## selling_diffwithin a year            1.0000000     1.0000000     1.0000000
## selling_diffwithin fifteen years     1.0000000     1.0000000     1.0000000
## selling_diffwithin five years        1.0000000     1.0000000     1.0000000
## selling_diffwithin ten years         1.0000000     1.0000000     1.0000000
## selling_diffwithin twenty years      1.0000000     1.0000000     1.0000000
## Overall.Qual2                        0.0000000     0.0000000     0.0000000
## Overall.Qual3                        1.0000000     1.0000000     1.0000000
## Overall.Qual4                        1.0000000     1.0000000     1.0000000
## Overall.Qual5                        1.0000000     1.0000000     1.0000000
## Overall.Qual6                        1.0000000     1.0000000     1.0000000
## Overall.Qual7                        1.0000000     1.0000000     1.0000000
## Overall.Qual8                        1.0000000     1.0000000     1.0000000
## Overall.Qual9                        1.0000000     1.0000000     1.0000000
## Overall.Qual10                       1.0000000     1.0000000     1.0000000
## log(Lot.Area)                        1.0000000     1.0000000     1.0000000
## NeighborhoodBlueste                  0.0000000     0.0000000     0.0000000
## NeighborhoodBrDale                   0.0000000     0.0000000     0.0000000
## NeighborhoodBrkSide                  0.0000000     0.0000000     0.0000000
## NeighborhoodClearCr                  1.0000000     0.0000000     0.0000000
## NeighborhoodCollgCr                  0.0000000     0.0000000     0.0000000
## NeighborhoodCrawfor                  1.0000000     1.0000000     1.0000000
## NeighborhoodEdwards                  0.0000000     0.0000000     0.0000000
## NeighborhoodGilbert                  0.0000000     0.0000000     0.0000000
## NeighborhoodGreens                   0.0000000     0.0000000     0.0000000
## NeighborhoodGrnHill                  1.0000000     1.0000000     1.0000000
## NeighborhoodIDOTRR                   0.0000000     0.0000000     0.0000000
## NeighborhoodMeadowV                  0.0000000     0.0000000     0.0000000
## NeighborhoodMitchel                  0.0000000     0.0000000     0.0000000
## NeighborhoodNAmes                    0.0000000     0.0000000     0.0000000
## NeighborhoodNoRidge                  1.0000000     1.0000000     1.0000000
## NeighborhoodNPkVill                  0.0000000     0.0000000     0.0000000
## NeighborhoodNridgHt                  1.0000000     1.0000000     1.0000000
## NeighborhoodNWAmes                   0.0000000     0.0000000     1.0000000
## NeighborhoodOldTown                  0.0000000     0.0000000     0.0000000
## NeighborhoodSawyer                   0.0000000     0.0000000     0.0000000
## NeighborhoodSawyerW                  0.0000000     0.0000000     0.0000000
## NeighborhoodSomerst                  0.0000000     1.0000000     0.0000000
## NeighborhoodStoneBr                  1.0000000     1.0000000     1.0000000
## NeighborhoodSWISU                    0.0000000     0.0000000     0.0000000
## NeighborhoodTimber                   0.0000000     0.0000000     0.0000000
## NeighborhoodVeenker                  0.0000000     0.0000000     0.0000000
## MS.ZoningFV                          1.0000000     1.0000000     1.0000000
## MS.ZoningI (all)                     0.0000000     0.0000000     0.0000000
## MS.ZoningRH                          1.0000000     1.0000000     1.0000000
## MS.ZoningRL                          1.0000000     1.0000000     1.0000000
## MS.ZoningRM                          1.0000000     1.0000000     1.0000000
## Garage.Area                          1.0000000     1.0000000     1.0000000
## Total.Bsmt.SF                        1.0000000     1.0000000     1.0000000
## BF                                   0.1934976     0.1528193     0.1459556
## PostProbs                            0.0340000     0.0269000     0.0257000
## R2                                   0.8481000     0.8480000     0.8480000
## dim                                 27.0000000    27.0000000    27.0000000
## logmarg                          -1738.3954071 -1738.6314164 -1738.6773701

2.2.3 Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.


The residuals plot of the model.BIC appears to be normally distributed. The plot also identifies 3 possible outliers, but not influential ones (181, 310, and 428). However, the Scale-Location plot indicates a violation of the constant variance assumption.

The same pattern is shown in the model using BAS. It identifies 181, 310, and 428 as possible outliers which the model would possibly predict the sale price too high or low than the actual one. These outliers are affected by factors such as sale within family members, house not completed when assessed, and abnormal sale. Limited the dataset to have houses selling under normal condition would provide more accurate results.

par(mfrow=c(2,2))
plot(model_BIC)
## Warning: not plotting observations with leverage one:
##   569, 918

plot(model_bas, which = 1)


2.2.4 Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).


The RMSE for the model predicting on the training dataset is USD 30,809, while the RMSE for the model using using BAS to predict on the training dataset produces a higher RMSE value of USD 31,412.

RMSE <- function(model, dataset) {
  pred_price <- exp(predict(model, dataset))
  resids <- dataset$price - pred_price
  rmse <- round(sqrt(mean(resids^2))) 
  rmse
}
initial_rmse <- RMSE(model_BIC, ames_train)
print(paste("The RMSE for the model predicting on the training dataset is USD", initial_rmse))
## [1] "The RMSE for the model predicting on the training dataset is USD 30809"
predict_bas <- predict(model_bas, newdata = ames_train, 
                           estimator = "HPM")

predict_bas_HPM <- sqrt(mean((exp(predict_bas$fit)-ames_train$price)^2))

print(paste("The RMSE for the model using HPM (Highest Probability Model) to predict on the training dataset is USD", round(predict_bas_HPM)))
## [1] "The RMSE for the model using HPM (Highest Probability Model) to predict on the training dataset is USD 31412"

2.2.5 Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("~/Desktop/R Programming/Statistics_Coursera/Capstone/ames_test.Rdata")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?


The coverage probability of the initial model on the out-of-sample (ames_test) is calculated. The result shows that 97.18% of the actual prices in ames_test fall within the 95% confidence interval of the expected value. The RMSE of the model predicting on the testing dataset is 28,853 USD - a much lower than the RMSE on the training dataset (USD 30,809).

However, the RMSE for the model using BAS to predict on the testing dataset remains the same value as the training dataset of USD 31,412.

#create selling_diff variable & convert to factor class
ames_test$selling_diff <- ames_test$Yr.Sold - ames_test$Year.Remod.Add
ames_test$selling_diff <- case_when(
    ames_test$selling_diff <= 1 ~ "within a year",
    ames_test$selling_diff <= 5 ~ "within five years",
    ames_test$selling_diff <= 10 ~ "within ten years",
    ames_test$selling_diff <= 25 ~ "within fifteen years",
    ames_test$selling_diff <= 50 ~ "within twenty years",
    TRUE ~ "more than twenty years")
ames_test$selling_diff <- as.factor(ames_test$selling_diff)  

#clean Garage.Are & Total.Bsmt.SF
ames_test$Garage.Area[is.na(ames_test$Garage.Area)] <- 0
ames_test$Total.Bsmt.SF[is.na(ames_test$Total.Bsmt.SF)] <- 0

#convert Overall_Qual to factor class
ames_test$Overall.Qual <- as.factor(ames_test$Overall.Qual)

#remove level "Landmrk" from Neighborhood
ames_test <- ames_test %>%
  filter(Neighborhood != "Landmrk")
pred_test <- exp(predict(model_BIC, ames_test, interval = "prediction"))

coverage_prob_full <- mean(ames_test$price > pred_test[, "lwr"] &
                             ames_test$price < pred_test[, "upr"])

rmse_model_BIC <- RMSE(model_BIC, ames_test)

print(paste("The out-of-sample coverage for the model is", coverage_prob_full * 100))
## [1] "The out-of-sample coverage for the model is 97.1813725490196"
print(paste("The RMSE for the model predicting on the testing dataset is USD", rmse_model_BIC))
## [1] "The RMSE for the model predicting on the testing dataset is USD 28853"
predict_test_bas <- predict(model_bas, newdata = ames_test, 
                           estimator = "HPM")

predict_test_bas_HPM <- sqrt(mean((exp(predict_test_bas$fit)-ames_test$price)^2))

print(paste("The RMSE for the model using HPM (Highest Probability Model) to predict on the training dataset is USD", round(predict_bas_HPM)))
## [1] "The RMSE for the model using HPM (Highest Probability Model) to predict on the training dataset is USD 31412"

Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

2.3 Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

2.3.1 Section 3.1 Final Model

Provide the summary table for your model.


It is apparent that, based on the residuals plot analysis, the outliers are affected by sale condition other than normal. In order to generate a more accurate and better predictive result, the dataset normal_sale_trainset is filtered to have only houses sold under normal condition.

The model using AIC stepwise method suggests that important predictors include selling_diff, Overall.Qual, Lot.Area, Neighborhood, MS.Zoning, Total.Bsmt.SF, Garage.Area, Bldg.Type, BsmtFin.Type.1, area, Land.Slope, House.Style, Exter.Cond, Foundation, Central.Air, Wood.Deck.SF, with \(Adj.R^2\) of 0.9349 and RSE of 0.09746.

The BIC variable selection includes selling_diff, Overall.Qual, Lot.Area, MS.Zoning, Total.Bsmt.SF, Garage.Area, BsmtFin.Type.1, area, Land.Slope, Central.Air, Wood.Deck.SF, with \(Adj.R^2\) of 0.9174 and RSE of 0.1097.

Using BAS, the model with highest posterior probabilities of 0.0124 includes selling_diff, Overall.Qual, Lot.Area, MS.Zoning, Total.Bsmt.SF, Garage.Area, BsmtFin.Type.1, area, Land.Slope, Central.Air, Wood.Deck.SF.

#filter ames_train to houses sold under normal selling conditions
normal_sale_trainset <- ames_train %>%
  filter(Sale.Condition == "Normal")

#clean dataset
levels(normal_sale_trainset$BsmtFin.Type.1) <- c(levels(normal_sale_trainset$BsmtFin.Type.1),"NB")
normal_sale_trainset$BsmtFin.Type.1[is.na(normal_sale_trainset$BsmtFin.Type.1)] <- "NB"

normal_sale_trainset$Garage.Cars[is.na(normal_sale_trainset$Garage.Cars)] <- 0
FullModel <- lm(log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
                  Neighborhood + MS.Zoning + Total.Bsmt.SF + Lot.Shape +
                  Full.Bath + Garage.Area  + Bldg.Type + 
                  BsmtFin.Type.1 + log(area) + Land.Contour + Land.Slope +
                  House.Style + Exter.Cond + Foundation + Central.Air +
                  Wood.Deck.SF, 
                data = normal_sale_trainset)

FullModel_AIC <- stepAIC(FullModel, trace = FALSE, k = 2)
summary(FullModel_AIC)
## 
## Call:
## lm(formula = log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) + 
##     Neighborhood + MS.Zoning + Total.Bsmt.SF + Garage.Area + 
##     Bldg.Type + BsmtFin.Type.1 + log(area) + Land.Slope + House.Style + 
##     Exter.Cond + Foundation + Central.Air + Wood.Deck.SF, data = normal_sale_trainset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45944 -0.05364  0.00232  0.05816  0.31117 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       7.669e+00  2.129e-01  36.020  < 2e-16 ***
## selling_diffwithin a year         9.321e-02  1.886e-02   4.943 9.48e-07 ***
## selling_diffwithin fifteen years  8.286e-02  1.514e-02   5.473 6.01e-08 ***
## selling_diffwithin five years     1.203e-01  1.557e-02   7.725 3.56e-14 ***
## selling_diffwithin ten years      1.145e-01  1.586e-02   7.217 1.30e-12 ***
## selling_diffwithin twenty years   3.900e-02  1.415e-02   2.757 0.005979 ** 
## Overall.Qual2                     1.676e-01  1.200e-01   1.396 0.163105    
## Overall.Qual3                     2.417e-01  1.182e-01   2.045 0.041229 *  
## Overall.Qual4                     3.054e-01  1.159e-01   2.635 0.008575 ** 
## Overall.Qual5                     3.556e-01  1.161e-01   3.062 0.002278 ** 
## Overall.Qual6                     4.201e-01  1.168e-01   3.597 0.000343 ***
## Overall.Qual7                     4.752e-01  1.176e-01   4.041 5.88e-05 ***
## Overall.Qual8                     5.599e-01  1.183e-01   4.734 2.63e-06 ***
## Overall.Qual9                     7.287e-01  1.203e-01   6.058 2.18e-09 ***
## Overall.Qual10                    7.728e-01  1.287e-01   6.007 2.94e-09 ***
## log(Lot.Area)                     5.794e-02  1.329e-02   4.359 1.49e-05 ***
## NeighborhoodBlueste              -2.429e-02  7.493e-02  -0.324 0.745898    
## NeighborhoodBrDale               -9.496e-02  6.252e-02  -1.519 0.129245    
## NeighborhoodBrkSide              -2.987e-02  5.075e-02  -0.589 0.556256    
## NeighborhoodClearCr              -8.320e-03  5.609e-02  -0.148 0.882127    
## NeighborhoodCollgCr              -4.564e-02  4.451e-02  -1.025 0.305457    
## NeighborhoodCrawfor               6.931e-02  4.894e-02   1.416 0.157114    
## NeighborhoodEdwards              -1.225e-01  4.622e-02  -2.651 0.008189 ** 
## NeighborhoodGilbert              -4.822e-02  4.723e-02  -1.021 0.307586    
## NeighborhoodGreens                1.158e-01  6.519e-02   1.776 0.076106 .  
## NeighborhoodGrnHill               5.539e-01  8.444e-02   6.560 9.99e-11 ***
## NeighborhoodIDOTRR               -7.509e-02  5.458e-02  -1.376 0.169303    
## NeighborhoodMeadowV              -1.672e-01  5.332e-02  -3.136 0.001777 ** 
## NeighborhoodMitchel              -6.523e-02  4.712e-02  -1.384 0.166659    
## NeighborhoodNAmes                -5.998e-02  4.534e-02  -1.323 0.186252    
## NeighborhoodNoRidge               2.272e-02  4.813e-02   0.472 0.636989    
## NeighborhoodNPkVill              -3.798e-02  6.660e-02  -0.570 0.568632    
## NeighborhoodNridgHt               4.335e-02  4.475e-02   0.969 0.332942    
## NeighborhoodNWAmes               -4.944e-02  4.733e-02  -1.045 0.296538    
## NeighborhoodOldTown              -1.122e-01  5.018e-02  -2.235 0.025712 *  
## NeighborhoodSawyer               -6.141e-02  4.685e-02  -1.311 0.190346    
## NeighborhoodSawyerW              -7.752e-02  4.617e-02  -1.679 0.093583 .  
## NeighborhoodSomerst               2.756e-02  5.343e-02   0.516 0.606128    
## NeighborhoodStoneBr               1.682e-02  4.946e-02   0.340 0.733892    
## NeighborhoodSWISU                -7.755e-02  5.539e-02  -1.400 0.161946    
## NeighborhoodTimber               -7.003e-03  4.903e-02  -0.143 0.886463    
## NeighborhoodVeenker               6.069e-02  5.460e-02   1.112 0.266632    
## MS.ZoningFV                       2.928e-01  6.446e-02   4.542 6.47e-06 ***
## MS.ZoningI (all)                 -1.120e-01  1.159e-01  -0.967 0.334021    
## MS.ZoningRH                       1.824e-01  6.668e-02   2.735 0.006384 ** 
## MS.ZoningRL                       2.945e-01  5.276e-02   5.582 3.31e-08 ***
## MS.ZoningRM                       2.647e-01  4.965e-02   5.332 1.28e-07 ***
## Total.Bsmt.SF                     1.246e-04  1.990e-05   6.258 6.51e-10 ***
## Garage.Area                       1.247e-04  2.421e-05   5.151 3.31e-07 ***
## Bldg.Type2fmCon                  -3.446e-02  2.684e-02  -1.284 0.199578    
## Bldg.TypeDuplex                  -7.124e-02  2.247e-02  -3.171 0.001581 ** 
## Bldg.TypeTwnhs                   -7.583e-02  2.923e-02  -2.594 0.009659 ** 
## Bldg.TypeTwnhsE                  -2.772e-02  2.070e-02  -1.339 0.180923    
## BsmtFin.Type.1BLQ                -3.778e-02  1.440e-02  -2.623 0.008887 ** 
## BsmtFin.Type.1GLQ                 1.354e-02  1.276e-02   1.061 0.288919    
## BsmtFin.Type.1LwQ                -8.535e-02  1.815e-02  -4.702 3.06e-06 ***
## BsmtFin.Type.1Rec                -4.354e-02  1.403e-02  -3.102 0.001991 ** 
## BsmtFin.Type.1Unf                -8.636e-02  1.256e-02  -6.873 1.31e-11 ***
## BsmtFin.Type.1NB                 -1.924e-01  4.312e-02  -4.463 9.31e-06 ***
## log(area)                         3.925e-01  2.558e-02  15.347  < 2e-16 ***
## Land.SlopeMod                     5.541e-02  2.000e-02   2.771 0.005734 ** 
## Land.SlopeSev                     1.751e-01  6.711e-02   2.609 0.009269 ** 
## House.Style1.5Unf                 1.412e-02  3.990e-02   0.354 0.723484    
## House.Style1Story                 1.031e-02  1.762e-02   0.585 0.558864    
## House.Style2.5Unf                 6.383e-02  3.767e-02   1.695 0.090579 .  
## House.Style2Story                 1.696e-02  1.586e-02   1.069 0.285216    
## House.StyleSFoyer                 9.784e-02  2.588e-02   3.780 0.000169 ***
## House.StyleSLvl                   1.347e-02  2.238e-02   0.602 0.547301    
## Exter.CondFa                     -2.298e-01  5.876e-02  -3.911 1.00e-04 ***
## Exter.CondGd                     -1.074e-01  5.211e-02  -2.061 0.039691 *  
## Exter.CondTA                     -1.057e-01  5.121e-02  -2.064 0.039337 *  
## FoundationCBlock                  3.591e-02  1.515e-02   2.370 0.018055 *  
## FoundationPConc                   6.286e-02  1.624e-02   3.871 0.000118 ***
## FoundationSlab                    2.272e-01  4.872e-02   4.663 3.68e-06 ***
## FoundationStone                   1.196e-01  5.978e-02   2.000 0.045865 *  
## Central.AirY                      1.277e-01  1.927e-02   6.626 6.55e-11 ***
## Wood.Deck.SF                      1.612e-04  3.201e-05   5.035 5.97e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09746 on 757 degrees of freedom
## Multiple R-squared:  0.9408, Adjusted R-squared:  0.9349 
## F-statistic: 158.3 on 76 and 757 DF,  p-value: < 2.2e-16
Full.BIC <- log(nrow(normal_sale_trainset))
FullModel.BIC <- stepAIC(FullModel, k = Full.BIC, se.fit = TRUE, trace = FALSE)
summary(FullModel.BIC)
## 
## Call:
## lm(formula = log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) + 
##     MS.Zoning + Total.Bsmt.SF + Garage.Area + BsmtFin.Type.1 + 
##     log(area) + Land.Slope + Central.Air + Wood.Deck.SF, data = normal_sale_trainset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52432 -0.06342  0.00539  0.06148  0.51963 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       7.478e+00  1.646e-01  45.437  < 2e-16 ***
## selling_diffwithin a year         1.067e-01  1.872e-02   5.701 1.68e-08 ***
## selling_diffwithin fifteen years  9.580e-02  1.519e-02   6.307 4.69e-10 ***
## selling_diffwithin five years     1.345e-01  1.594e-02   8.438  < 2e-16 ***
## selling_diffwithin ten years      1.257e-01  1.589e-02   7.909 8.60e-15 ***
## selling_diffwithin twenty years   3.799e-02  1.337e-02   2.842 0.004592 ** 
## Overall.Qual2                     1.126e-01  1.244e-01   0.905 0.365702    
## Overall.Qual3                     2.434e-01  1.219e-01   1.997 0.046131 *  
## Overall.Qual4                     3.021e-01  1.182e-01   2.556 0.010766 *  
## Overall.Qual5                     3.807e-01  1.184e-01   3.215 0.001355 ** 
## Overall.Qual6                     4.601e-01  1.193e-01   3.857 0.000124 ***
## Overall.Qual7                     5.499e-01  1.201e-01   4.581 5.37e-06 ***
## Overall.Qual8                     6.734e-01  1.207e-01   5.581 3.27e-08 ***
## Overall.Qual9                     8.548e-01  1.229e-01   6.954 7.40e-12 ***
## Overall.Qual10                    9.028e-01  1.328e-01   6.797 2.08e-11 ***
## log(Lot.Area)                     8.083e-02  1.028e-02   7.859 1.25e-14 ***
## MS.ZoningFV                       3.647e-01  5.491e-02   6.642 5.72e-11 ***
## MS.ZoningI (all)                 -9.704e-02  1.236e-01  -0.785 0.432567    
## MS.ZoningRH                       1.660e-01  6.781e-02   2.447 0.014614 *  
## MS.ZoningRL                       3.219e-01  5.156e-02   6.243 6.98e-10 ***
## MS.ZoningRM                       2.570e-01  5.197e-02   4.946 9.25e-07 ***
## Total.Bsmt.SF                     1.195e-04  1.460e-05   8.186 1.06e-15 ***
## Garage.Area                       1.391e-04  2.499e-05   5.568 3.53e-08 ***
## BsmtFin.Type.1BLQ                -5.074e-02  1.551e-02  -3.272 0.001113 ** 
## BsmtFin.Type.1GLQ                 1.591e-02  1.336e-02   1.191 0.234003    
## BsmtFin.Type.1LwQ                -1.057e-01  1.937e-02  -5.458 6.41e-08 ***
## BsmtFin.Type.1Rec                -5.198e-02  1.518e-02  -3.425 0.000645 ***
## BsmtFin.Type.1Unf                -1.052e-01  1.318e-02  -7.987 4.80e-15 ***
## BsmtFin.Type.1NB                 -7.967e-02  3.227e-02  -2.469 0.013756 *  
## log(area)                         3.623e-01  1.696e-02  21.358  < 2e-16 ***
## Land.SlopeMod                     6.500e-02  2.118e-02   3.068 0.002224 ** 
## Land.SlopeSev                     1.507e-01  6.829e-02   2.206 0.027646 *  
## Central.AirY                      1.566e-01  1.986e-02   7.882 1.06e-14 ***
## Wood.Deck.SF                      1.507e-04  3.442e-05   4.377 1.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1097 on 800 degrees of freedom
## Multiple R-squared:  0.9207, Adjusted R-squared:  0.9174 
## F-statistic: 281.5 on 33 and 800 DF,  p-value: < 2.2e-16
BASModel <- bas.lm(log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
                  Neighborhood + MS.Zoning + Total.Bsmt.SF + Lot.Shape +
                  Full.Bath + Garage.Area + Bldg.Type + 
                  BsmtFin.Type.1 + log(area) + Land.Contour + Land.Slope +
                  House.Style + Exter.Cond + Foundation + Central.Air +
                  Wood.Deck.SF, 
                data = normal_sale_trainset, prior = "BIC",
                modelprior = uniform())

summary(BASModel)
##                                  P(B != 0 | Y)   model 1      model 2
## Intercept                           1.00000000    1.0000    1.0000000
## selling_diffwithin a year           0.99999959    1.0000    1.0000000
## selling_diffwithin fifteen years    0.99999952    1.0000    1.0000000
## selling_diffwithin five years       1.00000000    1.0000    1.0000000
## selling_diffwithin ten years        1.00000000    1.0000    1.0000000
## selling_diffwithin twenty years     0.88609650    1.0000    1.0000000
## Overall.Qual2                       0.04812931    0.0000    0.0000000
## Overall.Qual3                       0.09784014    0.0000    0.0000000
## Overall.Qual4                       0.87769615    1.0000    1.0000000
## Overall.Qual5                       0.99980060    1.0000    1.0000000
## Overall.Qual6                       0.99999805    1.0000    1.0000000
## Overall.Qual7                       0.99999659    1.0000    1.0000000
## Overall.Qual8                       0.99999998    1.0000    1.0000000
## Overall.Qual9                       1.00000000    1.0000    1.0000000
## Overall.Qual10                      1.00000000    1.0000    1.0000000
## log(Lot.Area)                       0.99999963    1.0000    1.0000000
## NeighborhoodBlueste                 0.01033939    0.0000    0.0000000
## NeighborhoodBrDale                  0.06589975    0.0000    0.0000000
## NeighborhoodBrkSide                 0.02852850    0.0000    0.0000000
## NeighborhoodClearCr                 0.07313565    0.0000    0.0000000
## NeighborhoodCollgCr                 0.01548159    0.0000    0.0000000
## NeighborhoodCrawfor                 0.99992619    1.0000    1.0000000
## NeighborhoodEdwards                 0.99560735    1.0000    1.0000000
## NeighborhoodGilbert                 0.01094286    0.0000    0.0000000
## NeighborhoodGreens                  0.19237193    0.0000    0.0000000
## NeighborhoodGrnHill                 1.00000000    1.0000    1.0000000
## NeighborhoodIDOTRR                  0.01682174    0.0000    0.0000000
## NeighborhoodMeadowV                 0.97228657    1.0000    1.0000000
## NeighborhoodMitchel                 0.02030613    0.0000    0.0000000
## NeighborhoodNAmes                   0.01027621    0.0000    0.0000000
## NeighborhoodNoRidge                 0.22530214    0.0000    0.0000000
## NeighborhoodNPkVill                 0.01346348    0.0000    0.0000000
## NeighborhoodNridgHt                 0.15210024    0.0000    0.0000000
## NeighborhoodNWAmes                  0.01088890    0.0000    0.0000000
## NeighborhoodOldTown                 0.95340694    1.0000    1.0000000
## NeighborhoodSawyer                  0.01167435    0.0000    0.0000000
## NeighborhoodSawyerW                 0.16704455    0.0000    0.0000000
## NeighborhoodSomerst                 0.01974906    0.0000    0.0000000
## NeighborhoodStoneBr                 0.01064683    0.0000    0.0000000
## NeighborhoodSWISU                   0.02140927    0.0000    0.0000000
## NeighborhoodTimber                  0.01111017    0.0000    0.0000000
## NeighborhoodVeenker                 0.28523548    0.0000    0.0000000
## MS.ZoningFV                         1.00000000    1.0000    1.0000000
## MS.ZoningI (all)                    0.01641798    0.0000    0.0000000
## MS.ZoningRH                         0.98555459    1.0000    1.0000000
## MS.ZoningRL                         1.00000000    1.0000    1.0000000
## MS.ZoningRM                         0.99999999    1.0000    1.0000000
## Total.Bsmt.SF                       1.00000000    1.0000    1.0000000
## Lot.ShapeIR2                        0.01050720    0.0000    0.0000000
## Lot.ShapeIR3                        0.11962395    0.0000    0.0000000
## Lot.ShapeReg                        0.01071058    0.0000    0.0000000
## Full.Bath                           0.01075173    0.0000    0.0000000
## Garage.Area                         0.99999992    1.0000    1.0000000
## Bldg.Type2fmCon                     0.01897129    0.0000    0.0000000
## Bldg.TypeDuplex                     0.88199605    1.0000    1.0000000
## Bldg.TypeTwnhs                      0.11231027    0.0000    0.0000000
## Bldg.TypeTwnhsE                     0.02225229    0.0000    0.0000000
## BsmtFin.Type.1BLQ                   0.99230412    1.0000    1.0000000
## BsmtFin.Type.1GLQ                   0.02129025    0.0000    0.0000000
## BsmtFin.Type.1LwQ                   0.99999884    1.0000    1.0000000
## BsmtFin.Type.1Rec                   0.99756299    1.0000    1.0000000
## BsmtFin.Type.1Unf                   1.00000000    1.0000    1.0000000
## BsmtFin.Type.1NB                    0.99999372    1.0000    1.0000000
## log(area)                           1.00000000    1.0000    1.0000000
## Land.ContourHLS                     0.01127894    0.0000    0.0000000
## Land.ContourLow                     0.01291185    0.0000    0.0000000
## Land.ContourLvl                     0.01272651    0.0000    0.0000000
## Land.SlopeMod                       0.63619798    1.0000    0.0000000
## Land.SlopeSev                       0.75213440    1.0000    1.0000000
## House.Style1.5Unf                   0.01071035    0.0000    0.0000000
## House.Style1Story                   0.01080992    0.0000    0.0000000
## House.Style2.5Unf                   0.01220615    0.0000    0.0000000
## House.Style2Story                   0.01072292    0.0000    0.0000000
## House.StyleSFoyer                   0.99841348    1.0000    1.0000000
## House.StyleSLvl                     0.01222110    0.0000    0.0000000
## Exter.CondFa                        0.99999052    1.0000    1.0000000
## Exter.CondGd                        0.01374241    0.0000    0.0000000
## Exter.CondTA                        0.01148239    0.0000    0.0000000
## FoundationCBlock                    0.06866467    0.0000    0.0000000
## FoundationPConc                     0.95939245    1.0000    1.0000000
## FoundationSlab                      0.99864573    1.0000    1.0000000
## FoundationStone                     0.03743004    0.0000    0.0000000
## Central.AirY                        0.99999999    1.0000    1.0000000
## Wood.Deck.SF                        0.99999446    1.0000    1.0000000
## BF                                          NA    1.0000    0.5565378
## PostProbs                                   NA    0.0102    0.0057000
## R2                                          NA    0.9351    0.9345000
## dim                                         NA   40.0000   39.0000000
## logmarg                                     NA -995.2711 -995.8571187
##                                       model 3      model 4      model 5
## Intercept                           1.0000000    1.0000000    1.0000000
## selling_diffwithin a year           1.0000000    1.0000000    1.0000000
## selling_diffwithin fifteen years    1.0000000    1.0000000    1.0000000
## selling_diffwithin five years       1.0000000    1.0000000    1.0000000
## selling_diffwithin ten years        1.0000000    1.0000000    1.0000000
## selling_diffwithin twenty years     1.0000000    1.0000000    1.0000000
## Overall.Qual2                       0.0000000    0.0000000    0.0000000
## Overall.Qual3                       0.0000000    0.0000000    0.0000000
## Overall.Qual4                       1.0000000    1.0000000    1.0000000
## Overall.Qual5                       1.0000000    1.0000000    1.0000000
## Overall.Qual6                       1.0000000    1.0000000    1.0000000
## Overall.Qual7                       1.0000000    1.0000000    1.0000000
## Overall.Qual8                       1.0000000    1.0000000    1.0000000
## Overall.Qual9                       1.0000000    1.0000000    1.0000000
## Overall.Qual10                      1.0000000    1.0000000    1.0000000
## log(Lot.Area)                       1.0000000    1.0000000    1.0000000
## NeighborhoodBlueste                 0.0000000    0.0000000    0.0000000
## NeighborhoodBrDale                  0.0000000    0.0000000    0.0000000
## NeighborhoodBrkSide                 0.0000000    0.0000000    0.0000000
## NeighborhoodClearCr                 0.0000000    0.0000000    0.0000000
## NeighborhoodCollgCr                 0.0000000    0.0000000    0.0000000
## NeighborhoodCrawfor                 1.0000000    1.0000000    1.0000000
## NeighborhoodEdwards                 1.0000000    1.0000000    1.0000000
## NeighborhoodGilbert                 0.0000000    0.0000000    0.0000000
## NeighborhoodGreens                  0.0000000    0.0000000    0.0000000
## NeighborhoodGrnHill                 1.0000000    1.0000000    1.0000000
## NeighborhoodIDOTRR                  0.0000000    0.0000000    0.0000000
## NeighborhoodMeadowV                 1.0000000    1.0000000    1.0000000
## NeighborhoodMitchel                 0.0000000    0.0000000    0.0000000
## NeighborhoodNAmes                   0.0000000    0.0000000    0.0000000
## NeighborhoodNoRidge                 0.0000000    1.0000000    0.0000000
## NeighborhoodNPkVill                 0.0000000    0.0000000    0.0000000
## NeighborhoodNridgHt                 0.0000000    0.0000000    0.0000000
## NeighborhoodNWAmes                  0.0000000    0.0000000    0.0000000
## NeighborhoodOldTown                 1.0000000    1.0000000    1.0000000
## NeighborhoodSawyer                  0.0000000    0.0000000    0.0000000
## NeighborhoodSawyerW                 0.0000000    0.0000000    1.0000000
## NeighborhoodSomerst                 0.0000000    0.0000000    0.0000000
## NeighborhoodStoneBr                 0.0000000    0.0000000    0.0000000
## NeighborhoodSWISU                   0.0000000    0.0000000    0.0000000
## NeighborhoodTimber                  0.0000000    0.0000000    0.0000000
## NeighborhoodVeenker                 1.0000000    0.0000000    0.0000000
## MS.ZoningFV                         1.0000000    1.0000000    1.0000000
## MS.ZoningI (all)                    0.0000000    0.0000000    0.0000000
## MS.ZoningRH                         1.0000000    1.0000000    1.0000000
## MS.ZoningRL                         1.0000000    1.0000000    1.0000000
## MS.ZoningRM                         1.0000000    1.0000000    1.0000000
## Total.Bsmt.SF                       1.0000000    1.0000000    1.0000000
## Lot.ShapeIR2                        0.0000000    0.0000000    0.0000000
## Lot.ShapeIR3                        0.0000000    0.0000000    0.0000000
## Lot.ShapeReg                        0.0000000    0.0000000    0.0000000
## Full.Bath                           0.0000000    0.0000000    0.0000000
## Garage.Area                         1.0000000    1.0000000    1.0000000
## Bldg.Type2fmCon                     0.0000000    0.0000000    0.0000000
## Bldg.TypeDuplex                     1.0000000    1.0000000    1.0000000
## Bldg.TypeTwnhs                      0.0000000    0.0000000    0.0000000
## Bldg.TypeTwnhsE                     0.0000000    0.0000000    0.0000000
## BsmtFin.Type.1BLQ                   1.0000000    1.0000000    1.0000000
## BsmtFin.Type.1GLQ                   0.0000000    0.0000000    0.0000000
## BsmtFin.Type.1LwQ                   1.0000000    1.0000000    1.0000000
## BsmtFin.Type.1Rec                   1.0000000    1.0000000    1.0000000
## BsmtFin.Type.1Unf                   1.0000000    1.0000000    1.0000000
## BsmtFin.Type.1NB                    1.0000000    1.0000000    1.0000000
## log(area)                           1.0000000    1.0000000    1.0000000
## Land.ContourHLS                     0.0000000    0.0000000    0.0000000
## Land.ContourLow                     0.0000000    0.0000000    0.0000000
## Land.ContourLvl                     0.0000000    0.0000000    0.0000000
## Land.SlopeMod                       1.0000000    1.0000000    1.0000000
## Land.SlopeSev                       1.0000000    1.0000000    1.0000000
## House.Style1.5Unf                   0.0000000    0.0000000    0.0000000
## House.Style1Story                   0.0000000    0.0000000    0.0000000
## House.Style2.5Unf                   0.0000000    0.0000000    0.0000000
## House.Style2Story                   0.0000000    0.0000000    0.0000000
## House.StyleSFoyer                   1.0000000    1.0000000    1.0000000
## House.StyleSLvl                     0.0000000    0.0000000    0.0000000
## Exter.CondFa                        1.0000000    1.0000000    1.0000000
## Exter.CondGd                        0.0000000    0.0000000    0.0000000
## Exter.CondTA                        0.0000000    0.0000000    0.0000000
## FoundationCBlock                    0.0000000    0.0000000    0.0000000
## FoundationPConc                     1.0000000    1.0000000    1.0000000
## FoundationSlab                      1.0000000    1.0000000    1.0000000
## FoundationStone                     0.0000000    0.0000000    0.0000000
## Central.AirY                        1.0000000    1.0000000    1.0000000
## Wood.Deck.SF                        1.0000000    1.0000000    1.0000000
## BF                                  0.5328753    0.3567992    0.3430534
## PostProbs                           0.0055000    0.0036000    0.0035000
## R2                                  0.9356000    0.9355000    0.9355000
## dim                                41.0000000   41.0000000   41.0000000
## logmarg                          -995.9005665 -996.3016806 -996.3409678

2.3.2 Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.


The variables price, Lot.Area, and area are selected to transform as their distributions are very skewed.

par(mfrow=c(3,2))
hist(normal_sale_trainset$price, main = "Untransformed price")
hist(log(normal_sale_trainset$price), main = "Transformed price")
hist(normal_sale_trainset$Lot.Area, main = "Untransformed Lot.Area")
hist(log(normal_sale_trainset$Lot.Area), main = "Transformed Lot.Area")
hist(normal_sale_trainset$area, main = "Untransformed area")
hist(log(normal_sale_trainset$area), main = "Transformed area")


2.3.3 Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.


ANOVA analysis is used to understand the interaction effect among the variables. The p-value less than the significance level suggests that the interaction effect is present. However, I decided not to remove those variables because by doing so reduces the strength of the linear model.

FinalModel.Inter <- lm(log(price) ~ (selling_diff + Overall.Qual + log(Lot.Area) +
                  MS.Zoning + Total.Bsmt.SF + Garage.Area + 
                  BsmtFin.Type.1 + log(area) + Land.Slope +
                  Central.Air +Wood.Deck.SF)^2, data = normal_sale_trainset)

anova(FinalModel.Inter)
## Analysis of Variance Table
## 
## Response: log(price)
##                               Df Sum Sq Mean Sq   F value    Pr(>F)    
## selling_diff                   5 42.280  8.4559  973.6226 < 2.2e-16 ***
## Overall.Qual                   9 45.326  5.0362  579.8696 < 2.2e-16 ***
## log(Lot.Area)                  1 10.011 10.0108 1152.6525 < 2.2e-16 ***
## MS.Zoning                      5  1.710  0.3419   39.3679 < 2.2e-16 ***
## Total.Bsmt.SF                  1  3.384  3.3839  389.6288 < 2.2e-16 ***
## Garage.Area                    1  1.544  1.5444  177.8183 < 2.2e-16 ***
## BsmtFin.Type.1                 6  0.914  0.1524   17.5421 < 2.2e-16 ***
## log(area)                      1  5.526  5.5263  636.3061 < 2.2e-16 ***
## Land.Slope                     2  0.186  0.0931   10.7163 2.725e-05 ***
## Central.Air                    1  0.742  0.7421   85.4425 < 2.2e-16 ***
## Wood.Deck.SF                   1  0.231  0.2307   26.5660 3.570e-07 ***
## selling_diff:Overall.Qual     27  0.449  0.0166    1.9139 0.0039963 ** 
## selling_diff:log(Lot.Area)     5  0.043  0.0085    0.9818 0.4281789    
## selling_diff:MS.Zoning        11  0.590  0.0536    6.1742 1.325e-09 ***
## selling_diff:Total.Bsmt.SF     5  0.186  0.0372    4.2786 0.0007958 ***
## selling_diff:Garage.Area       5  0.050  0.0099    1.1423 0.3367707    
## selling_diff:BsmtFin.Type.1   28  0.645  0.0230    2.6533 1.232e-05 ***
## selling_diff:log(area)         5  0.171  0.0343    3.9468 0.0015926 ** 
## selling_diff:Land.Slope        5  0.089  0.0179    2.0590 0.0690629 .  
## selling_diff:Central.Air       4  0.449  0.1121   12.9129 4.739e-10 ***
## selling_diff:Wood.Deck.SF      5  0.042  0.0083    0.9590 0.4423937    
## Overall.Qual:log(Lot.Area)     8  0.170  0.0213    2.4499 0.0130484 *  
## Overall.Qual:MS.Zoning        12  0.227  0.0189    2.1764 0.0116716 *  
## Overall.Qual:Total.Bsmt.SF     7  0.092  0.0131    1.5107 0.1608973    
## Overall.Qual:Garage.Area       7  0.285  0.0407    4.6881 4.043e-05 ***
## Overall.Qual:BsmtFin.Type.1   23  0.237  0.0103    1.1860 0.2506647    
## Overall.Qual:log(area)         5  0.142  0.0284    3.2750 0.0063551 ** 
## Overall.Qual:Land.Slope        6  0.030  0.0051    0.5838 0.7433993    
## Overall.Qual:Central.Air       3  0.024  0.0080    0.9251 0.4283239    
## Overall.Qual:Wood.Deck.SF      5  0.101  0.0203    2.3332 0.0411095 *  
## log(Lot.Area):MS.Zoning        3  0.081  0.0269    3.0926 0.0266505 *  
## log(Lot.Area):Total.Bsmt.SF    1  0.015  0.0150    1.7271 0.1893338    
## log(Lot.Area):Garage.Area      1  0.018  0.0175    2.0192 0.1558893    
## log(Lot.Area):BsmtFin.Type.1   6  0.063  0.0105    1.2140 0.2972861    
## log(Lot.Area):log(area)        1  0.002  0.0022    0.2512 0.6164683    
## log(Lot.Area):Land.Slope       1  0.009  0.0087    1.0045 0.3166606    
## log(Lot.Area):Central.Air      1  0.040  0.0399    4.5906 0.0325915 *  
## log(Lot.Area):Wood.Deck.SF     1  0.006  0.0063    0.7217 0.3959607    
## MS.Zoning:Total.Bsmt.SF        2  0.146  0.0728    8.3807 0.0002602 ***
## MS.Zoning:Garage.Area          2  0.002  0.0011    0.1250 0.8825388    
## MS.Zoning:BsmtFin.Type.1       9  0.135  0.0150    1.7219 0.0810524 .  
## MS.Zoning:log(area)            2  0.005  0.0026    0.2983 0.7421780    
## MS.Zoning:Land.Slope           1  0.025  0.0248    2.8566 0.0915702 .  
## MS.Zoning:Central.Air          1  0.004  0.0041    0.4720 0.4923493    
## MS.Zoning:Wood.Deck.SF         2  0.003  0.0014    0.1660 0.8471228    
## Total.Bsmt.SF:Garage.Area      1  0.005  0.0045    0.5185 0.4717991    
## Total.Bsmt.SF:BsmtFin.Type.1   5  0.065  0.0129    1.4868 0.1922795    
## Total.Bsmt.SF:log(area)        1  0.000  0.0003    0.0364 0.8487844    
## Total.Bsmt.SF:Land.Slope       1  0.005  0.0049    0.5649 0.4526304    
## Total.Bsmt.SF:Central.Air      1  0.000  0.0003    0.0379 0.8457343    
## Total.Bsmt.SF:Wood.Deck.SF     1  0.001  0.0006    0.0692 0.7925932    
## Garage.Area:BsmtFin.Type.1     6  0.060  0.0100    1.1502 0.3317824    
## Garage.Area:log(area)          1  0.008  0.0079    0.9123 0.3399260    
## Garage.Area:Land.Slope         1  0.000  0.0000    0.0013 0.9707515    
## Garage.Area:Central.Air        1  0.004  0.0035    0.4050 0.5248022    
## Garage.Area:Wood.Deck.SF       1  0.020  0.0203    2.3322 0.1273013    
## BsmtFin.Type.1:log(area)       6  0.034  0.0057    0.6619 0.6805090    
## BsmtFin.Type.1:Land.Slope      5  0.042  0.0084    0.9685 0.4364422    
## BsmtFin.Type.1:Central.Air     5  0.018  0.0037    0.4243 0.8318220    
## BsmtFin.Type.1:Wood.Deck.SF    5  0.030  0.0060    0.6926 0.6292506    
## log(area):Land.Slope           1  0.010  0.0104    1.1951 0.2747862    
## log(area):Central.Air          1  0.000  0.0002    0.0239 0.8770745    
## log(area):Wood.Deck.SF         1  0.002  0.0023    0.2642 0.6074440    
## Land.Slope:Wood.Deck.SF        1  0.024  0.0242    2.7896 0.0954556 .  
## Central.Air:Wood.Deck.SF       1  0.006  0.0057    0.6552 0.4186050    
## Residuals                    544  4.725  0.0087                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.3.4 Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.


The method used to select the variables are AIC, BIC and BMA for model selection. AIC and BIC metrics are used for model evaluation by increasing error when including additional variables to ensure unbiased estimate of the model prediction. The BMA model selection will average multiple models to obtain marginal posterior inclusion probability.


2.3.5 Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.


The coverage probability of the final model on the testing dataset ames_test is calculated. It shows that 93.45% of the actual home prices from the testing dataset ames_test falls within the 95% confidence interval of the predicted values produced by the final model.

Testing on an out-of-sample data allows to have a metric other than \(R^2\), \(Adj.R^2\), and p-value to determine whether the model was optimal.

FinalModel <- lm(log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
                  MS.Zoning + Total.Bsmt.SF + Garage.Area + 
                  BsmtFin.Type.1 + log(area) + Land.Slope +
                  Central.Air + Wood.Deck.SF, data = normal_sale_trainset)

summary(FinalModel)
## 
## Call:
## lm(formula = log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) + 
##     MS.Zoning + Total.Bsmt.SF + Garage.Area + BsmtFin.Type.1 + 
##     log(area) + Land.Slope + Central.Air + Wood.Deck.SF, data = normal_sale_trainset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52432 -0.06342  0.00539  0.06148  0.51963 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       7.478e+00  1.646e-01  45.437  < 2e-16 ***
## selling_diffwithin a year         1.067e-01  1.872e-02   5.701 1.68e-08 ***
## selling_diffwithin fifteen years  9.580e-02  1.519e-02   6.307 4.69e-10 ***
## selling_diffwithin five years     1.345e-01  1.594e-02   8.438  < 2e-16 ***
## selling_diffwithin ten years      1.257e-01  1.589e-02   7.909 8.60e-15 ***
## selling_diffwithin twenty years   3.799e-02  1.337e-02   2.842 0.004592 ** 
## Overall.Qual2                     1.126e-01  1.244e-01   0.905 0.365702    
## Overall.Qual3                     2.434e-01  1.219e-01   1.997 0.046131 *  
## Overall.Qual4                     3.021e-01  1.182e-01   2.556 0.010766 *  
## Overall.Qual5                     3.807e-01  1.184e-01   3.215 0.001355 ** 
## Overall.Qual6                     4.601e-01  1.193e-01   3.857 0.000124 ***
## Overall.Qual7                     5.499e-01  1.201e-01   4.581 5.37e-06 ***
## Overall.Qual8                     6.734e-01  1.207e-01   5.581 3.27e-08 ***
## Overall.Qual9                     8.548e-01  1.229e-01   6.954 7.40e-12 ***
## Overall.Qual10                    9.028e-01  1.328e-01   6.797 2.08e-11 ***
## log(Lot.Area)                     8.083e-02  1.028e-02   7.859 1.25e-14 ***
## MS.ZoningFV                       3.647e-01  5.491e-02   6.642 5.72e-11 ***
## MS.ZoningI (all)                 -9.704e-02  1.236e-01  -0.785 0.432567    
## MS.ZoningRH                       1.660e-01  6.781e-02   2.447 0.014614 *  
## MS.ZoningRL                       3.219e-01  5.156e-02   6.243 6.98e-10 ***
## MS.ZoningRM                       2.570e-01  5.197e-02   4.946 9.25e-07 ***
## Total.Bsmt.SF                     1.195e-04  1.460e-05   8.186 1.06e-15 ***
## Garage.Area                       1.391e-04  2.499e-05   5.568 3.53e-08 ***
## BsmtFin.Type.1BLQ                -5.074e-02  1.551e-02  -3.272 0.001113 ** 
## BsmtFin.Type.1GLQ                 1.591e-02  1.336e-02   1.191 0.234003    
## BsmtFin.Type.1LwQ                -1.057e-01  1.937e-02  -5.458 6.41e-08 ***
## BsmtFin.Type.1Rec                -5.198e-02  1.518e-02  -3.425 0.000645 ***
## BsmtFin.Type.1Unf                -1.052e-01  1.318e-02  -7.987 4.80e-15 ***
## BsmtFin.Type.1NB                 -7.967e-02  3.227e-02  -2.469 0.013756 *  
## log(area)                         3.623e-01  1.696e-02  21.358  < 2e-16 ***
## Land.SlopeMod                     6.500e-02  2.118e-02   3.068 0.002224 ** 
## Land.SlopeSev                     1.507e-01  6.829e-02   2.206 0.027646 *  
## Central.AirY                      1.566e-01  1.986e-02   7.882 1.06e-14 ***
## Wood.Deck.SF                      1.507e-04  3.442e-05   4.377 1.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1097 on 800 degrees of freedom
## Multiple R-squared:  0.9207, Adjusted R-squared:  0.9174 
## F-statistic: 281.5 on 33 and 800 DF,  p-value: < 2.2e-16
pred_final_model <- predict(FinalModel, newdata = ames_test, 
                           interval = "prediction")

pred_final_model <-exp(pred_final_model)

coverage_prob_final <- mean (ames_test$price > pred_final_model[,"lwr"] & ames_test$price < pred_final_model[,"upr"],na.rm=TRUE)

coverage_prob_final * 100
## [1] 93.45088

2.4 Part 4 Final Model Assessment

2.4.1 Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.


The distribution of residuals of the final model is randomly scattered around zero, the linearity can thus be assumed.

The residuals distribution in the Normal Q-Q plot fairly follows the straight line.

The Scale - Location shows that the red line is approximately horizontal, which means that the average magnitude of the standardised residuals isn’t changing much. However, the spread of residuals varies along the range of predictors, suggesting the presence of heteroskedasticity.

The Residuals VS Leverage plot highlights the top three most extreme points (90, 560, 611). However, the data do not present any influential points as the Cook’s distance lines are not shown on the plot.

The Cook’s distance identifies 3 data points with large residuals and have high leverage - 53, 339, and 611. The predicted sale price of these houses is undervalued.

par(mfrow=c(2,2))
plot(FinalModel)
## Warning: not plotting observations with leverage one:
##   472, 763


2.4.2 Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.


The RMSE for the final model predicting on the testing dataset is USD 23,050.

predict_final_test <-exp(predict(FinalModel, ames_test))

resid_final_test <-ames_test$price - predict_final_test 

rmse_final_test <-sqrt(mean(resid_final_test^2,na.rm=TRUE))

print(paste("The RMSE for the final model predicting on the testing dataset is USD", round(rmse_final_test)))
## [1] "The RMSE for the final model predicting on the testing dataset is USD 23050"

2.4.3 Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?


The strength of the final model is the acceptable error (RMSE). The weakness of the model is the issue of heteroskedasticity which may result in a lower predictive ability.


2.4.4 Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("~/Desktop/R Programming/Statistics_Coursera/Capstone/Peer_Assignment_II/ames_validation.Rdata")

  • The RMSE for the final model predicting on the validating dataset is 21,610 USD
  • The RMSE value of the final model when applied on the validating dataset is much lower than that on the testing dataset (23,050. USD).
  • The out-of-sample coverage for the final model on ames_validation is 93.63%, suggesting that 93.63% of the actual home prices from the validating dataset ames_validation falls within the 95% confidence interval of the predicted values produced by the final model as illustrated in figure 5.
  • Since the model’s coverage probability is less than 95%, some assumptions regarding uncertainty may not be met.
#create selling_diff variable & convert to factor class
ames_validation$selling_diff <- ames_validation$Yr.Sold - ames_validation$Year.Remod.Add

ames_validation$selling_diff <- case_when(
    ames_validation$selling_diff <= 1 ~ "within a year",
    ames_validation$selling_diff <= 5 ~ "within five years",
    ames_validation$selling_diff <= 10 ~ "within ten years",
    ames_validation$selling_diff <= 25 ~ "within fifteen years",
    ames_validation$selling_diff <= 50 ~ "within twenty years",
    TRUE ~ "more than twenty years")
ames_validation$selling_diff <- as.factor(ames_validation$selling_diff)  

#clean Garage.Are & Total.Bsmt.SF
ames_validation$Garage.Area[is.na(ames_validation$Garage.Area)] <- 0
ames_validation$Total.Bsmt.SF[is.na(ames_validation$Total.Bsmt.SF)] <- 0

#convert Overall_Qual to factor class
ames_validation$Overall.Qual <- as.factor(ames_validation$Overall.Qual)

#remove level "Landmrk" from Neighborhood
ames_validation <- ames_validation %>%
  filter(Neighborhood != "Landmrk")

#remove Ms.Zoning A(agr) from ames_validation
ames_validation <- ames_validation[-c(387), ]
predict_final_valid <-exp(predict(FinalModel, ames_validation))

resid_final_valid <-ames_validation$price - predict_final_valid  

rmse_final_valid <-sqrt(mean(resid_final_valid^2,na.rm=TRUE))

print(paste("The RMSE for the final model predicting on the validating dataset is USD", round(rmse_final_valid)))
## [1] "The RMSE for the final model predicting on the validating dataset is USD 21610"
pred_final_cov_model <- predict(FinalModel, newdata = ames_validation, 
                           interval = "prediction")

pred_final_cov_model <-exp(pred_final_cov_model)

coverage_cov_final <- mean(ames_validation$price > pred_final_cov_model[,"lwr"] & ames_validation$price < pred_final_cov_model [,"upr"], na.rm=TRUE)

coverage_cov_final * 100
## [1] 93.63144

2.5 Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.


  • The final model performs best on the validation dataset ames_validation than on the testing dataset ames_test. This is evidenced by a higher out-of-sample performance of 93.63%, and a much lower RMSE value of 21,610 USD.
  • The predicted sale price is clustered at low values as illustrated in Figure 5.
  • Figure 6 shows the number of houses whose sale prices are predicted as undervalued and overvalued. There are 366 houses whose predicted sale prices are higher than the actual ones, while 372 are predicted to have lower sale prices as compared to the actual ones.
  • I learned how to compare different linear models using out-of-sample performance and root mean squared error (RMSE), apart from \(Adj.R^2\) and \(R^2\).
pred_final_cov_model <- as.data.frame(pred_final_cov_model)

saleprice <- ames_validation$price

ModelEval <- cbind(pred_final_cov_model, saleprice) %>%
  mutate(coverage = ifelse(lwr < saleprice &
                               upr > saleprice, "yes", "no"))

ModelEval <- na.omit(ModelEval)

p3 <- ggplot(data = ModelEval, aes(x = fit, y = saleprice, color = coverage)) +
    geom_point() +
  theme_solarized() +
    geom_line(aes(y=lwr), color = "red", linetype = "dashed") +
    geom_line(aes(y=upr), color = "red", linetype = "dashed") +
  labs(title = "Figure 5 - 95% confidence interval for sale price prediction",
       x = "Predicted values",
       y = "Sale price")

ModelBind <- cbind(pred_final_cov_model, saleprice) 

ModelPredict <- ModelBind %>%
  mutate(value = case_when(ModelBind$fit < saleprice ~ "undervalued",
                           ModelBind$fit > saleprice ~ "overvalued",
                           TRUE ~ "fit")) 

ModelPredict <- na.omit(ModelPredict)

countpredict <- count(ModelPredict$value)

p4 <- ggplot(data = countpredict , aes(x = x, y = freq, fill = x)) +
  geom_bar(stat = "identity", width = 0.5) +
  theme_solarized() +
  theme(legend.position="top") +
  geom_text(aes(label = freq), color = "white", vjust=1.6) +
  labs(title = "Figure 6 - Overvalued & Undervalued sale price prediction",
       x = "",
       y = "Number of houses",
       fill = "")

p3

p4