As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
load("~/Desktop/R Programming/Statistics_Coursera/Capstone/Peer_Assignment_II/ames_train.Rdata")
Use the code block below to load any necessary packages
library(statsr)
library(dplyr)
library(MASS)
library(plyr)
library(BAS)
library(broom)
library(kableExtra)
library(corrplot)
library(ggplot2)
library(ggthemes)
library(graphics)
library(PerformanceAnalytics)
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
Figure 1 shows the distribution of sale price of homes in Ames, Iowa, with the median price of 159,467 USD. It is apparent that the distribution of sale price is extremely right skewed, suggesting a number of outliers in the dataset.
The distribution of price in figure 2 compares the median prices by neighborhood. It suggests that location is an important factor affecting the price of properties. At the median, stoneBr is the most expensive neighborhood (340,691.5 USD), while MeadowV is the least expensive (85,750 USD).
The overall quality of the property is another important factor to determine price as shown in figure 3. It shows that the newly remodeled house with highly rated quality (8 - 10 score points) can fetch higher price.
Figure 4 illustrates the distribution of sale price of from 25 USD per square feet by neighborhood and zoning. selling_diff variable is created to understand when the house is sold after being remodeled or built. The figure indicates that the house in the floating village (FV) zone located in Somerst neighborhood is the most expensive in terms of price per square feet even though most houses remain on the market for almost 10 years after remodeled. However, most houses which can be sold within a year after remodeled are mostly located in NridgHT neighborhood - the low density residential area (RL).
price_data <- ames_train %>%
filter(!is.na(price)) %>%
dplyr::select(price, Neighborhood)
median_price <- median(price_data$price/1000)
line <- data.frame(vlines = 270, labels = "median price = 159,467 USD", stringsAsFactors = FALSE)
p1 <- price_data %>%
ggplot(aes(x = price/1000)) +
geom_histogram() +
geom_vline(xintercept = median_price, linetype = "dashed") +
geom_text(data = line, aes(x = vlines, y = 170, label = labels)) +
theme_solarized() +
labs(title = "Figure 1 - Distribution of sale price",
x = "sale price (thousands)",
y = "count")
p2 <- price_data %>%
ggplot(aes(x = reorder(Neighborhood, price/1000, FUN = median), y = price/1000)) +
geom_boxplot() +
theme_solarized() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Figure 2 - Distribution of sale price by neighborhood",
y = "Sale price (thousands)",
x = "Neighborhood")
p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p2
location_price <- ames_train %>%
dplyr::select(Neighborhood, price)
price_summary <- location_price %>%
group_by(Neighborhood) %>%
summarise(min_price = min(price),
max_price = max(price),
mean_price = mean(price),
median_price = median(price),
IQR_price = IQR(price),
sd_price = sd(price))
ex_neighborhood <- price_summary[which(price_summary$median_price == max(price_summary$median_price)),]
leastex_neighborhood <- price_summary[which(price_summary$median_price == min(price_summary$median_price)),]
ex_neighborhood %>%
kbl(caption = "<b>Most expensive neighborhood</b>") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| min_price | max_price | mean_price | median_price | IQR_price | sd_price |
|---|---|---|---|---|---|
| 12789 | 615000 | 181190.1 | 159467 | 83237.5 | 81909.79 |
leastex_neighborhood %>%
kbl(caption = "<b>Least expensive neighborhood</b>") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| min_price | max_price | mean_price | median_price | IQR_price | sd_price |
|---|---|---|---|---|---|
| 12789 | 615000 | 181190.1 | 159467 | 83237.5 | 81909.79 |
ames_train$Overall.Qual <- as.factor(ames_train$Overall.Qual)
ggplot(ames_train, aes(x = Year.Remod.Add, y = price/1000,
color = Overall.Qual)) +
geom_point() +
theme_solarized() +
labs(title = "Figure 3 - Sale price distribution by remodel year and overall quality",
x = "Remodel year",
y = "Sale price (thousands)")
ames_train$price.Sqfeet <- ames_train$price/ames_train$Lot.Area
ames_train$selling_diff <- ames_train$Yr.Sold - ames_train$Year.Remod.Add
ames_train$selling_diff <- case_when(
ames_train$selling_diff <= 1 ~ "within a year",
ames_train$selling_diff <= 5 ~ "within five years",
ames_train$selling_diff <= 10 ~ "within ten years",
ames_train$selling_diff <= 25 ~ "within fifteen years",
ames_train$selling_diff <= 50 ~ "within twenty years",
TRUE ~ "more than twenty years")
ames_train$selling_diff <- as.factor(ames_train$selling_diff)
price.more25 <- ames_train %>%
filter(price.Sqfeet >= 25) %>%
dplyr::select(price.Sqfeet, Neighborhood, selling_diff, MS.Zoning)
ggplot(price.more25, aes(x = Neighborhood, y = price.Sqfeet,
color = selling_diff, shape = MS.Zoning)) +
geom_point() +
theme_solarized() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Figure 4 - Sale price per square feet by neighborhood & zoning",
x = "Neighborhood",
y = "Sale price per square feet",
color = "House sold",
shape = "House zone")
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
selling_diff, Overall.Qual, Lot.Area, Neighborhood, MS.Zoning are selected to create a linear model for price. A stepwise regression method using AIC is used to select the most contributive predictors.
The summary table shows that all variables are important predictors, with \(Adj.R^2\) of 0.8208 and RSE of 0.178. It is apparent that the p-value of selling_diff, Overall.Qual, Lot.Area are statistically significant and should be included in the model.
#clean ames_train dataset
ames_train$Garage.Area[is.na(ames_train$Garage.Area)] <- 0
ames_train$Total.Bsmt.SF[is.na(ames_train$Total.Bsmt.SF)] <- 0
initial_model <- lm(log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
Neighborhood + MS.Zoning,
data = ames_train)
initial_model_AIC <- stepAIC(initial_model, k = 2)
## Start: AIC=-3405.64
## log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) + Neighborhood +
## MS.Zoning
##
## Df Sum of Sq RSS AIC
## <none> 30.208 -3405.6
## - MS.Zoning 5 0.7560 30.964 -3390.9
## - selling_diff 5 3.2860 33.494 -3312.4
## - Neighborhood 26 4.8295 35.038 -3309.3
## - log(Lot.Area) 1 4.8296 35.038 -3259.3
## - Overall.Qual 9 19.2858 49.494 -2929.9
summary(initial_model_AIC)
##
## Call:
## lm(formula = log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
## Neighborhood + MS.Zoning, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.40748 -0.09052 0.00262 0.09133 0.81040
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.844553 0.236605 37.381 < 2e-16 ***
## selling_diffwithin a year 0.204444 0.027375 7.468 1.83e-13 ***
## selling_diffwithin fifteen years 0.175801 0.024113 7.291 6.47e-13 ***
## selling_diffwithin five years 0.196541 0.025383 7.743 2.47e-14 ***
## selling_diffwithin ten years 0.231974 0.024902 9.316 < 2e-16 ***
## selling_diffwithin twenty years 0.102827 0.022184 4.635 4.06e-06 ***
## Overall.Qual2 0.124814 0.193324 0.646 0.518680
## Overall.Qual3 0.632486 0.190539 3.319 0.000936 ***
## Overall.Qual4 0.808254 0.182785 4.422 1.09e-05 ***
## Overall.Qual5 0.948730 0.181905 5.216 2.25e-07 ***
## Overall.Qual6 1.091936 0.182245 5.992 2.94e-09 ***
## Overall.Qual7 1.203669 0.182888 6.581 7.68e-11 ***
## Overall.Qual8 1.360050 0.183938 7.394 3.12e-13 ***
## Overall.Qual9 1.589812 0.186388 8.530 < 2e-16 ***
## Overall.Qual10 1.650782 0.193904 8.513 < 2e-16 ***
## log(Lot.Area) 0.198656 0.016094 12.344 < 2e-16 ***
## NeighborhoodBlueste -0.077823 0.121803 -0.639 0.523024
## NeighborhoodBrDale -0.263515 0.086477 -3.047 0.002373 **
## NeighborhoodBrkSide -0.254784 0.069076 -3.688 0.000238 ***
## NeighborhoodClearCr -0.150555 0.081591 -1.845 0.065312 .
## NeighborhoodCollgCr -0.162075 0.060711 -2.670 0.007723 **
## NeighborhoodCrawfor -0.009752 0.068078 -0.143 0.886127
## NeighborhoodEdwards -0.249268 0.063991 -3.895 0.000105 ***
## NeighborhoodGilbert -0.191096 0.063946 -2.988 0.002876 **
## NeighborhoodGreens -0.091811 0.107415 -0.855 0.392916
## NeighborhoodGrnHill 0.190387 0.142768 1.334 0.182672
## NeighborhoodIDOTRR -0.324120 0.075876 -4.272 2.13e-05 ***
## NeighborhoodMeadowV -0.192138 0.080614 -2.383 0.017347 *
## NeighborhoodMitchel -0.176658 0.066526 -2.655 0.008052 **
## NeighborhoodNAmes -0.210709 0.061855 -3.407 0.000685 ***
## NeighborhoodNoRidge 0.063023 0.069330 0.909 0.363569
## NeighborhoodNPkVill -0.089052 0.106713 -0.835 0.404207
## NeighborhoodNridgHt 0.059462 0.063103 0.942 0.346282
## NeighborhoodNWAmes -0.119535 0.066457 -1.799 0.072385 .
## NeighborhoodOldTown -0.306890 0.068198 -4.500 7.63e-06 ***
## NeighborhoodSawyer -0.223468 0.065084 -3.434 0.000621 ***
## NeighborhoodSawyerW -0.182844 0.064147 -2.850 0.004461 **
## NeighborhoodSomerst -0.003329 0.071951 -0.046 0.963106
## NeighborhoodStoneBr 0.045517 0.071364 0.638 0.523750
## NeighborhoodSWISU -0.202073 0.079543 -2.540 0.011229 *
## NeighborhoodTimber -0.079227 0.071757 -1.104 0.269825
## NeighborhoodVeenker -0.146865 0.084142 -1.745 0.081232 .
## MS.ZoningFV 0.236980 0.088476 2.678 0.007524 **
## MS.ZoningI (all) -0.100718 0.191190 -0.527 0.598456
## MS.ZoningRH 0.195618 0.100782 1.941 0.052553 .
## MS.ZoningRL 0.297223 0.073254 4.057 5.37e-05 ***
## MS.ZoningRM 0.275782 0.067623 4.078 4.92e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.178 on 953 degrees of freedom
## Multiple R-squared: 0.8291, Adjusted R-squared: 0.8208
## F-statistic: 100.5 on 46 and 953 DF, p-value: < 2.2e-16
Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
Garage.Area and Total.Bsmt.SF are added. Using BIC for stepwise variable selection, the summary table suggests that all variables - selling_diff, Overall.Qual, Lot.Area, Neighbourhood, MS.Zoning, Garage.Area, Total.Bsmt.SF - are important predictors.
The same result is produced using BAS, with the highest posterior probabilities of 0.1754. This suggests that the selected variables are contributive predictors.
model_BIC <- lm(log(price) ~ selling_diff + Overall.Qual +
log(Lot.Area) + Neighborhood +
MS.Zoning + Garage.Area + Total.Bsmt.SF,
data = ames_train)
k.BIC <- log(nrow(ames_train))
initial_model_BIC <- stepAIC(model_BIC, k = k.BIC)
## Start: AIC=-3306.64
## log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) + Neighborhood +
## MS.Zoning + Garage.Area + Total.Bsmt.SF
##
## Df Sum of Sq RSS AIC
## - Neighborhood 26 3.4018 29.520 -3363.8
## - MS.Zoning 5 0.8139 26.932 -3310.5
## <none> 26.118 -3306.6
## - Garage.Area 1 1.4493 27.568 -3259.5
## - log(Lot.Area) 1 1.7787 27.897 -3247.7
## - Total.Bsmt.SF 1 2.1223 28.241 -3235.4
## - selling_diff 5 3.0673 29.186 -3230.1
## - Overall.Qual 9 12.2628 38.381 -2983.9
##
## Step: AIC=-3363.8
## log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) + MS.Zoning +
## Garage.Area + Total.Bsmt.SF
##
## Df Sum of Sq RSS AIC
## <none> 29.520 -3363.8
## - MS.Zoning 5 1.9613 31.481 -3334.0
## - Garage.Area 1 1.8650 31.385 -3309.5
## - log(Lot.Area) 1 2.2560 31.776 -3297.1
## - Total.Bsmt.SF 1 2.7437 32.264 -3281.8
## - selling_diff 5 3.6956 33.216 -3280.4
## - Overall.Qual 9 20.9165 50.437 -2890.3
model_bas <- bas.lm(log(price) ~ selling_diff + Overall.Qual +
log(Lot.Area) + Neighborhood +
MS.Zoning + Garage.Area + Total.Bsmt.SF,
data = ames_train, prior = "BIC",
modelprior = uniform())
summary(model_bas)
## P(B != 0 | Y) model 1 model 2
## Intercept 1.00000000 1.0000 1.0000000
## selling_diffwithin a year 1.00000000 1.0000 1.0000000
## selling_diffwithin fifteen years 1.00000000 1.0000 1.0000000
## selling_diffwithin five years 1.00000000 1.0000 1.0000000
## selling_diffwithin ten years 1.00000000 1.0000 1.0000000
## selling_diffwithin twenty years 0.99845238 1.0000 1.0000000
## Overall.Qual2 0.04535582 0.0000 0.0000000
## Overall.Qual3 0.98466910 1.0000 1.0000000
## Overall.Qual4 0.99956592 1.0000 1.0000000
## Overall.Qual5 0.99999258 1.0000 1.0000000
## Overall.Qual6 0.99999987 1.0000 1.0000000
## Overall.Qual7 1.00000000 1.0000 1.0000000
## Overall.Qual8 1.00000000 1.0000 1.0000000
## Overall.Qual9 1.00000000 1.0000 1.0000000
## Overall.Qual10 1.00000000 1.0000 1.0000000
## log(Lot.Area) 1.00000000 1.0000 1.0000000
## NeighborhoodBlueste 0.05539778 0.0000 0.0000000
## NeighborhoodBrDale 0.03583905 0.0000 0.0000000
## NeighborhoodBrkSide 0.03063710 0.0000 0.0000000
## NeighborhoodClearCr 0.15582098 0.0000 0.0000000
## NeighborhoodCollgCr 0.04863838 0.0000 0.0000000
## NeighborhoodCrawfor 0.99999931 1.0000 1.0000000
## NeighborhoodEdwards 0.08070309 0.0000 0.0000000
## NeighborhoodGilbert 0.04811348 0.0000 0.0000000
## NeighborhoodGreens 0.03786438 0.0000 0.0000000
## NeighborhoodGrnHill 0.99672306 1.0000 1.0000000
## NeighborhoodIDOTRR 0.04664861 0.0000 0.0000000
## NeighborhoodMeadowV 0.09176215 0.0000 0.0000000
## NeighborhoodMitchel 0.02934080 0.0000 0.0000000
## NeighborhoodNAmes 0.06308684 0.0000 0.0000000
## NeighborhoodNoRidge 0.99981501 1.0000 1.0000000
## NeighborhoodNPkVill 0.03015002 0.0000 0.0000000
## NeighborhoodNridgHt 0.99880119 1.0000 1.0000000
## NeighborhoodNWAmes 0.12720562 0.0000 0.0000000
## NeighborhoodOldTown 0.25734326 0.0000 1.0000000
## NeighborhoodSawyer 0.03016981 0.0000 0.0000000
## NeighborhoodSawyerW 0.03049529 0.0000 0.0000000
## NeighborhoodSomerst 0.13349009 0.0000 0.0000000
## NeighborhoodStoneBr 0.99457804 1.0000 1.0000000
## NeighborhoodSWISU 0.03916785 0.0000 0.0000000
## NeighborhoodTimber 0.04655959 0.0000 0.0000000
## NeighborhoodVeenker 0.03363327 0.0000 0.0000000
## MS.ZoningFV 0.99902756 1.0000 1.0000000
## MS.ZoningI (all) 0.03200003 0.0000 0.0000000
## MS.ZoningRH 0.92405762 1.0000 1.0000000
## MS.ZoningRL 0.99969750 1.0000 1.0000000
## MS.ZoningRM 0.99891563 1.0000 1.0000000
## Garage.Area 1.00000000 1.0000 1.0000000
## Total.Bsmt.SF 1.00000000 1.0000 1.0000000
## BF NA 1.0000 0.3424334
## PostProbs NA 0.1759 0.0602000
## R2 NA 0.8475 0.8482000
## dim NA 26.0000 27.0000000
## logmarg NA -1736.7529 -1737.8245952
## model 3 model 4 model 5
## Intercept 1.0000000 1.0000000 1.0000000
## selling_diffwithin a year 1.0000000 1.0000000 1.0000000
## selling_diffwithin fifteen years 1.0000000 1.0000000 1.0000000
## selling_diffwithin five years 1.0000000 1.0000000 1.0000000
## selling_diffwithin ten years 1.0000000 1.0000000 1.0000000
## selling_diffwithin twenty years 1.0000000 1.0000000 1.0000000
## Overall.Qual2 0.0000000 0.0000000 0.0000000
## Overall.Qual3 1.0000000 1.0000000 1.0000000
## Overall.Qual4 1.0000000 1.0000000 1.0000000
## Overall.Qual5 1.0000000 1.0000000 1.0000000
## Overall.Qual6 1.0000000 1.0000000 1.0000000
## Overall.Qual7 1.0000000 1.0000000 1.0000000
## Overall.Qual8 1.0000000 1.0000000 1.0000000
## Overall.Qual9 1.0000000 1.0000000 1.0000000
## Overall.Qual10 1.0000000 1.0000000 1.0000000
## log(Lot.Area) 1.0000000 1.0000000 1.0000000
## NeighborhoodBlueste 0.0000000 0.0000000 0.0000000
## NeighborhoodBrDale 0.0000000 0.0000000 0.0000000
## NeighborhoodBrkSide 0.0000000 0.0000000 0.0000000
## NeighborhoodClearCr 1.0000000 0.0000000 0.0000000
## NeighborhoodCollgCr 0.0000000 0.0000000 0.0000000
## NeighborhoodCrawfor 1.0000000 1.0000000 1.0000000
## NeighborhoodEdwards 0.0000000 0.0000000 0.0000000
## NeighborhoodGilbert 0.0000000 0.0000000 0.0000000
## NeighborhoodGreens 0.0000000 0.0000000 0.0000000
## NeighborhoodGrnHill 1.0000000 1.0000000 1.0000000
## NeighborhoodIDOTRR 0.0000000 0.0000000 0.0000000
## NeighborhoodMeadowV 0.0000000 0.0000000 0.0000000
## NeighborhoodMitchel 0.0000000 0.0000000 0.0000000
## NeighborhoodNAmes 0.0000000 0.0000000 0.0000000
## NeighborhoodNoRidge 1.0000000 1.0000000 1.0000000
## NeighborhoodNPkVill 0.0000000 0.0000000 0.0000000
## NeighborhoodNridgHt 1.0000000 1.0000000 1.0000000
## NeighborhoodNWAmes 0.0000000 0.0000000 1.0000000
## NeighborhoodOldTown 0.0000000 0.0000000 0.0000000
## NeighborhoodSawyer 0.0000000 0.0000000 0.0000000
## NeighborhoodSawyerW 0.0000000 0.0000000 0.0000000
## NeighborhoodSomerst 0.0000000 1.0000000 0.0000000
## NeighborhoodStoneBr 1.0000000 1.0000000 1.0000000
## NeighborhoodSWISU 0.0000000 0.0000000 0.0000000
## NeighborhoodTimber 0.0000000 0.0000000 0.0000000
## NeighborhoodVeenker 0.0000000 0.0000000 0.0000000
## MS.ZoningFV 1.0000000 1.0000000 1.0000000
## MS.ZoningI (all) 0.0000000 0.0000000 0.0000000
## MS.ZoningRH 1.0000000 1.0000000 1.0000000
## MS.ZoningRL 1.0000000 1.0000000 1.0000000
## MS.ZoningRM 1.0000000 1.0000000 1.0000000
## Garage.Area 1.0000000 1.0000000 1.0000000
## Total.Bsmt.SF 1.0000000 1.0000000 1.0000000
## BF 0.1934976 0.1528193 0.1459556
## PostProbs 0.0340000 0.0269000 0.0257000
## R2 0.8481000 0.8480000 0.8480000
## dim 27.0000000 27.0000000 27.0000000
## logmarg -1738.3954071 -1738.6314164 -1738.6773701
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
The residuals plot of the model.BIC appears to be normally distributed. The plot also identifies 3 possible outliers, but not influential ones (181, 310, and 428). However, the Scale-Location plot indicates a violation of the constant variance assumption.
The same pattern is shown in the model using BAS. It identifies 181, 310, and 428 as possible outliers which the model would possibly predict the sale price too high or low than the actual one. These outliers are affected by factors such as sale within family members, house not completed when assessed, and abnormal sale. Limited the dataset to have houses selling under normal condition would provide more accurate results.
par(mfrow=c(2,2))
plot(model_BIC)
## Warning: not plotting observations with leverage one:
## 569, 918
plot(model_bas, which = 1)
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
The RMSE for the model predicting on the training dataset is USD 30,809, while the RMSE for the model using using BAS to predict on the training dataset produces a higher RMSE value of USD 31,412.
RMSE <- function(model, dataset) {
pred_price <- exp(predict(model, dataset))
resids <- dataset$price - pred_price
rmse <- round(sqrt(mean(resids^2)))
rmse
}
initial_rmse <- RMSE(model_BIC, ames_train)
print(paste("The RMSE for the model predicting on the training dataset is USD", initial_rmse))
## [1] "The RMSE for the model predicting on the training dataset is USD 30809"
predict_bas <- predict(model_bas, newdata = ames_train,
estimator = "HPM")
predict_bas_HPM <- sqrt(mean((exp(predict_bas$fit)-ames_train$price)^2))
print(paste("The RMSE for the model using HPM (Highest Probability Model) to predict on the training dataset is USD", round(predict_bas_HPM)))
## [1] "The RMSE for the model using HPM (Highest Probability Model) to predict on the training dataset is USD 31412"
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.
load("~/Desktop/R Programming/Statistics_Coursera/Capstone/ames_test.Rdata")
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
The coverage probability of the initial model on the out-of-sample (ames_test) is calculated. The result shows that 97.18% of the actual prices in ames_test fall within the 95% confidence interval of the expected value. The RMSE of the model predicting on the testing dataset is 28,853 USD - a much lower than the RMSE on the training dataset (USD 30,809).
However, the RMSE for the model using BAS to predict on the testing dataset remains the same value as the training dataset of USD 31,412.
#create selling_diff variable & convert to factor class
ames_test$selling_diff <- ames_test$Yr.Sold - ames_test$Year.Remod.Add
ames_test$selling_diff <- case_when(
ames_test$selling_diff <= 1 ~ "within a year",
ames_test$selling_diff <= 5 ~ "within five years",
ames_test$selling_diff <= 10 ~ "within ten years",
ames_test$selling_diff <= 25 ~ "within fifteen years",
ames_test$selling_diff <= 50 ~ "within twenty years",
TRUE ~ "more than twenty years")
ames_test$selling_diff <- as.factor(ames_test$selling_diff)
#clean Garage.Are & Total.Bsmt.SF
ames_test$Garage.Area[is.na(ames_test$Garage.Area)] <- 0
ames_test$Total.Bsmt.SF[is.na(ames_test$Total.Bsmt.SF)] <- 0
#convert Overall_Qual to factor class
ames_test$Overall.Qual <- as.factor(ames_test$Overall.Qual)
#remove level "Landmrk" from Neighborhood
ames_test <- ames_test %>%
filter(Neighborhood != "Landmrk")
pred_test <- exp(predict(model_BIC, ames_test, interval = "prediction"))
coverage_prob_full <- mean(ames_test$price > pred_test[, "lwr"] &
ames_test$price < pred_test[, "upr"])
rmse_model_BIC <- RMSE(model_BIC, ames_test)
print(paste("The out-of-sample coverage for the model is", coverage_prob_full * 100))
## [1] "The out-of-sample coverage for the model is 97.1813725490196"
print(paste("The RMSE for the model predicting on the testing dataset is USD", rmse_model_BIC))
## [1] "The RMSE for the model predicting on the testing dataset is USD 28853"
predict_test_bas <- predict(model_bas, newdata = ames_test,
estimator = "HPM")
predict_test_bas_HPM <- sqrt(mean((exp(predict_test_bas$fit)-ames_test$price)^2))
print(paste("The RMSE for the model using HPM (Highest Probability Model) to predict on the training dataset is USD", round(predict_bas_HPM)))
## [1] "The RMSE for the model using HPM (Highest Probability Model) to predict on the training dataset is USD 31412"
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Provide the summary table for your model.
It is apparent that, based on the residuals plot analysis, the outliers are affected by sale condition other than normal. In order to generate a more accurate and better predictive result, the dataset normal_sale_trainset is filtered to have only houses sold under normal condition.
The model using AIC stepwise method suggests that important predictors include selling_diff, Overall.Qual, Lot.Area, Neighborhood, MS.Zoning, Total.Bsmt.SF, Garage.Area, Bldg.Type, BsmtFin.Type.1, area, Land.Slope, House.Style, Exter.Cond, Foundation, Central.Air, Wood.Deck.SF, with \(Adj.R^2\) of 0.9349 and RSE of 0.09746.
The BIC variable selection includes selling_diff, Overall.Qual, Lot.Area, MS.Zoning, Total.Bsmt.SF, Garage.Area, BsmtFin.Type.1, area, Land.Slope, Central.Air, Wood.Deck.SF, with \(Adj.R^2\) of 0.9174 and RSE of 0.1097.
Using BAS, the model with highest posterior probabilities of 0.0124 includes selling_diff, Overall.Qual, Lot.Area, MS.Zoning, Total.Bsmt.SF, Garage.Area, BsmtFin.Type.1, area, Land.Slope, Central.Air, Wood.Deck.SF.
#filter ames_train to houses sold under normal selling conditions
normal_sale_trainset <- ames_train %>%
filter(Sale.Condition == "Normal")
#clean dataset
levels(normal_sale_trainset$BsmtFin.Type.1) <- c(levels(normal_sale_trainset$BsmtFin.Type.1),"NB")
normal_sale_trainset$BsmtFin.Type.1[is.na(normal_sale_trainset$BsmtFin.Type.1)] <- "NB"
normal_sale_trainset$Garage.Cars[is.na(normal_sale_trainset$Garage.Cars)] <- 0
FullModel <- lm(log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
Neighborhood + MS.Zoning + Total.Bsmt.SF + Lot.Shape +
Full.Bath + Garage.Area + Bldg.Type +
BsmtFin.Type.1 + log(area) + Land.Contour + Land.Slope +
House.Style + Exter.Cond + Foundation + Central.Air +
Wood.Deck.SF,
data = normal_sale_trainset)
FullModel_AIC <- stepAIC(FullModel, trace = FALSE, k = 2)
summary(FullModel_AIC)
##
## Call:
## lm(formula = log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
## Neighborhood + MS.Zoning + Total.Bsmt.SF + Garage.Area +
## Bldg.Type + BsmtFin.Type.1 + log(area) + Land.Slope + House.Style +
## Exter.Cond + Foundation + Central.Air + Wood.Deck.SF, data = normal_sale_trainset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45944 -0.05364 0.00232 0.05816 0.31117
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.669e+00 2.129e-01 36.020 < 2e-16 ***
## selling_diffwithin a year 9.321e-02 1.886e-02 4.943 9.48e-07 ***
## selling_diffwithin fifteen years 8.286e-02 1.514e-02 5.473 6.01e-08 ***
## selling_diffwithin five years 1.203e-01 1.557e-02 7.725 3.56e-14 ***
## selling_diffwithin ten years 1.145e-01 1.586e-02 7.217 1.30e-12 ***
## selling_diffwithin twenty years 3.900e-02 1.415e-02 2.757 0.005979 **
## Overall.Qual2 1.676e-01 1.200e-01 1.396 0.163105
## Overall.Qual3 2.417e-01 1.182e-01 2.045 0.041229 *
## Overall.Qual4 3.054e-01 1.159e-01 2.635 0.008575 **
## Overall.Qual5 3.556e-01 1.161e-01 3.062 0.002278 **
## Overall.Qual6 4.201e-01 1.168e-01 3.597 0.000343 ***
## Overall.Qual7 4.752e-01 1.176e-01 4.041 5.88e-05 ***
## Overall.Qual8 5.599e-01 1.183e-01 4.734 2.63e-06 ***
## Overall.Qual9 7.287e-01 1.203e-01 6.058 2.18e-09 ***
## Overall.Qual10 7.728e-01 1.287e-01 6.007 2.94e-09 ***
## log(Lot.Area) 5.794e-02 1.329e-02 4.359 1.49e-05 ***
## NeighborhoodBlueste -2.429e-02 7.493e-02 -0.324 0.745898
## NeighborhoodBrDale -9.496e-02 6.252e-02 -1.519 0.129245
## NeighborhoodBrkSide -2.987e-02 5.075e-02 -0.589 0.556256
## NeighborhoodClearCr -8.320e-03 5.609e-02 -0.148 0.882127
## NeighborhoodCollgCr -4.564e-02 4.451e-02 -1.025 0.305457
## NeighborhoodCrawfor 6.931e-02 4.894e-02 1.416 0.157114
## NeighborhoodEdwards -1.225e-01 4.622e-02 -2.651 0.008189 **
## NeighborhoodGilbert -4.822e-02 4.723e-02 -1.021 0.307586
## NeighborhoodGreens 1.158e-01 6.519e-02 1.776 0.076106 .
## NeighborhoodGrnHill 5.539e-01 8.444e-02 6.560 9.99e-11 ***
## NeighborhoodIDOTRR -7.509e-02 5.458e-02 -1.376 0.169303
## NeighborhoodMeadowV -1.672e-01 5.332e-02 -3.136 0.001777 **
## NeighborhoodMitchel -6.523e-02 4.712e-02 -1.384 0.166659
## NeighborhoodNAmes -5.998e-02 4.534e-02 -1.323 0.186252
## NeighborhoodNoRidge 2.272e-02 4.813e-02 0.472 0.636989
## NeighborhoodNPkVill -3.798e-02 6.660e-02 -0.570 0.568632
## NeighborhoodNridgHt 4.335e-02 4.475e-02 0.969 0.332942
## NeighborhoodNWAmes -4.944e-02 4.733e-02 -1.045 0.296538
## NeighborhoodOldTown -1.122e-01 5.018e-02 -2.235 0.025712 *
## NeighborhoodSawyer -6.141e-02 4.685e-02 -1.311 0.190346
## NeighborhoodSawyerW -7.752e-02 4.617e-02 -1.679 0.093583 .
## NeighborhoodSomerst 2.756e-02 5.343e-02 0.516 0.606128
## NeighborhoodStoneBr 1.682e-02 4.946e-02 0.340 0.733892
## NeighborhoodSWISU -7.755e-02 5.539e-02 -1.400 0.161946
## NeighborhoodTimber -7.003e-03 4.903e-02 -0.143 0.886463
## NeighborhoodVeenker 6.069e-02 5.460e-02 1.112 0.266632
## MS.ZoningFV 2.928e-01 6.446e-02 4.542 6.47e-06 ***
## MS.ZoningI (all) -1.120e-01 1.159e-01 -0.967 0.334021
## MS.ZoningRH 1.824e-01 6.668e-02 2.735 0.006384 **
## MS.ZoningRL 2.945e-01 5.276e-02 5.582 3.31e-08 ***
## MS.ZoningRM 2.647e-01 4.965e-02 5.332 1.28e-07 ***
## Total.Bsmt.SF 1.246e-04 1.990e-05 6.258 6.51e-10 ***
## Garage.Area 1.247e-04 2.421e-05 5.151 3.31e-07 ***
## Bldg.Type2fmCon -3.446e-02 2.684e-02 -1.284 0.199578
## Bldg.TypeDuplex -7.124e-02 2.247e-02 -3.171 0.001581 **
## Bldg.TypeTwnhs -7.583e-02 2.923e-02 -2.594 0.009659 **
## Bldg.TypeTwnhsE -2.772e-02 2.070e-02 -1.339 0.180923
## BsmtFin.Type.1BLQ -3.778e-02 1.440e-02 -2.623 0.008887 **
## BsmtFin.Type.1GLQ 1.354e-02 1.276e-02 1.061 0.288919
## BsmtFin.Type.1LwQ -8.535e-02 1.815e-02 -4.702 3.06e-06 ***
## BsmtFin.Type.1Rec -4.354e-02 1.403e-02 -3.102 0.001991 **
## BsmtFin.Type.1Unf -8.636e-02 1.256e-02 -6.873 1.31e-11 ***
## BsmtFin.Type.1NB -1.924e-01 4.312e-02 -4.463 9.31e-06 ***
## log(area) 3.925e-01 2.558e-02 15.347 < 2e-16 ***
## Land.SlopeMod 5.541e-02 2.000e-02 2.771 0.005734 **
## Land.SlopeSev 1.751e-01 6.711e-02 2.609 0.009269 **
## House.Style1.5Unf 1.412e-02 3.990e-02 0.354 0.723484
## House.Style1Story 1.031e-02 1.762e-02 0.585 0.558864
## House.Style2.5Unf 6.383e-02 3.767e-02 1.695 0.090579 .
## House.Style2Story 1.696e-02 1.586e-02 1.069 0.285216
## House.StyleSFoyer 9.784e-02 2.588e-02 3.780 0.000169 ***
## House.StyleSLvl 1.347e-02 2.238e-02 0.602 0.547301
## Exter.CondFa -2.298e-01 5.876e-02 -3.911 1.00e-04 ***
## Exter.CondGd -1.074e-01 5.211e-02 -2.061 0.039691 *
## Exter.CondTA -1.057e-01 5.121e-02 -2.064 0.039337 *
## FoundationCBlock 3.591e-02 1.515e-02 2.370 0.018055 *
## FoundationPConc 6.286e-02 1.624e-02 3.871 0.000118 ***
## FoundationSlab 2.272e-01 4.872e-02 4.663 3.68e-06 ***
## FoundationStone 1.196e-01 5.978e-02 2.000 0.045865 *
## Central.AirY 1.277e-01 1.927e-02 6.626 6.55e-11 ***
## Wood.Deck.SF 1.612e-04 3.201e-05 5.035 5.97e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09746 on 757 degrees of freedom
## Multiple R-squared: 0.9408, Adjusted R-squared: 0.9349
## F-statistic: 158.3 on 76 and 757 DF, p-value: < 2.2e-16
Full.BIC <- log(nrow(normal_sale_trainset))
FullModel.BIC <- stepAIC(FullModel, k = Full.BIC, se.fit = TRUE, trace = FALSE)
summary(FullModel.BIC)
##
## Call:
## lm(formula = log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
## MS.Zoning + Total.Bsmt.SF + Garage.Area + BsmtFin.Type.1 +
## log(area) + Land.Slope + Central.Air + Wood.Deck.SF, data = normal_sale_trainset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52432 -0.06342 0.00539 0.06148 0.51963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.478e+00 1.646e-01 45.437 < 2e-16 ***
## selling_diffwithin a year 1.067e-01 1.872e-02 5.701 1.68e-08 ***
## selling_diffwithin fifteen years 9.580e-02 1.519e-02 6.307 4.69e-10 ***
## selling_diffwithin five years 1.345e-01 1.594e-02 8.438 < 2e-16 ***
## selling_diffwithin ten years 1.257e-01 1.589e-02 7.909 8.60e-15 ***
## selling_diffwithin twenty years 3.799e-02 1.337e-02 2.842 0.004592 **
## Overall.Qual2 1.126e-01 1.244e-01 0.905 0.365702
## Overall.Qual3 2.434e-01 1.219e-01 1.997 0.046131 *
## Overall.Qual4 3.021e-01 1.182e-01 2.556 0.010766 *
## Overall.Qual5 3.807e-01 1.184e-01 3.215 0.001355 **
## Overall.Qual6 4.601e-01 1.193e-01 3.857 0.000124 ***
## Overall.Qual7 5.499e-01 1.201e-01 4.581 5.37e-06 ***
## Overall.Qual8 6.734e-01 1.207e-01 5.581 3.27e-08 ***
## Overall.Qual9 8.548e-01 1.229e-01 6.954 7.40e-12 ***
## Overall.Qual10 9.028e-01 1.328e-01 6.797 2.08e-11 ***
## log(Lot.Area) 8.083e-02 1.028e-02 7.859 1.25e-14 ***
## MS.ZoningFV 3.647e-01 5.491e-02 6.642 5.72e-11 ***
## MS.ZoningI (all) -9.704e-02 1.236e-01 -0.785 0.432567
## MS.ZoningRH 1.660e-01 6.781e-02 2.447 0.014614 *
## MS.ZoningRL 3.219e-01 5.156e-02 6.243 6.98e-10 ***
## MS.ZoningRM 2.570e-01 5.197e-02 4.946 9.25e-07 ***
## Total.Bsmt.SF 1.195e-04 1.460e-05 8.186 1.06e-15 ***
## Garage.Area 1.391e-04 2.499e-05 5.568 3.53e-08 ***
## BsmtFin.Type.1BLQ -5.074e-02 1.551e-02 -3.272 0.001113 **
## BsmtFin.Type.1GLQ 1.591e-02 1.336e-02 1.191 0.234003
## BsmtFin.Type.1LwQ -1.057e-01 1.937e-02 -5.458 6.41e-08 ***
## BsmtFin.Type.1Rec -5.198e-02 1.518e-02 -3.425 0.000645 ***
## BsmtFin.Type.1Unf -1.052e-01 1.318e-02 -7.987 4.80e-15 ***
## BsmtFin.Type.1NB -7.967e-02 3.227e-02 -2.469 0.013756 *
## log(area) 3.623e-01 1.696e-02 21.358 < 2e-16 ***
## Land.SlopeMod 6.500e-02 2.118e-02 3.068 0.002224 **
## Land.SlopeSev 1.507e-01 6.829e-02 2.206 0.027646 *
## Central.AirY 1.566e-01 1.986e-02 7.882 1.06e-14 ***
## Wood.Deck.SF 1.507e-04 3.442e-05 4.377 1.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1097 on 800 degrees of freedom
## Multiple R-squared: 0.9207, Adjusted R-squared: 0.9174
## F-statistic: 281.5 on 33 and 800 DF, p-value: < 2.2e-16
BASModel <- bas.lm(log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
Neighborhood + MS.Zoning + Total.Bsmt.SF + Lot.Shape +
Full.Bath + Garage.Area + Bldg.Type +
BsmtFin.Type.1 + log(area) + Land.Contour + Land.Slope +
House.Style + Exter.Cond + Foundation + Central.Air +
Wood.Deck.SF,
data = normal_sale_trainset, prior = "BIC",
modelprior = uniform())
summary(BASModel)
## P(B != 0 | Y) model 1 model 2
## Intercept 1.00000000 1.0000 1.0000000
## selling_diffwithin a year 0.99999959 1.0000 1.0000000
## selling_diffwithin fifteen years 0.99999952 1.0000 1.0000000
## selling_diffwithin five years 1.00000000 1.0000 1.0000000
## selling_diffwithin ten years 1.00000000 1.0000 1.0000000
## selling_diffwithin twenty years 0.88609650 1.0000 1.0000000
## Overall.Qual2 0.04812931 0.0000 0.0000000
## Overall.Qual3 0.09784014 0.0000 0.0000000
## Overall.Qual4 0.87769615 1.0000 1.0000000
## Overall.Qual5 0.99980060 1.0000 1.0000000
## Overall.Qual6 0.99999805 1.0000 1.0000000
## Overall.Qual7 0.99999659 1.0000 1.0000000
## Overall.Qual8 0.99999998 1.0000 1.0000000
## Overall.Qual9 1.00000000 1.0000 1.0000000
## Overall.Qual10 1.00000000 1.0000 1.0000000
## log(Lot.Area) 0.99999963 1.0000 1.0000000
## NeighborhoodBlueste 0.01033939 0.0000 0.0000000
## NeighborhoodBrDale 0.06589975 0.0000 0.0000000
## NeighborhoodBrkSide 0.02852850 0.0000 0.0000000
## NeighborhoodClearCr 0.07313565 0.0000 0.0000000
## NeighborhoodCollgCr 0.01548159 0.0000 0.0000000
## NeighborhoodCrawfor 0.99992619 1.0000 1.0000000
## NeighborhoodEdwards 0.99560735 1.0000 1.0000000
## NeighborhoodGilbert 0.01094286 0.0000 0.0000000
## NeighborhoodGreens 0.19237193 0.0000 0.0000000
## NeighborhoodGrnHill 1.00000000 1.0000 1.0000000
## NeighborhoodIDOTRR 0.01682174 0.0000 0.0000000
## NeighborhoodMeadowV 0.97228657 1.0000 1.0000000
## NeighborhoodMitchel 0.02030613 0.0000 0.0000000
## NeighborhoodNAmes 0.01027621 0.0000 0.0000000
## NeighborhoodNoRidge 0.22530214 0.0000 0.0000000
## NeighborhoodNPkVill 0.01346348 0.0000 0.0000000
## NeighborhoodNridgHt 0.15210024 0.0000 0.0000000
## NeighborhoodNWAmes 0.01088890 0.0000 0.0000000
## NeighborhoodOldTown 0.95340694 1.0000 1.0000000
## NeighborhoodSawyer 0.01167435 0.0000 0.0000000
## NeighborhoodSawyerW 0.16704455 0.0000 0.0000000
## NeighborhoodSomerst 0.01974906 0.0000 0.0000000
## NeighborhoodStoneBr 0.01064683 0.0000 0.0000000
## NeighborhoodSWISU 0.02140927 0.0000 0.0000000
## NeighborhoodTimber 0.01111017 0.0000 0.0000000
## NeighborhoodVeenker 0.28523548 0.0000 0.0000000
## MS.ZoningFV 1.00000000 1.0000 1.0000000
## MS.ZoningI (all) 0.01641798 0.0000 0.0000000
## MS.ZoningRH 0.98555459 1.0000 1.0000000
## MS.ZoningRL 1.00000000 1.0000 1.0000000
## MS.ZoningRM 0.99999999 1.0000 1.0000000
## Total.Bsmt.SF 1.00000000 1.0000 1.0000000
## Lot.ShapeIR2 0.01050720 0.0000 0.0000000
## Lot.ShapeIR3 0.11962395 0.0000 0.0000000
## Lot.ShapeReg 0.01071058 0.0000 0.0000000
## Full.Bath 0.01075173 0.0000 0.0000000
## Garage.Area 0.99999992 1.0000 1.0000000
## Bldg.Type2fmCon 0.01897129 0.0000 0.0000000
## Bldg.TypeDuplex 0.88199605 1.0000 1.0000000
## Bldg.TypeTwnhs 0.11231027 0.0000 0.0000000
## Bldg.TypeTwnhsE 0.02225229 0.0000 0.0000000
## BsmtFin.Type.1BLQ 0.99230412 1.0000 1.0000000
## BsmtFin.Type.1GLQ 0.02129025 0.0000 0.0000000
## BsmtFin.Type.1LwQ 0.99999884 1.0000 1.0000000
## BsmtFin.Type.1Rec 0.99756299 1.0000 1.0000000
## BsmtFin.Type.1Unf 1.00000000 1.0000 1.0000000
## BsmtFin.Type.1NB 0.99999372 1.0000 1.0000000
## log(area) 1.00000000 1.0000 1.0000000
## Land.ContourHLS 0.01127894 0.0000 0.0000000
## Land.ContourLow 0.01291185 0.0000 0.0000000
## Land.ContourLvl 0.01272651 0.0000 0.0000000
## Land.SlopeMod 0.63619798 1.0000 0.0000000
## Land.SlopeSev 0.75213440 1.0000 1.0000000
## House.Style1.5Unf 0.01071035 0.0000 0.0000000
## House.Style1Story 0.01080992 0.0000 0.0000000
## House.Style2.5Unf 0.01220615 0.0000 0.0000000
## House.Style2Story 0.01072292 0.0000 0.0000000
## House.StyleSFoyer 0.99841348 1.0000 1.0000000
## House.StyleSLvl 0.01222110 0.0000 0.0000000
## Exter.CondFa 0.99999052 1.0000 1.0000000
## Exter.CondGd 0.01374241 0.0000 0.0000000
## Exter.CondTA 0.01148239 0.0000 0.0000000
## FoundationCBlock 0.06866467 0.0000 0.0000000
## FoundationPConc 0.95939245 1.0000 1.0000000
## FoundationSlab 0.99864573 1.0000 1.0000000
## FoundationStone 0.03743004 0.0000 0.0000000
## Central.AirY 0.99999999 1.0000 1.0000000
## Wood.Deck.SF 0.99999446 1.0000 1.0000000
## BF NA 1.0000 0.5565378
## PostProbs NA 0.0102 0.0057000
## R2 NA 0.9351 0.9345000
## dim NA 40.0000 39.0000000
## logmarg NA -995.2711 -995.8571187
## model 3 model 4 model 5
## Intercept 1.0000000 1.0000000 1.0000000
## selling_diffwithin a year 1.0000000 1.0000000 1.0000000
## selling_diffwithin fifteen years 1.0000000 1.0000000 1.0000000
## selling_diffwithin five years 1.0000000 1.0000000 1.0000000
## selling_diffwithin ten years 1.0000000 1.0000000 1.0000000
## selling_diffwithin twenty years 1.0000000 1.0000000 1.0000000
## Overall.Qual2 0.0000000 0.0000000 0.0000000
## Overall.Qual3 0.0000000 0.0000000 0.0000000
## Overall.Qual4 1.0000000 1.0000000 1.0000000
## Overall.Qual5 1.0000000 1.0000000 1.0000000
## Overall.Qual6 1.0000000 1.0000000 1.0000000
## Overall.Qual7 1.0000000 1.0000000 1.0000000
## Overall.Qual8 1.0000000 1.0000000 1.0000000
## Overall.Qual9 1.0000000 1.0000000 1.0000000
## Overall.Qual10 1.0000000 1.0000000 1.0000000
## log(Lot.Area) 1.0000000 1.0000000 1.0000000
## NeighborhoodBlueste 0.0000000 0.0000000 0.0000000
## NeighborhoodBrDale 0.0000000 0.0000000 0.0000000
## NeighborhoodBrkSide 0.0000000 0.0000000 0.0000000
## NeighborhoodClearCr 0.0000000 0.0000000 0.0000000
## NeighborhoodCollgCr 0.0000000 0.0000000 0.0000000
## NeighborhoodCrawfor 1.0000000 1.0000000 1.0000000
## NeighborhoodEdwards 1.0000000 1.0000000 1.0000000
## NeighborhoodGilbert 0.0000000 0.0000000 0.0000000
## NeighborhoodGreens 0.0000000 0.0000000 0.0000000
## NeighborhoodGrnHill 1.0000000 1.0000000 1.0000000
## NeighborhoodIDOTRR 0.0000000 0.0000000 0.0000000
## NeighborhoodMeadowV 1.0000000 1.0000000 1.0000000
## NeighborhoodMitchel 0.0000000 0.0000000 0.0000000
## NeighborhoodNAmes 0.0000000 0.0000000 0.0000000
## NeighborhoodNoRidge 0.0000000 1.0000000 0.0000000
## NeighborhoodNPkVill 0.0000000 0.0000000 0.0000000
## NeighborhoodNridgHt 0.0000000 0.0000000 0.0000000
## NeighborhoodNWAmes 0.0000000 0.0000000 0.0000000
## NeighborhoodOldTown 1.0000000 1.0000000 1.0000000
## NeighborhoodSawyer 0.0000000 0.0000000 0.0000000
## NeighborhoodSawyerW 0.0000000 0.0000000 1.0000000
## NeighborhoodSomerst 0.0000000 0.0000000 0.0000000
## NeighborhoodStoneBr 0.0000000 0.0000000 0.0000000
## NeighborhoodSWISU 0.0000000 0.0000000 0.0000000
## NeighborhoodTimber 0.0000000 0.0000000 0.0000000
## NeighborhoodVeenker 1.0000000 0.0000000 0.0000000
## MS.ZoningFV 1.0000000 1.0000000 1.0000000
## MS.ZoningI (all) 0.0000000 0.0000000 0.0000000
## MS.ZoningRH 1.0000000 1.0000000 1.0000000
## MS.ZoningRL 1.0000000 1.0000000 1.0000000
## MS.ZoningRM 1.0000000 1.0000000 1.0000000
## Total.Bsmt.SF 1.0000000 1.0000000 1.0000000
## Lot.ShapeIR2 0.0000000 0.0000000 0.0000000
## Lot.ShapeIR3 0.0000000 0.0000000 0.0000000
## Lot.ShapeReg 0.0000000 0.0000000 0.0000000
## Full.Bath 0.0000000 0.0000000 0.0000000
## Garage.Area 1.0000000 1.0000000 1.0000000
## Bldg.Type2fmCon 0.0000000 0.0000000 0.0000000
## Bldg.TypeDuplex 1.0000000 1.0000000 1.0000000
## Bldg.TypeTwnhs 0.0000000 0.0000000 0.0000000
## Bldg.TypeTwnhsE 0.0000000 0.0000000 0.0000000
## BsmtFin.Type.1BLQ 1.0000000 1.0000000 1.0000000
## BsmtFin.Type.1GLQ 0.0000000 0.0000000 0.0000000
## BsmtFin.Type.1LwQ 1.0000000 1.0000000 1.0000000
## BsmtFin.Type.1Rec 1.0000000 1.0000000 1.0000000
## BsmtFin.Type.1Unf 1.0000000 1.0000000 1.0000000
## BsmtFin.Type.1NB 1.0000000 1.0000000 1.0000000
## log(area) 1.0000000 1.0000000 1.0000000
## Land.ContourHLS 0.0000000 0.0000000 0.0000000
## Land.ContourLow 0.0000000 0.0000000 0.0000000
## Land.ContourLvl 0.0000000 0.0000000 0.0000000
## Land.SlopeMod 1.0000000 1.0000000 1.0000000
## Land.SlopeSev 1.0000000 1.0000000 1.0000000
## House.Style1.5Unf 0.0000000 0.0000000 0.0000000
## House.Style1Story 0.0000000 0.0000000 0.0000000
## House.Style2.5Unf 0.0000000 0.0000000 0.0000000
## House.Style2Story 0.0000000 0.0000000 0.0000000
## House.StyleSFoyer 1.0000000 1.0000000 1.0000000
## House.StyleSLvl 0.0000000 0.0000000 0.0000000
## Exter.CondFa 1.0000000 1.0000000 1.0000000
## Exter.CondGd 0.0000000 0.0000000 0.0000000
## Exter.CondTA 0.0000000 0.0000000 0.0000000
## FoundationCBlock 0.0000000 0.0000000 0.0000000
## FoundationPConc 1.0000000 1.0000000 1.0000000
## FoundationSlab 1.0000000 1.0000000 1.0000000
## FoundationStone 0.0000000 0.0000000 0.0000000
## Central.AirY 1.0000000 1.0000000 1.0000000
## Wood.Deck.SF 1.0000000 1.0000000 1.0000000
## BF 0.5328753 0.3567992 0.3430534
## PostProbs 0.0055000 0.0036000 0.0035000
## R2 0.9356000 0.9355000 0.9355000
## dim 41.0000000 41.0000000 41.0000000
## logmarg -995.9005665 -996.3016806 -996.3409678
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
The variables price, Lot.Area, and area are selected to transform as their distributions are very skewed.
par(mfrow=c(3,2))
hist(normal_sale_trainset$price, main = "Untransformed price")
hist(log(normal_sale_trainset$price), main = "Transformed price")
hist(normal_sale_trainset$Lot.Area, main = "Untransformed Lot.Area")
hist(log(normal_sale_trainset$Lot.Area), main = "Transformed Lot.Area")
hist(normal_sale_trainset$area, main = "Untransformed area")
hist(log(normal_sale_trainset$area), main = "Transformed area")
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.
ANOVA analysis is used to understand the interaction effect among the variables. The p-value less than the significance level suggests that the interaction effect is present. However, I decided not to remove those variables because by doing so reduces the strength of the linear model.
FinalModel.Inter <- lm(log(price) ~ (selling_diff + Overall.Qual + log(Lot.Area) +
MS.Zoning + Total.Bsmt.SF + Garage.Area +
BsmtFin.Type.1 + log(area) + Land.Slope +
Central.Air +Wood.Deck.SF)^2, data = normal_sale_trainset)
anova(FinalModel.Inter)
## Analysis of Variance Table
##
## Response: log(price)
## Df Sum Sq Mean Sq F value Pr(>F)
## selling_diff 5 42.280 8.4559 973.6226 < 2.2e-16 ***
## Overall.Qual 9 45.326 5.0362 579.8696 < 2.2e-16 ***
## log(Lot.Area) 1 10.011 10.0108 1152.6525 < 2.2e-16 ***
## MS.Zoning 5 1.710 0.3419 39.3679 < 2.2e-16 ***
## Total.Bsmt.SF 1 3.384 3.3839 389.6288 < 2.2e-16 ***
## Garage.Area 1 1.544 1.5444 177.8183 < 2.2e-16 ***
## BsmtFin.Type.1 6 0.914 0.1524 17.5421 < 2.2e-16 ***
## log(area) 1 5.526 5.5263 636.3061 < 2.2e-16 ***
## Land.Slope 2 0.186 0.0931 10.7163 2.725e-05 ***
## Central.Air 1 0.742 0.7421 85.4425 < 2.2e-16 ***
## Wood.Deck.SF 1 0.231 0.2307 26.5660 3.570e-07 ***
## selling_diff:Overall.Qual 27 0.449 0.0166 1.9139 0.0039963 **
## selling_diff:log(Lot.Area) 5 0.043 0.0085 0.9818 0.4281789
## selling_diff:MS.Zoning 11 0.590 0.0536 6.1742 1.325e-09 ***
## selling_diff:Total.Bsmt.SF 5 0.186 0.0372 4.2786 0.0007958 ***
## selling_diff:Garage.Area 5 0.050 0.0099 1.1423 0.3367707
## selling_diff:BsmtFin.Type.1 28 0.645 0.0230 2.6533 1.232e-05 ***
## selling_diff:log(area) 5 0.171 0.0343 3.9468 0.0015926 **
## selling_diff:Land.Slope 5 0.089 0.0179 2.0590 0.0690629 .
## selling_diff:Central.Air 4 0.449 0.1121 12.9129 4.739e-10 ***
## selling_diff:Wood.Deck.SF 5 0.042 0.0083 0.9590 0.4423937
## Overall.Qual:log(Lot.Area) 8 0.170 0.0213 2.4499 0.0130484 *
## Overall.Qual:MS.Zoning 12 0.227 0.0189 2.1764 0.0116716 *
## Overall.Qual:Total.Bsmt.SF 7 0.092 0.0131 1.5107 0.1608973
## Overall.Qual:Garage.Area 7 0.285 0.0407 4.6881 4.043e-05 ***
## Overall.Qual:BsmtFin.Type.1 23 0.237 0.0103 1.1860 0.2506647
## Overall.Qual:log(area) 5 0.142 0.0284 3.2750 0.0063551 **
## Overall.Qual:Land.Slope 6 0.030 0.0051 0.5838 0.7433993
## Overall.Qual:Central.Air 3 0.024 0.0080 0.9251 0.4283239
## Overall.Qual:Wood.Deck.SF 5 0.101 0.0203 2.3332 0.0411095 *
## log(Lot.Area):MS.Zoning 3 0.081 0.0269 3.0926 0.0266505 *
## log(Lot.Area):Total.Bsmt.SF 1 0.015 0.0150 1.7271 0.1893338
## log(Lot.Area):Garage.Area 1 0.018 0.0175 2.0192 0.1558893
## log(Lot.Area):BsmtFin.Type.1 6 0.063 0.0105 1.2140 0.2972861
## log(Lot.Area):log(area) 1 0.002 0.0022 0.2512 0.6164683
## log(Lot.Area):Land.Slope 1 0.009 0.0087 1.0045 0.3166606
## log(Lot.Area):Central.Air 1 0.040 0.0399 4.5906 0.0325915 *
## log(Lot.Area):Wood.Deck.SF 1 0.006 0.0063 0.7217 0.3959607
## MS.Zoning:Total.Bsmt.SF 2 0.146 0.0728 8.3807 0.0002602 ***
## MS.Zoning:Garage.Area 2 0.002 0.0011 0.1250 0.8825388
## MS.Zoning:BsmtFin.Type.1 9 0.135 0.0150 1.7219 0.0810524 .
## MS.Zoning:log(area) 2 0.005 0.0026 0.2983 0.7421780
## MS.Zoning:Land.Slope 1 0.025 0.0248 2.8566 0.0915702 .
## MS.Zoning:Central.Air 1 0.004 0.0041 0.4720 0.4923493
## MS.Zoning:Wood.Deck.SF 2 0.003 0.0014 0.1660 0.8471228
## Total.Bsmt.SF:Garage.Area 1 0.005 0.0045 0.5185 0.4717991
## Total.Bsmt.SF:BsmtFin.Type.1 5 0.065 0.0129 1.4868 0.1922795
## Total.Bsmt.SF:log(area) 1 0.000 0.0003 0.0364 0.8487844
## Total.Bsmt.SF:Land.Slope 1 0.005 0.0049 0.5649 0.4526304
## Total.Bsmt.SF:Central.Air 1 0.000 0.0003 0.0379 0.8457343
## Total.Bsmt.SF:Wood.Deck.SF 1 0.001 0.0006 0.0692 0.7925932
## Garage.Area:BsmtFin.Type.1 6 0.060 0.0100 1.1502 0.3317824
## Garage.Area:log(area) 1 0.008 0.0079 0.9123 0.3399260
## Garage.Area:Land.Slope 1 0.000 0.0000 0.0013 0.9707515
## Garage.Area:Central.Air 1 0.004 0.0035 0.4050 0.5248022
## Garage.Area:Wood.Deck.SF 1 0.020 0.0203 2.3322 0.1273013
## BsmtFin.Type.1:log(area) 6 0.034 0.0057 0.6619 0.6805090
## BsmtFin.Type.1:Land.Slope 5 0.042 0.0084 0.9685 0.4364422
## BsmtFin.Type.1:Central.Air 5 0.018 0.0037 0.4243 0.8318220
## BsmtFin.Type.1:Wood.Deck.SF 5 0.030 0.0060 0.6926 0.6292506
## log(area):Land.Slope 1 0.010 0.0104 1.1951 0.2747862
## log(area):Central.Air 1 0.000 0.0002 0.0239 0.8770745
## log(area):Wood.Deck.SF 1 0.002 0.0023 0.2642 0.6074440
## Land.Slope:Wood.Deck.SF 1 0.024 0.0242 2.7896 0.0954556 .
## Central.Air:Wood.Deck.SF 1 0.006 0.0057 0.6552 0.4186050
## Residuals 544 4.725 0.0087
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
The method used to select the variables are AIC, BIC and BMA for model selection. AIC and BIC metrics are used for model evaluation by increasing error when including additional variables to ensure unbiased estimate of the model prediction. The BMA model selection will average multiple models to obtain marginal posterior inclusion probability.
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
The coverage probability of the final model on the testing dataset ames_test is calculated. It shows that 93.45% of the actual home prices from the testing dataset ames_test falls within the 95% confidence interval of the predicted values produced by the final model.
Testing on an out-of-sample data allows to have a metric other than \(R^2\), \(Adj.R^2\), and p-value to determine whether the model was optimal.
FinalModel <- lm(log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
MS.Zoning + Total.Bsmt.SF + Garage.Area +
BsmtFin.Type.1 + log(area) + Land.Slope +
Central.Air + Wood.Deck.SF, data = normal_sale_trainset)
summary(FinalModel)
##
## Call:
## lm(formula = log(price) ~ selling_diff + Overall.Qual + log(Lot.Area) +
## MS.Zoning + Total.Bsmt.SF + Garage.Area + BsmtFin.Type.1 +
## log(area) + Land.Slope + Central.Air + Wood.Deck.SF, data = normal_sale_trainset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52432 -0.06342 0.00539 0.06148 0.51963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.478e+00 1.646e-01 45.437 < 2e-16 ***
## selling_diffwithin a year 1.067e-01 1.872e-02 5.701 1.68e-08 ***
## selling_diffwithin fifteen years 9.580e-02 1.519e-02 6.307 4.69e-10 ***
## selling_diffwithin five years 1.345e-01 1.594e-02 8.438 < 2e-16 ***
## selling_diffwithin ten years 1.257e-01 1.589e-02 7.909 8.60e-15 ***
## selling_diffwithin twenty years 3.799e-02 1.337e-02 2.842 0.004592 **
## Overall.Qual2 1.126e-01 1.244e-01 0.905 0.365702
## Overall.Qual3 2.434e-01 1.219e-01 1.997 0.046131 *
## Overall.Qual4 3.021e-01 1.182e-01 2.556 0.010766 *
## Overall.Qual5 3.807e-01 1.184e-01 3.215 0.001355 **
## Overall.Qual6 4.601e-01 1.193e-01 3.857 0.000124 ***
## Overall.Qual7 5.499e-01 1.201e-01 4.581 5.37e-06 ***
## Overall.Qual8 6.734e-01 1.207e-01 5.581 3.27e-08 ***
## Overall.Qual9 8.548e-01 1.229e-01 6.954 7.40e-12 ***
## Overall.Qual10 9.028e-01 1.328e-01 6.797 2.08e-11 ***
## log(Lot.Area) 8.083e-02 1.028e-02 7.859 1.25e-14 ***
## MS.ZoningFV 3.647e-01 5.491e-02 6.642 5.72e-11 ***
## MS.ZoningI (all) -9.704e-02 1.236e-01 -0.785 0.432567
## MS.ZoningRH 1.660e-01 6.781e-02 2.447 0.014614 *
## MS.ZoningRL 3.219e-01 5.156e-02 6.243 6.98e-10 ***
## MS.ZoningRM 2.570e-01 5.197e-02 4.946 9.25e-07 ***
## Total.Bsmt.SF 1.195e-04 1.460e-05 8.186 1.06e-15 ***
## Garage.Area 1.391e-04 2.499e-05 5.568 3.53e-08 ***
## BsmtFin.Type.1BLQ -5.074e-02 1.551e-02 -3.272 0.001113 **
## BsmtFin.Type.1GLQ 1.591e-02 1.336e-02 1.191 0.234003
## BsmtFin.Type.1LwQ -1.057e-01 1.937e-02 -5.458 6.41e-08 ***
## BsmtFin.Type.1Rec -5.198e-02 1.518e-02 -3.425 0.000645 ***
## BsmtFin.Type.1Unf -1.052e-01 1.318e-02 -7.987 4.80e-15 ***
## BsmtFin.Type.1NB -7.967e-02 3.227e-02 -2.469 0.013756 *
## log(area) 3.623e-01 1.696e-02 21.358 < 2e-16 ***
## Land.SlopeMod 6.500e-02 2.118e-02 3.068 0.002224 **
## Land.SlopeSev 1.507e-01 6.829e-02 2.206 0.027646 *
## Central.AirY 1.566e-01 1.986e-02 7.882 1.06e-14 ***
## Wood.Deck.SF 1.507e-04 3.442e-05 4.377 1.36e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1097 on 800 degrees of freedom
## Multiple R-squared: 0.9207, Adjusted R-squared: 0.9174
## F-statistic: 281.5 on 33 and 800 DF, p-value: < 2.2e-16
pred_final_model <- predict(FinalModel, newdata = ames_test,
interval = "prediction")
pred_final_model <-exp(pred_final_model)
coverage_prob_final <- mean (ames_test$price > pred_final_model[,"lwr"] & ames_test$price < pred_final_model[,"upr"],na.rm=TRUE)
coverage_prob_final * 100
## [1] 93.45088
For your final model, create and briefly interpret an informative plot of the residuals.
The distribution of residuals of the final model is randomly scattered around zero, the linearity can thus be assumed.
The residuals distribution in the Normal Q-Q plot fairly follows the straight line.
The Scale - Location shows that the red line is approximately horizontal, which means that the average magnitude of the standardised residuals isn’t changing much. However, the spread of residuals varies along the range of predictors, suggesting the presence of heteroskedasticity.
The Residuals VS Leverage plot highlights the top three most extreme points (90, 560, 611). However, the data do not present any influential points as the Cook’s distance lines are not shown on the plot.
The Cook’s distance identifies 3 data points with large residuals and have high leverage - 53, 339, and 611. The predicted sale price of these houses is undervalued.
par(mfrow=c(2,2))
plot(FinalModel)
## Warning: not plotting observations with leverage one:
## 472, 763
For your final model, calculate and briefly comment on the RMSE.
The RMSE for the final model predicting on the testing dataset is USD 23,050.
predict_final_test <-exp(predict(FinalModel, ames_test))
resid_final_test <-ames_test$price - predict_final_test
rmse_final_test <-sqrt(mean(resid_final_test^2,na.rm=TRUE))
print(paste("The RMSE for the final model predicting on the testing dataset is USD", round(rmse_final_test)))
## [1] "The RMSE for the final model predicting on the testing dataset is USD 23050"
What are some strengths and weaknesses of your model?
The strength of the final model is the acceptable error (RMSE). The weakness of the model is the issue of heteroskedasticity which may result in a lower predictive ability.
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?
load("~/Desktop/R Programming/Statistics_Coursera/Capstone/Peer_Assignment_II/ames_validation.Rdata")
ames_validation is 93.63%, suggesting that 93.63% of the actual home prices from the validating dataset ames_validation falls within the 95% confidence interval of the predicted values produced by the final model as illustrated in figure 5.#create selling_diff variable & convert to factor class
ames_validation$selling_diff <- ames_validation$Yr.Sold - ames_validation$Year.Remod.Add
ames_validation$selling_diff <- case_when(
ames_validation$selling_diff <= 1 ~ "within a year",
ames_validation$selling_diff <= 5 ~ "within five years",
ames_validation$selling_diff <= 10 ~ "within ten years",
ames_validation$selling_diff <= 25 ~ "within fifteen years",
ames_validation$selling_diff <= 50 ~ "within twenty years",
TRUE ~ "more than twenty years")
ames_validation$selling_diff <- as.factor(ames_validation$selling_diff)
#clean Garage.Are & Total.Bsmt.SF
ames_validation$Garage.Area[is.na(ames_validation$Garage.Area)] <- 0
ames_validation$Total.Bsmt.SF[is.na(ames_validation$Total.Bsmt.SF)] <- 0
#convert Overall_Qual to factor class
ames_validation$Overall.Qual <- as.factor(ames_validation$Overall.Qual)
#remove level "Landmrk" from Neighborhood
ames_validation <- ames_validation %>%
filter(Neighborhood != "Landmrk")
#remove Ms.Zoning A(agr) from ames_validation
ames_validation <- ames_validation[-c(387), ]
predict_final_valid <-exp(predict(FinalModel, ames_validation))
resid_final_valid <-ames_validation$price - predict_final_valid
rmse_final_valid <-sqrt(mean(resid_final_valid^2,na.rm=TRUE))
print(paste("The RMSE for the final model predicting on the validating dataset is USD", round(rmse_final_valid)))
## [1] "The RMSE for the final model predicting on the validating dataset is USD 21610"
pred_final_cov_model <- predict(FinalModel, newdata = ames_validation,
interval = "prediction")
pred_final_cov_model <-exp(pred_final_cov_model)
coverage_cov_final <- mean(ames_validation$price > pred_final_cov_model[,"lwr"] & ames_validation$price < pred_final_cov_model [,"upr"], na.rm=TRUE)
coverage_cov_final * 100
## [1] 93.63144
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
ames_validation than on the testing dataset ames_test. This is evidenced by a higher out-of-sample performance of 93.63%, and a much lower RMSE value of 21,610 USD.pred_final_cov_model <- as.data.frame(pred_final_cov_model)
saleprice <- ames_validation$price
ModelEval <- cbind(pred_final_cov_model, saleprice) %>%
mutate(coverage = ifelse(lwr < saleprice &
upr > saleprice, "yes", "no"))
ModelEval <- na.omit(ModelEval)
p3 <- ggplot(data = ModelEval, aes(x = fit, y = saleprice, color = coverage)) +
geom_point() +
theme_solarized() +
geom_line(aes(y=lwr), color = "red", linetype = "dashed") +
geom_line(aes(y=upr), color = "red", linetype = "dashed") +
labs(title = "Figure 5 - 95% confidence interval for sale price prediction",
x = "Predicted values",
y = "Sale price")
ModelBind <- cbind(pred_final_cov_model, saleprice)
ModelPredict <- ModelBind %>%
mutate(value = case_when(ModelBind$fit < saleprice ~ "undervalued",
ModelBind$fit > saleprice ~ "overvalued",
TRUE ~ "fit"))
ModelPredict <- na.omit(ModelPredict)
countpredict <- count(ModelPredict$value)
p4 <- ggplot(data = countpredict , aes(x = x, y = freq, fill = x)) +
geom_bar(stat = "identity", width = 0.5) +
theme_solarized() +
theme(legend.position="top") +
geom_text(aes(label = freq), color = "white", vjust=1.6) +
labs(title = "Figure 6 - Overvalued & Undervalued sale price prediction",
x = "",
y = "Number of houses",
fill = "")
p3
p4