As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
library(statsr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.3
library(BAS)
## Warning: package 'BAS' was built under R version 3.4.4
library(MASS)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
load("ames_train.Rdata")
The first thing to do is to cut down on inventory that may pose too much difficulty or cost to deal with, To this end only include those sold under “normal” sales condition.
ames_train <- ames_train%>%
filter(Sale.Condition == "Normal")
Use the code block below to load any necessary packages
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
hist(ames_train$price, main = "Ames Price Distribution", label = T, col = "blue", las =1, ylim= c(0,400))
summary(ames_train$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39300 129000 155500 174622 205000 615000
boxplot(ames_train$price, col= "red", main = "Ames Houes Price Boxplot", ylab = " House Price")
ggplot(data = ames_train, aes(Neighborhood, color = Neighborhood, fill = Neighborhood)) +
geom_bar()+coord_flip()+
labs(title = "Neighborhood Counts")
house_age<-(2018)-(ames_train$Year.Built)
ames_train<-ames_train%>%
mutate(house_age)
ames_train$"house_age <- (2018) - (ames_train$Year.Built)" <- NULL
summary(house_age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9 21 46 48 64 146
ggplot(data = ames_train, aes(ames_train$house_age)) +
geom_histogram(binwidth = 4.7, color = "blue", fill = "yellow") +
geom_vline(xintercept = 43, size = .1, color = "red") +
geom_vline(xintercept = 17, size = .1, color = "green")+
geom_vline(xintercept = 45.8, size = .1, color = "blue")+
labs(title = "Table 1: House Age Histogram", x = "House Age from 2018", y = "Number of Houses")+
theme_linedraw()
ggplot(data = ames_train, aes(price, fill = Neighborhood)) +geom_histogram()+
labs(title = "Counts by Neighborhood and Price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
house_price<-ames_train %>%
group_by(Neighborhood) %>%
summarise(med_price = median(price), mean_price = mean(price), sd_price = sd(price), var_price = var(price),iqr_price = IQR(price), n = n())
house_price
## # A tibble: 27 x 7
## Neighborhood med_price mean_price sd_price var_price iqr_price n
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Blmngtn 192000. 198961. 24912. 620582881. 20130. 7
## 2 Blueste 123900. 125800. 10381. 107770000. 10250. 3
## 3 BrDale 100500. 100557. 13596. 184856190. 11450. 7
## 4 BrkSide 125250. 123733. 38466. 1479636208. 41394. 36
## 5 ClearCr 187500. 198273. 50054. 2505368182. 68000. 11
## 6 CollgCr 195000. 191878. 43178. 1864357269. 58500. 75
## 7 Crawfor 198000. 197296. 64847. 4205186233. 76100. 25
## 8 Edwards 125400. 130975. 53555. 2868098189. 40125. 50
## 9 Gilbert 184000. 191722. 36939. 1364520386. 22900. 36
## 10 Greens 212625. 198562. 29063. 844682292. 16438. 4
## # ... with 17 more rows
a<-ames_train%>%
group_by(Neighborhood)%>%
filter(house_age<=10)
a
## # A tibble: 9 x 82
## # Groups: Neighborhood [5]
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <int> <int> <int> <int> <fct> <int> <int> <fct>
## 1 9.16e8 1346 220000 20 RL 88 11896 Pave
## 2 9.06e8 1226 198900 20 RL 94 10402 Pave
## 3 5.28e8 1547 215200 60 FV 72 8640 Pave
## 4 9.21e8 2018 378500 20 RL 89 13214 Pave
## 5 5.28e8 1808 324000 20 RL 102 13514 Pave
## 6 9.05e8 1212 186000 20 RL 83 10420 Pave
## 7 5.28e8 2007 310000 60 FV 85 11003 Pave
## 8 5.28e8 1743 335000 20 RL 87 10367 Pave
## 9 5.28e8 2020 404000 20 RL 95 12350 Pave
## # ... with 74 more variables: Alley <fct>, Lot.Shape <fct>,
## # Land.Contour <fct>, Utilities <fct>, Lot.Config <fct>,
## # Land.Slope <fct>, Neighborhood <fct>, Condition.1 <fct>,
## # Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
## # Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## # Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
## # Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
## # Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>,
## # Foundation <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
## # Bsmt.Exposure <fct>, BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>,
## # BsmtFin.Type.2 <fct>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## # Total.Bsmt.SF <int>, Heating <fct>, Heating.QC <fct>,
## # Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
## # X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## # Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## # Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
## # TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
## # Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
## # Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
## # Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
## # Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## # X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
## # Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
## # Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>, house_age <dbl>
ggplot(data = a, aes(Neighborhood, fill = Neighborhood))+geom_bar()+
labs(title = "Neighborhood Counts for Houses Younger than Ten Years")
ggplot(data = ames_train, aes(ames_train$price, label = T, las =1)) +
geom_histogram(binwidth = 50000, color = "blue", fill = "cyan") +
geom_vline(xintercept = 155500, size = 1.7, color = "black") +
geom_vline(xintercept = 174622, size = 1.4, color = "yellow")+
geom_vline(xintercept = 300000, size = 1.7, color = "red")+
labs(title = "Table 1: Ames Price with Meidan, Mean and Ouliers", x = "House Age from 2018", y = "Number of Houses")+
theme_linedraw()
# type your code for Question 2 here, and Knit
ggplot(aes(y = price, x = Neighborhood, fill = Neighborhood), data = ames_train ) +
geom_boxplot() +
labs(title = "Table 2: Neighborhood vs Price") +coord_flip() +theme_dark()
ggplot(data = house_price, aes(x = sd_price, y = med_price ))+
geom_point(col = "blue", size = 2.2)+
labs(title = "Table3: Neighborhood Standard Deviation vs Neigbhorhood Median Price", x = "Median Price per Standard Deviation", y = "Standard Deviation by Neighborhood")
cor(house_price$sd_price, house_price$mean_price)
## [1] 0.7793776
which.min(house_price$med_price)
## [1] 13
which.min(house_price$sd_price)
## [1] 2
which.min(house_price$iqr_price)
## [1] 2
which.max(house_price$med_price)
## [1] 18
which.max(house_price$sd_price)
## [1] 18
which.max(house_price$iqr_price)
## [1] 18
The above graphs bring to light a few things to take notice of. The price distribution is clearly right skewed implying the mean is greater than the median. This is confirmed by the summary. This indicates the presence of high-priced, and possibly overvalued, homes in the housing pool. The good news is that both the mean and median are in the same 50,000 dollar binwidth. They are in fact very close being about 20,000 dollars apart. It is also evident that the majority, more tha half, are withing the 100,000 to 200,000 dollar range. This implies there are a good number of potential prospects for purchase.
** The boxplot shows the presence of outliers in the set.Somewhere around 300,000 the houses start moving away from the rest of the sample and become overpriced.**
The first table is essentially a blend of the first two EDA graphs. It shows the overall price distribution and correspointing values of interest, median, mean and outlier boundary. This points to prices for consideration for acceptance or rejection for strategic investment potential.
The second table gives a graphic look at the price spread per neighborhood. This gives a sense of ranges and median prices as well as outliers at both the low and hugh ends. This is important to decide if the mobility of the price in a neighborhood warrants investment potential.
The third table directly shows the relationship between the median price versus the standard deviation per neighborhood. the relationship of how big a spread there is ties in the presence of “deals” on a price, i.e. undervalued properties in a area that may be primed for price increases. The most striking feature of this table is there is a clear linear relationship between the standard deviation and median price for the entire set per neighboord at 78% correlation. In short the more expensive the neighborhood median price the greater the spread in house price from the average of that neighborhood.
The ranges on the median, standard deviation and interquartile ranges for the variuos neighborhoods are as follows:
The smallest median price is in Meadow Village. The smallest standard deviation and interquartile range are in Blue Stem.
The largest of each are all in North Ridge Heights.
NOTE: At this stage for these variables, the presence of NA’a did not impact output so cleaning was not necessary
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from ames_train and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
library(MASS)
model_train<-lm(log(price) ~ Overall.Qual + log(Garage.Area + 1) +
Neighborhood +log(area) + Full.Bath + Bedroom.AbvGr + Year.Built +
log(Lot.Area) + Central.Air + Overall.Cond,
data = ames_train)
summary(model_train)
##
## Call:
## lm(formula = log(price) ~ Overall.Qual + log(Garage.Area + 1) +
## Neighborhood + log(area) + Full.Bath + Bedroom.AbvGr + Year.Built +
## log(Lot.Area) + Central.Air + Overall.Cond, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73166 -0.06308 0.00166 0.07012 0.51978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.7457973 0.7560257 -2.309 0.021188 *
## Overall.Qual 0.0780147 0.0055068 14.167 < 2e-16 ***
## log(Garage.Area + 1) 0.0051522 0.0038302 1.345 0.178952
## NeighborhoodBlueste -0.0742720 0.0819874 -0.906 0.365265
## NeighborhoodBrDale -0.1907787 0.0646406 -2.951 0.003256 **
## NeighborhoodBrkSide -0.0629624 0.0545630 -1.154 0.248870
## NeighborhoodClearCr -0.0536285 0.0619106 -0.866 0.386627
## NeighborhoodCollgCr -0.1028692 0.0485217 -2.120 0.034308 *
## NeighborhoodCrawfor 0.0293478 0.0549179 0.534 0.593217
## NeighborhoodEdwards -0.1414351 0.0511227 -2.767 0.005796 **
## NeighborhoodGilbert -0.1456983 0.0505066 -2.885 0.004023 **
## NeighborhoodGreens 0.0944600 0.0745767 1.267 0.205662
## NeighborhoodGrnHill 0.2577918 0.0945396 2.727 0.006535 **
## NeighborhoodIDOTRR -0.1711145 0.0560396 -3.053 0.002337 **
## NeighborhoodMeadowV -0.1570883 0.0565996 -2.775 0.005642 **
## NeighborhoodMitchel -0.0694029 0.0508342 -1.365 0.172549
## NeighborhoodNAmes -0.0763612 0.0500166 -1.527 0.127227
## NeighborhoodNoRidge 0.0134845 0.0516248 0.261 0.794005
## NeighborhoodNPkVill -0.0436025 0.0740910 -0.588 0.556364
## NeighborhoodNridgHt 0.0745068 0.0497750 1.497 0.134822
## NeighborhoodNWAmes -0.1185988 0.0514253 -2.306 0.021353 *
## NeighborhoodOldTown -0.1220336 0.0544072 -2.243 0.025173 *
## NeighborhoodSawyer -0.0926175 0.0515403 -1.797 0.072715 .
## NeighborhoodSawyerW -0.1663309 0.0500255 -3.325 0.000925 ***
## NeighborhoodSomerst -0.0138265 0.0479958 -0.288 0.773363
## NeighborhoodStoneBr 0.0361535 0.0566160 0.639 0.523283
## NeighborhoodSWISU -0.0886177 0.0627484 -1.412 0.158260
## NeighborhoodTimber -0.0083022 0.0547359 -0.152 0.879479
## NeighborhoodVeenker -0.0288405 0.0618963 -0.466 0.641379
## log(area) 0.4935606 0.0244240 20.208 < 2e-16 ***
## Full.Bath -0.0011532 0.0123762 -0.093 0.925782
## Bedroom.AbvGr -0.0326804 0.0074310 -4.398 1.24e-05 ***
## Year.Built 0.0041093 0.0003562 11.538 < 2e-16 ***
## log(Lot.Area) 0.1498777 0.0117607 12.744 < 2e-16 ***
## Central.AirY 0.1003716 0.0216710 4.632 4.24e-06 ***
## Overall.Cond 0.0531754 0.0044419 11.971 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1167 on 798 degrees of freedom
## Multiple R-squared: 0.9105, Adjusted R-squared: 0.9066
## F-statistic: 232 on 35 and 798 DF, p-value: < 2.2e-16
In deciding on features for the model I made the following choices: Given the wide discrepancy in house ages overall and the fact there are a substantial number of new houses being built Year.Built seemes a reasonable variable to include. Younger houses may need less work and older houses are likely cheaper. In this set Neighborhood also played a significant factor in price as their is a huge variance between both medians and variance. However the output reveals that quite a few of these neighborhoods and a few other predictors are in fact not statistically significant. The variable selection was based on requests that seemed reasonable to ask about such as Full Bath, Central Air, Overall Qual etc.
Of the ten variables selected for this model, two were not statistically significant, log(Garage.Area + 1) and Full.Bath (which seems surprising). Neighborhood is a mixed bag with a total of 15 of the neighborhoods not being statistically significant for price prediction. In a way this is good news as it refines the number of areas to be examined but it could also hide potential for growth in an undervalued area. However because some of the variables under Neighborhood are significant we cannot drop the category.
The R-squared value is quite high at 91%, meaning that 91% of the variation in price is explained by the variables in this model. As previoulsy noted, however, there is a certain level of complexity that is unecessary in the model as indicated by the number of statistically non-significant variables. The addition of these variables repesent an increase in potential overfitting and so as a measure of a penalty for this, the adjusted R-squared will lower the degree of fit of the model. But even with this calculation it remains near 91% so this model is on good footing taken as is.
The p value is particularly low mwaning that the hypothesis that all coefficients are zero is rejected in favor of the alternative that at least one of the regression coefficents does not equal zero.
The y-intercept is not a realistic figure in this case. While it is positive which is good, it is also extremely low. Adjusting for the logarithm the intercept value is about 175 dollars which is not a viable realistic value for the houses and in fact contradicts that the minimium house price is 39,000 dollars.
Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
#AIC backward selection
AIC_back<-step(model_train, direction = "backward")
## Start: AIC=-3547.66
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + Neighborhood +
## log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) +
## Central.Air + Overall.Cond
##
## Df Sum of Sq RSS AIC
## - Full.Bath 1 0.0001 10.871 -3549.6
## - log(Garage.Area + 1) 1 0.0247 10.896 -3547.8
## <none> 10.871 -3547.7
## - Bedroom.AbvGr 1 0.2635 11.135 -3529.7
## - Central.Air 1 0.2922 11.163 -3527.5
## - Neighborhood 26 2.5528 13.424 -3423.7
## - Year.Built 1 1.8136 12.685 -3421.0
## - Overall.Cond 1 1.9523 12.823 -3411.9
## - log(Lot.Area) 1 2.2125 13.084 -3395.2
## - Overall.Qual 1 2.7342 13.605 -3362.6
## - log(area) 1 5.5632 16.434 -3205.0
##
## Step: AIC=-3549.65
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + Neighborhood +
## log(area) + Bedroom.AbvGr + Year.Built + log(Lot.Area) +
## Central.Air + Overall.Cond
##
## Df Sum of Sq RSS AIC
## - log(Garage.Area + 1) 1 0.0250 10.896 -3549.7
## <none> 10.871 -3549.6
## - Bedroom.AbvGr 1 0.2750 11.146 -3530.8
## - Central.Air 1 0.2958 11.167 -3529.3
## - Neighborhood 26 2.5617 13.433 -3425.2
## - Year.Built 1 1.8936 12.765 -3417.7
## - Overall.Cond 1 1.9527 12.824 -3413.9
## - log(Lot.Area) 1 2.2146 13.086 -3397.0
## - Overall.Qual 1 2.7387 13.610 -3364.3
## - log(area) 1 6.3141 17.185 -3169.7
##
## Step: AIC=-3549.73
## log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
## Year.Built + log(Lot.Area) + Central.Air + Overall.Cond
##
## Df Sum of Sq RSS AIC
## <none> 10.896 -3549.7
## - Bedroom.AbvGr 1 0.2838 11.180 -3530.3
## - Central.Air 1 0.3532 11.250 -3525.1
## - Neighborhood 26 2.6129 13.509 -3422.5
## - Overall.Cond 1 1.9391 12.835 -3415.1
## - Year.Built 1 1.9794 12.876 -3412.5
## - log(Lot.Area) 1 2.3282 13.225 -3390.2
## - Overall.Qual 1 2.7611 13.657 -3363.4
## - log(area) 1 6.4673 17.364 -3163.1
#AIC forward selection
AIC_forw<-step(model_train, direction = "forward")
## Start: AIC=-3547.66
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + Neighborhood +
## log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) +
## Central.Air + Overall.Cond
summary(AIC_back)
##
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) +
## Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air +
## Overall.Cond, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72173 -0.06326 0.00098 0.06980 0.53260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.8630744 0.7262336 -2.565 0.010487 *
## Overall.Qual 0.0783002 0.0054995 14.238 < 2e-16 ***
## NeighborhoodBlueste -0.0692021 0.0814672 -0.849 0.395887
## NeighborhoodBrDale -0.1872665 0.0640904 -2.922 0.003577 **
## NeighborhoodBrkSide -0.0603686 0.0543550 -1.111 0.267059
## NeighborhoodClearCr -0.0539172 0.0618208 -0.872 0.383387
## NeighborhoodCollgCr -0.1030812 0.0483872 -2.130 0.033448 *
## NeighborhoodCrawfor 0.0302845 0.0547911 0.553 0.580605
## NeighborhoodEdwards -0.1443631 0.0507947 -2.842 0.004596 **
## NeighborhoodGilbert -0.1475456 0.0504250 -2.926 0.003530 **
## NeighborhoodGreens 0.0967099 0.0742574 1.302 0.193168
## NeighborhoodGrnHill 0.2542540 0.0943951 2.694 0.007218 **
## NeighborhoodIDOTRR -0.1726389 0.0558741 -3.090 0.002072 **
## NeighborhoodMeadowV -0.1642407 0.0558941 -2.938 0.003394 **
## NeighborhoodMitchel -0.0690971 0.0505470 -1.367 0.172014
## NeighborhoodNAmes -0.0748871 0.0495134 -1.512 0.130811
## NeighborhoodNoRidge 0.0127871 0.0514149 0.249 0.803653
## NeighborhoodNPkVill -0.0409370 0.0740100 -0.553 0.580330
## NeighborhoodNridgHt 0.0736187 0.0496769 1.482 0.138748
## NeighborhoodNWAmes -0.1185271 0.0513334 -2.309 0.021199 *
## NeighborhoodOldTown -0.1200146 0.0543088 -2.210 0.027398 *
## NeighborhoodSawyer -0.0918229 0.0510502 -1.799 0.072446 .
## NeighborhoodSawyerW -0.1671598 0.0499781 -3.345 0.000862 ***
## NeighborhoodSomerst -0.0138799 0.0479576 -0.289 0.772335
## NeighborhoodStoneBr 0.0347125 0.0565899 0.613 0.539783
## NeighborhoodSWISU -0.0895647 0.0627063 -1.428 0.153590
## NeighborhoodTimber -0.0100021 0.0547074 -0.183 0.854978
## NeighborhoodVeenker -0.0297807 0.0615042 -0.484 0.628372
## log(area) 0.4959963 0.0227621 21.790 < 2e-16 ***
## Bedroom.AbvGr -0.0332892 0.0072932 -4.564 5.80e-06 ***
## Year.Built 0.0041612 0.0003452 12.055 < 2e-16 ***
## log(Lot.Area) 0.1521561 0.0116379 13.074 < 2e-16 ***
## Central.AirY 0.1071024 0.0210307 5.093 4.41e-07 ***
## Overall.Cond 0.0529576 0.0044383 11.932 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1167 on 800 degrees of freedom
## Multiple R-squared: 0.9103, Adjusted R-squared: 0.9066
## F-statistic: 246 on 33 and 800 DF, p-value: < 2.2e-16
summary(AIC_forw)
##
## Call:
## lm(formula = log(price) ~ Overall.Qual + log(Garage.Area + 1) +
## Neighborhood + log(area) + Full.Bath + Bedroom.AbvGr + Year.Built +
## log(Lot.Area) + Central.Air + Overall.Cond, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73166 -0.06308 0.00166 0.07012 0.51978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.7457973 0.7560257 -2.309 0.021188 *
## Overall.Qual 0.0780147 0.0055068 14.167 < 2e-16 ***
## log(Garage.Area + 1) 0.0051522 0.0038302 1.345 0.178952
## NeighborhoodBlueste -0.0742720 0.0819874 -0.906 0.365265
## NeighborhoodBrDale -0.1907787 0.0646406 -2.951 0.003256 **
## NeighborhoodBrkSide -0.0629624 0.0545630 -1.154 0.248870
## NeighborhoodClearCr -0.0536285 0.0619106 -0.866 0.386627
## NeighborhoodCollgCr -0.1028692 0.0485217 -2.120 0.034308 *
## NeighborhoodCrawfor 0.0293478 0.0549179 0.534 0.593217
## NeighborhoodEdwards -0.1414351 0.0511227 -2.767 0.005796 **
## NeighborhoodGilbert -0.1456983 0.0505066 -2.885 0.004023 **
## NeighborhoodGreens 0.0944600 0.0745767 1.267 0.205662
## NeighborhoodGrnHill 0.2577918 0.0945396 2.727 0.006535 **
## NeighborhoodIDOTRR -0.1711145 0.0560396 -3.053 0.002337 **
## NeighborhoodMeadowV -0.1570883 0.0565996 -2.775 0.005642 **
## NeighborhoodMitchel -0.0694029 0.0508342 -1.365 0.172549
## NeighborhoodNAmes -0.0763612 0.0500166 -1.527 0.127227
## NeighborhoodNoRidge 0.0134845 0.0516248 0.261 0.794005
## NeighborhoodNPkVill -0.0436025 0.0740910 -0.588 0.556364
## NeighborhoodNridgHt 0.0745068 0.0497750 1.497 0.134822
## NeighborhoodNWAmes -0.1185988 0.0514253 -2.306 0.021353 *
## NeighborhoodOldTown -0.1220336 0.0544072 -2.243 0.025173 *
## NeighborhoodSawyer -0.0926175 0.0515403 -1.797 0.072715 .
## NeighborhoodSawyerW -0.1663309 0.0500255 -3.325 0.000925 ***
## NeighborhoodSomerst -0.0138265 0.0479958 -0.288 0.773363
## NeighborhoodStoneBr 0.0361535 0.0566160 0.639 0.523283
## NeighborhoodSWISU -0.0886177 0.0627484 -1.412 0.158260
## NeighborhoodTimber -0.0083022 0.0547359 -0.152 0.879479
## NeighborhoodVeenker -0.0288405 0.0618963 -0.466 0.641379
## log(area) 0.4935606 0.0244240 20.208 < 2e-16 ***
## Full.Bath -0.0011532 0.0123762 -0.093 0.925782
## Bedroom.AbvGr -0.0326804 0.0074310 -4.398 1.24e-05 ***
## Year.Built 0.0041093 0.0003562 11.538 < 2e-16 ***
## log(Lot.Area) 0.1498777 0.0117607 12.744 < 2e-16 ***
## Central.AirY 0.1003716 0.0216710 4.632 4.24e-06 ***
## Overall.Cond 0.0531754 0.0044419 11.971 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1167 on 798 degrees of freedom
## Multiple R-squared: 0.9105, Adjusted R-squared: 0.9066
## F-statistic: 232 on 35 and 798 DF, p-value: < 2.2e-16
model_train_AIC<-lm(log(price) ~ Overall.Qual + log(Garage.Area + 1) +
Neighborhood +log(area) + Full.Bath + Bedroom.AbvGr + Year.Built +
log(Lot.Area) + Central.Air + Overall.Cond,
data = ames_train)
# Model selection using AIC
model_train_AIC <- stepAIC(model_train_AIC, k = 2)
## Start: AIC=-3547.66
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + Neighborhood +
## log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + log(Lot.Area) +
## Central.Air + Overall.Cond
##
## Df Sum of Sq RSS AIC
## - Full.Bath 1 0.0001 10.871 -3549.6
## - log(Garage.Area + 1) 1 0.0247 10.896 -3547.8
## <none> 10.871 -3547.7
## - Bedroom.AbvGr 1 0.2635 11.135 -3529.7
## - Central.Air 1 0.2922 11.163 -3527.5
## - Neighborhood 26 2.5528 13.424 -3423.7
## - Year.Built 1 1.8136 12.685 -3421.0
## - Overall.Cond 1 1.9523 12.823 -3411.9
## - log(Lot.Area) 1 2.2125 13.084 -3395.2
## - Overall.Qual 1 2.7342 13.605 -3362.6
## - log(area) 1 5.5632 16.434 -3205.0
##
## Step: AIC=-3549.65
## log(price) ~ Overall.Qual + log(Garage.Area + 1) + Neighborhood +
## log(area) + Bedroom.AbvGr + Year.Built + log(Lot.Area) +
## Central.Air + Overall.Cond
##
## Df Sum of Sq RSS AIC
## - log(Garage.Area + 1) 1 0.0250 10.896 -3549.7
## <none> 10.871 -3549.6
## - Bedroom.AbvGr 1 0.2750 11.146 -3530.8
## - Central.Air 1 0.2958 11.167 -3529.3
## - Neighborhood 26 2.5617 13.433 -3425.2
## - Year.Built 1 1.8936 12.765 -3417.7
## - Overall.Cond 1 1.9527 12.824 -3413.9
## - log(Lot.Area) 1 2.2146 13.086 -3397.0
## - Overall.Qual 1 2.7387 13.610 -3364.3
## - log(area) 1 6.3141 17.185 -3169.7
##
## Step: AIC=-3549.73
## log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
## Year.Built + log(Lot.Area) + Central.Air + Overall.Cond
##
## Df Sum of Sq RSS AIC
## <none> 10.896 -3549.7
## - Bedroom.AbvGr 1 0.2838 11.180 -3530.3
## - Central.Air 1 0.3532 11.250 -3525.1
## - Neighborhood 26 2.6129 13.509 -3422.5
## - Overall.Cond 1 1.9391 12.835 -3415.1
## - Year.Built 1 1.9794 12.876 -3412.5
## - log(Lot.Area) 1 2.3282 13.225 -3390.2
## - Overall.Qual 1 2.7611 13.657 -3363.4
## - log(area) 1 6.4673 17.364 -3163.1
model_train_AIC
##
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) +
## Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air +
## Overall.Cond, data = ames_train)
##
## Coefficients:
## (Intercept) Overall.Qual NeighborhoodBlueste
## -1.863074 0.078300 -0.069202
## NeighborhoodBrDale NeighborhoodBrkSide NeighborhoodClearCr
## -0.187266 -0.060369 -0.053917
## NeighborhoodCollgCr NeighborhoodCrawfor NeighborhoodEdwards
## -0.103081 0.030285 -0.144363
## NeighborhoodGilbert NeighborhoodGreens NeighborhoodGrnHill
## -0.147546 0.096710 0.254254
## NeighborhoodIDOTRR NeighborhoodMeadowV NeighborhoodMitchel
## -0.172639 -0.164241 -0.069097
## NeighborhoodNAmes NeighborhoodNoRidge NeighborhoodNPkVill
## -0.074887 0.012787 -0.040937
## NeighborhoodNridgHt NeighborhoodNWAmes NeighborhoodOldTown
## 0.073619 -0.118527 -0.120015
## NeighborhoodSawyer NeighborhoodSawyerW NeighborhoodSomerst
## -0.091823 -0.167160 -0.013880
## NeighborhoodStoneBr NeighborhoodSWISU NeighborhoodTimber
## 0.034712 -0.089565 -0.010002
## NeighborhoodVeenker log(area) Bedroom.AbvGr
## -0.029781 0.495996 -0.033289
## Year.Built log(Lot.Area) Central.AirY
## 0.004161 0.152156 0.107102
## Overall.Cond
## 0.052958
BIC(AIC_back)
## [1] -1015.525
BIC(AIC_forw)
## [1] -1003.996
Using the Akaike Information Criterion to assess and forward and backward selection the model is mearly identical under either test. In fact both best models have the same adjusted R-squared. . In order to further assess this model a stepwisw AIC was used. This model agrees that the backward selection model is the preferred. However the AIC scores are still fairly close so as a further analysis a Bayesian Information Criterion (BIC) was used and the lower score was obtained by the backward selection model.
Since AIC and BIC agree on the bacward model that will be the one selected. It has 8 variables instead of 10, omitting, log(Garage.Area + 1) and Full.Bath
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
model_price_select<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
Year.Built + log(Lot.Area) + Central.Air + Overall.Cond, data = ames_train)
par(mfrow = c(2,2))
plot(model_price_select)
hist(model_price_select$residuals, col = "maroon", las = 1, main = "Selected Model Residual Histogram", label = T, ylim = c(0,325))
The assumptions of the model rest on four properties:
Normality of the residuals is demonstrated by the histogram and the Normal Q-Q plot. Both show a high degree of normality and the QQ plot indicates there is some skew deviation from normality by the indicated outliers. However the residuals are nearly normal.
Residual linearity the linearity of the variables is in the first graph. The scatter of the residuals is around zero and there is no obvious shape or pattern to the scatter. The red line being practically flat assures us of the strong linear relationship of the residuals in the model. Again there a few outliers but the majority of points agrees with the premise for linearity
Homoscedasticity refers to the degree of constancy of the the variance of the residuals. This is demonstrated by the first graph and the scatter of the points. The majority of the points fall between two lines, -0.5 and 0.5. with almost nothing outside of this region indicating a fairly constant average variance between residuals. The Scale-Location graph also shows the spread of residuals about the predictors. Since we have a lack of pattern and an almost flat red line, with a little dip, we can be assured of good homoscedasicity with the model.
Independence is an indication that the variables were randomly selected and that the selection of any particular observation has no bearing on another. since there is no test for this we have to go to first principles. Since the data was obtained from the table dataset its structure will indicate the randomness of the partition used for the tarining and other sets. The table has no structure that indicates any meaningful subsetting of the data so any subsection would be a random sample of the entire pool. Therefore any division would qualify as a SRS. Any set of observations would therefore be independent.
Residuals vs Leverage. The fourth graph helps us clarify the nature of our outliers with respect to their leverage. Its been established that there are outliers present in a number of different graphical and analytical tools used so far. The question remains though should they be left in or omitted to improve the model and its performannce? Analytically this is expressed by the Cook’s distance, If the distance is less than one then the point, outlier or not, is not influential and therefore its presence does not have a significant impact on the regression of the model. As the graph shows all of points are well withing this range and as such can be included in the final model without undo impact. Therefore, no outliers need be deleted from the model.
Therefore the conditions for both a non-multicollinear and parsimonious model from the initial one are met.
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
# Extract Predictions
predict.full <- exp(predict(model_price_select, ames_train))
# Extract Residuals
resid.full <- ames_train$price - predict.full
# Calculate RMSE
rmse.full <- sqrt(mean(resid.full^2))
rmse.full
## [1] 22665.42
The Root mean square error (RMSE) is a measure of the standard deviation of the residuals.In this way it represents an absolute measure of fit and shows how closely the data is actually spread from the regression line. The smaller it is the better the fit. For purposes of prediction this is the better option.
This model has a RMSE of 22,665.42 dollars. This value is** 13% of the mean home price and 15% of the median home price**. this is not an inordinate amount to vary by when purchasing a house so the fit is relatively good. * * *
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called â???ooverfitting.â??? To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.
load("ames_test.Rdata")
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
model_price_select_test<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
Year.Built + log(Lot.Area) + Central.Air + Overall.Cond, data = ames_test)
summary(model_price_select_test)
##
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) +
## Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air +
## Overall.Cond, data = ames_test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55288 -0.06329 0.00404 0.07388 0.33664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3197284 0.6920407 -0.462 0.644204
## Overall.Qual 0.0788674 0.0054303 14.524 < 2e-16 ***
## NeighborhoodBlueste -0.0506033 0.0640896 -0.790 0.430017
## NeighborhoodBrDale -0.1174224 0.0577298 -2.034 0.042288 *
## NeighborhoodBrkSide -0.1014693 0.0520253 -1.950 0.051486 .
## NeighborhoodClearCr -0.0290978 0.0571392 -0.509 0.610724
## NeighborhoodCollgCr -0.0645388 0.0457340 -1.411 0.158589
## NeighborhoodCrawfor 0.0299842 0.0503906 0.595 0.551992
## NeighborhoodEdwards -0.1592750 0.0488420 -3.261 0.001158 **
## NeighborhoodGilbert -0.1418487 0.0472388 -3.003 0.002760 **
## NeighborhoodIDOTRR -0.1385458 0.0558523 -2.481 0.013326 *
## NeighborhoodLandmrk -0.1396342 0.1251099 -1.116 0.264725
## NeighborhoodMeadowV -0.1028101 0.0580819 -1.770 0.077101 .
## NeighborhoodMitchel -0.0981988 0.0513926 -1.911 0.056400 .
## NeighborhoodNAmes -0.0931867 0.0469399 -1.985 0.047467 *
## NeighborhoodNoRidge 0.0797727 0.0500766 1.593 0.111560
## NeighborhoodNPkVill -0.0297085 0.0583926 -0.509 0.611055
## NeighborhoodNridgHt 0.0743417 0.0485020 1.533 0.125739
## NeighborhoodNWAmes -0.1182678 0.0486513 -2.431 0.015283 *
## NeighborhoodOldTown -0.1464827 0.0507662 -2.885 0.004016 **
## NeighborhoodSawyer -0.1113037 0.0498445 -2.233 0.025829 *
## NeighborhoodSawyerW -0.1188930 0.0483722 -2.458 0.014191 *
## NeighborhoodSomerst -0.0151898 0.0457289 -0.332 0.739849
## NeighborhoodStoneBr 0.0533240 0.0538454 0.990 0.322325
## NeighborhoodSWISU -0.0882204 0.0579430 -1.523 0.128278
## NeighborhoodTimber -0.0679612 0.0539005 -1.261 0.207733
## NeighborhoodVeenker -0.0007065 0.0819207 -0.009 0.993121
## log(area) 0.5112374 0.0228948 22.330 < 2e-16 ***
## Bedroom.AbvGr -0.0362240 0.0069766 -5.192 2.65e-07 ***
## Year.Built 0.0033515 0.0003334 10.054 < 2e-16 ***
## log(Lot.Area) 0.1533616 0.0118716 12.918 < 2e-16 ***
## Central.AirY 0.0723886 0.0204066 3.547 0.000412 ***
## Overall.Cond 0.0481565 0.0044583 10.801 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1176 on 784 degrees of freedom
## Multiple R-squared: 0.9011, Adjusted R-squared: 0.8971
## F-statistic: 223.3 on 32 and 784 DF, p-value: < 2.2e-16
# Extract Predictions
predict_test <- exp(predict(model_price_select_test, ames_test))
# Extract Residuals
resid_test <- ames_test$price - predict_test
# Calculate RMSE
rmse_test<- sqrt(mean(resid_test^2))
rmse_test
## [1] 23197.04
# Predict prices
predict_test_select <- exp(predict(model_price_select_test, ames_test, interval = "prediction"))
# Calculate proportion of observations that fall within prediction intervals
coverage_prob_test_select <- mean(ames_test$price > predict_test_select[,"lwr"] &
ames_test$price < predict_test_select[,"upr"])
coverage_prob_test_select
## [1] 0.9522644
model_price_select_test<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
Year.Built + log(Lot.Area) + Central.Air + Overall.Cond, data = ames_train)
# Predict prices
predict_train<- exp(predict(model_price_select_test, ames_train, interval = "prediction"))
# Calculate proportion of observations that fall within prediction intervals
coverage_prob_train <- mean(ames_train$price > predict_train[,"lwr"] &
ames_train$price < predict_train[,"upr"])
coverage_prob_train
## [1] 0.9580336
NOTE: Write your written response to section 2.5 here. Delete this note before you submit your work.
We can see from the application of the model to the test data that there is a difference in how well it fits. The p value assures us that the model is significant however the R-squared and adjusted R-squared values have changed. In both cases they are lower but not drastically. The R-squared value is still approximately 91% but the adjusted R-squared value has dropped to 89.7%. Not a big drop in either case so the model still fits rather well.
The root mean square error as expected in this case has gone up and it has but only by 2.3% so the values for the absolute fit of this model is on par with the previous one. The difference in rmse is less than 1000 dollars.
**As we can see the model performs better on the training data than it does on the test data and as such so we dont have to be concerned at this point with the possibility of overfitting.However they are very close to each other so in one sense we can say they fit them “equally well**
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Provide the summary table for your model.
model_final<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
Year.Built + log(Lot.Area) + Central.Air + Overall.Cond + Heating + Fireplaces +Kitchen.Qual + log(Wood.Deck.SF +1) + log(Open.Porch.SF +1) +Bldg.Type + House.Style +Overall.Qual:Neighborhood, data = ames_train)
summary(model_final)
##
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) +
## Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air +
## Overall.Cond + Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF +
## 1) + log(Open.Porch.SF + 1) + Bldg.Type + House.Style + Overall.Qual:Neighborhood,
## data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69822 -0.05617 0.00425 0.06373 0.37444
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5266563 1.0105996 -0.521 0.602427
## Overall.Qual 0.0927874 0.0882472 1.051 0.293388
## NeighborhoodBlueste 0.0428123 0.1372979 0.312 0.755263
## NeighborhoodBrDale 0.7402694 0.8234023 0.899 0.368919
## NeighborhoodBrkSide 0.1541402 0.6529460 0.236 0.813443
## NeighborhoodClearCr 0.7954652 0.6979593 1.140 0.254772
## NeighborhoodCollgCr 0.2750019 0.6512966 0.422 0.672972
## NeighborhoodCrawfor 0.3523530 0.6601897 0.534 0.593696
## NeighborhoodEdwards 0.1583458 0.6520805 0.243 0.808202
## NeighborhoodGilbert 0.1477275 0.6604809 0.224 0.823077
## NeighborhoodGreens 0.0951448 0.0927763 1.026 0.305442
## NeighborhoodGrnHill 0.3719440 0.0901194 4.127 4.08e-05
## NeighborhoodIDOTRR 0.1853424 0.6517930 0.284 0.776214
## NeighborhoodMeadowV 0.3132852 0.6767949 0.463 0.643573
## NeighborhoodMitchel 0.2211560 0.6535294 0.338 0.735154
## NeighborhoodNAmes 0.2772342 0.6500526 0.426 0.669880
## NeighborhoodNoRidge 0.4270951 0.7051283 0.606 0.544897
## NeighborhoodNPkVill 0.9068422 1.0011008 0.906 0.365306
## NeighborhoodNridgHt 0.0253128 0.6627167 0.038 0.969542
## NeighborhoodNWAmes 0.2961887 0.6630522 0.447 0.655216
## NeighborhoodOldTown 0.1704423 0.6489251 0.263 0.792889
## NeighborhoodSawyer 0.4324344 0.6559705 0.659 0.509950
## NeighborhoodSawyerW -0.1968511 0.6623965 -0.297 0.766411
## NeighborhoodSomerst 0.3046157 0.6588591 0.462 0.643972
## NeighborhoodStoneBr 0.0185944 0.7537742 0.025 0.980326
## NeighborhoodSWISU -0.0365519 0.7127408 -0.051 0.959113
## NeighborhoodTimber -0.1839511 0.6779657 -0.271 0.786213
## NeighborhoodVeenker -0.2877899 0.6784808 -0.424 0.671564
## log(area) 0.5301257 0.0262830 20.170 < 2e-16
## Bedroom.AbvGr -0.0101726 0.0075518 -1.347 0.178373
## Year.Built 0.0035099 0.0003594 9.765 < 2e-16
## log(Lot.Area) 0.0919559 0.0134785 6.822 1.84e-11
## Central.AirY 0.1306494 0.0219619 5.949 4.13e-09
## Overall.Cond 0.0509427 0.0045747 11.136 < 2e-16
## HeatingGasW 0.1444190 0.0451145 3.201 0.001426
## HeatingGrav 0.0477921 0.1208976 0.395 0.692725
## HeatingOthW 0.0050873 0.1158124 0.044 0.964974
## HeatingWall 0.1339746 0.1141956 1.173 0.241084
## Fireplaces 0.0317582 0.0075335 4.216 2.79e-05
## Kitchen.QualFa -0.1095302 0.0383594 -2.855 0.004416
## Kitchen.QualGd -0.0760636 0.0246296 -3.088 0.002087
## Kitchen.QualPo 0.0153935 0.1248634 0.123 0.901916
## Kitchen.QualTA -0.1139836 0.0266646 -4.275 2.16e-05
## log(Wood.Deck.SF + 1) 0.0045636 0.0017350 2.630 0.008705
## log(Open.Porch.SF + 1) 0.0031072 0.0021995 1.413 0.158168
## Bldg.Type2fmCon 0.0251457 0.0285940 0.879 0.379461
## Bldg.TypeDuplex -0.0921404 0.0245435 -3.754 0.000187
## Bldg.TypeTwnhs -0.0537371 0.0324561 -1.656 0.098200
## Bldg.TypeTwnhsE -0.0268705 0.0227535 -1.181 0.237998
## House.Style1.5Unf 0.0603442 0.0449435 1.343 0.179783
## House.Style1Story 0.0815747 0.0167729 4.863 1.40e-06
## House.Style2.5Unf 0.0178577 0.0386041 0.463 0.643794
## House.Style2Story -0.0039440 0.0164673 -0.240 0.810780
## House.StyleSFoyer 0.1490007 0.0276012 5.398 9.01e-08
## House.StyleSLvl 0.0660868 0.0238136 2.775 0.005654
## Overall.Qual:NeighborhoodBlueste NA NA NA NA
## Overall.Qual:NeighborhoodBrDale -0.1341061 0.1259888 -1.064 0.287475
## Overall.Qual:NeighborhoodBrkSide -0.0174782 0.0901425 -0.194 0.846310
## Overall.Qual:NeighborhoodClearCr -0.1270657 0.0992514 -1.280 0.200853
## Overall.Qual:NeighborhoodCollgCr -0.0441635 0.0893042 -0.495 0.621077
## Overall.Qual:NeighborhoodCrawfor -0.0378912 0.0910554 -0.416 0.677431
## Overall.Qual:NeighborhoodEdwards -0.0359666 0.0900227 -0.400 0.689617
## Overall.Qual:NeighborhoodGilbert -0.0298331 0.0908822 -0.328 0.742805
## Overall.Qual:NeighborhoodGreens NA NA NA NA
## Overall.Qual:NeighborhoodGrnHill NA NA NA NA
## Overall.Qual:NeighborhoodIDOTRR -0.0454130 0.0900849 -0.504 0.614329
## Overall.Qual:NeighborhoodMeadowV -0.0805197 0.0991498 -0.812 0.416989
## Overall.Qual:NeighborhoodMitchel -0.0315168 0.0900743 -0.350 0.726513
## Overall.Qual:NeighborhoodNAmes -0.0473328 0.0894901 -0.529 0.597019
## Overall.Qual:NeighborhoodNoRidge -0.0467993 0.0955442 -0.490 0.624405
## Overall.Qual:NeighborhoodNPkVill -0.1377268 0.1509558 -0.912 0.361868
## Overall.Qual:NeighborhoodNridgHt 0.0062515 0.0904779 0.069 0.944933
## Overall.Qual:NeighborhoodNWAmes -0.0515192 0.0914319 -0.563 0.573281
## Overall.Qual:NeighborhoodOldTown -0.0328543 0.0891315 -0.369 0.712526
## Overall.Qual:NeighborhoodSawyer -0.0793605 0.0910731 -0.871 0.383816
## Overall.Qual:NeighborhoodSawyerW 0.0218426 0.0914242 0.239 0.811236
## Overall.Qual:NeighborhoodSomerst -0.0323187 0.0901753 -0.358 0.720145
## Overall.Qual:NeighborhoodStoneBr 0.0043993 0.1006984 0.044 0.965165
## Overall.Qual:NeighborhoodSWISU 0.0098192 0.1028172 0.096 0.923942
## Overall.Qual:NeighborhoodTimber 0.0262250 0.0924563 0.284 0.776758
## Overall.Qual:NeighborhoodVeenker 0.0438011 0.0931258 0.470 0.638245
##
## (Intercept)
## Overall.Qual
## NeighborhoodBlueste
## NeighborhoodBrDale
## NeighborhoodBrkSide
## NeighborhoodClearCr
## NeighborhoodCollgCr
## NeighborhoodCrawfor
## NeighborhoodEdwards
## NeighborhoodGilbert
## NeighborhoodGreens
## NeighborhoodGrnHill ***
## NeighborhoodIDOTRR
## NeighborhoodMeadowV
## NeighborhoodMitchel
## NeighborhoodNAmes
## NeighborhoodNoRidge
## NeighborhoodNPkVill
## NeighborhoodNridgHt
## NeighborhoodNWAmes
## NeighborhoodOldTown
## NeighborhoodSawyer
## NeighborhoodSawyerW
## NeighborhoodSomerst
## NeighborhoodStoneBr
## NeighborhoodSWISU
## NeighborhoodTimber
## NeighborhoodVeenker
## log(area) ***
## Bedroom.AbvGr
## Year.Built ***
## log(Lot.Area) ***
## Central.AirY ***
## Overall.Cond ***
## HeatingGasW **
## HeatingGrav
## HeatingOthW
## HeatingWall
## Fireplaces ***
## Kitchen.QualFa **
## Kitchen.QualGd **
## Kitchen.QualPo
## Kitchen.QualTA ***
## log(Wood.Deck.SF + 1) **
## log(Open.Porch.SF + 1)
## Bldg.Type2fmCon
## Bldg.TypeDuplex ***
## Bldg.TypeTwnhs .
## Bldg.TypeTwnhsE
## House.Style1.5Unf
## House.Style1Story ***
## House.Style2.5Unf
## House.Style2Story
## House.StyleSFoyer ***
## House.StyleSLvl **
## Overall.Qual:NeighborhoodBlueste
## Overall.Qual:NeighborhoodBrDale
## Overall.Qual:NeighborhoodBrkSide
## Overall.Qual:NeighborhoodClearCr
## Overall.Qual:NeighborhoodCollgCr
## Overall.Qual:NeighborhoodCrawfor
## Overall.Qual:NeighborhoodEdwards
## Overall.Qual:NeighborhoodGilbert
## Overall.Qual:NeighborhoodGreens
## Overall.Qual:NeighborhoodGrnHill
## Overall.Qual:NeighborhoodIDOTRR
## Overall.Qual:NeighborhoodMeadowV
## Overall.Qual:NeighborhoodMitchel
## Overall.Qual:NeighborhoodNAmes
## Overall.Qual:NeighborhoodNoRidge
## Overall.Qual:NeighborhoodNPkVill
## Overall.Qual:NeighborhoodNridgHt
## Overall.Qual:NeighborhoodNWAmes
## Overall.Qual:NeighborhoodOldTown
## Overall.Qual:NeighborhoodSawyer
## Overall.Qual:NeighborhoodSawyerW
## Overall.Qual:NeighborhoodSomerst
## Overall.Qual:NeighborhoodStoneBr
## Overall.Qual:NeighborhoodSWISU
## Overall.Qual:NeighborhoodTimber
## Overall.Qual:NeighborhoodVeenker
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1054 on 756 degrees of freedom
## Multiple R-squared: 0.9308, Adjusted R-squared: 0.9238
## F-statistic: 132.1 on 77 and 756 DF, p-value: < 2.2e-16
As we can see from the p-value the model is significant at the 0.05 level of significance. This model accounts for approximatley 93% of the variability for both R-squared and 92% for adjusted R squared.
From the model above the it appears that** Overall.Qual, Neighborhood (pnly Green Hill), log(area), Year.Built , log(Lot.Area), Central.Air, Overall.Cond, Heating (Gas), Fireplaces, Kitchen.Qual, log(Wood.Deck.SF +1), Bldg.Type (Duplex) and House.Style are the variables of significance** for the model at the 5% level indicating that in a hypothesis test their at least on of the coefficeients are likely not zero (not accepting the null hypothesis) whereas the rest would be
The interaction between Neighborhood and Overall Quality will be discussed below.
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
YES* I did transform variables, price, area, Lot.Area, Wood.Deck.SF, Open.Porch.SF**for the purpose of scaling a covariate that can equal zero regardless of the corresponding coefficient. The transformation is to add 1 to the term and take the log of the sum.
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.
inter_reg<- lm(log(price)~Overall.Qual:Neighborhood, ames_test)
summary(inter_reg)
##
## Call:
## lm(formula = log(price) ~ Overall.Qual:Neighborhood, data = ames_test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83152 -0.10348 0.00566 0.10949 0.63202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.962179 0.040904 267.997 < 2e-16 ***
## Overall.Qual:NeighborhoodBlmngtn 0.169859 0.010737 15.820 < 2e-16 ***
## Overall.Qual:NeighborhoodBlueste 0.135850 0.012028 11.295 < 2e-16 ***
## Overall.Qual:NeighborhoodBrDale 0.110066 0.012388 8.885 < 2e-16 ***
## Overall.Qual:NeighborhoodBrkSide 0.148776 0.009920 14.998 < 2e-16 ***
## Overall.Qual:NeighborhoodClearCr 0.217461 0.009850 22.077 < 2e-16 ***
## Overall.Qual:NeighborhoodCollgCr 0.184400 0.006730 27.398 < 2e-16 ***
## Overall.Qual:NeighborhoodCrawfor 0.200134 0.008040 24.891 < 2e-16 ***
## Overall.Qual:NeighborhoodEdwards 0.154885 0.009209 16.819 < 2e-16 ***
## Overall.Qual:NeighborhoodGilbert 0.179308 0.007455 24.053 < 2e-16 ***
## Overall.Qual:NeighborhoodIDOTRR 0.139787 0.012129 11.525 < 2e-16 ***
## Overall.Qual:NeighborhoodLandmrk 0.144259 0.031285 4.611 4.67e-06 ***
## Overall.Qual:NeighborhoodMeadowV 0.134794 0.015531 8.679 < 2e-16 ***
## Overall.Qual:NeighborhoodMitchel 0.185488 0.009926 18.686 < 2e-16 ***
## Overall.Qual:NeighborhoodNAmes 0.171970 0.007998 21.502 < 2e-16 ***
## Overall.Qual:NeighborhoodNoRidge 0.219160 0.006571 33.350 < 2e-16 ***
## Overall.Qual:NeighborhoodNPkVill 0.147395 0.011778 12.514 < 2e-16 ***
## Overall.Qual:NeighborhoodNridgHt 0.198724 0.006426 30.927 < 2e-16 ***
## Overall.Qual:NeighborhoodNWAmes 0.183963 0.007804 23.574 < 2e-16 ***
## Overall.Qual:NeighborhoodOldTown 0.142932 0.008322 17.175 < 2e-16 ***
## Overall.Qual:NeighborhoodSawyer 0.170038 0.010070 16.886 < 2e-16 ***
## Overall.Qual:NeighborhoodSawyerW 0.184639 0.007875 23.447 < 2e-16 ***
## Overall.Qual:NeighborhoodSomerst 0.176890 0.006647 26.611 < 2e-16 ***
## Overall.Qual:NeighborhoodStoneBr 0.188703 0.007911 23.855 < 2e-16 ***
## Overall.Qual:NeighborhoodSWISU 0.162032 0.011133 14.554 < 2e-16 ***
## Overall.Qual:NeighborhoodTimber 0.196954 0.008751 22.506 < 2e-16 ***
## Overall.Qual:NeighborhoodVeenker 0.220654 0.018706 11.796 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1832 on 790 degrees of freedom
## Multiple R-squared: 0.758, Adjusted R-squared: 0.7501
## F-statistic: 95.18 on 26 and 790 DF, p-value: < 2.2e-16
Yes I did include a variable interaction. The one variable interaction that I felt was important was the overall quality of the house and the neighborhood to which it belonged. This was because that question could arise if there is a correlation between the two. Does Neighborhood matter for the condition of the house?
Taken by themselves the above table seems to make ALL Neigborhoods significant, However an examination of the R-square (76%) and adjusted R-square (75%) show a relationship that while strong within itself is weak overall for price prediction.
Consequently there are other variables exerting at least just as great an influence of the predictive strength of the model as Neighborhood.
This forms the basis for my variable selection.
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
model_price_select_test<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
Year.Built + log(Lot.Area) + Central.Air + Overall.Cond, data = ames_test)
summary(model_price_select_test)
##
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) +
## Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air +
## Overall.Cond, data = ames_test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55288 -0.06329 0.00404 0.07388 0.33664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3197284 0.6920407 -0.462 0.644204
## Overall.Qual 0.0788674 0.0054303 14.524 < 2e-16 ***
## NeighborhoodBlueste -0.0506033 0.0640896 -0.790 0.430017
## NeighborhoodBrDale -0.1174224 0.0577298 -2.034 0.042288 *
## NeighborhoodBrkSide -0.1014693 0.0520253 -1.950 0.051486 .
## NeighborhoodClearCr -0.0290978 0.0571392 -0.509 0.610724
## NeighborhoodCollgCr -0.0645388 0.0457340 -1.411 0.158589
## NeighborhoodCrawfor 0.0299842 0.0503906 0.595 0.551992
## NeighborhoodEdwards -0.1592750 0.0488420 -3.261 0.001158 **
## NeighborhoodGilbert -0.1418487 0.0472388 -3.003 0.002760 **
## NeighborhoodIDOTRR -0.1385458 0.0558523 -2.481 0.013326 *
## NeighborhoodLandmrk -0.1396342 0.1251099 -1.116 0.264725
## NeighborhoodMeadowV -0.1028101 0.0580819 -1.770 0.077101 .
## NeighborhoodMitchel -0.0981988 0.0513926 -1.911 0.056400 .
## NeighborhoodNAmes -0.0931867 0.0469399 -1.985 0.047467 *
## NeighborhoodNoRidge 0.0797727 0.0500766 1.593 0.111560
## NeighborhoodNPkVill -0.0297085 0.0583926 -0.509 0.611055
## NeighborhoodNridgHt 0.0743417 0.0485020 1.533 0.125739
## NeighborhoodNWAmes -0.1182678 0.0486513 -2.431 0.015283 *
## NeighborhoodOldTown -0.1464827 0.0507662 -2.885 0.004016 **
## NeighborhoodSawyer -0.1113037 0.0498445 -2.233 0.025829 *
## NeighborhoodSawyerW -0.1188930 0.0483722 -2.458 0.014191 *
## NeighborhoodSomerst -0.0151898 0.0457289 -0.332 0.739849
## NeighborhoodStoneBr 0.0533240 0.0538454 0.990 0.322325
## NeighborhoodSWISU -0.0882204 0.0579430 -1.523 0.128278
## NeighborhoodTimber -0.0679612 0.0539005 -1.261 0.207733
## NeighborhoodVeenker -0.0007065 0.0819207 -0.009 0.993121
## log(area) 0.5112374 0.0228948 22.330 < 2e-16 ***
## Bedroom.AbvGr -0.0362240 0.0069766 -5.192 2.65e-07 ***
## Year.Built 0.0033515 0.0003334 10.054 < 2e-16 ***
## log(Lot.Area) 0.1533616 0.0118716 12.918 < 2e-16 ***
## Central.AirY 0.0723886 0.0204066 3.547 0.000412 ***
## Overall.Cond 0.0481565 0.0044583 10.801 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1176 on 784 degrees of freedom
## Multiple R-squared: 0.9011, Adjusted R-squared: 0.8971
## F-statistic: 223.3 on 32 and 784 DF, p-value: < 2.2e-16
I have called back the model to illustrate a point about the interactions and selection process. In the previous section it seemed that neighborhood was important but had weak R-square values. When other variables are introduced the significance of neighborhood drops off in favor of other things.
why is this?
simply put, people do not make purchases rartionally in general. This is no more seen clearly than in home buying. The issues that matter to people are the ones that provide emotional satisfaction and a sense and place for the future. These are not always variables that at first seem important.
Something like a fireplace or a deck can do more for the desireability of a house than the neighborhood especially if you can still get a deal on the overall price and quality.
If you look at the additional variables many of them refer to the creature comfort aspect of a house. As such I tried to build a model that included those and the result was a better fit for the predictive nature of the model. In conjunction with the de-throning of the the interaction of overall quality and neighborhood it points to the subjective nature of the decision process when accepting or rejecting a house price.
This is demonstrated below by using stepwise AIC on the model above
model_final_AIC<-stepAIC(model_final, k =2)
## Start: AIC=-3678.25
## log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
## Year.Built + log(Lot.Area) + Central.Air + Overall.Cond +
## Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF +
## 1) + log(Open.Porch.SF + 1) + Bldg.Type + House.Style + Overall.Qual:Neighborhood
##
## Df Sum of Sq RSS AIC
## - Overall.Qual:Neighborhood 23 0.4274 8.8322 -3682.9
## - Bedroom.AbvGr 1 0.0202 8.4250 -3678.3
## <none> 8.4049 -3678.3
## - log(Open.Porch.SF + 1) 1 0.0222 8.4270 -3678.1
## - Heating 4 0.1280 8.5329 -3673.6
## - log(Wood.Deck.SF + 1) 1 0.0769 8.4818 -3672.7
## - Bldg.Type 4 0.2057 8.6105 -3666.1
## - Kitchen.Qual 4 0.2559 8.6608 -3661.2
## - Fireplaces 1 0.1976 8.6024 -3660.9
## - Central.Air 1 0.3934 8.7983 -3642.1
## - log(Lot.Area) 1 0.5175 8.9223 -3630.4
## - House.Style 6 0.6566 9.0615 -3627.5
## - Year.Built 1 1.0600 9.4649 -3581.2
## - Overall.Cond 1 1.3786 9.7835 -3553.6
## - log(area) 1 4.5229 12.9278 -3321.2
##
## Step: AIC=-3682.89
## log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
## Year.Built + log(Lot.Area) + Central.Air + Overall.Cond +
## Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF +
## 1) + log(Open.Porch.SF + 1) + Bldg.Type + House.Style
##
## Df Sum of Sq RSS AIC
## - log(Open.Porch.SF + 1) 1 0.0198 8.8520 -3683.0
## <none> 8.8322 -3682.9
## - Bedroom.AbvGr 1 0.0392 8.8714 -3681.2
## - log(Wood.Deck.SF + 1) 1 0.0713 8.9035 -3678.2
## - Heating 4 0.1437 8.9759 -3677.4
## - Bldg.Type 4 0.2290 9.0612 -3669.5
## - Fireplaces 1 0.1815 9.0137 -3667.9
## - Kitchen.Qual 4 0.4806 9.3128 -3646.7
## - Central.Air 1 0.4349 9.2672 -3644.8
## - log(Lot.Area) 1 0.5785 9.4107 -3632.0
## - House.Style 6 0.6929 9.5252 -3631.9
## - Year.Built 1 1.2252 10.0574 -3576.6
## - Neighborhood 26 1.9989 10.8311 -3564.7
## - Overall.Qual 1 1.3736 10.2058 -3564.3
## - Overall.Cond 1 1.3963 10.2285 -3562.5
## - log(area) 1 4.6351 13.4673 -3333.1
##
## Step: AIC=-3683.02
## log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
## Year.Built + log(Lot.Area) + Central.Air + Overall.Cond +
## Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF +
## 1) + Bldg.Type + House.Style
##
## Df Sum of Sq RSS AIC
## <none> 8.8520 -3683.0
## - Bedroom.AbvGr 1 0.0447 8.8967 -3680.8
## - log(Wood.Deck.SF + 1) 1 0.0668 8.9188 -3678.8
## - Heating 4 0.1386 8.9906 -3678.1
## - Bldg.Type 4 0.2359 9.0880 -3669.1
## - Fireplaces 1 0.1782 9.0302 -3668.4
## - Kitchen.Qual 4 0.4910 9.3430 -3646.0
## - Central.Air 1 0.4331 9.2851 -3645.2
## - log(Lot.Area) 1 0.5735 9.4255 -3632.7
## - House.Style 6 0.6979 9.5499 -3631.7
## - Year.Built 1 1.2827 10.1347 -3572.2
## - Neighborhood 26 1.9874 10.8395 -3566.1
## - Overall.Cond 1 1.3795 10.2315 -3564.2
## - Overall.Qual 1 1.3999 10.2519 -3562.6
## - log(area) 1 4.9036 13.7556 -3317.4
summary(model_final_AIC)
##
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) +
## Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air +
## Overall.Cond + Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF +
## 1) + Bldg.Type + House.Style, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68115 -0.05809 -0.00013 0.06399 0.39079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5450732 0.7297305 -0.747 0.455318
## Overall.Qual 0.0606529 0.0054612 11.106 < 2e-16 ***
## NeighborhoodBlueste 0.0099167 0.0772420 0.128 0.897877
## NeighborhoodBrDale -0.0665733 0.0634705 -1.049 0.294555
## NeighborhoodBrkSide 0.0031492 0.0533136 0.059 0.952912
## NeighborhoodClearCr 0.0210728 0.0595575 0.354 0.723568
## NeighborhoodCollgCr -0.0414695 0.0471163 -0.880 0.379048
## NeighborhoodCrawfor 0.0890713 0.0529027 1.684 0.092643 .
## NeighborhoodEdwards -0.0929931 0.0489076 -1.901 0.057617 .
## NeighborhoodGilbert -0.0727539 0.0493584 -1.474 0.140887
## NeighborhoodGreens 0.1277815 0.0689483 1.853 0.064218 .
## NeighborhoodGrnHill 0.3479941 0.0870026 4.000 6.94e-05 ***
## NeighborhoodIDOTRR -0.1082178 0.0545228 -1.985 0.047515 *
## NeighborhoodMeadowV -0.1244833 0.0533691 -2.332 0.019928 *
## NeighborhoodMitchel -0.0094556 0.0489189 -0.193 0.846782
## NeighborhoodNAmes -0.0347972 0.0479963 -0.725 0.468672
## NeighborhoodNoRidge 0.0767954 0.0497886 1.542 0.123375
## NeighborhoodNPkVill 0.0276138 0.0707151 0.390 0.696278
## NeighborhoodNridgHt 0.0865020 0.0476483 1.815 0.069842 .
## NeighborhoodNWAmes -0.0587472 0.0493722 -1.190 0.234454
## NeighborhoodOldTown -0.0606729 0.0529408 -1.146 0.252125
## NeighborhoodSawyer -0.0452126 0.0492553 -0.918 0.358942
## NeighborhoodSawyerW -0.0910218 0.0482280 -1.887 0.059488 .
## NeighborhoodSomerst 0.0658933 0.0462688 1.424 0.154806
## NeighborhoodStoneBr 0.0659977 0.0526610 1.253 0.210489
## NeighborhoodSWISU -0.0234563 0.0603135 -0.389 0.697451
## NeighborhoodTimber 0.0149122 0.0522014 0.286 0.775210
## NeighborhoodVeenker 0.0114988 0.0574368 0.200 0.841377
## log(area) 0.5358828 0.0257803 20.787 < 2e-16 ***
## Bedroom.AbvGr -0.0144719 0.0072904 -1.985 0.047486 *
## Year.Built 0.0036432 0.0003427 10.631 < 2e-16 ***
## log(Lot.Area) 0.0936459 0.0131734 7.109 2.65e-12 ***
## Central.AirY 0.1333048 0.0215789 6.178 1.05e-09 ***
## Overall.Cond 0.0477292 0.0043292 11.025 < 2e-16 ***
## HeatingGasW 0.1423503 0.0446255 3.190 0.001480 **
## HeatingGrav 0.0501757 0.1190623 0.421 0.673562
## HeatingOthW 0.0184439 0.1127412 0.164 0.870092
## HeatingWall 0.1572899 0.1108588 1.419 0.156348
## Fireplaces 0.0297317 0.0075033 3.963 8.10e-05 ***
## Kitchen.QualFa -0.1417247 0.0364427 -3.889 0.000109 ***
## Kitchen.QualGd -0.1054297 0.0215640 -4.889 1.23e-06 ***
## Kitchen.QualPo 0.0439225 0.1188289 0.370 0.711760
## Kitchen.QualTA -0.1482163 0.0239487 -6.189 9.78e-10 ***
## log(Wood.Deck.SF + 1) 0.0041417 0.0017073 2.426 0.015498 *
## Bldg.Type2fmCon 0.0243767 0.0286304 0.851 0.394794
## Bldg.TypeDuplex -0.0921453 0.0243434 -3.785 0.000165 ***
## Bldg.TypeTwnhs -0.0686228 0.0311169 -2.205 0.027722 *
## Bldg.TypeTwnhsE -0.0339208 0.0216769 -1.565 0.118026
## House.Style1.5Unf 0.0589464 0.0444919 1.325 0.185600
## House.Style1Story 0.0835486 0.0166983 5.003 6.96e-07 ***
## House.Style2.5Unf 0.0207134 0.0380180 0.545 0.586025
## House.Style2Story -0.0023239 0.0163394 -0.142 0.886939
## House.StyleSFoyer 0.1505716 0.0273637 5.503 5.08e-08 ***
## House.StyleSLvl 0.0581141 0.0237033 2.452 0.014435 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1065 on 780 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9222
## F-statistic: 187.3 on 53 and 780 DF, p-value: < 2.2e-16
As we can see the model with the lowest AIC = -3683 has many “amenities” in the house and once again for predicting price, Neighborhood is all but irrelevant except for a few areas. The subjective component in this final model is demonstrated by which variables have significance
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
NOTE: Write your written response to section 3.5 here. Delete this note before you submit your work.
model_final_AIC<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
Year.Built + log(Lot.Area) + Central.Air + Overall.Cond +
Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF +
1) + Bldg.Type + House.Style, data=ames_test)
# Predict prices
predict_final <- exp(predict(model_final_AIC, ames_test, interval = "prediction"))
# Calculate proportion of observations that fall within prediction intervals
coverage_prob_test_final <- mean(ames_test$price > predict_final[,"lwr"] &
ames_test$price < predict_final[,"upr"])
coverage_prob_test_final
## [1] 0.9583843
This final model nows shows the models out of sample coverage has improved to 96%.
To further refine the analysis Im going to add a Bayesian analysis.
BAYESIAN ASSESSMENT
model_bayes<- bas.lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
Year.Built + log(Lot.Area) + Central.Air + Overall.Cond +
Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF +
1) + Bldg.Type + House.Style, data=ames_test, prior = "ZS-null", modelprior = uniform(),method = "MCMC")
par(mfrow = c(1,2))
diagnostics(model_bayes)
image(model_bayes, rotate = T, cex.lab = .6)
plot(model_bayes, which = 4, cex.lab = .8)
The Bayesian Model Averaging (BMA) used will include a Markov Chain Monte Carlo sampling, which entails taking a sample of a model generated by all the possible 2^p models where p is the number of predictor variables and using Bayes Factors evaluating the posterior probability for that model and then comparing it the current model.
The prior being used is the Zellners g-prior which allows for giving the intercept to have the same meaning across all models by subtracting the sample mean from each of the predictors. This allows for greater precision by the parameter g, which is a scalar, and scaled variance from the ordinary least squares.This simplifies the posterior by having prior elicitation converge to g and b-zero. However the g prior has some disadvantages, the information paradox, Bartletts paradox, lack of uncertainty about g, and not allowing g to be updated which would allow the data to cause the coefficients to converge to zero, and to avoid those drawbacks we modify the g-prior with hyper g/n and the Zellner-Siow cauchy prior.
WE will assume a model prior that is uniform across all models.
The MCMC and normalized posterior inclusion probabilities are in close agreement as they should be if the MCMC has been run long enough.
The results of the anaylsis for the best model for the model ranktends to agree with the model selected.
The inclusion probability model however has more neighborhoods included in its final model. For additional detail this is useful but since it is likely that at this point we will not subset Neighborhood then all areas will be included in that particular predictor category
For your final model, create and briefly interpret an informative plot of the residuals.
par(mfrow = c(2,2))
plot(model_final_AIC)
## Warning: not plotting observations with leverage one:
## 205
## Warning: not plotting observations with leverage one:
## 205
hist(model_final_AIC$residuals, col = "orange", las = 1, main = "Final Model Residual Histogram", label = T, ylim = c(0,325))
* * * The assumptions of the model rest on four properties:
Normality of the residuals is demonstrated by the histogram and the Normal Q-Q plot. Both show a high degree of normality and the QQ plot indicates there is some skew deviation from normality by the indicated outliers. However the residuals are nearly normal.
Residual linearity the linearity of the variables is in the first graph. The scatter of the residuals is around zero and there is no obvious shape or pattern to the scatter.
Homoscedasticity refers to the degree of constancy of the the variance of the residuals. This is demonstrated by the first graph and the scatter of the points. The majority of the points fall between two lines, -0.5 and 0.5. with almost nothing outside of this region indicating a fairly constant average variance between residuals.
The Scale-Location graph also shows the spread of residuals about the predictors. Since we have a lack of pattern and an almost flat red line, with a little dip, we can be assured of good homoscedasicity with the model.
Independence is an indication that the variables were randomly selected and that the selection of any particular observation has no bearing on another. since there is no test for this we have to go to first principles. The table has no structure that indicates any meaningful subsetting of the data so any subsection would be a random sample of the entire pool.
Residuals vs Leverage. The fourth graph helps us clarify the nature of our outliers with respect to their leverage. As the graph shows all of points are well withing this range and as such can be included in the final model without undo impact. Therefore, no outliers need be deleted from the model.
Therefore the conditions for both a non-multicollinear and parsimonious model from the initial one are met.
For your final model, calculate and briefly comment on the RMSE.
# Extract Predictions
predict_test <- exp(predict(model_final_AIC, ames_test))
# Extract Residuals
resid_test <- ames_test$price - predict_test
# Calculate RMSE
rmse_test<- sqrt(mean(resid_test^2))
rmse_test
## [1] 19798.47
The RSME for the final model is smaller than previous models and as an absolute measure of fit is 12.6% more accurate than previous models.
What are some strengths and weaknesses of your model?
The model I have is more accurate as far as RSME and R-squared and adjusted R-squared is concerned. These measures are small and high, respectively. However there are many subjective componenets and the model ignores Neighborhood. This is counterintuitive for many people. The model is also small compared to the number of variables and its possible that a larger model may be more effective.
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the â???oames_validationâ??? dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?
load("ames_validation.Rdata")
model_final_AIC_valid<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
Year.Built + log(Lot.Area) + Central.Air + Overall.Cond +
Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF +
1) + Bldg.Type + House.Style, data=ames_validation)
summary(model_final_AIC_valid)
##
## Call:
## lm(formula = log(price) ~ Overall.Qual + Neighborhood + log(area) +
## Bedroom.AbvGr + Year.Built + log(Lot.Area) + Central.Air +
## Overall.Cond + Heating + Fireplaces + Kitchen.Qual + log(Wood.Deck.SF +
## 1) + Bldg.Type + House.Style, data = ames_validation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34506 -0.05944 0.00000 0.06180 0.43579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.089313 0.674330 -3.098 0.002023 **
## Overall.Qual 0.072705 0.005119 14.203 < 2e-16 ***
## NeighborhoodBlueste 0.040359 0.121437 0.332 0.739729
## NeighborhoodBrDale -0.032725 0.071272 -0.459 0.646260
## NeighborhoodBrkSide 0.030115 0.060712 0.496 0.620027
## NeighborhoodClearCr 0.013750 0.062292 0.221 0.825361
## NeighborhoodCollgCr -0.033869 0.056241 -0.602 0.547229
## NeighborhoodCrawfor 0.084826 0.060510 1.402 0.161393
## NeighborhoodEdwards -0.081370 0.058632 -1.388 0.165629
## NeighborhoodGilbert -0.093541 0.057764 -1.619 0.105813
## NeighborhoodGreens 0.090050 0.078549 1.146 0.252013
## NeighborhoodIDOTRR -0.055108 0.062495 -0.882 0.378180
## NeighborhoodMeadowV -0.101815 0.069505 -1.465 0.143406
## NeighborhoodMitchel -0.055860 0.058478 -0.955 0.339789
## NeighborhoodNAmes -0.042409 0.056947 -0.745 0.456699
## NeighborhoodNoRidge 0.106421 0.063258 1.682 0.092945 .
## NeighborhoodNPkVill 0.035143 0.066788 0.526 0.598921
## NeighborhoodNridgHt 0.011642 0.057416 0.203 0.839372
## NeighborhoodNWAmes -0.051076 0.058000 -0.881 0.378822
## NeighborhoodOldTown -0.035768 0.059579 -0.600 0.548464
## NeighborhoodSawyer -0.004763 0.058926 -0.081 0.935603
## NeighborhoodSawyerW -0.072336 0.057810 -1.251 0.211253
## NeighborhoodSomerst 0.056174 0.057879 0.971 0.332104
## NeighborhoodStoneBr 0.089628 0.062501 1.434 0.152005
## NeighborhoodSWISU -0.039895 0.063061 -0.633 0.527170
## NeighborhoodTimber 0.006139 0.060261 0.102 0.918885
## NeighborhoodVeenker 0.001529 0.064116 0.024 0.980984
## log(area) 0.518117 0.024798 20.893 < 2e-16 ***
## Bedroom.AbvGr -0.015457 0.006763 -2.285 0.022586 *
## Year.Built 0.004226 0.000313 13.505 < 2e-16 ***
## log(Lot.Area) 0.128404 0.014210 9.036 < 2e-16 ***
## Central.AirY 0.057300 0.018140 3.159 0.001652 **
## Overall.Cond 0.057163 0.004143 13.797 < 2e-16 ***
## HeatingGasA 0.150477 0.106894 1.408 0.159652
## HeatingGasW 0.239269 0.112810 2.121 0.034270 *
## HeatingGrav 0.127009 0.125067 1.016 0.310198
## HeatingOthW 0.004119 0.153851 0.027 0.978646
## HeatingWall 0.164631 0.150046 1.097 0.272925
## Fireplaces 0.027667 0.007585 3.648 0.000284 ***
## Kitchen.QualFa -0.112518 0.033126 -3.397 0.000720 ***
## Kitchen.QualGd -0.089810 0.020619 -4.356 1.52e-05 ***
## Kitchen.QualTA -0.112292 0.022858 -4.913 1.12e-06 ***
## log(Wood.Deck.SF + 1) 0.003180 0.001670 1.904 0.057273 .
## Bldg.Type2fmCon -0.042451 0.024383 -1.741 0.082125 .
## Bldg.TypeDuplex -0.089368 0.023137 -3.863 0.000122 ***
## Bldg.TypeTwnhs -0.036305 0.038390 -0.946 0.344633
## Bldg.TypeTwnhsE 0.002357 0.021867 0.108 0.914199
## House.Style1.5Unf 0.100253 0.055954 1.792 0.073609 .
## House.Style1Story 0.056914 0.015525 3.666 0.000265 ***
## House.Style2.5Fin 0.027220 0.056175 0.485 0.628144
## House.Style2.5Unf 0.069118 0.043429 1.592 0.111941
## House.Style2Story -0.010823 0.015083 -0.718 0.473258
## House.StyleSFoyer 0.127250 0.029984 4.244 2.49e-05 ***
## House.StyleSLvl 0.044750 0.023589 1.897 0.058230 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1034 on 709 degrees of freedom
## Multiple R-squared: 0.9241, Adjusted R-squared: 0.9185
## F-statistic: 162.9 on 53 and 709 DF, p-value: < 2.2e-16
# Extract Predictions
predict_valid <- exp(predict(model_final_AIC_valid, ames_validation))
# Extract Residuals
resid_valid <- ames_validation$price - predict_valid
# Calculate RMSE
rmse_valid<- sqrt(mean(resid_valid^2))
rmse_valid
## [1] 19173.35
model_valid<-lm(log(price) ~ Overall.Qual + Neighborhood + log(area) + Bedroom.AbvGr +
Year.Built + log(Lot.Area) + Central.Air + Overall.Cond, data = ames_validation)
# Predict prices
predict_valid<- exp(predict(model_valid, ames_validation, interval = "prediction"))
# Calculate proportion of observations that fall within prediction intervals
coverage_prob_valid <- mean(ames_validation$price > predict_valid[,"lwr"] &
ames_validation$price < predict_valid[,"upr"])
coverage_prob_valid
## [1] 0.9541284
The final model works even better on the validation set with a RSME of 19,798 dollars which increases the fit from 12.6% to 15.4% more accurate as compared to the original RSME, and dropping the RSME from 14.6% (at 22,665) to 12.3% of the median home price.
The R-squared is high at 92.4% percent of the variability given by the model and the adjusted R-squared ay 91.8% variability. Both of these while slighty worse than the final model variability at (92.7% and 92.2% respestively) the RSME gives a better absolute fit to the data.
The coverage probability for the taining model was 95.8, the testing model; 95.2 and the final model 95.8 making it as good a fit as as the training model. Since the validation set represents an out-of-sample set relative to the final model, then if incertainty assumptions are met it should be lower but not significantly less than or more than 0.95. AS seen abobe the coverage probability for the validation set is 95.4 and so the conditions regarding uncertainty assumptions are met at this stage.
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
My final model works with a high degree of variability explanatory power and a close fit to predicted values for out of sample data. Its relatively small for such power so parsimony exists for the model. With so many statistically insignificant areas though I believe it points to the impact of variance within a given neighborhood throwing off the predictive model at least within that section of the city.
Model building is a time consuming process. Plagued as it is by combinatorial and subjective elements. Consequently it will not be a linear process and its always unfinished. The most pressing issue of theis fact is emphasized by change and what that can do to the elements of a data set and any previous constraints or conditions. Some will fall some will rise and new ones will emerge either organically or through necessity. Its a learning and a growing process and sometimes the data helps you and other times it hinders you. I will wonder if I still have too many or not enough variables or quite frankly even the correct or best ones. My models worked well even from the beginning and tightening them a bit is always a good feeling but models are fragile and in five or ten years or perhaps even two how good will it be. Models are always provisional so its a good idea to not get attached to them or their performance. * * *