As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
load("ames_train.Rdata")
Use the code block below to load any necessary packages
library(statsr)
library(dplyr)
library(BAS)
library(ggplot2)
library(lubridate)
library(MASS)
library(gridExtra)
library(plotly)
library(cowplot)
## Warning: package 'cowplot' was built under R version 3.4.3
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
Before even starting, we have to re-estate the questions we would like to answer with this data set. The question we would like to answer here is “what variables of this dataset as useful to predict a house price in IOWA and what is the contribution of each of the predictors to the overall fitted model if any?”
Based on the provided information and the request analysis from the firm, we want first to explore the data and few potential predictors of house pricing that comes to mind are the house age, the Neighborhood, the quality of the house, the size of the house and more. With that in mind, we will begin by plotting the house count by Year built and see if there is any trend observed.
## Frequency plot of house count by Year.Built
ggplot(data = ames_train, aes(x=year(today())-Year.Built))+
geom_histogram(bins = 30, fill="blue", colour="white")+labs(title="House counts by age", x="House age", y="House count")+
geom_vline(xintercept = median(year(today())-ames_train$Year.Built), colour="red")+
annotate("text", x= median(year(today())-ames_train$Year.Built)-4, y=c(100,-10)
, label=c("Median", median(year(today())-ames_train$Year.Built))
, colour="red", angle=c(90,0))+
geom_vline(xintercept = mean(year(today())-ames_train$Year.Built), colour="#41c42f")+
annotate("text", x= mean(year(today())-ames_train$Year.Built)+4, y=c(100,-10)
, label=c("Mean", round(mean(year(today())-ames_train$Year.Built), 1))
, colour="#41c42f", angle=c(-90,0))+
theme(plot.title = element_text(hjust=0.5)
,title = element_text(hjust = 0.5, colour = "red", size = rel(2))
, axis.title =element_text(size = rel(1), colour = "blue", hjust = .5)
, legend.position = "none"
,panel.background = element_rect(fill = "grey", colour = "black")
, panel.grid = element_line(color = "blue"))
Plotting the above house count by Year.Built shows a multimodel distribution of houses with at least 3 modes. The plot shows also the distribution is extremely right skewed. This tells us that the usage of robust statistics would be more appropriate to discribe this distribution. Let us use below appropriate summary statistics to describe the distribution.
ames_train %>% summarise(Min=min(price), Mean=mean(price), Median=median(price), IQR=IQR(price), Std.dev=sd(price), Max=max(price))
## # A tibble: 1 x 6
## Min Mean Median IQR Std.dev Max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 12789 181190.1 159467 83237.5 81909.79 615000
ames_train %>% summarise(Min=min(year(today())-Year.Built), Mean=mean(year(today())-Year.Built), Median=median(year(today())-Year.Built), IQR=IQR(year(today())-Year.Built), Std.dev=sd(year(today())-Year.Built), Max=max(year(today())-Year.Built))
## # A tibble: 1 x 6
## Min Mean Median IQR Std.dev Max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 44.797 42 46 29.63741 145
Above summary statistics shows that median age of this house distribution is 42 Years which means 50% of houses were built less than 42 years ago. The IQR reveals more details by showing that the middle 50% of houses have an age difference of 46 Years.
We can also see on the other summary statistics that the minimum house price is 12789 which is too low fora house built and sold in normal condition. The max price of 615000 may need to be investigated as well.
a1 <- ames_train %>% group_by(Neighborhood) %>% ggplot(aes(x=Neighborhood, y=log(price), fill=factor(Neighborhood))) + geom_boxplot()+labs(title="Log(price) by Neighborhood", xlab="Neighborhood", ylab="Log(Price)", fill="By Neighborhood")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5))
a1
a2 <- ames_train %>% ggplot(aes(x=price/10^5))+geom_histogram(aes(fill=factor(ames_train$Pool.QC)))+labs(title="Price distribution by Pool Quality", xlab="Price in 100K", ylab="House Count", fill="Pool Quality")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5))
a2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
a3 <- ames_train %>% ggplot(aes(x=price/10^5))+geom_histogram(aes(fill=factor(ames_train$Sale.Condition)))+labs(title="Price distribution by Sale.Condition", xlab="Price in 100K", ylab="House Count", fill="Sale.Condition")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5))
a3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
a4 <- ames_train %>% filter(Sale.Condition=="Normal") %>% ggplot(aes(x=price/10^5, fill=Sale.Condition))+geom_histogram()+labs(title="Price distribution by Normal Sale.Condition", xlab="Price in 100K", ylab="House Count", fill="Normal Sale.Condition")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5))
a4
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The boxplot above shows the pricing variation per neighborhood and appears from it that StoneBr is the Neighborhood in IOWA with the highest Inter-Quartile Range (IQR). The plot also shows that the cheapest house in the dataset is in OldTown and the related house price is quite dispersed as compared to other houses in the same neighborhood. This may require further investigations.
This plot also shows that the house price skewness is influence partially by Sales not done in Normal condition.
The most expensive house in the data set is one that ha a swimming pool in good quality.
With this in mind, we will see how to subset the dataset to normallize the house distribution to improve modelling.
## counting all missing values
missing_val <- ames_train %>% summarise_all(funs(sum(is.na(.))))
## sorting the top 3 missing values. Unlist helps converting to vector with variable names as names
head(sort(unlist(missing_val[1,]), decreasing = TRUE), 20)
## Pool.QC Misc.Feature Alley Fence Fireplace.Qu
## 997 971 933 798 491
## Lot.Frontage Garage.Yr.Blt Garage.Qual Garage.Cond Garage.Type
## 167 48 47 47 46
## Garage.Finish Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1
## 46 21 21 21 21
## BsmtFin.Type.2 Mas.Vnr.Area BsmtFin.SF.1 BsmtFin.SF.2 Bsmt.Unf.SF
## 21 7 1 1 1
ames_train %>% filter(is.na(Garage.Qual) | is.na(Bsmt.Qual)) %>% summarise(Count=n())
## # A tibble: 1 x 1
## Count
## <int>
## 1 64
##
Investigating missing values count and their significance is a critical step in understanding this dataset. The above dataset summary shows that in this train data set of 1000 houses, Only 3 houses are equipped with a swimming pool. This can be justified by the cost of having a swimming pool an operating it. This tells us that the presence of a swimmingpool is likely not a relevant indicator of the house price and might actually not be a relevant predictor of house price. The same analysis applies to Miscellaneous features variable Misc.Feature, Alley, Fence and Fireplace.Qu.
Above summary also shows that there are 47 houses without a garage of with an unfinished garage and 21 houses without basement, filtering both criteria returns 64 houses with NA values related to those variables to be dealt with.
In order to avoid discarding all NAs while fitting the model, all NAs of categorical variables of this dataset will be investigated individually and based on their meaning will be set to another categorical variable that we will call “None”.
P1 <- ames_train %>% ggplot(aes(x=Lot.Area ,y=price))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of house Price by Lot.Area Log Transformed")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))
P1
P2 <- ames_train %>% ggplot(aes(x=log(Lot.Area) ,y=log(price)))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of house Price by Lot.Area Log Transformed")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))
P2
P3 <- ames_train %>% ggplot(aes(x=Overall.Qual ,y=price))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of house Price by Overall quality")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))
P3
P4 <- ames_train %>% ggplot(aes(x=Overall.Qual ,y=log(price)))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of Log of house Price by Overall quality")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))
P4
P5 <- ames_train %>% ggplot(aes(x=Bedroom.AbvGr ,y=log(price)))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of Log of house Price by Bedroom.AvgGr")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))
P5
P6 <- ames_train %>% ggplot(aes(x=log(area) ,y=log(price)))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of Log of house Price by Log of area")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))
P6
P7 <- ames_train %>% ggplot(aes(x=Sale.Condition, y=log(price), fill=Sale.Condition))+geom_boxplot()+labs(title="Scatter plot of Log of house Price by Sale Condition")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))
P7
P8 <- ames_train %>% ggplot(aes(x=Bldg.Type, y=log(price), fill=Bldg.Type))+geom_boxplot()+labs(title="Scatter plot of Log of house Price by Building Type")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))
P8
Below two (02) plots show as well analysis of the relationship of house price and both Lot.Area and Oevrall.Qual. on the left scatter plots are done without log transformation of the price and it shows a high concentration of points around zero (0) making it difficult to really perceive the relationship of the Lot.Area to the price. on the right price and Lot.Area are log transformed, while the overal quality remains the same. we immediately see better the linear relationship between the price and Lot.Area and the line fitted for the Overall.Qual seems to be better as it crosses right in the middle of the dots. These two (02) variables seems to describe some linearity in their relationship to price.
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from âames_trainâ and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
Based on the exploratory data analysis performed, we believe the following 10 variables are the most significant predictor to the house price in this dataset:
Lot.Area, Neighborhood, Area, overall.Qual, Bedroom.AbvGr, Year.Built (that may be transformed into age), Presence of Basement or not (Bsmt.Qual), Garage.Cond, Sale.Condition, Bldg.Type
Some variables selected as part of the following model are based on result of data exploration performed above. For instance, from plotting performed above, it appears that Lot.Area, Neighborhood, Area, Overall.Qual, Year.Built, Sale.Condition seem to form a linear relationship with house price. The remaining variables were added as best guess to this model based on assumptions.
Given also the NA values analysis perfomed above, to avoid a silent discard of the NA values and based on the data dictionary definition of those NA, we decide to convert all NAs in categorical variables to a new factor level called “None” which will be considered while executing the model selection. This is only applicable to the Bsmt.Qual and the Garage.Cond variables which are the only categorical variables with NAs in this model.
NA values on Bsmt.Qual represents the fact that there is no Basement in the house
prepare_data <- function(data=ames_train){
# identifying variables with same values not to be considered in modeling since not significant to the study
var_to_use <- data %>% summarise_all(funs(length(unique(.))))
var_to_use <- names(data.frame(var_to_use)[,var_to_use>1])
# select userful variables only for the model
data <- data %>% select(var_to_use)
# Deal with NA values
## Seperate Numerical from Categorical variables
ames_var_types <- split(names(data), sapply(data, function(x) paste(class(x), collapse = " ")))
## Deal with numerical variables NA values
ames_train_int <- data %>% select(ames_var_types$integer)
#ames_train_int[is.na(ames_train_int)] <- 0
## Dealing with categorical variables NA values
ames_train_fac <- data %>% select(ames_var_types$factor)
ames_train_fac <- sapply(ames_train_fac, as.character)
ames_train_fac[is.na(ames_train_fac)] <- c("None")
ames_train_fac[ames_train_fac==""] <- c("None")
ames_train_fac <- data.frame(ames_train_fac)
## Merging both numerical and categoricals
data <- cbind(ames_train_int, ames_train_fac)
data <- data %>% mutate(House.Age= year(today()) - Year.Built)
data$Year.Built <- NULL
return(data)
}
ames_train <- prepare_data(data=ames_train)
The above script:
creates a new factor level “none” by setting all NA and empty values of all categorical variables in the dataset ames_train to the value “none”.
It also removes all categorical variables with only a single level.
Finally creates a new variable called House.Age which is numerical and removes the variable Year.Built variable
intuitive_model <- lm(formula = log(price) ~ log(Lot.Area) + log(area) +Neighborhood + Overall.Qual + Bedroom.AbvGr + House.Age + Bsmt.Cond + Garage.Cond + Sale.Condition + Bldg.Type, data = ames_train)
summary(intuitive_model)
##
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + log(area) + Neighborhood +
## Overall.Qual + Bedroom.AbvGr + House.Age + Bsmt.Cond + Garage.Cond +
## Sale.Condition + Bldg.Type, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.37652 -0.07388 0.00082 0.07745 0.59956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.5863353 0.2612093 29.043 < 2e-16 ***
## log(Lot.Area) 0.0881312 0.0159873 5.513 4.56e-08 ***
## log(area) 0.5151260 0.0272478 18.905 < 2e-16 ***
## NeighborhoodBlueste -0.0171610 0.1014559 -0.169 0.865717
## NeighborhoodBrDale -0.1169465 0.0741136 -1.578 0.114913
## NeighborhoodBrkSide -0.0546212 0.0609345 -0.896 0.370271
## NeighborhoodClearCr 0.0307487 0.0690191 0.446 0.656053
## NeighborhoodCollgCr -0.0696761 0.0528234 -1.319 0.187475
## NeighborhoodCrawfor 0.0753758 0.0599847 1.257 0.209213
## NeighborhoodEdwards -0.1093875 0.0550902 -1.986 0.047365 *
## NeighborhoodGilbert -0.1342566 0.0548214 -2.449 0.014506 *
## NeighborhoodGreens 0.1115216 0.0902853 1.235 0.217057
## NeighborhoodGrnHill 0.4105934 0.1189072 3.453 0.000579 ***
## NeighborhoodIDOTRR -0.2110818 0.0628662 -3.358 0.000817 ***
## NeighborhoodMeadowV -0.0986541 0.0639489 -1.543 0.123237
## NeighborhoodMitchel -0.0339515 0.0559659 -0.607 0.544231
## NeighborhoodNAmes -0.0391561 0.0541415 -0.723 0.469725
## NeighborhoodNoRidge 0.0300313 0.0582792 0.515 0.606463
## NeighborhoodNPkVill 0.0017566 0.0924858 0.019 0.984851
## NeighborhoodNridgHt 0.1128681 0.0528356 2.136 0.032917 *
## NeighborhoodNWAmes -0.0831152 0.0564491 -1.472 0.141246
## NeighborhoodOldTown -0.1047175 0.0604668 -1.732 0.083631 .
## NeighborhoodSawyer -0.0433053 0.0561918 -0.771 0.441094
## NeighborhoodSawyerW -0.1169313 0.0546608 -2.139 0.032673 *
## NeighborhoodSomerst -0.0006721 0.0505021 -0.013 0.989384
## NeighborhoodStoneBr 0.1347873 0.0588866 2.289 0.022302 *
## NeighborhoodSWISU -0.1101904 0.0714777 -1.542 0.123503
## NeighborhoodTimber 0.0185630 0.0616569 0.301 0.763427
## NeighborhoodVeenker -0.0428562 0.0701149 -0.611 0.541195
## Overall.Qual 0.0845770 0.0065190 12.974 < 2e-16 ***
## Bedroom.AbvGr -0.0441271 0.0086307 -5.113 3.84e-07 ***
## House.Age -0.0031231 0.0003989 -7.830 1.30e-14 ***
## Bsmt.CondFa -0.2072388 0.1130394 -1.833 0.067066 .
## Bsmt.CondGd -0.0698351 0.1101118 -0.634 0.526090
## Bsmt.CondNone -0.2799906 0.1130547 -2.477 0.013438 *
## Bsmt.CondPo 0.0563993 0.1892885 0.298 0.765803
## Bsmt.CondTA -0.0834522 0.1077999 -0.774 0.439041
## Garage.CondFa -0.5011045 0.1561708 -3.209 0.001378 **
## Garage.CondGd -0.2283402 0.1661071 -1.375 0.169562
## Garage.CondNone -0.3507137 0.1550106 -2.263 0.023891 *
## Garage.CondPo -0.3831565 0.1657232 -2.312 0.020989 *
## Garage.CondTA -0.2934632 0.1539917 -1.906 0.056990 .
## Sale.ConditionAdjLand 0.2268656 0.1160550 1.955 0.050899 .
## Sale.ConditionAlloca 0.2985431 0.0801409 3.725 0.000207 ***
## Sale.ConditionFamily -0.0596883 0.0423659 -1.409 0.159199
## Sale.ConditionNormal 0.1053826 0.0205939 5.117 3.75e-07 ***
## Sale.ConditionPartial 0.1331717 0.0286052 4.656 3.69e-06 ***
## Bldg.Type2fmCon 0.0038722 0.0362826 0.107 0.915030
## Bldg.TypeDuplex -0.0757817 0.0298963 -2.535 0.011410 *
## Bldg.TypeTwnhs -0.1454875 0.0398270 -3.653 0.000273 ***
## Bldg.TypeTwnhsE -0.0804046 0.0255017 -3.153 0.001667 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.151 on 949 degrees of freedom
## Multiple R-squared: 0.8776, Adjusted R-squared: 0.8712
## F-statistic: 136.1 on 50 and 949 DF, p-value: < 2.2e-16
The model summary shows that all variables of the intuitive model are to the model based on the P-Value criteria. This is backed up by the R-squared value which is quite high 0.8723.
Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
We start on the basis of equi-probability of belonging to the final model for each variable of the initial model.
# Using the BIC
bic_model <- bas.lm(log(price) ~ log(Lot.Area) + log(area) +Neighborhood + Overall.Qual + Bedroom.AbvGr + House.Age + Bsmt.Cond + Garage.Cond + Sale.Condition + Bldg.Type, data = ames_train, prior = "BIC", modelprior = uniform())
summary(bic_model)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## log(Lot.Area) 1.00000000 1.0000 1.0000000 1.0000000
## log(area) 1.00000000 1.0000 1.0000000 1.0000000
## NeighborhoodBlueste 0.02039545 0.0000 0.0000000 0.0000000
## NeighborhoodBrDale 0.06094873 0.0000 0.0000000 0.0000000
## NeighborhoodBrkSide 0.03184298 0.0000 0.0000000 0.0000000
## NeighborhoodClearCr 0.10514497 0.0000 0.0000000 0.0000000
## NeighborhoodCollgCr 0.42304918 1.0000 1.0000000 0.0000000
## NeighborhoodCrawfor 0.99176547 1.0000 1.0000000 1.0000000
## NeighborhoodEdwards 0.72424873 1.0000 1.0000000 1.0000000
## NeighborhoodGilbert 0.99821304 1.0000 1.0000000 1.0000000
## NeighborhoodGreens 0.09011289 0.0000 0.0000000 0.0000000
## NeighborhoodGrnHill 0.97251381 1.0000 1.0000000 1.0000000
## NeighborhoodIDOTRR 0.99793365 1.0000 1.0000000 1.0000000
## NeighborhoodMeadowV 0.07608799 0.0000 0.0000000 0.0000000
## NeighborhoodMitchel 0.02036033 0.0000 0.0000000 0.0000000
## NeighborhoodNAmes 0.05492988 0.0000 0.0000000 0.0000000
## NeighborhoodNoRidge 0.20083919 0.0000 0.0000000 0.0000000
## NeighborhoodNPkVill 0.02339286 0.0000 0.0000000 0.0000000
## NeighborhoodNridgHt 0.99961186 1.0000 1.0000000 1.0000000
## NeighborhoodNWAmes 0.20339322 0.0000 0.0000000 0.0000000
## NeighborhoodOldTown 0.20675032 0.0000 0.0000000 0.0000000
## NeighborhoodSawyer 0.02100306 0.0000 0.0000000 0.0000000
## NeighborhoodSawyerW 0.95281915 1.0000 1.0000000 1.0000000
## NeighborhoodSomerst 0.05676248 0.0000 0.0000000 0.0000000
## NeighborhoodStoneBr 0.97453751 1.0000 1.0000000 1.0000000
## NeighborhoodSWISU 0.02282877 0.0000 0.0000000 0.0000000
## NeighborhoodTimber 0.04166532 0.0000 0.0000000 0.0000000
## NeighborhoodVeenker 0.02700343 0.0000 0.0000000 0.0000000
## Overall.Qual 1.00000000 1.0000 1.0000000 1.0000000
## Bedroom.AbvGr 0.99999175 1.0000 1.0000000 1.0000000
## House.Age 1.00000000 1.0000 1.0000000 1.0000000
## Bsmt.CondFa 0.95993576 1.0000 1.0000000 1.0000000
## Bsmt.CondGd 0.02180072 0.0000 0.0000000 0.0000000
## Bsmt.CondNone 0.99997311 1.0000 1.0000000 1.0000000
## Bsmt.CondPo 0.02386114 0.0000 0.0000000 0.0000000
## Bsmt.CondTA 0.02667441 0.0000 0.0000000 0.0000000
## Garage.CondFa 0.99981400 1.0000 1.0000000 1.0000000
## Garage.CondGd 0.06148967 0.0000 0.0000000 0.0000000
## Garage.CondNone 0.27741195 0.0000 0.0000000 0.0000000
## Garage.CondPo 0.05564299 0.0000 0.0000000 0.0000000
## Garage.CondTA 0.09129333 0.0000 0.0000000 0.0000000
## Sale.ConditionAdjLand 0.05288864 0.0000 0.0000000 0.0000000
## Sale.ConditionAlloca 0.95959351 1.0000 1.0000000 1.0000000
## Sale.ConditionFamily 0.08199673 0.0000 0.0000000 0.0000000
## Sale.ConditionNormal 0.99999846 1.0000 1.0000000 1.0000000
## Sale.ConditionPartial 0.99998607 1.0000 1.0000000 1.0000000
## Bldg.Type2fmCon 0.02044057 0.0000 0.0000000 0.0000000
## Bldg.TypeDuplex 0.34352837 0.0000 1.0000000 0.0000000
## Bldg.TypeTwnhs 0.94048871 1.0000 1.0000000 1.0000000
## Bldg.TypeTwnhsE 0.68216647 1.0000 1.0000000 1.0000000
## BF NA 1.0000 0.7988378 0.7216875
## PostProbs NA 0.0115 0.0092000 0.0083000
## R2 NA 0.8709 0.8717000 0.8699000
## dim NA 23.0000 24.0000000 22.0000000
## logmarg NA -1643.3158 -1643.5404087 -1643.6419743
## model 4 model 5
## Intercept 1.0000000 1.0000000
## log(Lot.Area) 1.0000000 1.0000000
## log(area) 1.0000000 1.0000000
## NeighborhoodBlueste 0.0000000 0.0000000
## NeighborhoodBrDale 0.0000000 0.0000000
## NeighborhoodBrkSide 0.0000000 0.0000000
## NeighborhoodClearCr 0.0000000 0.0000000
## NeighborhoodCollgCr 1.0000000 1.0000000
## NeighborhoodCrawfor 1.0000000 1.0000000
## NeighborhoodEdwards 1.0000000 1.0000000
## NeighborhoodGilbert 1.0000000 1.0000000
## NeighborhoodGreens 0.0000000 0.0000000
## NeighborhoodGrnHill 1.0000000 1.0000000
## NeighborhoodIDOTRR 1.0000000 1.0000000
## NeighborhoodMeadowV 0.0000000 0.0000000
## NeighborhoodMitchel 0.0000000 0.0000000
## NeighborhoodNAmes 0.0000000 0.0000000
## NeighborhoodNoRidge 0.0000000 0.0000000
## NeighborhoodNPkVill 0.0000000 0.0000000
## NeighborhoodNridgHt 1.0000000 1.0000000
## NeighborhoodNWAmes 1.0000000 1.0000000
## NeighborhoodOldTown 0.0000000 0.0000000
## NeighborhoodSawyer 0.0000000 0.0000000
## NeighborhoodSawyerW 1.0000000 1.0000000
## NeighborhoodSomerst 0.0000000 0.0000000
## NeighborhoodStoneBr 1.0000000 1.0000000
## NeighborhoodSWISU 0.0000000 0.0000000
## NeighborhoodTimber 0.0000000 0.0000000
## NeighborhoodVeenker 0.0000000 0.0000000
## Overall.Qual 1.0000000 1.0000000
## Bedroom.AbvGr 1.0000000 1.0000000
## House.Age 1.0000000 1.0000000
## Bsmt.CondFa 1.0000000 1.0000000
## Bsmt.CondGd 0.0000000 0.0000000
## Bsmt.CondNone 1.0000000 1.0000000
## Bsmt.CondPo 0.0000000 0.0000000
## Bsmt.CondTA 0.0000000 0.0000000
## Garage.CondFa 1.0000000 1.0000000
## Garage.CondGd 0.0000000 0.0000000
## Garage.CondNone 0.0000000 0.0000000
## Garage.CondPo 0.0000000 0.0000000
## Garage.CondTA 0.0000000 0.0000000
## Sale.ConditionAdjLand 0.0000000 0.0000000
## Sale.ConditionAlloca 1.0000000 1.0000000
## Sale.ConditionFamily 0.0000000 0.0000000
## Sale.ConditionNormal 1.0000000 1.0000000
## Sale.ConditionPartial 1.0000000 1.0000000
## Bldg.Type2fmCon 0.0000000 0.0000000
## Bldg.TypeDuplex 1.0000000 0.0000000
## Bldg.TypeTwnhs 1.0000000 1.0000000
## Bldg.TypeTwnhsE 1.0000000 1.0000000
## BF 0.5883681 0.5055826
## PostProbs 0.0068000 0.0058000
## R2 0.8725000 0.8716000
## dim 25.0000000 24.0000000
## logmarg -1643.8462137 -1643.9978551
image(bic_model, rotate = FALSE)
Using the BIC model selection, all variables from the initial model are included in the best model selected except the Garage.Cond variable. Selected variables are indicated by the value 1 of the Bayesian Factor and the posterior probability of belonging to the model is as well displayed. on average, the variable Bldg.Type appears to have the lowest posterior probability in the best model selected.
bas_model_AIC <- bas.lm(log(price) ~ log(Lot.Area) + log(area) +Neighborhood + Overall.Qual + Bedroom.AbvGr + House.Age + Bsmt.Cond + Garage.Cond + Sale.Condition + Bldg.Type, data = ames_train, prior = "AIC", modelprior = uniform())
summary(bas_model_AIC)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## log(Lot.Area) 1.00000000 1.0000 1.0000000 1.0000000
## log(area) 1.00000000 1.0000 1.0000000 1.0000000
## NeighborhoodBlueste 0.12846933 0.0000 0.0000000 0.0000000
## NeighborhoodBrDale 0.67297799 1.0000 1.0000000 1.0000000
## NeighborhoodBrkSide 0.14410069 0.0000 0.0000000 0.0000000
## NeighborhoodClearCr 0.47423069 1.0000 1.0000000 1.0000000
## NeighborhoodCollgCr 0.99010283 1.0000 1.0000000 1.0000000
## NeighborhoodCrawfor 0.99959374 1.0000 1.0000000 1.0000000
## NeighborhoodEdwards 0.99895678 1.0000 1.0000000 1.0000000
## NeighborhoodGilbert 0.99999529 1.0000 1.0000000 1.0000000
## NeighborhoodGreens 0.70807406 1.0000 1.0000000 1.0000000
## NeighborhoodGrnHill 0.99981682 1.0000 1.0000000 1.0000000
## NeighborhoodIDOTRR 0.99999765 1.0000 1.0000000 1.0000000
## NeighborhoodMeadowV 0.69025188 1.0000 1.0000000 1.0000000
## NeighborhoodMitchel 0.14986329 0.0000 0.0000000 0.0000000
## NeighborhoodNAmes 0.19839750 0.0000 0.0000000 0.0000000
## NeighborhoodNoRidge 0.46099683 1.0000 0.0000000 1.0000000
## NeighborhoodNPkVill 0.07347804 0.0000 0.0000000 0.0000000
## NeighborhoodNridgHt 0.99999582 1.0000 1.0000000 1.0000000
## NeighborhoodNWAmes 0.95471751 1.0000 1.0000000 1.0000000
## NeighborhoodOldTown 0.92324730 1.0000 1.0000000 1.0000000
## NeighborhoodSawyer 0.12904437 0.0000 0.0000000 0.0000000
## NeighborhoodSawyerW 0.99986213 1.0000 1.0000000 1.0000000
## NeighborhoodSomerst 0.20027242 0.0000 0.0000000 0.0000000
## NeighborhoodStoneBr 0.99980097 1.0000 1.0000000 1.0000000
## NeighborhoodSWISU 0.36458743 1.0000 1.0000000 0.0000000
## NeighborhoodTimber 0.15043017 0.0000 0.0000000 0.0000000
## NeighborhoodVeenker 0.15642752 0.0000 0.0000000 0.0000000
## Overall.Qual 1.00000000 1.0000 1.0000000 1.0000000
## Bedroom.AbvGr 0.99999982 1.0000 1.0000000 1.0000000
## House.Age 1.00000000 1.0000 1.0000000 1.0000000
## Bsmt.CondFa 0.99885843 1.0000 1.0000000 1.0000000
## Bsmt.CondGd 0.15158859 0.0000 0.0000000 0.0000000
## Bsmt.CondNone 0.99995206 1.0000 1.0000000 1.0000000
## Bsmt.CondPo 0.19458851 0.0000 0.0000000 0.0000000
## Bsmt.CondTA 0.21737386 0.0000 0.0000000 0.0000000
## Garage.CondFa 0.99961403 1.0000 1.0000000 1.0000000
## Garage.CondGd 0.31311583 0.0000 0.0000000 0.0000000
## Garage.CondNone 0.98189573 1.0000 1.0000000 1.0000000
## Garage.CondPo 0.95173539 1.0000 1.0000000 1.0000000
## Garage.CondTA 0.80772557 1.0000 1.0000000 1.0000000
## Sale.ConditionAdjLand 0.79540444 1.0000 1.0000000 1.0000000
## Sale.ConditionAlloca 0.99982165 1.0000 1.0000000 1.0000000
## Sale.ConditionFamily 0.68989393 1.0000 1.0000000 1.0000000
## Sale.ConditionNormal 0.99999978 1.0000 1.0000000 1.0000000
## Sale.ConditionPartial 0.99999804 1.0000 1.0000000 1.0000000
## Bldg.Type2fmCon 0.05890929 0.0000 0.0000000 0.0000000
## Bldg.TypeDuplex 0.99082333 1.0000 1.0000000 1.0000000
## Bldg.TypeTwnhs 0.99964138 1.0000 1.0000000 1.0000000
## Bldg.TypeTwnhsE 0.99764979 1.0000 1.0000000 1.0000000
## BF NA 1.0000 0.9891761 0.9888365
## PostProbs NA 0.0002 0.0002000 0.0002000
## R2 NA 0.8767 0.8764000 0.8764000
## dim NA 37.0000 36.0000000 36.0000000
## logmarg NA -1577.7603 -1577.7711587 -1577.7715020
## model 4 model 5
## Intercept 1.0000000 1.0000000
## log(Lot.Area) 1.0000000 1.0000000
## log(area) 1.0000000 1.0000000
## NeighborhoodBlueste 0.0000000 0.0000000
## NeighborhoodBrDale 1.0000000 1.0000000
## NeighborhoodBrkSide 0.0000000 0.0000000
## NeighborhoodClearCr 1.0000000 1.0000000
## NeighborhoodCollgCr 1.0000000 1.0000000
## NeighborhoodCrawfor 1.0000000 1.0000000
## NeighborhoodEdwards 1.0000000 1.0000000
## NeighborhoodGilbert 1.0000000 1.0000000
## NeighborhoodGreens 1.0000000 1.0000000
## NeighborhoodGrnHill 1.0000000 1.0000000
## NeighborhoodIDOTRR 1.0000000 1.0000000
## NeighborhoodMeadowV 1.0000000 1.0000000
## NeighborhoodMitchel 0.0000000 0.0000000
## NeighborhoodNAmes 0.0000000 0.0000000
## NeighborhoodNoRidge 0.0000000 1.0000000
## NeighborhoodNPkVill 0.0000000 0.0000000
## NeighborhoodNridgHt 1.0000000 1.0000000
## NeighborhoodNWAmes 1.0000000 1.0000000
## NeighborhoodOldTown 1.0000000 1.0000000
## NeighborhoodSawyer 0.0000000 0.0000000
## NeighborhoodSawyerW 1.0000000 1.0000000
## NeighborhoodSomerst 0.0000000 0.0000000
## NeighborhoodStoneBr 1.0000000 1.0000000
## NeighborhoodSWISU 0.0000000 0.0000000
## NeighborhoodTimber 0.0000000 0.0000000
## NeighborhoodVeenker 0.0000000 0.0000000
## Overall.Qual 1.0000000 1.0000000
## Bedroom.AbvGr 1.0000000 1.0000000
## House.Age 1.0000000 1.0000000
## Bsmt.CondFa 1.0000000 1.0000000
## Bsmt.CondGd 0.0000000 0.0000000
## Bsmt.CondNone 1.0000000 1.0000000
## Bsmt.CondPo 0.0000000 0.0000000
## Bsmt.CondTA 0.0000000 0.0000000
## Garage.CondFa 1.0000000 1.0000000
## Garage.CondGd 0.0000000 0.0000000
## Garage.CondNone 1.0000000 1.0000000
## Garage.CondPo 1.0000000 1.0000000
## Garage.CondTA 1.0000000 1.0000000
## Sale.ConditionAdjLand 1.0000000 1.0000000
## Sale.ConditionAlloca 1.0000000 1.0000000
## Sale.ConditionFamily 1.0000000 0.0000000
## Sale.ConditionNormal 1.0000000 1.0000000
## Sale.ConditionPartial 1.0000000 1.0000000
## Bldg.Type2fmCon 0.0000000 0.0000000
## Bldg.TypeDuplex 1.0000000 1.0000000
## Bldg.TypeTwnhs 1.0000000 1.0000000
## Bldg.TypeTwnhsE 1.0000000 1.0000000
## BF 0.9562441 0.9562177
## PostProbs 0.0002000 0.0002000
## R2 0.8762000 0.8762000
## dim 35.0000000 35.0000000
## logmarg -1577.8050178 -1577.8050454
image(bas_model_AIC, rotate = FALSE)
Using the AIC model selection, the model 1 selected contains all the variables from the initial model and looking at the image plot, we can see that the step AIC is more inclusive of the new variables created as a result of the categorical variables and their levels combinations. This is justified by the fact that the AIC model penalty for additional \(n\) parameters is set to \(k=2\) which in most cases is less than the one of the BIC for which \(k=log(n)\) with \(n\) being the number of observations. BIC is more parsimonious. We will stick to the BIC method during this analysis.
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
Let us first refit the model selected by the BIC model and proceed with residuals analysis.
Plotting residuals helps verifying conditions for MLR. We will go throught the conditions for Multilinear Regressions.
selected_model_bic <- lm(formula = log(price) ~ log(Lot.Area) + log(area) +Neighborhood + Overall.Qual + Bedroom.AbvGr + House.Age + Bsmt.Cond + Sale.Condition + Bldg.Type, data = ames_train)
hist(selected_model_bic$residuals, breaks = 100)
plot(selected_model_bic$residuals, type="p")
The scatter plot of residuals shows that observations are scattered around 0. We can also see that the histogram of residuals is centered around zero (0) which shows nearly normality. We can however see a left skewness of the plot showing presence of potential outliers. The theoretical quantiles plot shows some irregular linearlity showing as well the presence of outliers.
dev.off()
## null device
## 1
plot(selected_model_bic)
## Warning: not plotting observations with leverage one:
## 516
## Warning: not plotting observations with leverage one:
## 516
plot(abs(selected_model_bic$residuals) ~ selected_model_bic$fitted.values)
q1 <- ggplot(data = ames_train, aes(x=selected_model_bic$fitted.values, y=log(ames_train$price)))+geom_point(aes(color=ames_train$Sale.Condition)) + geom_smooth(method = "lm", aes(color="red"))
q1
q2 <- ggplot(data = ames_train, aes(x=selected_model_bic$fitted.values, y=log(ames_train$price)))+geom_point(aes(color=ames_train$Pool.QC)) + geom_smooth(method = "lm", aes(color="red"))
q2
q3 <- ggplot(data = ames_train, aes(x=selected_model_bic$fitted.values, y=log(ames_train$price)))+geom_point(aes(color=ames_train$Bsmt.Cond)) + geom_smooth(method = "lm", aes(color="red"))
q3
q4 <- ggplot(data = ames_train, aes(x=selected_model_bic$fitted.values, y=log(ames_train$price)))+geom_point(aes(color=ames_train$Garage.Cond)) + geom_smooth(method = "lm", aes(color="red"))
q4
ames_train[names(head(sort(selected_model_bic$residuals), 10)),c("price","House.Age", "Sale.Condition", "Neighborhood", "Bsmt.Cond", "Garage.Cond", "Overall.Qual", "Overall.Cond", "Lot.Area", "Lot.Frontage")]
## price House.Age Sale.Condition Neighborhood Bsmt.Cond Garage.Cond
## 428 12789 94 Abnorml OldTown Fa Fa
## 310 184750 10 Partial Edwards TA TA
## 741 40000 97 Normal IDOTRR TA Fa
## 276 150000 40 Family Veenker TA TA
## 181 82500 40 Family NWAmes TA TA
## 559 34900 97 Abnorml IDOTRR TA None
## 511 63000 83 Normal Edwards TA Po
## 80 67500 97 Normal SawyerW TA TA
## 233 97500 69 Abnorml NAmes TA TA
## 820 104000 66 Normal Crawfor TA TA
## Overall.Qual Overall.Cond Lot.Area Lot.Frontage
## 428 2 2 9656 68
## 310 10 5 40094 130
## 741 4 4 8500 50
## 276 9 3 24572 NA
## 181 7 5 11900 85
## 559 4 5 7879 60
## 511 5 3 9780 60
## 80 5 5 9800 70
## 233 6 5 17503 78
## 820 6 5 7801 79
From the above plot, we can see that several observations are appearing to be outliers to the rest of observation. Investigating them deeper taking into account their relationship to variables within the current model and we observed that among the top 10 observations with the highest residuals:
7/10 are more than 65 years old and the 5 cheapest are more than 80 years old.
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
The RMSE is calculated below in dollar for our model. We use the predict function for in-sample prediction based on our initial inituive model. Since the price was log transformed, we use exponential function to return the prediction to dollar values before calculating the residuals and then calculate the RMSE of the model.
predict_intuitive <- exp(predict(selected_model_bic))
resid_intuitive <- ames_train$price - predict_intuitive
rmse_intuitive <- sqrt(mean(resid_intuitive^2))
rmse_intuitive
## [1] 29984.45
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called âoverfitting.â To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.
load("ames_test.Rdata")
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
While attempting to calculate the RMSE, we noticed that the test dataset provided has some differences which prevented the calculation to be straight forward. For instance, all observations in the ames_test dataset have:
The new variable House.Age is not present in the test dataset and the Year.Built Variable was still present
To address all the above we:
We treated the variables with the function prepare_data above which removes NA and replaced with th new level none, replaced the variable Year.Built by the variable House.Age as in the trian dataset
The out-sample RMSE was then calculated for this linear regression and the value \(191810.5\) was found for it. Comparing it to the in-sample RMSE, we can clearly see that the out-sample prediction RMSE & the in-sample prediction RMSE \(29984.45\) difference is too large for this model.
This may be due to possible outliers in the dataset. Several potential outliying observations were taken note of in the previous model analysis.
Additionally, the model generally fits better within the sample it was built on (in-sample) and poorly with a different sample (out-sample). Further investigations and finetuning are required to improve this model.
# Calculating the RMSE value of the intuitive model based on the ames_test dataset
ames_test_new <- prepare_data(data=ames_test)
# restoring the Sale.Condition removed in the variable treatment
ames_test_new$Sale.Condition <- ames_test$Sale.Condition
# Reconcilling the levels of the model with those of the train dataset to avoid errors while running the predict function.
var_list_bic <- names(selected_model_bic$xlevels)
selected_model_bic$xlevels <- sapply(var_list_bic, function(x) union(selected_model_bic$xlevels[[x]], levels(ames_test_new[[x]])))
predict_outsample <- exp(predict(selected_model_bic, newdata=ames_test_new))
## Warning in predict.lm(selected_model_bic, newdata = ames_test_new):
## prediction from a rank-deficient fit may be misleading
resid_outsample <- ames_test$price - predict_outsample
rmse_outsample <- sqrt(mean(resid_outsample^2, na.rm = TRUE))
rmse_outsample
## [1] 191810.5
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Provide the summary table for your model.
The final model extracted from the full model using the BIC method returns a single variable Overall.Qual with an R-square of roughtly 0.70 which would mean the Overall.Qual on its own explains 70% of the house price in this dataset. However, based on our EDA, we know that other variables such as the Neiborhood, the Lot.Area and others are statistically significant predictors of the house price. Hence we will use the AIC method in this case which is more inclusive to selecte a model with a little bit more variables than the currently returned one.
full_model_lm <- lm(formula = log(price)~., data = na.omit(ames_train))
n <- nrow(na.omit(ames_train))
final_model_BIC <- stepAIC(full_model_lm, direction = "backward", k=n, trace = 0, steps = 1000)
print("Final model based on BIC")
## [1] "Final model based on BIC"
summary(final_model_BIC)
##
## Call:
## lm(formula = log(price) ~ Overall.Qual, data = na.omit(ames_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.53757 -0.11989 0.01186 0.13492 0.94285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.494269 0.037090 282.94 <2e-16 ***
## Overall.Qual 0.249820 0.005859 42.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2338 on 779 degrees of freedom
## Multiple R-squared: 0.7001, Adjusted R-squared: 0.6997
## F-statistic: 1818 on 1 and 779 DF, p-value: < 2.2e-16
# This function will do the automatic model selection and yield the best model based on selection criteria
final_model_AIC <- stepAIC(full_model_lm, direction = "backward", k=2, trace = 0, steps = 1000)
print("Final model based on AIC")
## [1] "Final model based on AIC"
summary(final_model_AIC)
##
## Call:
## lm(formula = log(price) ~ area + Lot.Frontage + Overall.Qual +
## Overall.Cond + Year.Remod.Add + BsmtFin.SF.1 + BsmtFin.SF.2 +
## X1st.Flr.SF + Bsmt.Full.Bath + Kitchen.AbvGr + Fireplaces +
## Garage.Cars + Enclosed.Porch + X3Ssn.Porch + Screen.Porch +
## MS.Zoning + Lot.Shape + Lot.Config + Land.Slope + Neighborhood +
## Condition.1 + Condition.2 + Bldg.Type + Exterior.2nd + Exter.Qual +
## Exter.Cond + Foundation + Bsmt.Qual + Bsmt.Exposure + Heating +
## Central.Air + Kitchen.Qual + Functional + Garage.Type + Garage.Qual +
## Garage.Cond + Sale.Type + Sale.Condition + House.Age, data = na.omit(ames_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81141 -0.05332 0.00198 0.04965 0.66353
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.930e+00 7.274e-01 10.901 < 2e-16 ***
## area 2.903e-04 1.553e-05 18.688 < 2e-16 ***
## Lot.Frontage -5.980e-04 2.732e-04 -2.189 0.028980 *
## Overall.Qual 4.060e-02 6.312e-03 6.431 2.47e-10 ***
## Overall.Cond 4.451e-02 5.639e-03 7.892 1.28e-14 ***
## Year.Remod.Add 8.696e-04 3.512e-04 2.476 0.013529 *
## BsmtFin.SF.1 8.113e-05 1.497e-05 5.420 8.45e-08 ***
## BsmtFin.SF.2 4.458e-05 2.926e-05 1.523 0.128129
## X1st.Flr.SF 7.318e-05 1.895e-05 3.863 0.000124 ***
## Bsmt.Full.Bath 2.586e-02 1.166e-02 2.217 0.026979 *
## Kitchen.AbvGr 1.304e-01 4.717e-02 2.765 0.005860 **
## Fireplaces 1.656e-02 8.909e-03 1.859 0.063511 .
## Garage.Cars 3.261e-02 1.036e-02 3.147 0.001728 **
## Enclosed.Porch 1.414e-04 8.342e-05 1.694 0.090656 .
## X3Ssn.Porch 2.846e-04 1.285e-04 2.214 0.027152 *
## Screen.Porch 1.704e-04 7.717e-05 2.209 0.027553 *
## MS.ZoningFV 3.616e-01 6.987e-02 5.175 3.05e-07 ***
## MS.ZoningRH 4.993e-01 7.965e-02 6.269 6.68e-10 ***
## MS.ZoningRL 4.566e-01 6.044e-02 7.553 1.46e-13 ***
## MS.ZoningRM 3.617e-01 5.548e-02 6.520 1.42e-10 ***
## Lot.ShapeIR2 2.876e-02 2.793e-02 1.030 0.303529
## Lot.ShapeIR3 3.369e-01 7.783e-02 4.329 1.73e-05 ***
## Lot.ShapeReg -3.381e-03 1.068e-02 -0.317 0.751656
## Lot.ConfigCulDSac -3.945e-02 2.542e-02 -1.552 0.121121
## Lot.ConfigFR2 -5.461e-02 2.646e-02 -2.064 0.039412 *
## Lot.ConfigFR3 -1.396e-01 6.661e-02 -2.095 0.036536 *
## Lot.ConfigInside -3.436e-02 1.183e-02 -2.904 0.003815 **
## Land.SlopeMod -1.417e-02 2.464e-02 -0.575 0.565489
## Land.SlopeSev -2.065e-01 9.578e-02 -2.156 0.031482 *
## NeighborhoodBlueste 6.961e-02 8.468e-02 0.822 0.411377
## NeighborhoodBrDale 7.097e-02 7.033e-02 1.009 0.313254
## NeighborhoodBrkSide 1.326e-01 6.125e-02 2.165 0.030721 *
## NeighborhoodClearCr 1.007e-01 7.317e-02 1.377 0.169015
## NeighborhoodCollgCr -5.692e-03 4.752e-02 -0.120 0.904687
## NeighborhoodCrawfor 1.815e-01 5.721e-02 3.172 0.001583 **
## NeighborhoodEdwards -4.248e-03 5.157e-02 -0.082 0.934381
## NeighborhoodGilbert 5.458e-03 4.960e-02 0.110 0.912414
## NeighborhoodGreens 9.118e-02 8.161e-02 1.117 0.264313
## NeighborhoodIDOTRR 7.521e-02 6.823e-02 1.102 0.270766
## NeighborhoodMeadowV -4.861e-02 7.184e-02 -0.677 0.498854
## NeighborhoodMitchel 3.263e-02 5.206e-02 0.627 0.530971
## NeighborhoodNAmes 2.700e-03 5.119e-02 0.053 0.957945
## NeighborhoodNoRidge 6.638e-02 5.465e-02 1.215 0.224984
## NeighborhoodNPkVill 1.007e-01 1.185e-01 0.849 0.395919
## NeighborhoodNridgHt 1.170e-01 4.685e-02 2.496 0.012795 *
## NeighborhoodNWAmes -2.596e-02 5.336e-02 -0.487 0.626707
## NeighborhoodOldTown 1.619e-02 6.243e-02 0.259 0.795415
## NeighborhoodSawyer 7.070e-02 5.316e-02 1.330 0.183993
## NeighborhoodSawyerW -1.085e-02 4.948e-02 -0.219 0.826487
## NeighborhoodSomerst 1.796e-01 5.343e-02 3.361 0.000823 ***
## NeighborhoodStoneBr 1.730e-01 5.168e-02 3.348 0.000862 ***
## NeighborhoodSWISU 3.746e-02 6.515e-02 0.575 0.565494
## NeighborhoodTimber 5.207e-02 5.404e-02 0.964 0.335627
## NeighborhoodVeenker 1.007e-01 6.162e-02 1.634 0.102739
## Condition.1Feedr -3.234e-02 3.992e-02 -0.810 0.418148
## Condition.1Norm 5.170e-02 3.352e-02 1.543 0.123437
## Condition.1PosA 9.648e-02 5.832e-02 1.654 0.098573 .
## Condition.1PosN 1.059e-01 5.822e-02 1.819 0.069392 .
## Condition.1RRAe 1.244e-02 5.371e-02 0.232 0.816940
## Condition.1RRAn -2.675e-02 5.184e-02 -0.516 0.606011
## Condition.2Feedr 2.724e-01 1.478e-01 1.843 0.065856 .
## Condition.2Norm 3.870e-01 1.395e-01 2.774 0.005703 **
## Condition.2PosA 3.196e-01 1.868e-01 1.711 0.087596 .
## Condition.2PosN -5.597e-01 1.745e-01 -3.207 0.001407 **
## Condition.2RRNn 3.974e-01 1.830e-01 2.172 0.030241 *
## Bldg.Type2fmCon -1.569e-02 4.244e-02 -0.370 0.711766
## Bldg.TypeDuplex -1.703e-01 4.793e-02 -3.553 0.000409 ***
## Bldg.TypeTwnhs -1.589e-01 3.275e-02 -4.852 1.54e-06 ***
## Bldg.TypeTwnhsE -9.328e-02 2.261e-02 -4.126 4.18e-05 ***
## Exterior.2ndBrk Cmn 2.037e-01 1.382e-01 1.474 0.141029
## Exterior.2ndBrkFace 3.468e-01 5.114e-02 6.781 2.71e-11 ***
## Exterior.2ndCmentBd 2.493e-01 5.390e-02 4.625 4.53e-06 ***
## Exterior.2ndHdBoard 2.538e-01 4.651e-02 5.458 6.87e-08 ***
## Exterior.2ndImStucc 2.195e-01 6.187e-02 3.548 0.000416 ***
## Exterior.2ndMetalSd 3.001e-01 4.487e-02 6.687 4.95e-11 ***
## Exterior.2ndPlywood 2.429e-01 4.804e-02 5.058 5.55e-07 ***
## Exterior.2ndStucco 2.968e-01 5.731e-02 5.178 3.00e-07 ***
## Exterior.2ndVinylSd 2.798e-01 4.586e-02 6.102 1.81e-09 ***
## Exterior.2ndWd Sdng 3.021e-01 4.597e-02 6.572 1.03e-10 ***
## Exterior.2ndWd Shng 2.706e-01 4.989e-02 5.424 8.26e-08 ***
## Exter.QualFa 1.002e-01 5.848e-02 1.713 0.087264 .
## Exter.QualGd -3.819e-02 2.893e-02 -1.320 0.187315
## Exter.QualTA -6.860e-02 3.345e-02 -2.051 0.040715 *
## Exter.CondFa -1.436e-01 7.869e-02 -1.825 0.068393 .
## Exter.CondGd 3.875e-04 6.688e-02 0.006 0.995379
## Exter.CondTA 2.412e-02 6.584e-02 0.366 0.714229
## FoundationCBlock 1.469e-02 2.012e-02 0.730 0.465493
## FoundationPConc 5.204e-02 2.168e-02 2.400 0.016679 *
## FoundationSlab 4.752e-02 7.795e-02 0.610 0.542357
## FoundationStone 2.274e-02 6.921e-02 0.328 0.742646
## Bsmt.QualFa -1.756e-01 3.913e-02 -4.487 8.54e-06 ***
## Bsmt.QualGd -5.573e-02 2.061e-02 -2.704 0.007041 **
## Bsmt.QualNone -1.492e-01 1.261e-01 -1.184 0.237041
## Bsmt.QualPo 9.760e-01 1.844e-01 5.293 1.65e-07 ***
## Bsmt.QualTA -8.146e-02 2.665e-02 -3.057 0.002330 **
## Bsmt.ExposureGd 5.316e-02 1.922e-02 2.766 0.005831 **
## Bsmt.ExposureMn -3.019e-02 1.842e-02 -1.639 0.101605
## Bsmt.ExposureNo -2.574e-02 1.284e-02 -2.004 0.045448 *
## Bsmt.ExposureNone -7.072e-02 1.083e-01 -0.653 0.513844
## HeatingGasW 1.701e-01 5.564e-02 3.056 0.002334 **
## HeatingGrav 3.124e-01 1.576e-01 1.983 0.047818 *
## HeatingWall 1.269e-01 1.252e-01 1.014 0.311036
## Central.AirY 1.692e-01 2.992e-02 5.656 2.34e-08 ***
## Kitchen.QualFa -9.575e-02 4.358e-02 -2.197 0.028382 *
## Kitchen.QualGd -3.295e-02 2.350e-02 -1.402 0.161397
## Kitchen.QualPo 2.561e-01 1.451e-01 1.764 0.078136 .
## Kitchen.QualTA -5.048e-02 2.663e-02 -1.896 0.058441 .
## FunctionalMaj2 2.617e-02 1.247e-01 0.210 0.833866
## FunctionalMin1 4.461e-02 8.103e-02 0.550 0.582197
## FunctionalMin2 8.289e-02 7.833e-02 1.058 0.290368
## FunctionalMod 5.267e-02 8.112e-02 0.649 0.516385
## FunctionalSal -2.650e-01 1.558e-01 -1.701 0.089496 .
## FunctionalTyp 1.222e-01 7.334e-02 1.666 0.096249 .
## Garage.TypeAttchd 3.516e-02 4.443e-02 0.791 0.429055
## Garage.TypeBasment -2.603e-02 6.703e-02 -0.388 0.697870
## Garage.TypeBuiltIn -3.429e-02 4.833e-02 -0.710 0.478217
## Garage.TypeDetchd 1.337e-02 4.555e-02 0.293 0.769302
## Garage.QualFa -1.400e-01 1.149e-01 -1.219 0.223333
## Garage.QualGd -9.882e-02 1.274e-01 -0.776 0.438054
## Garage.QualPo -7.601e-01 1.756e-01 -4.328 1.75e-05 ***
## Garage.QualTA -1.485e-01 1.126e-01 -1.318 0.187845
## Garage.CondFa -1.207e-01 3.542e-02 -3.407 0.000698 ***
## Garage.CondGd 8.017e-02 7.719e-02 1.039 0.299404
## Garage.CondPo 2.909e-01 6.710e-02 4.336 1.69e-05 ***
## Garage.CondTA NA NA NA NA
## Sale.TypeCon 3.059e-02 5.939e-02 0.515 0.606642
## Sale.TypeConLD 1.235e-01 6.102e-02 2.024 0.043427 *
## Sale.TypeConLI -7.701e-02 7.441e-02 -1.035 0.301046
## Sale.TypeConLw 1.622e-02 5.447e-02 0.298 0.766008
## Sale.TypeCWD 6.867e-02 7.013e-02 0.979 0.327828
## Sale.TypeNew -6.809e-02 7.743e-02 -0.879 0.379531
## Sale.TypeOth 1.952e-01 8.477e-02 2.303 0.021583 *
## Sale.TypeVWD -2.004e-02 1.181e-01 -0.170 0.865330
## Sale.TypeWD -4.421e-03 2.731e-02 -0.162 0.871457
## Sale.ConditionAlloca 1.446e-01 8.545e-02 1.692 0.091059 .
## Sale.ConditionFamily 5.949e-03 3.708e-02 0.160 0.872575
## Sale.ConditionNormal 7.821e-02 1.881e-02 4.157 3.66e-05 ***
## Sale.ConditionPartial 1.608e-01 7.317e-02 2.198 0.028320 *
## House.Age -2.179e-03 4.728e-04 -4.608 4.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1064 on 643 degrees of freedom
## Multiple R-squared: 0.9487, Adjusted R-squared: 0.9378
## F-statistic: 86.81 on 137 and 643 DF, p-value: < 2.2e-16
From the StepAIC model we can see that the following variables Condition.1, Exter.Qual, Kitchen.Qual, Functional, Garage.Type, Lot.Frontage are statiscally insignificant based on their P-value and the default used significant level of 5%.
By removing the insignificant variables and refitting the model several times we get to the following model we end up with the below model that has an R-squared of 0.8947 which tells us all variables included explains 89.47% of the house price in this datase.
full_model_lm2 <- lm(formula = log(price) ~ log(area) + Overall.Qual +
X1st.Flr.SF + Kitchen.AbvGr +
Garage.Cars + Enclosed.Porch + Screen.Porch +
MS.Zoning + Land.Slope + Neighborhood +
Bldg.Type + Exterior.2nd +
Bsmt.Qual + Central.Air + Sale.Condition + House.Age, data = na.omit(ames_train))
final_model_AIC2 <- stepAIC(full_model_lm2, direction = "backward", k=2, trace = 0, steps = 1000)
summary(final_model_AIC2)
##
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + X1st.Flr.SF +
## Kitchen.AbvGr + Garage.Cars + Enclosed.Porch + Screen.Porch +
## MS.Zoning + Neighborhood + Bldg.Type + Exterior.2nd + Bsmt.Qual +
## Central.Air + Sale.Condition + House.Age, data = na.omit(ames_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19931 -0.06467 -0.00153 0.07681 0.60390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.015e+00 1.954e-01 41.019 < 2e-16 ***
## log(area) 3.581e-01 2.581e-02 13.877 < 2e-16 ***
## Overall.Qual 6.522e-02 7.468e-03 8.733 < 2e-16 ***
## X1st.Flr.SF 1.350e-04 2.058e-05 6.559 1.04e-10 ***
## Kitchen.AbvGr 8.835e-02 5.786e-02 1.527 0.127204
## Garage.Cars 4.299e-02 1.234e-02 3.484 0.000524 ***
## Enclosed.Porch 2.745e-04 1.034e-04 2.654 0.008141 **
## Screen.Porch 3.121e-04 9.708e-05 3.214 0.001366 **
## MS.ZoningFV 2.511e-01 8.501e-02 2.954 0.003238 **
## MS.ZoningRH 2.246e-01 9.905e-02 2.268 0.023631 *
## MS.ZoningRL 2.660e-01 7.325e-02 3.632 0.000302 ***
## MS.ZoningRM 1.947e-01 6.743e-02 2.887 0.004001 **
## NeighborhoodBlueste 8.588e-02 1.086e-01 0.791 0.429272
## NeighborhoodBrDale 7.253e-02 8.923e-02 0.813 0.416611
## NeighborhoodBrkSide 1.105e-01 7.559e-02 1.462 0.144207
## NeighborhoodClearCr 1.523e-01 8.993e-02 1.693 0.090863 .
## NeighborhoodCollgCr 4.454e-02 6.156e-02 0.724 0.469558
## NeighborhoodCrawfor 2.423e-01 7.139e-02 3.395 0.000725 ***
## NeighborhoodEdwards 2.050e-02 6.480e-02 0.316 0.751883
## NeighborhoodGilbert 2.608e-02 6.356e-02 0.410 0.681688
## NeighborhoodGreens 2.453e-01 1.026e-01 2.392 0.016995 *
## NeighborhoodIDOTRR 1.559e-02 8.320e-02 0.187 0.851395
## NeighborhoodMeadowV -3.530e-02 9.255e-02 -0.381 0.703044
## NeighborhoodMitchel 8.910e-02 6.597e-02 1.351 0.177276
## NeighborhoodNAmes 5.776e-02 6.432e-02 0.898 0.369470
## NeighborhoodNoRidge 1.672e-01 6.951e-02 2.405 0.016404 *
## NeighborhoodNPkVill 7.785e-02 1.570e-01 0.496 0.620156
## NeighborhoodNridgHt 1.884e-01 6.027e-02 3.126 0.001844 **
## NeighborhoodNWAmes 1.324e-02 6.748e-02 0.196 0.844460
## NeighborhoodOldTown 7.102e-02 7.753e-02 0.916 0.360008
## NeighborhoodSawyer 8.171e-02 6.727e-02 1.215 0.224849
## NeighborhoodSawyerW 2.703e-02 6.366e-02 0.425 0.671259
## NeighborhoodSomerst 1.218e-01 6.812e-02 1.788 0.074248 .
## NeighborhoodStoneBr 2.382e-01 6.626e-02 3.595 0.000347 ***
## NeighborhoodSWISU 3.565e-02 8.204e-02 0.435 0.663973
## NeighborhoodTimber 1.280e-01 6.966e-02 1.837 0.066595 .
## NeighborhoodVeenker 1.860e-01 7.797e-02 2.385 0.017333 *
## Bldg.Type2fmCon -1.973e-03 5.174e-02 -0.038 0.969597
## Bldg.TypeDuplex -1.791e-01 6.008e-02 -2.981 0.002968 **
## Bldg.TypeTwnhs -1.732e-01 3.854e-02 -4.494 8.14e-06 ***
## Bldg.TypeTwnhsE -1.019e-01 2.550e-02 -3.997 7.09e-05 ***
## Exterior.2ndBrk Cmn 2.602e-01 1.812e-01 1.436 0.151531
## Exterior.2ndBrkFace 3.553e-01 6.463e-02 5.497 5.37e-08 ***
## Exterior.2ndCmentBd 3.020e-01 6.674e-02 4.526 7.05e-06 ***
## Exterior.2ndHdBoard 2.788e-01 5.794e-02 4.811 1.83e-06 ***
## Exterior.2ndImStucc 2.708e-01 7.892e-02 3.432 0.000634 ***
## Exterior.2ndMetalSd 3.243e-01 5.583e-02 5.809 9.46e-09 ***
## Exterior.2ndPlywood 2.471e-01 5.962e-02 4.144 3.82e-05 ***
## Exterior.2ndStucco 4.174e-01 7.059e-02 5.913 5.19e-09 ***
## Exterior.2ndVinylSd 2.996e-01 5.688e-02 5.267 1.84e-07 ***
## Exterior.2ndWd Sdng 3.110e-01 5.752e-02 5.407 8.75e-08 ***
## Exterior.2ndWd Shng 3.173e-01 6.212e-02 5.108 4.18e-07 ***
## Bsmt.QualFa -2.077e-01 4.643e-02 -4.473 8.96e-06 ***
## Bsmt.QualGd -9.612e-02 2.366e-02 -4.062 5.41e-05 ***
## Bsmt.QualNone -2.993e-01 5.225e-02 -5.729 1.49e-08 ***
## Bsmt.QualPo 2.508e-01 1.618e-01 1.550 0.121482
## Bsmt.QualTA -1.466e-01 3.158e-02 -4.644 4.07e-06 ***
## Central.AirY 2.411e-01 2.998e-02 8.041 3.66e-15 ***
## Sale.ConditionAlloca 1.238e-01 1.096e-01 1.130 0.258966
## Sale.ConditionFamily -6.395e-02 4.560e-02 -1.403 0.161192
## Sale.ConditionNormal 7.469e-02 2.294e-02 3.255 0.001187 **
## Sale.ConditionPartial 9.074e-02 3.026e-02 2.998 0.002807 **
## House.Age -2.582e-03 4.994e-04 -5.170 3.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1431 on 718 degrees of freedom
## Multiple R-squared: 0.8965, Adjusted R-squared: 0.8875
## F-statistic: 100.3 on 62 and 718 DF, p-value: < 2.2e-16
The above is the summary table of the final model selected
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
Variables price, area and Lot.Area were log transformed to allow a better fit of the model.
This is well illustrated below using simple linear regression model between the response variable price and the transformed variable and see the goodness of fit of the models (Adjusted R-squared) compaired to the model without transformation.
Looking at the plots below, we clearly see that for the relationship between price & area, the linearity of the relationship is more visible when calculating the log of both the price and the area. The same applies to the linearity between price and Lot.Area.
Plot variation from left to right shows linearity improvement. This means the Lot.Area that was removed from the automaticmodel slection will be reinserted during the next step of this analysis.
# This is done by testing the Adjusted R-squared value of linear models with and without the log transformation of area
# Justifying plotting of price, are and Lot.Area
dev.off()
## null device
## 1
plot(price ~area, data = ames_train, col="blue")
plot(log(price) ~area, data = ames_train, col="red")
plot(log(price) ~log(area), data = ames_train, col="green")
plot(price ~Lot.Area, data = ames_train, col="blue")
plot(log(price) ~Lot.Area, data = ames_train, col="red")
plot(log(price) ~log(Lot.Area), data = ames_train, col="green")
summary(lm(formula= log(price)~area, data = ames_train))$adj.r.squared < summary(lm(formula= log(price)~log(area), data = ames_train))$adj.r.squared
## [1] TRUE
print("TRUE. It shows that log transformation of area is necessary for both price and area.")
## [1] "TRUE. It shows that log transformation of area is necessary for both price and area."
summary(lm(formula= log(price)~Lot.Area, data = ames_train))$adj.r.squared < summary(lm(formula= log(price)~log(Lot.Area), data = ames_train))$adj.r.squared
## [1] TRUE
print("TRUE. It shows that log transformation of area is necessary for both price and Lot.Area.")
## [1] "TRUE. It shows that log transformation of area is necessary for both price and Lot.Area."
ames_train %>% filter(price>500000)
## PID area price MS.SubClass Lot.Frontage Lot.Area Overall.Qual
## 1 528164060 2470 615000 20 106 12720 10
## 2 528118050 2290 500067 20 59 17169 10
## 3 527214060 2698 535000 60 82 16052 10
## 4 528150070 2364 611657 20 100 12919 9
## 5 527216080 2338 591587 20 52 51974 9
## 6 527216070 3279 538000 60 47 53504 8
## Overall.Cond Year.Remod.Add Mas.Vnr.Area BsmtFin.SF.1 BsmtFin.SF.2
## 1 5 2003 680 2257 0
## 2 5 2007 970 1684 0
## 3 5 2006 734 1206 0
## 4 5 2010 760 2188 0
## 5 5 2007 710 1101 0
## 6 5 2003 603 1416 0
## Bsmt.Unf.SF Total.Bsmt.SF X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF
## 1 278 2535 2470 0 0
## 2 636 2320 2290 0 0
## 3 644 1850 1850 848 0
## 4 142 2330 2364 0 0
## 5 1559 2660 2338 0 0
## 6 234 1650 1690 1589 0
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## 1 2 0 1 1 1
## 2 2 0 2 1 2
## 3 1 0 2 1 4
## 4 1 0 2 1 2
## 5 1 0 2 1 4
## 6 1 0 3 1 4
## Kitchen.AbvGr TotRms.AbvGrd Fireplaces Garage.Yr.Blt Garage.Cars
## 1 1 7 2 2003 3
## 2 1 7 1 2007 3
## 3 1 11 1 2006 3
## 4 1 11 2 2009 3
## 5 1 8 2 2005 3
## 6 1 12 1 2003 3
## Garage.Area Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
## 1 789 154 65 0 0
## 2 1174 192 30 0 0
## 3 736 250 0 0 0
## 4 820 0 67 0 0
## 5 1110 0 135 0 0
## 6 841 503 36 0 0
## Screen.Porch Pool.Area Misc.Val Mo.Sold Yr.Sold MS.Zoning Street Alley
## 1 216 144 0 2 2008 RL Pave None
## 2 0 0 0 8 2007 RL Pave None
## 3 0 0 0 7 2006 RL Pave None
## 4 0 0 0 3 2010 RL Pave None
## 5 322 0 0 6 2007 RL Pave None
## 6 210 0 0 6 2010 RL Pave None
## Lot.Shape Land.Contour Lot.Config Land.Slope Neighborhood Condition.1
## 1 Reg HLS Inside Mod NridgHt Norm
## 2 IR2 Lvl CulDSac Gtl NridgHt Norm
## 3 IR1 Lvl CulDSac Gtl StoneBr Norm
## 4 IR1 Lvl Inside Gtl NridgHt Norm
## 5 IR1 Lvl CulDSac Gtl StoneBr PosN
## 6 IR2 HLS CulDSac Mod StoneBr Norm
## Condition.2 Bldg.Type House.Style Roof.Style Roof.Matl Exterior.1st
## 1 Norm 1Fam 1Story Hip CompShg MetalSd
## 2 Norm 1Fam 1Story Hip CompShg CemntBd
## 3 Norm 1Fam 2Story Hip CompShg VinylSd
## 4 Norm 1Fam 1Story Hip CompShg VinylSd
## 5 Norm 1Fam 1Story Hip CompShg VinylSd
## 6 Norm 1Fam 2Story Hip CompShg CemntBd
## Exterior.2nd Mas.Vnr.Type Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1 MetalSd Stone Ex TA PConc Ex
## 2 CmentBd BrkFace Ex TA PConc Ex
## 3 VinylSd Stone Ex TA PConc Ex
## 4 VinylSd Stone Ex TA PConc Ex
## 5 VinylSd BrkFace Ex TA PConc Ex
## 6 Wd Shng BrkFace Ex TA PConc Gd
## Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.Type.2 Heating Heating.QC
## 1 TA Gd GLQ Unf GasA Ex
## 2 TA Av GLQ Unf GasA Ex
## 3 TA No GLQ Unf GasA Ex
## 4 TA Gd GLQ Unf GasA Ex
## 5 TA Av GLQ Unf GasA Ex
## 6 TA Gd ALQ Unf GasA Ex
## Central.Air Electrical Kitchen.Qual Functional Fireplace.Qu Garage.Type
## 1 Y SBrkr Ex Typ Gd Attchd
## 2 Y SBrkr Ex Typ Gd Attchd
## 3 Y SBrkr Ex Typ Gd Attchd
## 4 Y SBrkr Ex Typ Gd Attchd
## 5 Y SBrkr Gd Typ Gd Attchd
## 6 Y SBrkr Ex Mod Gd BuiltIn
## Garage.Finish Garage.Qual Garage.Cond Paved.Drive Pool.QC Fence
## 1 Fin TA TA Y Ex None
## 2 Fin TA TA Y None None
## 3 RFn TA TA Y None None
## 4 Fin TA TA Y None None
## 5 Fin Gd TA Y None None
## 6 Fin TA TA Y None None
## Misc.Feature Sale.Type Sale.Condition House.Age
## 1 None WD Normal 14
## 2 None New Partial 10
## 3 None New Partial 11
## 4 None New Partial 8
## 5 None New Partial 11
## 6 None WD Normal 14
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.
Instead of dealing with the Year.Built as a categorical variable, it was found easier to correlate between the price and another numerical variable called House.Age which would compute the age of the house today base on the Year.Built. Since both variable would be representing the same phenomenon, we decided to discard the variable Year.Built from the dataset. This interaction is constructed within the function prepare_data().
We recognize limitation of the variable House.Age as it was supposed to be calculated with the Year.Sold instead of using Today date. However, the year sold varies between 2006-2010 and a difference of 4 years would not significantly change the current House-Age variation.
# Below is the formula that was used to perform this interaction as part of the prepare_data formula.
# ames_train <- ames_train %>% mutate(House.Age = year(today())-Year.Built)
summary(lm(formula=log(price)~House.Age, data = ames_train))
##
## Call:
## lm(formula = log(price) ~ House.Age, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.12310 -0.18669 -0.03868 0.16837 1.29541
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.4181869 0.0187605 661.93 <2e-16 ***
## House.Age -0.0089228 0.0003493 -25.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3272 on 998 degrees of freedom
## Multiple R-squared: 0.3953, Adjusted R-squared: 0.3947
## F-statistic: 652.5 on 1 and 998 DF, p-value: < 2.2e-16
ggplot(data=ames_train, aes(x=price, y=House.Age))+geom_point(aes(colour="red"))+geom_smooth(method = "lm")
The negative relationship is expressed year by the descending slope from left to right which means that the older the house, the lower the price as per this relationship.
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
The method choosen to select variables is AIC to allow inclusion of more variables in the initial model before proceeding to finetuning. As demonstrate above, the BIC model would only return the Overall.Qual as significant predictor, completely ignoring all other variables. Hence we decided to use the AIC and proceed by elimination based on the return P-values until we reach this final model.
This is due to the fact that AIC calculates the likelihood of the model and penalize it by a fixed values of K=2 for penalty calculation for each an every new parameter added.
We will also take into account the fact that several observations distorting house price should be removed from the dataet in order to improve model prediction. In this case we will filter in only observation for which the Sale.Condition was Normal and without a Swimming pool.
# Model selection was performed above. But for the purposed of responding to this question, we are refitting the model in the same variable.
ames_train_noout <- ames_train %>% filter(is.na(Pool.QC) | tolower(Sale.Condition) =="normal")
full_model_lm3 <- lm(formula = log(price) ~ log(area) + log(Lot.Area) + Overall.Qual +
X1st.Flr.SF + Kitchen.AbvGr +
Garage.Cars + Enclosed.Porch + Screen.Porch +
MS.Zoning + Land.Slope + Neighborhood +
Bldg.Type + Exterior.2nd +
Bsmt.Qual + Central.Air + House.Age, data = na.omit(ames_train_noout))
final_model_AIC3 <- stepAIC(full_model_lm3, direction = "backward", k=2, trace = 0, steps = 1000)
summary(final_model_AIC3)
##
## Call:
## lm(formula = log(price) ~ log(area) + log(Lot.Area) + Overall.Qual +
## X1st.Flr.SF + Garage.Cars + Screen.Porch + MS.Zoning + Land.Slope +
## Neighborhood + Bldg.Type + Exterior.2nd + Bsmt.Qual + Central.Air +
## House.Age, data = na.omit(ames_train_noout))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56808 -0.05691 0.00061 0.06547 0.54444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.620e+00 2.266e-01 33.632 < 2e-16 ***
## log(area) 3.814e-01 2.265e-02 16.840 < 2e-16 ***
## log(Lot.Area) 5.953e-02 1.886e-02 3.156 0.001681 **
## Overall.Qual 6.455e-02 6.694e-03 9.643 < 2e-16 ***
## X1st.Flr.SF 1.248e-04 1.957e-05 6.379 3.63e-10 ***
## Garage.Cars 5.582e-02 1.063e-02 5.253 2.11e-07 ***
## Screen.Porch 2.123e-04 8.564e-05 2.478 0.013476 *
## MS.ZoningFV 3.753e-01 8.151e-02 4.604 5.10e-06 ***
## MS.ZoningRH 2.390e-01 8.974e-02 2.663 0.007948 **
## MS.ZoningRL 3.480e-01 6.745e-02 5.160 3.40e-07 ***
## MS.ZoningRM 2.549e-01 6.334e-02 4.025 6.44e-05 ***
## Land.SlopeMod 2.581e-02 2.550e-02 1.013 0.311719
## Land.SlopeSev 2.643e-01 1.336e-01 1.979 0.048308 *
## NeighborhoodBlueste 1.659e-01 1.021e-01 1.625 0.104607
## NeighborhoodBrDale 1.070e-01 9.202e-02 1.163 0.245400
## NeighborhoodBrkSide 1.568e-01 8.243e-02 1.902 0.057659 .
## NeighborhoodClearCr 1.491e-01 9.148e-02 1.630 0.103689
## NeighborhoodCollgCr 4.402e-02 7.324e-02 0.601 0.548088
## NeighborhoodCrawfor 2.485e-01 8.042e-02 3.090 0.002099 **
## NeighborhoodEdwards 3.850e-02 7.560e-02 0.509 0.610784
## NeighborhoodGilbert 2.393e-02 7.592e-02 0.315 0.752707
## NeighborhoodGreens 2.647e-01 9.625e-02 2.750 0.006140 **
## NeighborhoodIDOTRR 1.316e-01 8.797e-02 1.496 0.135291
## NeighborhoodMeadowV -3.925e-02 9.459e-02 -0.415 0.678363
## NeighborhoodMitchel 9.939e-02 7.602e-02 1.308 0.191543
## NeighborhoodNAmes 9.382e-02 7.509e-02 1.249 0.211992
## NeighborhoodNoRidge 1.665e-01 7.696e-02 2.164 0.030904 *
## NeighborhoodNPkVill 1.199e-01 1.349e-01 0.889 0.374596
## NeighborhoodNridgHt 1.609e-01 7.260e-02 2.216 0.027082 *
## NeighborhoodNWAmes 5.394e-02 7.714e-02 0.699 0.484680
## NeighborhoodOldTown 1.271e-01 8.392e-02 1.515 0.130330
## NeighborhoodSawyer 9.464e-02 7.699e-02 1.229 0.219463
## NeighborhoodSawyerW 3.620e-02 7.466e-02 0.485 0.627986
## NeighborhoodSomerst 9.884e-02 8.060e-02 1.226 0.220565
## NeighborhoodStoneBr 1.475e-01 7.913e-02 1.864 0.062866 .
## NeighborhoodSWISU 4.366e-02 8.661e-02 0.504 0.614395
## NeighborhoodTimber 8.091e-02 7.869e-02 1.028 0.304223
## NeighborhoodVeenker 1.700e-01 8.392e-02 2.025 0.043312 *
## Bldg.Type2fmCon -2.499e-03 3.547e-02 -0.070 0.943868
## Bldg.TypeDuplex -1.214e-01 3.195e-02 -3.799 0.000161 ***
## Bldg.TypeTwnhs -9.335e-02 3.823e-02 -2.441 0.014924 *
## Bldg.TypeTwnhsE -4.119e-02 2.666e-02 -1.545 0.122914
## Exterior.2ndBrk Cmn 4.782e-02 1.474e-01 0.324 0.745723
## Exterior.2ndBrkFace 1.364e-01 5.885e-02 2.318 0.020816 *
## Exterior.2ndCmentBd 1.912e-01 6.437e-02 2.970 0.003105 **
## Exterior.2ndHdBoard 1.028e-01 5.250e-02 1.959 0.050567 .
## Exterior.2ndImStucc 9.243e-02 6.896e-02 1.340 0.180660
## Exterior.2ndMetalSd 1.245e-01 5.096e-02 2.443 0.014853 *
## Exterior.2ndPlywood 5.421e-02 5.377e-02 1.008 0.313820
## Exterior.2ndStucco 2.122e-01 6.194e-02 3.426 0.000655 ***
## Exterior.2ndVinylSd 1.134e-01 5.204e-02 2.179 0.029744 *
## Exterior.2ndWd Sdng 1.229e-01 5.174e-02 2.375 0.017854 *
## Exterior.2ndWd Shng 1.374e-01 5.618e-02 2.447 0.014717 *
## Bsmt.QualFa -1.577e-01 4.305e-02 -3.664 0.000271 ***
## Bsmt.QualGd -9.764e-02 2.377e-02 -4.107 4.58e-05 ***
## Bsmt.QualNone -2.849e-01 4.493e-02 -6.342 4.55e-10 ***
## Bsmt.QualPo 2.276e-01 1.349e-01 1.687 0.092124 .
## Bsmt.QualTA -1.320e-01 2.943e-02 -4.485 8.79e-06 ***
## Central.AirY 1.700e-01 2.514e-02 6.763 3.29e-11 ***
## House.Age -2.728e-03 4.306e-04 -6.336 4.71e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1142 on 582 degrees of freedom
## Multiple R-squared: 0.9184, Adjusted R-squared: 0.9101
## F-statistic: 111 on 59 and 582 DF, p-value: < 2.2e-16
By doing all the changes described above and by adding back the log transformed log(Lot.Area) in the model, we see a jump of Adjusted R-squared from \(0.8856\) to \(0.9101\) which is a great improvement of the goodness of fit of our model. Also we see the R-square value goes from 0.8947 to 0.9184 which means out model explanatory variables are able to explain nearly 92% of the house price.
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
While testing the model on the out-of-sample data, the RMSE value returned was \(191810.5\) which shows that the model predicts very poorly the house price. Issues noticed while calculatingRMSE value were the following:
The new variable House.Age is not present in the test dataset and the Year.Built Variable was still present
To address all the above we:
We treated the variables with the function prepare_data above which removes NA and replaced with th new level none, replaced the variable Year.Built by the variable House.Age as in the trian dataset
The out-sample RMSE was then calculated for this linear regression and the value \(191810.5\) was found for it. Comparing it to the in-sample RMSE, we can clearly see that the out-sample prediction RMSE & the in-sample prediction RMSE \(29984.45\) difference is too large for this model.
# Reconcilling the levels of the model with those of the train dataset to avoid errors while running the predict function.
cat_var_list <- names(final_model_AIC3$xlevels)
final_model_AIC3$xlevels <- sapply(cat_var_list, function(x) union(final_model_AIC3$xlevels[[x]], levels(ames_test_new[[x]])))
# Performing out-sample prediction
predict_out_final <- suppressWarnings(exp(predict(final_model_AIC3, newdata=ames_test_new)))
resid_out_final <- ames_test_new$price - predict_out_final
rmse_out_final <- sqrt(mean(resid_out_final^2, na.rm = TRUE))
rmse_out_final
## [1] 38401.83
We can also clearly see an improvement of the RMSE value after subsetting the dataset to Sale.Condition=“Normal” only and reintroduing the log(Lot.Area). RMSE goes from \(191810.5\) in our initial model out-of-sample test to \(38401.83\) for this final model. The final model predicts way better the house price based on this output.
For your final model, create and briefly interpret an informative plot of the residuals.
Below plots show the model plot including the Theoretical quantile plot that shows that:
The dot plot of actual vs fitted price values shows quite reliable fit which means that the model is quite good at predicting the out-of-sample data.
suppressWarnings(plot(final_model_AIC3))
residual_data <- data.frame(price=ames_test_new$price, predicted=predict_out_final, residuals=resid_out_final)
ggplot()+geom_point(data = residual_data, aes(x=price, y=predicted, color="blue")) + geom_smooth(data = residual_data, aes(x=price, y=predicted), method = "lm")
ggplot(data = residual_data, aes(x=residuals))+geom_histogram(fill="blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For your final model, calculate and briefly comment on the RMSE.
# Performing out-sample prediction
predict_out_final2 <- suppressWarnings(exp(predict(final_model_AIC3, interval = "prediction" ,newdata=ames_test_new)))
resid_out_final2 <- ames_test_new$price - predict_out_final2[,"fit"]
rmse_out_final2 <- sqrt(mean(resid_out_final2^2, na.rm = TRUE))
paste("Final model RMSE is :", round(rmse_out_final2,2))
## [1] "Final model RMSE is : 38401.83"
# Comparing both out-sample rmse for both intuitive model and refined final model
paste("Final model RMSE is lower than intuitive model:", rmse_out_final2 < rmse_outsample)
## [1] "Final model RMSE is lower than intuitive model: TRUE"
coverage.prob.final <- mean(ames_test_new$price > predict_out_final2[,"lwr"] &
ames_test_new$price < predict_out_final2[,"upr"], na.rm = TRUE)
paste("This model covers up to:", round(coverage.prob.final*100, 2), ("% of the data in the modified test dataset"))
## [1] "This model covers up to: 82.5 % of the data in the modified test dataset"
RMSE for the intuitive model compared to the final model indicates that the final model predicts better the price than the initial model. However, the high value of the RMSE shows that some high leverage points are influencing the entire calculation.
What are some strengths and weaknesses of your model?
The model was built by using the combination of automated model selection and manual.
Coverage of the modelon test data is equal to 82.5% which is still to improve
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the âames_validationâ dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?
load("ames_validation.Rdata")
From this result we can conclude that even though the final model accuracy is not as high as expected, it does an acceptable approximation of the house price based on the provided dataset.
## Ensuring that the NA are dealt with and that the new variable Houe.Age is created
ames_validation_new <- prepare_data(ames_validation)
# Reconcilling the levels of the model with those of the train dataset to avoid errors while running the predict function.
cat_var_list3 <- names(final_model_AIC3$xlevels)
final_model_AIC3$xlevels <- sapply(cat_var_list3, function(x) union(final_model_AIC3$xlevels[[x]], levels(ames_validation_new[[x]])))
# Performing out-sample prediction
predict_out_final3 <- suppressWarnings(exp(predict(final_model_AIC3 ,newdata=ames_validation_new)))
resid_out_final3 <- ames_validation_new$price - predict_out_final3["fit"]
rmse_out_final3 <- sqrt(mean(resid_out_final3^2, na.rm = TRUE))
paste("Final model RMSE on validation data is:", round(rmse_out_final3,2))
## [1] "Final model RMSE on validation data is: NaN"
paste("Compared to the initial model RMSE ", round(rmse_intuitive,2) ,"is", ifelse(rmse_out_final3 < rmse_intuitive, "lower", "higher"))
## [1] "Compared to the initial model RMSE 29984.45 is NA"
paste("Compared to the test model RMSE ", round(rmse_out_final,2), "is" ,ifelse(rmse_out_final3 < rmse_out_final, "lower", " higher"))
## [1] "Compared to the test model RMSE 38401.83 is NA"
## Calculating the coverage
predict_out_final4 <- suppressWarnings(exp(predict(final_model_AIC3, interval = "prediction" ,newdata=ames_validation_new)))
resid_out_final4 <- ames_validation_new$price - predict_out_final4["fit"]
rmse_out_final4 <- sqrt(mean(resid_out_final4^2, na.rm = TRUE))
coverage.prob.validation <- mean(ames_validation_new$price > predict_out_final4[,"lwr"] &
ames_validation_new$price < predict_out_final4[,"upr"], na.rm = TRUE)
paste("This model covers up to:", round(coverage.prob.validation*100, 2), ("% of the observations in the validation data"))
## [1] "This model covers up to: 75.88 % of the observations in the validation data"
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
This analysis has been full of lessons and through it I learn several concepts and methods including the concept of model testing and validation both in-smple and out-sample. I also learned that the that house price of analysing and validating my analysis using seperate portions of the dataset.
House price main 3 predictors per this model are area, Lot.Area * Overall quality.
If not transformed, some variables such as “area” and “Lot.Area” linear relationship to the house “price” is simply discarded during the model selection and marked as statistically insignificant. However, when log transformed they start presenting linearity.
Conditions for inference were unfortunately not fully respected as the nearly normallity and the linearity conditions were not completely verified due to the skeness of the residual distribution. Hence, performance of the predictions yield may vary greatly from one dataset to another.