Emiliano La Rocca
As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
load("ames_train.Rdata")
Use the code block below to load any necessary packages
library(statsr)
library(devtools)
library(MASS)
library(dplyr)
library(ggplot2)
library(BAS)
library(modes)
library(Hmisc)
library(stats)
library(xtable)
## Warning: package 'xtable' was built under R version 3.4.0
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.4.0
library(corrplot)
library(GGally)
library(xtable)
library(VIF)
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
2.1.1 Price or log(price)?
I evaluated the distribution of price in order to decide if it needed some kind of transformation. By the two graphs below, we see that the distribution of log(price) is much more normally distributed than its normal distribution. After analyzing this plot, I decided to model log(price).
summary(ames_train$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12790 129800 159500 181200 213000 615000
summary(log(ames_train$price)+1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.46 12.77 12.98 13.02 13.27 14.33
plot1 <- ggplot(ames_train,aes(price)) + geom_histogram() + ggtitle('Distribution of price')
plot2 <- ggplot(ames_train,aes(log(price)+1)) + geom_histogram() + ggtitle('Distribution of log(price)')
grid.arrange(plot1,plot2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
* * *
2.1.2 Numerical Correlations
Next step is, we will chech the numerical correlation between the numeric variables and their correlation with the dependent variable log(price). First, we check which column has NA and the percentage of NA inside it
integer.cols.index <- which(sapply(ames_train,class) == 'integer')
ames_train.numeric <- ames_train[,integer.cols.index]
NApercentage <- sapply(ames_train.numeric, function(col){
NAcount <- sum(is.na(col))
if(NAcount == 0){ result <- 0}
else{
result <- NAcount/length(col);
}
invisible(result)
});
perc.missing <- NApercentage[NApercentage != 0]
perc.missing
## Lot.Frontage Mas.Vnr.Area BsmtFin.SF.1 BsmtFin.SF.2 Bsmt.Unf.SF
## 0.167 0.007 0.001 0.001 0.001
## Total.Bsmt.SF Bsmt.Full.Bath Bsmt.Half.Bath Garage.Yr.Blt Garage.Cars
## 0.001 0.001 0.001 0.048 0.001
## Garage.Area
## 0.001
I remove a predictor varible, as the column with most NA’s only has 16% of NA’s. So, in order to create a numerical correlation matrix, I decided to remove the rows that had any NA’s in it. * * *
na.rows <- which(apply(ames_train.numeric,1,function(row){
any(is.na(row))
}))
ames_train.numeric <- ames_train.numeric[-na.rows,]
ames_train.numeric$log.price <- log(ames_train.numeric$price)
corplot <- corrplot(cor(ames_train.numeric),title = 'Correlation between numerical
variables')
I selected only the variables correlations with log.price and filtered only variables that had correlation > 0.4.
corplot <- corplot[-which(rownames(corplot) == 'price'),-which(colnames(corplot) == 'price')]
price.corr <- corplot['log.price',]
corplot.df <- data.frame(price.corr = price.corr, var = names(price.corr))
reorderedLevels <- corplot.df$var[order(corplot.df$price.corr,decreasing = TRUE)]
corplot.df$var <- factor(corplot.df$var, levels = reorderedLevels)
corplot.df.final <- corplot.df %>% filter(abs(price.corr) > 0.4)
ggplot(corplot.df.final, aes(var, price.corr)) +
geom_bar(stat = 'identity') +
ggtitle('Numeric variable correlation with log(price)') +
theme(axis.text.x=element_text(angle=45,hjust=1,vjust=0.5))
2.1.3 Other Numerical Correlations Plot display of relationship between three categorical variables (MS.Zoning, Neighborhood & House.Style).
In terms of MS.Zoning, the houses that have highest sale are selected for further analysis and their relationship with style of dwelling and physical locations within Ames city limits.
The MS.Zoning, Neighborhood & House.Style all are categorical variables. Therefore, bar graph with facets is used to examine the distribution. In ggplot, labs() is added to change the title in legend area,ggtitle() to add main title to the plot, to split the long title into multiple lines and theme (plot.title= element.text()) added for font size and bold text.
summary(ames_train$MS.Zoning)
## A (agr) C (all) FV I (all) RH RL RM
## 0 9 56 1 7 772 155
Low <- ames_train %>%
filter(MS.Zoning =="RL")
Low %>%
group_by(MS.Zoning, Neighborhood, House.Style)%>%
summarise(count= n())%>%
arrange(desc(count))
## Source: local data frame [75 x 4]
## Groups: MS.Zoning, Neighborhood [22]
##
## MS.Zoning Neighborhood House.Style count
## <fctr> <fctr> <fctr> <int>
## 1 RL NAmes 1Story 119
## 2 RL CollgCr 1Story 50
## 3 RL Sawyer 1Story 40
## 4 RL NridgHt 1Story 37
## 5 RL Edwards 1Story 35
## 6 RL Gilbert 2Story 35
## 7 RL CollgCr 2Story 28
## 8 RL SawyerW 2Story 26
## 9 RL NoRidge 2Story 23
## 10 RL Mitchel 1Story 20
## # ... with 65 more rows
ggplot(Low, aes(x=Neighborhood, fill= House.Style)) +geom_bar() + facet_grid(MS.Zoning ~ .)+ labs(x= "Neighborhood",y= "Count", fill="House Style") + ggtitle("Fig 1: Distribution of relationship of low density residential houses \n with style of dwelling and physical locations within Ames city.") + theme (plot.title= element_text(size= 12, face="bold")) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 10))
From above chart it has been observed that the maximum Low Density Residential houses are located in Northwest Ames and have one story style of dwelling.
2.1.2 Plot2: Graphical display of relationship between two numerical (Year.Built & price) variables and one categorical variable (Lot.Config).
The price of a house is greatly influenced by its age which is determined by its construction date as well as by lot configuration. That’s why these variables have been selected for examining their relationship.
The Year.Built and price are numerical variables while Lot.Config is a categorical variable. Therefore, a scatter plot is used to examine the distribution of these variables. In ggplot, labs() is added to change the title in legend area, ylim to set range and to choose where ticks appear with regular sequences on y-axis,ggtitle() to add main title to the plot, to split the long title into multiple lines,theme (plot.title= element.text()) added for font size and bold text and scale_color_brewer added for the color.
ggplot(data=ames_train, aes(x=Year.Built, y=price, group=Lot.Config, colour=Lot.Config)) + geom_point(na.rm = TRUE) + ylim(0, 1000000)+scale_color_brewer(palette="Accent") +labs(x= "Construction Year",y= "Price", fill="Lot Config") + ggtitle("Fig 2: Distribution of relationship between sale price, age and lot \n configuration of houses.") + theme (plot.title= element_text(size= 12, face="bold"))
The above scatter plot clearly revealed that the modern houses have the highest sale price. Also an expansion in price range has been observed after the year 2000. In addition, it has been observed that most of the houses have an inside as well as corner lot configuration. The modern houses in particular have an increased frequency of inside lot configuration with most houses price ranging between $150000 and $375000.
2.1.3 Plot3: Graphical display of relationship between two categorical variables (Overall.Qual & Foundation). Making decision for buying a house essentially depends upon the quality of material/s used in raising the structure of the house, their overall finish and also the basic foundation upon which the structure is raised. Therefore these two variables have been selected for analyzing whether houses with great quality material and finishing have great foundations also?
Here, Overall.Qual & Foundation both are categorical variables, therefore segmented bar plot is used for their distribution.In ggplot, labs() is added to change the title in legend area,ggtitle() to add main title to the plot, to split the long title into multiple lines, and theme (plot.title= element.text()) added for font size and bold text.
ames_train$Overall.Qual<- as.factor(ames_train$Overall.Qual)
ggplot(ames_train, aes(x=Overall.Qual, fill= Foundation)) + geom_bar(position = "fill") + scale_fill_brewer(palette="Set2") + labs(x= "Overall Finish and Material of Houses",y= "Count", fill="Foundation") + ggtitle("Fig 2: Distribution of relationship between overall quality \n and foundation of houses.") + theme (plot.title= element_text(size= 12, face="bold"))
From the plot it has been observed that the houses with a very excellent overall finishing and material quality have used poured concrete for making foundation. Rather, it can be inferred from the above plot that houses with overall good quality finishing and material also have used poured concrete for laying their foundations * * *
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from âames_trainâ and create a linear model for price
(or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
Here seven predictor variables namely Lot.Area, Lot.Shape, Condition.1, House.Style,Roof.Style,Foundation and Bedroom.AbvGr have been selected.
The price is a dependent or response variable
ames_new<- ames_train %>%
mutate(Lot.Shape= as.numeric(Lot.Shape), Condition.1= as.numeric(Condition.1), House.Style= as.numeric(House.Style), Roof.Style=as.numeric(Roof.Style), Foundation= as.numeric(Foundation))
ggpairs(ames_new[,c(7,10,16,19,24,32,53)], upper = list(continuous = wrap(ggally_cor, size = 5)), lower = list(continuous = 'smooth'), axisLabels = "none")
Reason for selection of variables
To avoid multicollinearity, only those explanatory variables have been selected which have no or very weak correlation with each other. ggapirs() command is used to check correlation between each explanatory variables (shown in plot above). From plot it has been observed that all the selected explanatory variables have either no or very weak correlation.
Addition of collinear variables results in biased estimation of regression parameter and also adding more than one of these variables to the model would not add much value to the model. However, it is impossible to avoid collinearity arising from observational data.
colSums(is.na(ames_train))
## PID area price MS.SubClass
## 0 0 0 0
## MS.Zoning Lot.Frontage Lot.Area Street
## 0 167 0 0
## Alley Lot.Shape Land.Contour Utilities
## 933 0 0 0
## Lot.Config Land.Slope Neighborhood Condition.1
## 0 0 0 0
## Condition.2 Bldg.Type House.Style Overall.Qual
## 0 0 0 0
## Overall.Cond Year.Built Year.Remod.Add Roof.Style
## 0 0 0 0
## Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type
## 0 0 0 0
## Mas.Vnr.Area Exter.Qual Exter.Cond Foundation
## 7 0 0 0
## Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1
## 21 21 21 21
## BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF
## 1 21 1 1
## Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 0 0 0
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF
## 0 0 0 0
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
## 1 1 0 0
## Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd
## 0 0 0 0
## Functional Fireplaces Fireplace.Qu Garage.Type
## 0 0 491 46
## Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area
## 48 46 1 1
## Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## 47 47 0 0
## Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch
## 0 0 0 0
## Pool.Area Pool.QC Fence Misc.Feature
## 0 997 798 971
## Misc.Val Mo.Sold Yr.Sold Sale.Type
## 0 0 0 0
## Sale.Condition
## 0
The variables PoolQC, MiscFeature, Alley, Fence, FireplaceQu and Lotfrontage have been excluded as they have more than 15% missing values. Furthermore, they are not an important point of consideration while making a decision to buy a new house which can also be one of the reasons that there are so many missing values. Moreover, these variables are also outliers and hence, can be excluded.
R code and the summary output table for Initial Model
Again, make sure to tell R which variables are categorical by converting them to factors. So, R does not think your categorical variable is numeric.
The categorical variables are converted to factors so that R does not take categorical variable as numeric.This is important especially in cases where categories are denoted by integer. we can use factor() function within the lm() function thereby, saving the step of creating the factor variables first.
Start_Model<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(House.Style)+ factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = ames_train)
summary(Start_Model)
##
## Call:
## lm(formula = log(price) ~ Lot.Area + factor(Lot.Shape) + factor(Condition.1) +
## factor(House.Style) + factor(Roof.Style) + factor(Foundation) +
## Bedroom.AbvGr, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.11642 -0.16708 0.01068 0.17351 1.08621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.146e+01 1.336e-01 85.761 < 2e-16 ***
## Lot.Area 7.965e-06 1.184e-06 6.727 2.94e-11 ***
## factor(Lot.Shape)IR2 -6.605e-02 6.090e-02 -1.085 0.27837
## factor(Lot.Shape)IR3 -2.490e-01 2.014e-01 -1.236 0.21672
## factor(Lot.Shape)Reg -1.346e-01 2.134e-02 -6.306 4.33e-10 ***
## factor(Condition.1)Feedr -5.875e-02 7.784e-02 -0.755 0.45057
## factor(Condition.1)Norm 8.483e-02 6.669e-02 1.272 0.20365
## factor(Condition.1)PosA 3.951e-01 1.290e-01 3.062 0.00226 **
## factor(Condition.1)PosN 1.923e-01 1.141e-01 1.685 0.09228 .
## factor(Condition.1)RRAe -1.024e-01 1.131e-01 -0.905 0.36558
## factor(Condition.1)RRAn 8.715e-02 1.055e-01 0.826 0.40907
## factor(Condition.1)RRNe -7.990e-02 2.259e-01 -0.354 0.72363
## factor(Condition.1)RRNn 8.772e-02 1.892e-01 0.464 0.64298
## factor(House.Style)1.5Unf -3.644e-02 1.136e-01 -0.321 0.74840
## factor(House.Style)1Story 1.003e-02 3.748e-02 0.268 0.78908
## factor(House.Style)2.5Unf 5.729e-02 1.035e-01 0.554 0.57985
## factor(House.Style)2Story 5.081e-02 3.880e-02 1.310 0.19067
## factor(House.Style)SFoyer 1.914e-02 6.254e-02 0.306 0.75965
## factor(House.Style)SLvl 4.989e-02 5.873e-02 0.849 0.39586
## factor(Roof.Style)Gable -7.048e-02 1.043e-01 -0.676 0.49937
## factor(Roof.Style)Gambrel -6.004e-02 1.510e-01 -0.398 0.69108
## factor(Roof.Style)Hip 1.180e-01 1.057e-01 1.117 0.26446
## factor(Roof.Style)Mansard -8.725e-02 1.904e-01 -0.458 0.64684
## factor(Foundation)CBlock 1.186e-01 3.679e-02 3.224 0.00131 **
## factor(Foundation)PConc 5.301e-01 3.722e-02 14.240 < 2e-16 ***
## factor(Foundation)Slab -8.143e-02 9.521e-02 -0.855 0.39262
## factor(Foundation)Stone 2.454e-01 1.867e-01 1.314 0.18902
## Bedroom.AbvGr 7.256e-02 1.350e-02 5.374 9.62e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3037 on 972 degrees of freedom
## Multiple R-squared: 0.4927, Adjusted R-squared: 0.4786
## F-statistic: 34.96 on 27 and 972 DF, p-value: < 2.2e-16
Here, the categorical predictors are automatically coded. R chooses the first level alphabetically as the reference group. The reference level can also changed. however, change of reference level has no effect on the prediction. Whatever the reference level is chosen for each categorical predictor, the same prediction was observed * * *
Now either using BAS
another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
Here,first new dataset Selected is created containing the response variable and initial set of explanatory variables.
Selected<- ames_train %>%
select(price, Lot.Area, Lot.Shape, Condition.1, House.Style, Roof.Style, Foundation, Bedroom.AbvGr)
model_no_na= na.omit(Selected)
The na.omit function is used to remove the NA entries from Selected and are saved as model_no_na. Thereafter, we start with Bayesian Model Averaging (BMA) in which multiple models are averaged to obtain posteriors of coefficients and predictions from new data. Here BIC prior is used for coefficients and uniform prior is used for the models
1. First Method for Model Selection is Backward elemination method with adjusted R-squared selection
full_model<- lm(log(price) ~ ., data=model_no_na)
summary(full_model)$adj.r.squared
## [1] 0.4785613
STEP 1
Drop each explanatory variable one by one and record adjusted R-squared of each smaller model.
## Excluded Lot.Area
model_no_LotArea<- lm(log(price) ~ factor(Lot.Shape)+ factor(Condition.1) + factor(House.Style)+ factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_LotArea)$adj.r.squared
## [1] 0.4548431
## Excluded Lot.shape
model_no_LotShape<- lm(log(price) ~ Lot.Area + factor(Condition.1) + factor(House.Style)+ factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_LotShape)$adj.r.squared
## [1] 0.4580318
## Excluded Condition.1
model_no_Condition1<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(House.Style)+ factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_Condition1)$adj.r.squared
## [1] 0.4690173
## Excluded House.Style
model_no_HouseStyle<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_HouseStyle)$adj.r.squared
## [1] 0.4797977
## Excluded Roof.Style
model_no_RoofStyle<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(House.Style) + factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_RoofStyle)$adj.r.squared
## [1] 0.4515077
## Excluded Foundation
model_no_Foundation<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(House.Style)+ factor(Roof.Style) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_Foundation)$adj.r.squared
## [1] 0.2467267
## Excluded Bedroom.AbvGr
model_no_BedroomAbvGr <- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(House.Style)+ factor(Roof.Style)+ factor(Foundation), data = model_no_na)
summary(model_no_BedroomAbvGr)$adj.r.squared
## [1] 0.4636176
STEP 2
When from full_model, the House.Style variable is excluded, then this smaller model has highest adjusted R-squared among all smaller models (i.e. 0.4797977) and also highest from full_model (i.e. 0.4785613). Therefore this model model_no_HouseStyle has been selected and again drop each explanatory variable one by one and record adjusted R-squared of each smaller model.
## Excluded House.Style
model_no_HouseStyle<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_HouseStyle)$adj.r.squared
## [1] 0.4797977
## Excluded House.Style + Lot.Area
model1<- lm(log(price) ~ factor(Lot.Shape)+ factor(Condition.1) + factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_HouseStyle)$adj.r.squared
## [1] 0.4797977
## Excluded House.Style + Lot.Shape
model2<- lm(log(price) ~ Lot.Area + factor(Condition.1) + factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_HouseStyle)$adj.r.squared
## [1] 0.4797977
## Excluded House.Style + Condition.1
model3<- lm(log(price) ~ Lot.Area + factor(Lot.Shape) + factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_HouseStyle)$adj.r.squared
## [1] 0.4797977
## Excluded House.Style + Roof.Style
model4<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_HouseStyle)$adj.r.squared
## [1] 0.4797977
## Excluded House.Style + Foundation
model5<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(Roof.Style) + Bedroom.AbvGr, data = model_no_na)
summary(model_no_HouseStyle)$adj.r.squared
## [1] 0.4797977
## Excluded House.Style + Bedroom.AbvGr
model6<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(Roof.Style)+ factor(Foundation), data = model_no_na)
summary(model_no_HouseStyle)$adj.r.squared
## [1] 0.4797977
Finally, the adjusted R squared of each smaller model is smaller than model_no_HouseStyle from which House.Style variable has been excluded (i.e. model_no_HouseStyle has highest adjusted R-squared among all smaller models). This model_no_HouseStyle should include all the explanatory variables that are present in full_model except `House.Style
Full Initial Model
Full_Initial_Model<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(Full_Initial_Model)
##
## Call:
## lm(formula = log(price) ~ Lot.Area + factor(Lot.Shape) + factor(Condition.1) +
## factor(Roof.Style) + factor(Foundation) + Bedroom.AbvGr,
## data = model_no_na)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.11678 -0.16592 0.01017 0.17239 1.08913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.145e+01 1.293e-01 88.535 < 2e-16 ***
## Lot.Area 7.817e-06 1.178e-06 6.636 5.31e-11 ***
## factor(Lot.Shape)IR2 -6.623e-02 6.069e-02 -1.091 0.275460
## factor(Lot.Shape)IR3 -2.455e-01 2.008e-01 -1.222 0.221885
## factor(Lot.Shape)Reg -1.366e-01 2.117e-02 -6.455 1.70e-10 ***
## factor(Condition.1)Feedr -5.632e-02 7.683e-02 -0.733 0.463703
## factor(Condition.1)Norm 9.319e-02 6.565e-02 1.420 0.156055
## factor(Condition.1)PosA 4.036e-01 1.281e-01 3.150 0.001682 **
## factor(Condition.1)PosN 2.057e-01 1.132e-01 1.816 0.069617 .
## factor(Condition.1)RRAe -1.117e-01 1.123e-01 -0.994 0.320249
## factor(Condition.1)RRAn 9.015e-02 1.049e-01 0.860 0.390241
## factor(Condition.1)RRNe -5.952e-02 2.250e-01 -0.265 0.791409
## factor(Condition.1)RRNn 1.127e-01 1.882e-01 0.599 0.549510
## factor(Roof.Style)Gable -7.292e-02 1.041e-01 -0.701 0.483622
## factor(Roof.Style)Gambrel -4.558e-02 1.501e-01 -0.304 0.761411
## factor(Roof.Style)Hip 1.101e-01 1.054e-01 1.045 0.296310
## factor(Roof.Style)Mansard -7.124e-02 1.897e-01 -0.376 0.707354
## factor(Foundation)CBlock 1.212e-01 3.468e-02 3.493 0.000498 ***
## factor(Foundation)PConc 5.414e-01 3.477e-02 15.573 < 2e-16 ***
## factor(Foundation)Slab -8.171e-02 9.341e-02 -0.875 0.381907
## factor(Foundation)Stone 2.508e-01 1.858e-01 1.350 0.177481
## Bedroom.AbvGr 8.124e-02 1.179e-02 6.890 9.96e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3034 on 978 degrees of freedom
## Multiple R-squared: 0.4907, Adjusted R-squared: 0.4798
## F-statistic: 44.88 on 21 and 978 DF, p-value: < 2.2e-16
From summary output it has been observed that Lot.Area, Lot.Shape, Condition.1, Roof.Style,Foundation and Bedroom.AbvGr are significant predictors of log(price).
We can also check the significant predictors of log(price) using function anova().
anova(Full_Initial_Model)
## Analysis of Variance Table
##
## Response: log(price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Lot.Area 1 10.392 10.3923 112.9281 < 2.2e-16 ***
## factor(Lot.Shape) 3 11.040 3.6799 39.9875 < 2.2e-16 ***
## factor(Condition.1) 8 5.822 0.7277 7.9076 2.381e-10 ***
## factor(Roof.Style) 4 7.490 1.8725 20.3479 4.067e-16 ***
## factor(Foundation) 4 47.613 11.9033 129.3468 < 2.2e-16 ***
## Bedroom.AbvGr 1 4.369 4.3691 47.4770 9.957e-12 ***
## Residuals 978 90.002 0.0920
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2. Second Method for Model Selection is Backward Selection based on AIC
stepAIC(Full_Initial_Model,direction = "backward", trace = FALSE, k=2)
##
## Call:
## lm(formula = log(price) ~ Lot.Area + factor(Lot.Shape) + factor(Condition.1) +
## factor(Roof.Style) + factor(Foundation) + Bedroom.AbvGr,
## data = model_no_na)
##
## Coefficients:
## (Intercept) Lot.Area
## 1.145e+01 7.817e-06
## factor(Lot.Shape)IR2 factor(Lot.Shape)IR3
## -6.623e-02 -2.455e-01
## factor(Lot.Shape)Reg factor(Condition.1)Feedr
## -1.366e-01 -5.632e-02
## factor(Condition.1)Norm factor(Condition.1)PosA
## 9.319e-02 4.036e-01
## factor(Condition.1)PosN factor(Condition.1)RRAe
## 2.057e-01 -1.117e-01
## factor(Condition.1)RRAn factor(Condition.1)RRNe
## 9.015e-02 -5.952e-02
## factor(Condition.1)RRNn factor(Roof.Style)Gable
## 1.127e-01 -7.292e-02
## factor(Roof.Style)Gambrel factor(Roof.Style)Hip
## -4.558e-02 1.101e-01
## factor(Roof.Style)Mansard factor(Foundation)CBlock
## -7.124e-02 1.212e-01
## factor(Foundation)PConc factor(Foundation)Slab
## 5.414e-01 -8.171e-02
## factor(Foundation)Stone Bedroom.AbvGr
## 2.508e-01 8.124e-02
The stepAIC method also confirms that Lot.Area, Lot.Shape, Condition.1, Roof.Style,Foundation and Bedroom.AbvGr are significant predictors of log(price).
Hence, both model selection methods namely, Backward Selection based on AIC and Backward elemination with adjusted R-squared selection generate a model including the same set of predictors.
** Therefore, the Final Initial Model is:
Initial_Model<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(Initial_Model)$adj.r.squared
## [1] 0.4797977
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
## Residual Plot
plot(Initial_Model, which=1)
## Outliers
##outlierTest(Initial_Model)
## The home which has largest squared residual
which.max(abs(residuals(Initial_Model)))
## 428
## 428
ames_train[c(428), c(3, 7, 15, 20, 21,53)]
## # A tibble: 1 × 6
## price Lot.Area Neighborhood Overall.Qual Overall.Cond Bedroom.AbvGr
## <int> <int> <fctr> <fctr> <int> <int>
## 1 12789 9656 OldTown 2 2 2
min(ames_train$price)
## [1] 12789
The observation 428 is an outlier with extreme value of y. This house stands out from the rest because of its lowest value in terms of price (i.e. $12789). Other factors contributing to the high squared residual are its physical location within Ames city limits in OldTown Neighborhood with a poor overall material quality and finishing (Overall.Qual) coupled with a poor overall condition (Overall.Cond). The very low price of this house can also be attributed to a small number of bedrooms (only two) and the Lot. Area (i.e. 9656 square feet) of the house.
However, Outliers that influence the slope of regression line are called Influential Points.
## Highest Influential observation
which.max(lm.influence(Initial_Model)$hat)
## 356
## 356
The highest influential observation is 356 and its PID is 916176125. The filter() function is added to exclude this observation and changes have been saved in new data set ames_2nd. Using this new dataset, a new linear model namely New_Initial_Model has been created.
ames_2nd<- ames_train %>%
filter(PID!= 916176125)
New_Initial_Model<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = ames_2nd)
summary(New_Initial_Model)$adj.r.squared
## [1] 0.4830558
Prediction of proportion of observations that fall within the 95% of prediction confidence interval
# The proportion of observations that fall within prediction intervals
predict1 <- exp(predict(Initial_Model, ames_train, interval = "prediction"))
coverage.prob1 <- mean(ames_train$price > predict1[,"lwr"] & model_no_na$price < predict1[,"upr"])
coverage.prob1
## [1] 0.949
From above summary statistic output, proportion of observations that fall within the 95% of prediction confidence interval is 95%. Hence, assumptions regarding uncertainty are met and thereby, this initial model completely reflects uncertainty.
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
predict.full<- exp(predict(Initial_Model, ames_train))
resid.full<- ames_train$price - predict.full
rmse.full<- sqrt(mean(resid.full^2))
rmse.full
## [1] 61144.23
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called âoverfitting.â To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test
.
load("ames_test.Rdata")
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
The metric for comparing the accuracy of prediction or out-of-sample performance of multiple models is called Root Mean Squared Error (RMSE). It involves the square root of the mean of squared residuals.
The two factor variables namely Foundation and Roof.Style have additional level in the test data than that of training data. That caused an error during prediction of test data. Therefore, these rows have been removed from the test data to avoid the error
which(ames_test$Foundation=="Wood")
## [1] 352 704
which(ames_test$Roof.Style=="Shed")
## [1] 627
test_data<- ames_test[-c(352,627,704),]
# For Training Data
predict.full<- exp(predict(Initial_Model, ames_train))
resid.full<- ames_train$price - predict.full
rmse.full<- sqrt(mean(resid.full^2))
rmse.full
## [1] 61144.23
# For Test Data
predict.test<- exp(predict(Initial_Model, test_data))
resid.test<- test_data$price - predict.test
rmse.test<- sqrt(mean(resid.test^2))
rmse.test
## [1] 54631.68
Hence, from above summary statistics output , it has been observed that the RMSE for test data is lower as compared to that of training data. In general, lower the RMSE, better is the model.
Thereore,the predictions are more accurate for the test data than that of training data because of lower value of RMSE for test data.
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this over-fitting. For simplicity, we do not ask you to do this on the assignment, however.
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Here fifteen (15) predictor variables namely Lot.Area, Lot.Shape, Land.Contour, Condition.1, House.Style,Roof.Style,Foundation, Bsmt.Exposure,Bedroom.AbvGr, Paved.Drive, Wood.Deck.SF, X3Ssn.Porch, Mo.Sold, Yr.Sold and Condition.1 have been selected.
The price is a dependent or response variable.
ames_new2<- ames_train %>%
mutate(Lot.Shape= as.numeric(Lot.Shape), Land.Contour=as.numeric(Land.Contour),Condition.1= as.numeric(Condition.1), House.Style= as.numeric(House.Style), Roof.Style=as.numeric(Roof.Style), Foundation= as.numeric(Foundation), Bsmt.Exposure= as.numeric(Bsmt.Exposure), Fireplaces= as.numeric(Fireplaces), Paved.Drive= as.numeric(Paved.Drive), Sale.Condition= as.numeric(Sale.Condition))
df <- ames_new2[, c(7,10,11,16,19,24,32,35,53,67,68,71,78,79,81)]
names(df)
## [1] "Lot.Area" "Lot.Shape" "Land.Contour" "Condition.1"
## [5] "House.Style" "Roof.Style" "Foundation" "Bsmt.Exposure"
## [9] "Bedroom.AbvGr" "Paved.Drive" "Wood.Deck.SF" "X3Ssn.Porch"
## [13] "Mo.Sold" "Yr.Sold" "Sale.Condition"
# Correlation plot
ggcorr(df, palette = "RdBu", label = TRUE, nbreaks=7, size= 3.2, hjust=0.85, vjust= 0.8)
Now, ggcorr()` command is used to check correlation between each explanatory variables (shown in plot above). From plot it has been observed that all the selected explanatory variables have either no or very very weak correlation.
R code and the summary output table for Full_Model2
Selected2<- ames_train %>%
select(c(3,7,10,11,16,19,24,32,35,53,67,68,71,78,79,81))
names(Selected2)
## [1] "price" "Lot.Area" "Lot.Shape" "Land.Contour"
## [5] "Condition.1" "House.Style" "Roof.Style" "Foundation"
## [9] "Bsmt.Exposure" "Bedroom.AbvGr" "Paved.Drive" "Wood.Deck.SF"
## [13] "X3Ssn.Porch" "Mo.Sold" "Yr.Sold" "Sale.Condition"
model_no_na2= na.omit(Selected2)
Full_model2<- lm(log(price) ~ ., data=model_no_na2)
summary(Full_model2)
##
## Call:
## lm(formula = log(price) ~ ., data = model_no_na2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7323 -0.1339 -0.0053 0.1615 0.8875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.837e+00 1.313e+01 0.673 0.501176
## Lot.Area 6.729e-06 1.134e-06 5.936 4.11e-09 ***
## Lot.ShapeIR2 -7.485e-02 5.091e-02 -1.470 0.141814
## Lot.ShapeIR3 -1.709e-01 1.709e-01 -1.000 0.317377
## Lot.ShapeReg -8.456e-02 1.837e-02 -4.604 4.72e-06 ***
## Land.ContourHLS -5.224e-02 6.466e-02 -0.808 0.419353
## Land.ContourLow -2.799e-01 8.341e-02 -3.356 0.000823 ***
## Land.ContourLvl -7.113e-02 4.753e-02 -1.497 0.134846
## Condition.1Feedr -1.103e-01 6.630e-02 -1.664 0.096390 .
## Condition.1Norm 2.851e-02 5.633e-02 0.506 0.612885
## Condition.1PosA 2.236e-01 1.091e-01 2.049 0.040780 *
## Condition.1PosN 1.077e-02 9.651e-02 0.112 0.911163
## Condition.1RRAe -1.220e-01 9.819e-02 -1.243 0.214345
## Condition.1RRAn 3.516e-02 8.894e-02 0.395 0.692738
## Condition.1RRNe -8.883e-02 1.914e-01 -0.464 0.642748
## Condition.1RRNn 1.180e-01 1.583e-01 0.746 0.456139
## House.Style1.5Unf -1.221e-01 9.555e-02 -1.278 0.201610
## House.Style1Story -5.155e-02 3.204e-02 -1.609 0.107910
## House.Style2.5Unf 3.818e-02 8.702e-02 0.439 0.660898
## House.Style2Story -6.017e-03 3.306e-02 -0.182 0.855601
## House.StyleSFoyer -2.042e-01 5.754e-02 -3.549 0.000406 ***
## House.StyleSLvl -1.182e-01 5.138e-02 -2.301 0.021626 *
## Roof.StyleGable -5.653e-02 9.569e-02 -0.591 0.554821
## Roof.StyleGambrel 3.826e-02 1.342e-01 0.285 0.775683
## Roof.StyleHip 8.179e-02 9.637e-02 0.849 0.396274
## Roof.StyleMansard 1.065e-01 1.661e-01 0.641 0.521398
## FoundationCBlock 8.765e-03 3.282e-02 0.267 0.789448
## FoundationPConc 3.221e-01 3.407e-02 9.454 < 2e-16 ***
## FoundationStone 4.338e-02 1.565e-01 0.277 0.781619
## Bsmt.ExposureAv 3.927e-01 1.830e-01 2.146 0.032160 *
## Bsmt.ExposureGd 5.603e-01 1.840e-01 3.044 0.002397 **
## Bsmt.ExposureMn 3.174e-01 1.833e-01 1.731 0.083723 .
## Bsmt.ExposureNo 2.817e-01 1.819e-01 1.549 0.121747
## Bedroom.AbvGr 7.276e-02 1.192e-02 6.104 1.52e-09 ***
## Paved.DriveP 1.430e-01 5.991e-02 2.387 0.017186 *
## Paved.DriveY 3.059e-01 3.783e-02 8.088 1.87e-15 ***
## Wood.Deck.SF 5.793e-04 7.178e-05 8.070 2.15e-15 ***
## X3Ssn.Porch 5.121e-04 2.810e-04 1.823 0.068691 .
## Mo.Sold 4.663e-03 3.071e-03 1.518 0.129272
## Yr.Sold 1.028e-03 6.544e-03 0.157 0.875147
## Sale.ConditionAdjLand -2.765e-01 1.920e-01 -1.440 0.150198
## Sale.ConditionAlloca 2.640e-01 1.523e-01 1.733 0.083364 .
## Sale.ConditionFamily -2.527e-02 7.087e-02 -0.357 0.721498
## Sale.ConditionNormal 1.731e-01 3.452e-02 5.014 6.37e-07 ***
## Sale.ConditionPartial 3.984e-01 4.598e-02 8.665 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2526 on 934 degrees of freedom
## Multiple R-squared: 0.6429, Adjusted R-squared: 0.6261
## F-statistic: 38.22 on 44 and 934 DF, p-value: < 2.2e-16
Reason for transforming the dependent variable
## Dependent variable (price) without log transformation
Model_no_log<- lm(price ~ Lot.Area+ factor(Lot.Shape)+ factor(Land.Contour) + factor(Condition.1) + factor(House.Style) + factor(Roof.Style) + factor(Foundation)+ factor(Bsmt.Exposure) + Bedroom.AbvGr + factor(Paved.Drive) + Wood.Deck.SF + X3Ssn.Porch + Mo.Sold+ Yr.Sold+ factor(Condition.1), data= model_no_na2)
plot(fitted(Model_no_log), residuals(Model_no_log))
## Dependent variable (price) with log transformation
Model_log<- lm(log(price) ~ Lot.Area+ factor(Lot.Shape)+ factor(Land.Contour) + factor(Condition.1) + factor(House.Style) + factor(Roof.Style) + factor(Foundation)+ factor(Bsmt.Exposure) + Bedroom.AbvGr + factor(Paved.Drive) + Wood.Deck.SF + X3Ssn.Porch + Mo.Sold+ Yr.Sold+ factor(Condition.1), data= model_no_na2)
## Normal Probability plots for comparison of both models
par(las=1, mfrow= c(1,1))
qqnorm(Model_no_log$residuals, main = "NormalQ-Q Plot \n with price")
qqline(Model_no_log$residuals, main = "NormalQ-Q Plot \n with price")
qqnorm(Model_log$residuals, main = "NormalQ-Q Plot \n with log(price)")
qqline(Model_log$residuals, main = "NormalQ-Q Plot \n with log(price)")
Hence from above plots, it has been found that model with log (price) showed no deviation from mean beside, as compared to model without log transformation of price. Also in the model without log transformation of price, more points bend towards up and to the left of line showing right skewness of this model. However, log transformation of price makes this model normaly distributed. Therefore it is prefable to use the model with log(price) as dependent variable.
Reason for transforming the both dependent and independent variables
Note: When log transforming the explanatory variables that can take value of zero, a useful trick is to add 1 before transforming the variables.
## Dependent variable (price)and Independent variables (Lot.Area, Bedroom.AbvGr, Mo.Sold & Yr.Sold) without log transformation
No_log_model<- lm(price ~ Lot.Area+ factor(Lot.Shape)+ factor(Land.Contour) + factor(Condition.1) + factor(House.Style) + factor(Roof.Style) + factor(Foundation)+ factor(Bsmt.Exposure) + Bedroom.AbvGr + factor(Paved.Drive) + Wood.Deck.SF + X3Ssn.Porch + Mo.Sold+ Yr.Sold +factor(Condition.1), data= model_no_na2)
## Dependent variable (price) and Independent variables (Lot.Area, Bedroom.AbvGr, Mo.Sold & Yr.Sold) with log transformation
Log_model<- lm(log(price) ~ log(Lot.Area)+ factor(Lot.Shape)+ factor(Land.Contour) + factor(Condition.1) + factor(House.Style) + factor(Roof.Style) + factor(Foundation)+ factor(Bsmt.Exposure) + log(Bedroom.AbvGr +1) + factor(Paved.Drive) + log(Wood.Deck.SF+1) +log(X3Ssn.Porch+1) + log(Mo.Sold+1) + log(Yr.Sold+1) + factor(Condition.1), data= model_no_na2)
## Residuals vs Predicted Plots for comparison of both models
par(las=1, mfrow= c(1,1))
plot(fitted(No_log_model), residuals(No_log_model), xlab = "Fitted", ylab = "Residuals", main = "Model without Log transformation of explanatory variables")
abline(h=0,col="red")
plot(fitted(Log_model), residuals(Log_model), xlab = "Fitted", ylab = "Residuals", main = "Model with Log transformation of explanatory variables")
abline(h=0,col="red")
Hence from above plots, it has been found that model with log transformation of variables showed scattering with constant width around “0” i.e. homoscedastic, as compared to model with no log transformation of variables. It means log transforming lowers the variance of the residuals. Therefore, log transformation of variables better fit the model
Final_Full_Model
Final_Full_Model<- lm(log(price) ~ log(Lot.Area)+ factor(Lot.Shape)+ factor(Land.Contour) + factor(Condition.1) + factor(House.Style) + factor(Roof.Style) + factor(Foundation)+ factor(Bsmt.Exposure) + log(Bedroom.AbvGr +1) + factor(Paved.Drive) + log(Wood.Deck.SF+1) +log(X3Ssn.Porch+1) + log(Mo.Sold+1) + log(Yr.Sold+1) + factor(Condition.1), data= model_no_na2)
The variance inflation factors (VIF), which indicate the extent to which multicollinearity is present in a regression model. A VIF of 5 or greater is associated with greater multicollinearity.
I only conducted a study to visualize the predictor variables and determine if they should be log or exponentially transformed. Although I could obtain some normal distributions after careful analysis, I could not improve the final fit of the model, so I decided not to include variable transformations, given a lot of work was necessary to evaluate each predictor individually
stepAIC<- stepAIC(Final_Full_Model,direction = "backward", trace = FALSE, k=2)
stepAIC
##
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + factor(Lot.Shape) +
## factor(Land.Contour) + factor(Condition.1) + factor(House.Style) +
## factor(Roof.Style) + factor(Foundation) + factor(Bsmt.Exposure) +
## log(Bedroom.AbvGr + 1) + factor(Paved.Drive) + log(Wood.Deck.SF +
## 1) + log(X3Ssn.Porch + 1) + log(Mo.Sold + 1), data = model_no_na2)
##
## Coefficients:
## (Intercept) log(Lot.Area)
## 8.93005 0.20211
## factor(Lot.Shape)IR2 factor(Lot.Shape)IR3
## -0.07984 0.09419
## factor(Lot.Shape)Reg factor(Land.Contour)HLS
## -0.05788 -0.03281
## factor(Land.Contour)Low factor(Land.Contour)Lvl
## -0.19059 -0.04747
## factor(Condition.1)Feedr factor(Condition.1)Norm
## -0.07413 0.05920
## factor(Condition.1)PosA factor(Condition.1)PosN
## 0.24829 0.10082
## factor(Condition.1)RRAe factor(Condition.1)RRAn
## -0.12835 0.04441
## factor(Condition.1)RRNe factor(Condition.1)RRNn
## -0.02702 0.06601
## factor(House.Style)1.5Unf factor(House.Style)1Story
## -0.12395 -0.06349
## factor(House.Style)2.5Unf factor(House.Style)2Story
## 0.13710 0.03456
## factor(House.Style)SFoyer factor(House.Style)SLvl
## -0.21604 -0.14338
## factor(Roof.Style)Gable factor(Roof.Style)Gambrel
## 0.11399 0.17801
## factor(Roof.Style)Hip factor(Roof.Style)Mansard
## 0.24331 0.15917
## factor(Foundation)CBlock factor(Foundation)PConc
## 0.02863 0.37120
## factor(Foundation)Stone factor(Bsmt.Exposure)Av
## 0.15186 0.54580
## factor(Bsmt.Exposure)Gd factor(Bsmt.Exposure)Mn
## 0.71757 0.43917
## factor(Bsmt.Exposure)No log(Bedroom.AbvGr + 1)
## 0.42965 0.09289
## factor(Paved.Drive)P factor(Paved.Drive)Y
## 0.17733 0.34074
## log(Wood.Deck.SF + 1) log(X3Ssn.Porch + 1)
## 0.01617 0.02770
## log(Mo.Sold + 1)
## 0.02764
BIC(stepAIC)
## [1] 371.7278
The stepAIC method confirms that Lot.Area, Lot.Shape,Land.Contour,Condition.1, House.Style,Roof.Style,Foundation, Bsmt.Exposure, Bedroom.AbvGr, Paved.Drive, Wood.Deck.SF, X3Ssn.Porch and Mo.Sold are significant predictors of log(price).
2. Second Method for Model Selection is Stepwise Selection with BIC
n = length(model_no_na2)
stepBIC = stepAIC(Final_Full_Model,k=log(n),trace = FALSE )
summary(stepBIC)
##
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + factor(Lot.Shape) +
## factor(Condition.1) + factor(House.Style) + factor(Roof.Style) +
## factor(Foundation) + factor(Bsmt.Exposure) + log(Bedroom.AbvGr +
## 1) + factor(Paved.Drive) + log(Wood.Deck.SF + 1) + log(X3Ssn.Porch +
## 1), data = model_no_na2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91451 -0.14009 0.00883 0.15727 0.75978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.939424 0.278035 32.152 < 2e-16 ***
## log(Lot.Area) 0.192507 0.019198 10.027 < 2e-16 ***
## factor(Lot.Shape)IR2 -0.081803 0.051550 -1.587 0.112877
## factor(Lot.Shape)IR3 0.074799 0.156182 0.479 0.632103
## factor(Lot.Shape)Reg -0.058432 0.019079 -3.063 0.002256 **
## factor(Condition.1)Feedr -0.073046 0.067882 -1.076 0.282170
## factor(Condition.1)Norm 0.061407 0.057715 1.064 0.287617
## factor(Condition.1)PosA 0.279682 0.111415 2.510 0.012230 *
## factor(Condition.1)PosN 0.118731 0.098141 1.210 0.226661
## factor(Condition.1)RRAe -0.125805 0.100791 -1.248 0.212276
## factor(Condition.1)RRAn 0.050605 0.090834 0.557 0.577585
## factor(Condition.1)RRNe -0.032525 0.196758 -0.165 0.868740
## factor(Condition.1)RRNn 0.064349 0.162404 0.396 0.692027
## factor(House.Style)1.5Unf -0.121108 0.097916 -1.237 0.216450
## factor(House.Style)1Story -0.064074 0.032805 -1.953 0.051097 .
## factor(House.Style)2.5Unf 0.128743 0.088443 1.456 0.145816
## factor(House.Style)2Story 0.030548 0.034074 0.897 0.370211
## factor(House.Style)SFoyer -0.207243 0.058443 -3.546 0.000410 ***
## factor(House.Style)SLvl -0.145410 0.052480 -2.771 0.005702 **
## factor(Roof.Style)Gable 0.164435 0.091701 1.793 0.073265 .
## factor(Roof.Style)Gambrel 0.250412 0.131648 1.902 0.057458 .
## factor(Roof.Style)Hip 0.295541 0.092273 3.203 0.001406 **
## factor(Roof.Style)Mansard 0.213831 0.165445 1.292 0.196514
## factor(Foundation)CBlock 0.022404 0.033556 0.668 0.504520
## factor(Foundation)PConc 0.368668 0.034414 10.713 < 2e-16 ***
## factor(Foundation)Stone 0.149001 0.160734 0.927 0.354160
## factor(Bsmt.Exposure)Av 0.556251 0.186972 2.975 0.003004 **
## factor(Bsmt.Exposure)Gd 0.729649 0.187732 3.887 0.000109 ***
## factor(Bsmt.Exposure)Mn 0.453808 0.187484 2.421 0.015686 *
## factor(Bsmt.Exposure)No 0.442820 0.185867 2.382 0.017395 *
## log(Bedroom.AbvGr + 1) 0.108277 0.045165 2.397 0.016706 *
## factor(Paved.Drive)P 0.178657 0.061012 2.928 0.003491 **
## factor(Paved.Drive)Y 0.342205 0.038810 8.817 < 2e-16 ***
## log(Wood.Deck.SF + 1) 0.016276 0.003625 4.490 8e-06 ***
## log(X3Ssn.Porch + 1) 0.028207 0.013573 2.078 0.037960 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.26 on 944 degrees of freedom
## Multiple R-squared: 0.6177, Adjusted R-squared: 0.6039
## F-statistic: 44.86 on 34 and 944 DF, p-value: < 2.2e-16
BIC(stepBIC)
## [1] 352.8393
The stepBIC method confirms that:
Lot.Area, Lot.Shape,Condition.1, House.Style,Roof.Style,Foundation, Bsmt.Exposure, Bedroom.AbvGr, Paved.Drive, Wood.Deck.SF and X3Ssn.Porch are significant predictors of log(price).
Reasons for Model of Choice
The comparison of BIC of the stepAIC model with the stepBIC model reveals that BIC of stepBIC model is lower. Generally, the model with the lowest BIC is preferred, because lower BIC implies either fewer explanatory variables, better fit, or both.
BIC of stepAIC model = 371.7278
BIC of stepBIC model= 352.8393
Another reason for selection of stepBIC model as the most likely and final model is due to preference of simplest best model i.e. Parsimonious Model with lowest number of variables. Hence, stepBIC model is one with small number of predictors.
Hence, the Final_Model is:
Final_Model<- lm(log(price) ~ log(Lot.Area)+ factor(Lot.Shape) + factor(Condition.1)+ factor(House.Style) + factor(Roof.Style)+ factor(Foundation) + factor(Bsmt.Exposure) + log(Bedroom.AbvGr + 1) + factor(Paved.Drive)+ log(Wood.Deck.SF +1) +log(X3Ssn.Porch + 1) , data= model_no_na2)
summary(Final_Model)$adj.r.squared
## [1] 0.6039156
Full_Initial_Model<- lm(log(price) ~ Lot.Area + factor(Lot.Shape)+ factor(Condition.1) + factor(Roof.Style)+ factor(Foundation) + Bedroom.AbvGr, data = model_no_na)
summary(Full_Initial_Model)
##
## Call:
## lm(formula = log(price) ~ Lot.Area + factor(Lot.Shape) + factor(Condition.1) +
## factor(Roof.Style) + factor(Foundation) + Bedroom.AbvGr,
## data = model_no_na)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.11678 -0.16592 0.01017 0.17239 1.08913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.145e+01 1.293e-01 88.535 < 2e-16 ***
## Lot.Area 7.817e-06 1.178e-06 6.636 5.31e-11 ***
## factor(Lot.Shape)IR2 -6.623e-02 6.069e-02 -1.091 0.275460
## factor(Lot.Shape)IR3 -2.455e-01 2.008e-01 -1.222 0.221885
## factor(Lot.Shape)Reg -1.366e-01 2.117e-02 -6.455 1.70e-10 ***
## factor(Condition.1)Feedr -5.632e-02 7.683e-02 -0.733 0.463703
## factor(Condition.1)Norm 9.319e-02 6.565e-02 1.420 0.156055
## factor(Condition.1)PosA 4.036e-01 1.281e-01 3.150 0.001682 **
## factor(Condition.1)PosN 2.057e-01 1.132e-01 1.816 0.069617 .
## factor(Condition.1)RRAe -1.117e-01 1.123e-01 -0.994 0.320249
## factor(Condition.1)RRAn 9.015e-02 1.049e-01 0.860 0.390241
## factor(Condition.1)RRNe -5.952e-02 2.250e-01 -0.265 0.791409
## factor(Condition.1)RRNn 1.127e-01 1.882e-01 0.599 0.549510
## factor(Roof.Style)Gable -7.292e-02 1.041e-01 -0.701 0.483622
## factor(Roof.Style)Gambrel -4.558e-02 1.501e-01 -0.304 0.761411
## factor(Roof.Style)Hip 1.101e-01 1.054e-01 1.045 0.296310
## factor(Roof.Style)Mansard -7.124e-02 1.897e-01 -0.376 0.707354
## factor(Foundation)CBlock 1.212e-01 3.468e-02 3.493 0.000498 ***
## factor(Foundation)PConc 5.414e-01 3.477e-02 15.573 < 2e-16 ***
## factor(Foundation)Slab -8.171e-02 9.341e-02 -0.875 0.381907
## factor(Foundation)Stone 2.508e-01 1.858e-01 1.350 0.177481
## Bedroom.AbvGr 8.124e-02 1.179e-02 6.890 9.96e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3034 on 978 degrees of freedom
## Multiple R-squared: 0.4907, Adjusted R-squared: 0.4798
## F-statistic: 44.88 on 21 and 978 DF, p-value: < 2.2e-16
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
When the model perform well at fitting the existing data (training data) but perform bad on new data (testing data or validation data) is called overfitting.
The metrics to check whether model is best fit, are:
Lower BIC Lower AIC Lower RMSE
The most coomly used metric for comparing out-of-sample performance of multiple models is called Root Mean Squared Error (RMSE).
Hence, out-of-sample data analysis was done (analyzed in section 4.2) and it was observed that there is no over-fitting of the model. The model did not overreact to minor fluctuations in the data and therefore has a reasonable predictive performance and no need for making any changes to the variable selection.
For your final model, create and briefly interpret an informative plot of the residuals.
Residual Plots
## Plots for checking constant variability of Residuals
par(las=1, mfrow= c(1,1))
plot(fitted(Final_Model), residuals(Final_Model), xlab = "Fitted", ylab = "Residuals", main = "Plot1: Residual Plot \n with Training Data")
abline(h=0, col= "red")
## Plots for checking normality of Residuals
qqnorm(Final_Model$residuals, main= "Plot2: Normal Probability Plot \n of Residuals for Training Data")
qqline(Final_Model$residuals, main= "Plot2: Normal Probability Plot \n of Residuals for Training Data")
plot(Final_Model, which=1)
For your final model, calculate and briefly comment on the RMSE.
The metric for comparing out-of-sample performance of multiple models is called Root Mean Squared Error (RMSE). It involves the square root of the mean of squared residuals.
The another factor variable namely House.Style has an additional level in the test data than that of training data. That caused an error during prediction of test data. Therefore, these rows have been removed from the test data to avoid the error. The NA entries are also removed from the test data using na.omit() command, because they also cause problem in prediction of data.
which(ames_test$House.Style=="2.5Fin")
## [1] 163 204
test_data<- test_data[-c(163,204),]
# For Training Data
predict.train<- exp(predict(Final_Model, model_no_na2))
resid.train<- model_no_na2$price - predict.train
rmse.train<- sqrt(mean(resid.train^2))
rmse.train
## [1] 51806.17
# For Test Data
selected_test<- test_data %>%
select(price,Lot.Area, Lot.Shape,Condition.1,House.Style,Roof.Style,Foundation,Bsmt.Exposure,Bedroom.AbvGr,Paved.Drive,Wood.Deck.SF,X3Ssn.Porch)
model_no_na_test= na.omit(selected_test)
predict.test2<- exp(predict(Final_Model, model_no_na_test))
resid.test2<- model_no_na_test$price - predict.test2
rmse.test2<- sqrt(mean(resid.test^2))
rmse.test2
## [1] 54631.68
Hence, from above summary statistics output , it has been observed that the RMSE for training data is lower as compared to that of test data.
What are some strengths and weaknesses of your model?
Strengths of Final Model
1.During variable selection procedure, only those explanatory variables are selected that have either no or very very weak correlation with each other, means mutlicollinearity has been avoided. Because addition of collinear variables results in biased estimation of regression model.
2.During model selection procedure, only the simplest best model i.e. Parsimonious Model with lowest number of variables has been preferred.
3.Also the model properly reflect uncertainty confirmed by determining the coverage probability and the true value of price that fall within the 95% prediction confidence interval is 97%.
4.The model that has been finalized, has lowest RMSE. In general, lower the RMSE, better is the model fitted.
Weaknesses of Final Model 1.Although due care has been taken to avoid multicollinearity, however, it is impossible to avoid collinearity arising from observational data.
2: In the plots of normal probability of the residual, there are some outlying observations and thus have heavy tails.
3: Location of the properties has not been included in the analysis but location can significantly impact the fair market value of a house. The reason for their non-inclusion is that some of the levels of neighborhood variable have a total count of either equal to or less than 10. This may lead us to false parametric estimations. .
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the âames_validationâ dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?
load("ames_validation.Rdata")
Section 4.4(a): RMSE of Final Model when applied to the validation data
The three factor variables namely Foundation, Roof.Style and House.Style have additional levels in the test data than that of training data. That caused an error during prediction of testing data. Therefore, these rows have been removed from the test data to avoid the error.The NA entries are also removed from the validation data using na.omit() command, because they also cause problem in prediction of data.
which(ames_validation$House.Style=="2.5Fin")
## [1] 140 188 299 631
which(ames_validation$Roof.Style=="Shed")
## [1] 144 290 716
which(ames_validation$Foundation=="Wood")
## [1] 440 520 563
validation_data<- ames_validation[-c(140,144,188,290,299,440,520,563,631,716),]
## For Validation Data
selected_val<- validation_data %>%
select(price,Lot.Area, Lot.Shape,Condition.1,House.Style,Roof.Style,Foundation,Bsmt.Exposure,Bedroom.AbvGr,Paved.Drive,Wood.Deck.SF,X3Ssn.Porch)
model_no_na_val= na.omit(selected_val)
predict.val<- exp(predict(Final_Model, model_no_na_val))
resid.val<- model_no_na_val$price - predict.val
rmse.val<- sqrt(mean(resid.val^2))
rmse.val
## [1] 41486.07
Section 4.4(a): Comparison of RMSE value of validation data to that of the training data and/or testing data
Hence, from above summary statistics output, it has been observed that the RMSE for validation data (RMSE = 41486.07) is lowest from both training (RMSE = 51806.17) as well as testing data (RMSE = 54381.98). In general, lower the RMSE, better is the model fit. Therefore, this final model is better fitted for validation data.
Section 4.4(c): The percentage of the 95% predictive confidence (or credible) intervals that contain the true price of the house in the validation data set
In order to assess whether model reflects the uncertainty or not, coverage probability has to be calculated. If the assumptions regarding uncertainty are met, then 95% prediction confidence interval of price should include the roughly 95% of data.
# The proportion of observations that fall within prediction intervals
predict.full <- exp(predict(Final_Model, model_no_na_val, interval = "prediction", level = 0.95))
coverage.prob.full <- mean(model_no_na_val$price > predict.full[,"lwr"] & model_no_na_val$price < predict.full[,"upr"])
coverage.prob.full
## [1] 0.9725275
Section 4.4(d): Check whether final model properly reflect uncertainty
From above summary statistic output, it has been found that 95% prediction confidence interval for price include the true value of price equal to 97% of the time. That means, this final model properly reflects the uncertainty
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
The built model showed that with a few good selected variables, we can already provide a good explanation of how the house prices of Ames can be explained. In this study, such variables as “overall quality” and “year built” provided a good explanation in a way of percentage of variance from the mean and could already help real estate investor and employees to see if they are charging (or being charged) above the expected price. This dataset can be used to perform an analysis which can provide a deep insight on the key characteristics that can influence price of a home. Firstly, we used variables selection procedure and selected only those predictors that are not associated with each other. After that, we used this ames data to find the regression model that predict the housing price most accurately by using different model selection methods.These model selection methods resulted in elimination of bad predictors and is termed as a parsimonious model i.e a model with low number and with significant predictors. Also, the final model properly reflected the uncertainty.The adjusted R-squared of final model was 0.60, which means 60% of the variability is explained by the model. Moreover, a model with a low R-squared value but having all statistically significant predictors can still be used to draw a conclusion on the association between changes in the predictor values and changes in the response value.
The characteristics like lot area, shale of lot,proximity to various condition, style of dwelling, roof style, type of foundation, basement exposure, bedroomsabove grade, paved driveway, wood deck area and three season porch area in square feet play important role in affecting the house prices.
In future the model can be refined by adding additional informative variables like tax status, prior sale price, age of family members, education, income , latitude and longitude of houses etc.