This document contains an analysis of the Ames Data set and is used for the associated Kaggle competition:https://www.kaggle.com/c/house-prices-advanced-regression-techniques
The training data consists of 1460 observations of house sale data from Iowa. Each sold house as up to 79 associated predictors.
79 predictors is quite large, it is likely not needed to include all of them. It is likely that there exists multicollinearity between the predictors as well.
The method of evaluation is Root Mean Squared error between the predicted log of sale price and the log of the actual sale price. Clearly, we wish to minimize this quantity.
Secondly, we would need to scale the data (mean normalize and scale columns). The benefit of scaling is that we put our predictors on a similar scale. This also speeds up learning. Luckily, we use the glmnet function which will do this for us.
We first try to build a simple model and add complexity over time.
We will see that the distribution of sale price is not normally distributed. To remedy this, we will model log(SalePrice) and take the exponential transform of our predictions to return to the regular scale.
We also take inspiration and tips from the high ranking notebook: https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard
In the end, we will use the technique of model averaging/blending to combine models.
We will see that this results in a score in the top 15% of submission via kaggle.
suppressMessages(library(tidyverse))
suppressMessages(library(caret))
suppressMessages(library(caTools))
suppressMessages(library(ggplot2))
suppressMessages(library(MASS))
suppressMessages(library(glmnet))
suppressMessages(library(plotly))
suppressMessages(library(reshape2))
suppressMessages(library(gridExtra))
suppressMessages(library(gbm))
setwd("C:/Users/User/Desktop/kaggle_housing") #I saved the csv files to a folder on my desktop.
housing=read.csv(file="housing_train.csv")
housing_testing=read.csv(file="housing_testing.csv")
housing_testing$SalePrice=NA #Fill in NA for missing values of test sale price.
housing_full=rbind(housing,housing_testing) #Create a full dataset to transform predictors later on.
# We note we can remove some outliers from the dataset. in particular only 1 test house is larger than 4000 sqft so we won't train on houses that large. This is noted in the documentation.
housing_full=housing_full[-which(housing_full$GrLivArea>4000 & housing_full$SalePrice < 800000),]
housing=housing[-which(housing$GrLivArea>4000 & housing$SalePrice < 800000),]
par(mfrow=c(1,2))
hist(housing$SalePrice,xlab='housing price',ylab='Frequency',main="Histrogram of Sale Price",col='red')
hist(log(housing$SalePrice),xlab='Log(housing price)',ylab='Frequency',main="Histrogram of Log Sale Price",col='red')
#Looks better, don't forget we will have to back transform before we finish.
We take a look at how some of the predictors relate to the log(Sale Price). There are to many predictors to consider all of them directly, so we choose some that we think may give insight.
A heat map allows us to examine the correlation between predictors and between sale price. We include a heatmap for our numeric predictors.
housing_numeric=housing%>%
dplyr::select_if(is.numeric)%>%
drop_na()
cc2=cor(housing_numeric)
cc2_melt=melt(cc2)
gz=ggplot(cc2_melt,mapping=aes(x=Var1,y=Var2,fill=value))+
geom_tile()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(text = element_text(size=8))+
ggtitle("Heat Map for Housing Data: Numeric Predictors")+
ylab("")+
xlab("")+
scale_fill_distiller(palette = "RdPu")
ggplotly(gz, tooltip="text")
#Try Zooming In!!
We now need to clean up missing data and do some transformations. We impute missing values using the most common occurence method for factors. We first deal with those predictors which had the most missing data. Instead of ‘NA’ we assign them the factor ‘none’. For example, NA fence means no fence.
housing_full$Fence=addNA(housing_full$Fence)
levels(housing_full$Fence)=c(levels(housing_full$Fence)[-length(levels(housing_full$Fence))], 'None')
housing_full$Alley=addNA(housing_full$Alley)
levels(housing_full$Alley)=c(levels(housing_full$Alley)[-length(levels(housing_full$Alley))], 'None')
housing_full$FireplaceQu=addNA(housing_full$FireplaceQu)
levels(housing_full$FireplaceQu)=c(levels(housing_full$FireplaceQu)[-length(levels(housing_full$FireplaceQu))], 'None')
For houses in which lot frontage is missing, we assign the median lot frontage for that specific neighborhoood to the observation.
#impute lot frontage based off median for that neighborhood
for(i in 1:nrow(housing_full)){
if(is.na(housing_full[i,'LotFrontage'])){
select_area=housing_full[i,'Neighborhood']
s=housing_full %>%
dplyr::filter(Neighborhood==select_area)
con_med=median(s$LotFrontage,na.rm=TRUE)
housing_full[i,'LotFrontage']=con_med
}
}
Some predictors are in the form of characters while we need them to be in the form of a factor. We also fill in factors that were missing or NA. We use the method of most common occurence.
housing_full$MSSubClass=as.factor(housing_full$MSSubClass)
housing_full$MSZoning=as.factor(housing_full$MSZoning)
housing_full$YrSold=as.factor(housing_full$YrSold)
housing_full$MoSold=as.factor(housing_full$MoSold)
housing_full[which(is.na(housing_full$MSZoning)),'MSZoning']='RL'
housing_full[which(is.na(housing_full$Functional)),'Functional']='Typ'
housing_full[which(is.na(housing_full$Electrical)),'Electrical']='SBrkr'
housing_full[which(is.na(housing_full$KitchenQual)),'KitchenQual']='TA'
housing_full[which(is.na(housing_full$SaleType)),'SaleType']="WD"
housing_full[which(is.na(housing_full$Exterior1st)),'Exterior1st']='VinylSd'
housing_full[which(is.na(housing_full$Exterior2nd)),'Exterior2nd']="VinylSd"
#For missing factors we can impute as an NA for missing.
housing_full$BsmtQual=addNA(housing_full$BsmtQual)
levels(housing_full$BsmtQual) <- c(levels(housing_full$BsmtQual)[-length(levels(housing_full$BsmtQual))], 'None')
housing_full$BsmtCond=addNA(housing_full$BsmtCond)
levels(housing_full$BsmtCond)=c(levels(housing_full$BsmtCond)[-length(levels(housing_full$BsmtCond))], 'None')
housing_full$BsmtExposure=addNA(housing_full$BsmtExposure)
levels(housing_full$BsmtExposure)=c(levels(housing_full$BsmtExposure)[-length(levels(housing_full$BsmtExposure))], 'None')
housing_full$BsmtFinType1=addNA(housing_full$BsmtFinType1)
levels(housing_full$BsmtFinType1)=c(levels(housing_full$BsmtFinType1)[-length(levels(housing_full$BsmtFinType1))], 'None')
housing_full$BsmtFinType2=addNA(housing_full$BsmtFinType2)
levels(housing_full$BsmtFinType2)=c(levels(housing_full$BsmtFinType2)[-length(levels(housing_full$BsmtFinType2))], 'None')
#For missing numeric, no basment is equivelent to zero SF.
for(i in 1:nrow(housing_full)){
if(is.na(housing_full[i,'BsmtFinSF1'])){
housing_full[i,'BsmtFinSF1']=0
}
if(is.na(housing_full[i,'BsmtFinSF2'])){
housing_full[i,'BsmtFinSF2']=0
}
if(is.na(housing_full[i,'BsmtUnfSF'])){
housing_full[i,'BsmtUnfSF']=0
}
if(is.na(housing_full[i,'TotalBsmtSF'])){
housing_full[i,'TotalBsmtSF']=0
}
if(is.na(housing_full[i,'BsmtFullBath'])){
housing_full[i,'BsmtFullBath']=0
}
if(is.na(housing_full[i,'BsmtHalfBath'])){
housing_full[i,'BsmtHalfBath']=0
}
}
housing_full$GarageType=addNA(housing_full$GarageType)
levels(housing_full$GarageType) <- c(levels(housing_full$GarageType)[-length(levels(housing_full$GarageType))], 'None')
housing_full$GarageFinish=addNA(housing_full$GarageFinish)
levels(housing_full$GarageFinish) <- c(levels(housing_full$GarageFinish)[-length(levels(housing_full$GarageFinish))], 'None')
housing_full$GarageQual=addNA(housing_full$GarageQual)
levels(housing_full$GarageQual) <- c(levels(housing_full$GarageQual)[-length(levels(housing_full$GarageQual))], 'None')
housing_full$GarageCond=addNA(housing_full$GarageCond)
levels(housing_full$GarageCond) <- c(levels(housing_full$GarageCond)[-length(levels(housing_full$GarageCond))], 'None')
for(i in 1:nrow(housing_full)){
if(is.na(housing_full[i,'GarageYrBlt'])){
housing_full[i,'GarageYrBlt']=0
}
if(is.na(housing_full[i,'GarageArea'])){
housing_full[i,'GarageArea']=0
}
if(is.na(housing_full[i,'GarageCars'])){
housing_full[i,'GarageCars']=0
}
}
Some predictors such as ID have no value in prediction. Others have too many missing values, or almost no variablity in the values they take. Hence, we remove these predictors.
housing_full=housing_full%>% # Remove Id and Utilities
dplyr::select(-Utilities,-Id,-MiscFeature,-PoolQC)
housing_full$TotalSQ=housing_full$TotalBsmtSF+housing_full$X1stFlrSF+housing_full$X2ndFlrSF
housing_full$MasVnrType=addNA(housing_full$MasVnrType)
levels(housing_full$MasVnrType) <- c(levels(housing_full$MasVnrType)[-length(levels(housing_full$MasVnrType))], 'None')
for(i in 1:nrow(housing_full)){
if(is.na(housing_full[i,'MasVnrArea'])){
housing_full[i,'MasVnrArea']=0
}}
for(i in 1:ncol(housing_full)){
if(is.character(housing_full[,i])){
housing_full[,i]=as.factor(housing_full[,i])
}
}
order=nrow(housing)+1
housing=housing_full[1:nrow(housing),]
housing_predictors=housing[,-length(housing)+1]
housing_sale=housing[,length(housing)]
housing_testing=housing_full[order:nrow(housing_full),]
In order to use glmet in R we need our data in the form of ‘model matrices’,
housing_predictors_Matrix=housing%>%
dplyr::select(-SalePrice)%>%
data.matrix()
housing_testing_Matrix=housing_testing%>%
dplyr::select(-SalePrice)%>%
data.matrix()
housing_Matrix=housing%>%
data.matrix()
Algorithms such as linear regression, ridge/lasso/elastic net work best on scaled predictors. This also speeds up learning in high dimensional datasets. Luckily, the packages perform scaling for us so we need not do it manually.
Our Mean Squared training error for ridge regression is 18.9.
cv_ridge<-cv.glmnet(x=housing_predictors_Matrix,y=log(housing_Matrix[,'SalePrice']),alpha=0)
fit_ridge<-glmnet(x=housing_predictors_Matrix, y=log(housing$SalePrice), alpha = 0, lambda = cv_ridge$lambda.min)
pridge=exp(predict(fit_ridge,newx=housing_testing_Matrix,type='response',s=cv_ridge$lambda.min))
# Train Error
ptrain_ridge=predict(fit_ridge,newx=housing_predictors_Matrix,type='response',s=cv_ridge$lambda.min)
MSE_ridge=sum((log(housing$SalePrice)-ptrain_ridge)^2) #Train Error
Our Mean Squared training error for ridge regression is 19.95.
cv_lasso<-cv.glmnet(x=housing_predictors_Matrix,y=log(housing_Matrix[,'SalePrice']),alpha=1)
fit_lasso<-glmnet(x=housing_predictors_Matrix, y=log(housing$SalePrice), alpha = 1, lambda = cv_lasso$lambda.min)
plasso=exp(predict(fit_lasso,newx=housing_testing_Matrix,type='response',s=cv_lasso$lambda.min))
# Train Error
ptrain_lasso=predict(fit_lasso,newx=housing_predictors_Matrix,type='response',s=cv_lasso$lambda.min)
MSE_lasso=sum((log(housing$SalePrice)-ptrain_lasso)^2) #Train Error
Our Mean Squared training error for elastic net regression is 19.29.
cv_en<-cv.glmnet(x=housing_predictors_Matrix,y=log(housing_Matrix[,'SalePrice']),alpha=0.5)
fit_elastic<-glmnet(x=housing_predictors_Matrix, y=log(housing$SalePrice), alpha = 0.5, lambda = cv_en$lambda.min)
pelastic=exp(predict(fit_elastic,newx=housing_testing_Matrix,type='response',s=cv_en$lambda.min))
# Train Error
ptrain_en=predict(fit_elastic,newx=housing_predictors_Matrix,type='response',s=cv_en$lambda.min)
MSE_elastic=sum((log(housing$SalePrice)-ptrain_en)^2) #Train Error
Our Mean Squared training error for gradient boosting is 14.5.
model_gbm=gbm(log(SalePrice)~.,data=housing,cv.folds = 5,distribution ="gaussian",shrinkage=0.3,n.trees = 300)
pgbm=exp(predict(model_gbm,newdata=housing_testing))
## Using 190 trees...
# Train Error
ptrain_gbm=predict(model_gbm,newdata=housing_predictors)
## Using 190 trees...
MSE_gbm=sum((log(housing$SalePrice)-ptrain_gbm)^2) #Train Error
Our Mean Squared training error for random forest is less than 6. This is very small and likely indicative that the random forest overfits the training data.
fitControl <- trainControl(method = "cv", # Define a train control that will do 5 fold cross validation for us.
number = 5)
#model_rf=caret::train(log(SalePrice)~.,data=housing,trControl=fitControl)
#prf=exp(predict(model_rf,newdata=housing_testing))
# Train Error
#ptrain_rf=predict(model_rf,newdata=housing_predictors)
#MSE_rf=sum((log(housing$SalePrice)-ptrain_rf)^2) #Train Error is too low, overfit.
We combine three models that had intermediate training errors. This will hopefully avoid inclusion of overfit models. We simply take the average of the three predictions. We combine Ridge, Elastic Net and Gbm models.
pav3=(1/3)*(pgbm+pelastic+pridge)
df.sub=data.frame(1461:2919,pav3)
colnames(df.sub)<-c('Id','SalePrice')
write.csv(df.sub,'housingprice1.csv',row.names=FALSE) # 0.1208
head(df.sub,n=15)
## Id SalePrice
## 1461 1461 120270.1
## 1462 1462 155085.1
## 1463 1463 181559.5
## 1464 1464 198565.5
## 1465 1465 191234.3
## 1466 1466 173190.7
## 1467 1467 179243.3
## 1468 1468 163654.1
## 1469 1469 188012.3
## 1470 1470 123393.6
## 1471 1471 193196.3
## 1472 1472 102580.2
## 1473 1473 100625.5
## 1474 1474 147995.1
## 1475 1475 120258.5