This document contains an analysis of the Ames Data set and is used for the associated Kaggle competition:https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The training data consists of 1460 observations of house sale data from Iowa. Each sold house as up to 79 associated predictors.

79 predictors is quite large, it is likely not needed to include all of them. It is likely that there exists multicollinearity between the predictors as well.

The method of evaluation is Root Mean Squared error between the predicted log of sale price and the log of the actual sale price. Clearly, we wish to minimize this quantity.

Secondly, we would need to scale the data (mean normalize and scale columns). The benefit of scaling is that we put our predictors on a similar scale. This also speeds up learning. Luckily, we use the glmnet function which will do this for us.

We first try to build a simple model and add complexity over time.

We will see that the distribution of sale price is not normally distributed. To remedy this, we will model log(SalePrice) and take the exponential transform of our predictions to return to the regular scale.

We also take inspiration and tips from the high ranking notebook: https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

In the end, we will use the technique of model averaging/blending to combine models.

We will see that this results in a score in the top 15% of submission via kaggle.

Please feel free to use this code and try to improve the score. Some ideas would be: -Box Cox instead of log transforms.
A better cross validation strategy to validate models before submissions.
One hot encoding instead of factors in some cases.

suppressMessages(library(tidyverse))
suppressMessages(library(caret))
suppressMessages(library(caTools))
suppressMessages(library(ggplot2))
suppressMessages(library(MASS))
suppressMessages(library(glmnet))
suppressMessages(library(plotly))
suppressMessages(library(reshape2))
suppressMessages(library(gridExtra))
suppressMessages(library(gbm))

setwd("C:/Users/User/Desktop/kaggle_housing") #I saved the csv files to a folder on my desktop.

housing=read.csv(file="housing_train.csv")
housing_testing=read.csv(file="housing_testing.csv")
housing_testing$SalePrice=NA #Fill in NA for missing values of test sale price.
  
housing_full=rbind(housing,housing_testing) #Create a full dataset to transform predictors later on.

# We note we can remove some outliers from the dataset. in particular only 1 test house is larger than 4000 sqft so we won't train on houses that large. This is noted in the documentation.
housing_full=housing_full[-which(housing_full$GrLivArea>4000 & housing_full$SalePrice < 800000),]
housing=housing[-which(housing$GrLivArea>4000 & housing$SalePrice < 800000),]

Examine Sale Prices

par(mfrow=c(1,2))
hist(housing$SalePrice,xlab='housing price',ylab='Frequency',main="Histrogram of Sale Price",col='red')

hist(log(housing$SalePrice),xlab='Log(housing price)',ylab='Frequency',main="Histrogram of Log Sale Price",col='red')

#Looks better, don't forget we will have to back transform before we finish.

Examine Relationships

We take a look at how some of the predictors relate to the log(Sale Price). There are to many predictors to consider all of them directly, so we choose some that we think may give insight.

Heat Map

A heat map allows us to examine the correlation between predictors and between sale price. We include a heatmap for our numeric predictors.

housing_numeric=housing%>%
  dplyr::select_if(is.numeric)%>%
  drop_na()


cc2=cor(housing_numeric)

cc2_melt=melt(cc2)

gz=ggplot(cc2_melt,mapping=aes(x=Var1,y=Var2,fill=value))+
  geom_tile()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
      theme(text = element_text(size=8))+
  ggtitle("Heat Map for Housing Data: Numeric Predictors")+
  ylab("")+
  xlab("")+
    scale_fill_distiller(palette = "RdPu") 

ggplotly(gz, tooltip="text")

#Try Zooming In!!

Clean Data

We now need to clean up missing data and do some transformations. We impute missing values using the most common occurence method for factors. We first deal with those predictors which had the most missing data. Instead of ‘NA’ we assign them the factor ‘none’. For example, NA fence means no fence.

housing_full$Fence=addNA(housing_full$Fence)
levels(housing_full$Fence)=c(levels(housing_full$Fence)[-length(levels(housing_full$Fence))], 'None')

housing_full$Alley=addNA(housing_full$Alley)
levels(housing_full$Alley)=c(levels(housing_full$Alley)[-length(levels(housing_full$Alley))], 'None')

housing_full$FireplaceQu=addNA(housing_full$FireplaceQu)
levels(housing_full$FireplaceQu)=c(levels(housing_full$FireplaceQu)[-length(levels(housing_full$FireplaceQu))], 'None')

Impute Lot Frontage

For houses in which lot frontage is missing, we assign the median lot frontage for that specific neighborhoood to the observation.

#impute lot frontage based off median for that neighborhood


for(i in 1:nrow(housing_full)){
  if(is.na(housing_full[i,'LotFrontage'])){
    select_area=housing_full[i,'Neighborhood']
    s=housing_full %>%
      dplyr::filter(Neighborhood==select_area)
    con_med=median(s$LotFrontage,na.rm=TRUE)
      housing_full[i,'LotFrontage']=con_med
  }
}

Fix Factors

Some predictors are in the form of characters while we need them to be in the form of a factor. We also fill in factors that were missing or NA. We use the method of most common occurence.

housing_full$MSSubClass=as.factor(housing_full$MSSubClass)
housing_full$MSZoning=as.factor(housing_full$MSZoning)
housing_full$YrSold=as.factor(housing_full$YrSold)
housing_full$MoSold=as.factor(housing_full$MoSold)
housing_full[which(is.na(housing_full$MSZoning)),'MSZoning']='RL'
housing_full[which(is.na(housing_full$Functional)),'Functional']='Typ'
housing_full[which(is.na(housing_full$Electrical)),'Electrical']='SBrkr'
housing_full[which(is.na(housing_full$KitchenQual)),'KitchenQual']='TA'
housing_full[which(is.na(housing_full$SaleType)),'SaleType']="WD"
housing_full[which(is.na(housing_full$Exterior1st)),'Exterior1st']='VinylSd'
housing_full[which(is.na(housing_full$Exterior2nd)),'Exterior2nd']="VinylSd"

Clean Basment Data

#For missing factors we can impute as an NA for missing.
housing_full$BsmtQual=addNA(housing_full$BsmtQual)
levels(housing_full$BsmtQual) <- c(levels(housing_full$BsmtQual)[-length(levels(housing_full$BsmtQual))], 'None')

housing_full$BsmtCond=addNA(housing_full$BsmtCond)
levels(housing_full$BsmtCond)=c(levels(housing_full$BsmtCond)[-length(levels(housing_full$BsmtCond))], 'None')

housing_full$BsmtExposure=addNA(housing_full$BsmtExposure)
levels(housing_full$BsmtExposure)=c(levels(housing_full$BsmtExposure)[-length(levels(housing_full$BsmtExposure))], 'None')

housing_full$BsmtFinType1=addNA(housing_full$BsmtFinType1)
levels(housing_full$BsmtFinType1)=c(levels(housing_full$BsmtFinType1)[-length(levels(housing_full$BsmtFinType1))], 'None')

housing_full$BsmtFinType2=addNA(housing_full$BsmtFinType2)
levels(housing_full$BsmtFinType2)=c(levels(housing_full$BsmtFinType2)[-length(levels(housing_full$BsmtFinType2))], 'None')

#For missing numeric, no basment is equivelent to zero SF.

for(i in 1:nrow(housing_full)){
  if(is.na(housing_full[i,'BsmtFinSF1'])){
    housing_full[i,'BsmtFinSF1']=0
  }
    if(is.na(housing_full[i,'BsmtFinSF2'])){
    housing_full[i,'BsmtFinSF2']=0
  }
    if(is.na(housing_full[i,'BsmtUnfSF'])){
    housing_full[i,'BsmtUnfSF']=0
  }
      if(is.na(housing_full[i,'TotalBsmtSF'])){
    housing_full[i,'TotalBsmtSF']=0
  }
    if(is.na(housing_full[i,'BsmtFullBath'])){
    housing_full[i,'BsmtFullBath']=0
  }
  
     if(is.na(housing_full[i,'BsmtHalfBath'])){
    housing_full[i,'BsmtHalfBath']=0
  }
}

Clean Garage Data

housing_full$GarageType=addNA(housing_full$GarageType)
levels(housing_full$GarageType) <- c(levels(housing_full$GarageType)[-length(levels(housing_full$GarageType))], 'None')

housing_full$GarageFinish=addNA(housing_full$GarageFinish)
levels(housing_full$GarageFinish) <- c(levels(housing_full$GarageFinish)[-length(levels(housing_full$GarageFinish))], 'None')

housing_full$GarageQual=addNA(housing_full$GarageQual)
levels(housing_full$GarageQual) <- c(levels(housing_full$GarageQual)[-length(levels(housing_full$GarageQual))], 'None')

housing_full$GarageCond=addNA(housing_full$GarageCond)
levels(housing_full$GarageCond) <- c(levels(housing_full$GarageCond)[-length(levels(housing_full$GarageCond))], 'None')



for(i in 1:nrow(housing_full)){
  if(is.na(housing_full[i,'GarageYrBlt'])){
    housing_full[i,'GarageYrBlt']=0
  }
    if(is.na(housing_full[i,'GarageArea'])){
    housing_full[i,'GarageArea']=0
  }
    if(is.na(housing_full[i,'GarageCars'])){
    housing_full[i,'GarageCars']=0

}
}

Remove Useless Predictors

Some predictors such as ID have no value in prediction. Others have too many missing values, or almost no variablity in the values they take. Hence, we remove these predictors.

housing_full=housing_full%>% # Remove Id and Utilities
  dplyr::select(-Utilities,-Id,-MiscFeature,-PoolQC)

Add an Engineered Feature

housing_full$TotalSQ=housing_full$TotalBsmtSF+housing_full$X1stFlrSF+housing_full$X2ndFlrSF

Finish Imputation of Missing Features

housing_full$MasVnrType=addNA(housing_full$MasVnrType)
levels(housing_full$MasVnrType) <- c(levels(housing_full$MasVnrType)[-length(levels(housing_full$MasVnrType))], 'None')



for(i in 1:nrow(housing_full)){
  if(is.na(housing_full[i,'MasVnrArea'])){
    housing_full[i,'MasVnrArea']=0
  }}

Reform Train and Test Sets

for(i in 1:ncol(housing_full)){
  if(is.character(housing_full[,i])){
    housing_full[,i]=as.factor(housing_full[,i])
  }
}


order=nrow(housing)+1
housing=housing_full[1:nrow(housing),]
housing_predictors=housing[,-length(housing)+1]
housing_sale=housing[,length(housing)]
housing_testing=housing_full[order:nrow(housing_full),]

Prepare Data

In order to use glmet in R we need our data in the form of ‘model matrices’,

housing_predictors_Matrix=housing%>%
  dplyr::select(-SalePrice)%>%
 data.matrix()

housing_testing_Matrix=housing_testing%>%
  dplyr::select(-SalePrice)%>%
  data.matrix()

housing_Matrix=housing%>%
  data.matrix()

Algorithms such as linear regression, ridge/lasso/elastic net work best on scaled predictors. This also speeds up learning in high dimensional datasets. Luckily, the packages perform scaling for us so we need not do it manually.

Ridge Regression

Our Mean Squared training error for ridge regression is 18.9.

cv_ridge<-cv.glmnet(x=housing_predictors_Matrix,y=log(housing_Matrix[,'SalePrice']),alpha=0)
fit_ridge<-glmnet(x=housing_predictors_Matrix, y=log(housing$SalePrice), alpha = 0, lambda = cv_ridge$lambda.min)
pridge=exp(predict(fit_ridge,newx=housing_testing_Matrix,type='response',s=cv_ridge$lambda.min))


# Train Error
ptrain_ridge=predict(fit_ridge,newx=housing_predictors_Matrix,type='response',s=cv_ridge$lambda.min)

MSE_ridge=sum((log(housing$SalePrice)-ptrain_ridge)^2) #Train Error

Lasso Regression

Our Mean Squared training error for ridge regression is 19.95.

cv_lasso<-cv.glmnet(x=housing_predictors_Matrix,y=log(housing_Matrix[,'SalePrice']),alpha=1)
fit_lasso<-glmnet(x=housing_predictors_Matrix, y=log(housing$SalePrice), alpha = 1, lambda = cv_lasso$lambda.min)
plasso=exp(predict(fit_lasso,newx=housing_testing_Matrix,type='response',s=cv_lasso$lambda.min))


# Train Error
ptrain_lasso=predict(fit_lasso,newx=housing_predictors_Matrix,type='response',s=cv_lasso$lambda.min)


MSE_lasso=sum((log(housing$SalePrice)-ptrain_lasso)^2) #Train Error

Elastic Net

Our Mean Squared training error for elastic net regression is 19.29.

cv_en<-cv.glmnet(x=housing_predictors_Matrix,y=log(housing_Matrix[,'SalePrice']),alpha=0.5)
fit_elastic<-glmnet(x=housing_predictors_Matrix, y=log(housing$SalePrice), alpha = 0.5, lambda = cv_en$lambda.min)
pelastic=exp(predict(fit_elastic,newx=housing_testing_Matrix,type='response',s=cv_en$lambda.min))


# Train Error
ptrain_en=predict(fit_elastic,newx=housing_predictors_Matrix,type='response',s=cv_en$lambda.min)


MSE_elastic=sum((log(housing$SalePrice)-ptrain_en)^2) #Train Error

Gradient Boosting

Our Mean Squared training error for gradient boosting is 14.5.

model_gbm=gbm(log(SalePrice)~.,data=housing,cv.folds = 5,distribution ="gaussian",shrinkage=0.3,n.trees = 300)

pgbm=exp(predict(model_gbm,newdata=housing_testing))

## Using 190 trees...

# Train Error
ptrain_gbm=predict(model_gbm,newdata=housing_predictors)

## Using 190 trees...

MSE_gbm=sum((log(housing$SalePrice)-ptrain_gbm)^2) #Train Error

Random Forest

Our Mean Squared training error for random forest is less than 6. This is very small and likely indicative that the random forest overfits the training data.

fitControl <- trainControl(method = "cv", # Define a train control that will do 5 fold cross validation for us.
                       number = 5)




#model_rf=caret::train(log(SalePrice)~.,data=housing,trControl=fitControl)

#prf=exp(predict(model_rf,newdata=housing_testing)) 
 

# Train Error
#ptrain_rf=predict(model_rf,newdata=housing_predictors)


#MSE_rf=sum((log(housing$SalePrice)-ptrain_rf)^2) #Train Error is too low, overfit.

Model Blending

We combine three models that had intermediate training errors. This will hopefully avoid inclusion of overfit models. We simply take the average of the three predictions. We combine Ridge, Elastic Net and Gbm models.

pav3=(1/3)*(pgbm+pelastic+pridge)
df.sub=data.frame(1461:2919,pav3)
colnames(df.sub)<-c('Id','SalePrice')
write.csv(df.sub,'housingprice1.csv',row.names=FALSE) #   0.1208
head(df.sub,n=15)

##        Id SalePrice
## 1461 1461  120270.1
## 1462 1462  155085.1
## 1463 1463  181559.5
## 1464 1464  198565.5
## 1465 1465  191234.3
## 1466 1466  173190.7
## 1467 1467  179243.3
## 1468 1468  163654.1
## 1469 1469  188012.3
## 1470 1470  123393.6
## 1471 1471  193196.3
## 1472 1472  102580.2
## 1473 1473  100625.5
## 1474 1474  147995.1
## 1475 1475  120258.5

Kaggle Housing Data

Robert Sneiderman

5/30/2020