Introduction

The data used in this project was pulled from the beginner Kaggle competition, “House Prices: Advanced Regression Techniques”. Being a beginner project and one of my first data science projects, many of the ideas used in this write up are inspired by other write ups, including:

https://www.kaggle.com/tannercarbonati/detailed-data-analysis-ensemble-modeling https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python/notebook https://www.kaggle.com/juliencs/a-study-on-regression-applied-to-the-ames-dataset

The goal of this project is to predict housing prices using data from the Ames Housing dataset. The dataset provides 79 explanatory variables describing various conditions for a total of 2919 houses, 1460 houses that have the prices included for training and 1459 houses for testing. The data is cleaned and transformed to be put into XGBoost, a gradient boosting method of machine learning. The error is computed using root mean squared logarithmic error (RMSLE).

Data Cleaning

I first combine the training and testing data sets together to find the variables that have missing inputs.

combined <- rbind(within(train, rm('SalePrice')), test)
na_cols <- which(colSums(is.na(combined)) > 0)
sort(colSums(sapply(combined[na_cols], is.na)), decreasing = TRUE)
##       PoolQC  MiscFeature        Alley        Fence  FireplaceQu 
##         2909         2814         2721         2348         1420 
##  LotFrontage  GarageYrBlt GarageFinish   GarageQual   GarageCond 
##          486          159          159          159          159 
##   GarageType     BsmtCond BsmtExposure     BsmtQual BsmtFinType2 
##          157           82           82           81           80 
## BsmtFinType1   MasVnrType   MasVnrArea     MSZoning    Utilities 
##           79           24           23            4            2 
## BsmtFullBath BsmtHalfBath   Functional  Exterior1st  Exterior2nd 
##            2            2            2            1            1 
##   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF   Electrical 
##            1            1            1            1            1 
##  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1

There are some variables which have just a few inputs missing, but also some variables that have the vast majority of their inputs missing. Usually, the presence of an NA means that the house doesn’t have the feature.

For example, the PoolQC feature has a whopping 2909 NAs, signifying that these 2909 houses do not have a pool. For other features such as Utilities and MSZoning, they only have a few missing inputs with no explanation as to why they are missing, so it’s likely that its just an input error. For those features, the most common input is filled in.

fill_missing <- function(df){
  #NA for PoolQC means no pool
  df[["PoolQC"]][is.na(df[["PoolQC"]])] <- "None"
  #NA for MiscFeature means none
  df[["MiscFeature"]][is.na(df[["MiscFeature"]])] <- "None"
  #NA for Alley means no alley access
  df[["Alley"]][is.na(df[["Alley"]])] <- "None"
  #NA for Fence means no fence
  df[["Fence"]][is.na(df[["Fence"]])] <- "None"
  #NA for FireplaceQu means no fireplace
  df[["FireplaceQu"]][is.na(df[["FireplaceQu"]])] <- "None"
  #For NAs in LotFrontage, we will fill in the median lot frontage from the rest of the data.
  df[["LotFrontage"]][is.na(df[["LotFrontage"]])] <- 69
  #For NAs in the garage factors, they mean that there is no garage
  df[["GarageYrBlt"]][is.na(df[["GarageYrBlt"]])] <- 0
  df[["GarageFinish"]][is.na(df[["GarageFinish"]])] <- "None"
  df[["GarageQual"]][is.na(df[["GarageQual"]])] <- "None"
  df[["GarageCond"]][is.na(df[["GarageCond"]])] <- "None"
  df[["GarageType"]][is.na(df[["GarageType"]])] <- "None"
  df[["GarageCars"]][is.na(df[["GarageCars"]])] <- 0
  df[["GarageArea"]][is.na(df[["GarageArea"]])] <- 0
  #For NAs in the basement factors, they mean that there is no basement
  df[["BsmtCond"]][is.na(df[["BsmtCond"]])] <- "None"
  df[["BsmtExposure"]][is.na(df[["BsmtExposure"]])] <- "None"
  df[["BsmtQual"]][is.na(df[["BsmtQual"]])] <- "None"
  df[["BsmtFinType2"]][is.na(df[["BsmtFinType2"]])] <- "None"
  df[["BsmtFinType1"]][is.na(df[["BsmtFinType1"]])] <- "None"
  df[["BsmtFullBath"]][is.na(df[["BsmtFullBath"]])] <- 0
  df[["BsmtHalfBath"]][is.na(df[["BsmtHalfBath"]])] <- 0
  df[["BsmtFinSF1"]][is.na(df[["BsmtFinSF1"]])] <- 0
  df[["BsmtFinSF2"]][is.na(df[["BsmtFinSF2"]])] <- 0
  df[["BsmtUnfSF"]][is.na(df[["BsmtUnfSF"]])] <- 0
  df[["TotalBsmtSF"]][is.na(df[["TotalBsmtSF"]])] <- 0
  #NA for MasVnrType and MasVnrArea means no masory veneer
  df[["MasVnrType"]][is.na(df[["MasVnrType"]])] <- "None"
  df[["MasVnrArea"]][is.na(df[["MasVnrArea"]])] <- 0
  #For NAs in MSZoning, we fill in the most common type, RL
  df[["MSZoning"]][is.na(df[["MSZoning"]])] <- "RL"
  #For NAs in Utilities, we fill in the most common type, AllPub
  df[["Utilities"]][is.na(df[["Utilities"]])] <- "AllPub"
  #For NAs in Functional, we fill in the most common type, Typ
  df[["Functional"]][is.na(df[["Functional"]])] <- "Typ"
  #For NAs in exteriors, we will fill in "Other"
  df[["Exterior1st"]][is.na(df[["Exterior1st"]])] <- "Other"
  df[["Exterior2nd"]][is.na(df[["Exterior2nd"]])] <- "Other"
  #For NAs in Electrical, we fill in the most common type, SBrkr
  df[["Electrical"]][is.na(df[["Electrical"]])] <- "SBrkr"
  #For NAs in KitchenQual, we fill in the most common type, TA
  df[["KitchenQual"]][is.na(df[["KitchenQual"]])] <- "TA"
  #For NAs in SaleType, we fill in the most common type, WD
  df[["SaleType"]][is.na(df[["SaleType"]])] <- "WD"
  return(df)
}

Some of the numeric variables should really be categorical variables and vice versa. For example, the MSSubClass is inputted as a numeric variable, but the numbers are code for each different type of dwelling and the actual codes have no significance, so we change it to categorical. There are many variables that rate features of the house from poor to good which could be represnted as a numeric scale, so we convert those features to numeric.

clean <- function(df){
  #Numeric factors that should be categorical
  df[["MoSold"]] <- factor(df[["MoSold"]])
  df[["GarageYrBlt"]] <- factor(df[["GarageYrBlt"]])
  df[["MSSubClass"]] <- factor(df[["MSSubClass"]])
  
  #Categorical factors that should be numeric
  df[["ExterQual"]] <- as.numeric(factor(df[["ExterQual"]], levels=c("None","Po","Fa", "TA", "Gd", "Ex")))
  df[["ExterCond"]] <- as.numeric(factor(df[["ExterCond"]], levels=c("None","Po","Fa", "TA", "Gd", "Ex")))
  df[["BsmtQual"]] <- as.numeric(factor(df[["BsmtQual"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
  df[["BsmtCond"]] <- as.numeric(factor(df[["BsmtCond"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
  df[["BsmtExposure"]] <- as.numeric(factor(df[["BsmtExposure"]], levels=c("None","No", "Mn", "Av", "Gd")))
  df[["BsmtFinType1"]] <- as.numeric(factor(df[["BsmtFinType1"]], levels=c("None","Unf","LwQ","Rec","BLQ","ALQ","GLQ")))
  df[["BsmtFinType2"]] <- as.numeric(factor(df[["BsmtFinType2"]], levels=c("None","Unf","LwQ","Rec","BLQ","ALQ","GLQ")))
  df[["HeatingQC"]] <- as.numeric(factor(df[["HeatingQC"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
  df[["KitchenQual"]] <- as.numeric(factor(df[["KitchenQual"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
  df[["FireplaceQu"]] <- as.numeric(factor(df[["FireplaceQu"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
  df[["GarageQual"]] <- as.numeric(factor(df[["GarageQual"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
  df[["GarageCond"]] <- as.numeric(factor(df[["GarageCond"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
  df[["PoolQC"]] <- as.numeric(factor(df[["PoolQC"]], levels=c("None", "Fa", "TA", "Gd", "Ex")))

  return(df)
}

Feature Engineering

Now that we filled in the missing data and cleaned it up a bit, we look to create some new variables from the existing ones.

The data set gives the number of half baths and full baths in the basements and above ground, but doesn’t give the total number of baths for the entire house, so we add those variables together to get the TotalBath variable. The data also gives us the square footage of the basement and each floor, but doesn’t give the total living square footage, so we add those variables to create the TotalSF variable.

Some features have a quality rating and condition rating, so we combine those two variables together to give each feature a grade.

featureEng <- function(df){

  ###Combining existing features###
  df[["OverallGrade"]] <- df[["OverallQual"]] * df[["OverallCond"]]
  df[["GarageGrade"]] <- df[["GarageQual"]] * df[["GarageCond"]]
  df[["ExterGrade"]] <- df[["ExterQual"]] * df[["ExterCond"]]
  df[["KitchenScore"]] <- df[["KitchenAbvGr"]] * df[["KitchenQual"]]
  df[["FireplaceScore"]] <- df[["Fireplaces"]] * df[["FireplaceQu"]]
  df[["GarageScore"]] <- df[["GarageArea"]] * df[["GarageQual"]]
  df[["PoolScore"]] <- df[["PoolArea"]] * df[["PoolQC"]]

  df[["TotalBath"]] <- df[["BsmtFullBath"]] + (0.5 * df[["BsmtHalfBath"]]) + df[["FullBath"]] + (0.5 * df[["HalfBath"]])
  df[["TotalSF"]] <- df[["TotalBsmtSF"]] + df[["X1stFlrSF"]] + df[["X2ndFlrSF"]]
  df[["TotalPorchSF"]] <- df[["OpenPorchSF"]] + df[["EnclosedPorch"]] + df[["X3SsnPorch"]] + df[["ScreenPorch"]]
  
  return(df) 
}

We use one-hot encoding to change our categorical variables into numeric by creating new variables that indicate whether the categorical description is true or not, increasing the number of variables to 387.

combined <- fill_missing(combined)
combined <- clean(combined)
combined <- featureEng(combined)

dmy <- dummyVars(" ~ .", data = combined)
combined2 <- data.frame(predict(dmy, newdata = combined))
dim(combined2)
## [1] 2919  387

Data Exploration

Now that we have cleaned the data and added some new features, we explore the data to see the distribution of our dependent variable, the sales price, and look at some of the important variables that influence the sales price.

From the histogram of sale prices, we can see that the data is skewed to the right. It makes sense that most of the houses would be around the same price with some houses being much more expensive. Since we prefer the distribution to be normal, we log transform the sale price to try to make it more normal.

After the sale prices are logarithmically transformed, the distribution curve is a lot closer to the normal curve. Normal dependent variables usually work better for linear regression so we will use the log of the sale price to do the training and predicting then transform it back at the end.

Now that we’ve transformed sale prices to be close to normal, we look at the variables that are highly correlated with sale price. These variables could be important in predicting housing prices and give more insight as to what determines a house’s price.

#look at correlations for factors to sale price
correlations <- cor(training)
## Warning in cor(training): the standard deviation is zero
correlations <- as.data.frame(correlations)
names <- row.names(correlations)
priceCorrelation <- correlations$SalePrice
correlations <- cbind(names, priceCorrelation)
correlations <- correlations[order(-priceCorrelation),]

correlations[1:11, 1:2]
##       names         priceCorrelation   
##  [1,] "SalePrice"   "1"                
##  [2,] "OverallQual" "0.817184417921683"
##  [3,] "TotalSF"     "0.777296214853448"
##  [4,] "GrLivArea"   "0.70092665254413" 
##  [5,] "GarageCars"  "0.680624807436047"
##  [6,] "ExterQual"   "0.678839834864308"
##  [7,] "TotalBath"   "0.673010594043434"
##  [8,] "KitchenQual" "0.667893025602166"
##  [9,] "GarageScore" "0.653893005412548"
## [10,] "GarageArea"  "0.650887555902007"
## [11,] "BsmtQual"    "0.615803610631313"

As you would expect, the overall quality and total square footage of the house is highly correlated with the sale price. Some of the other factors that are highly correlated are the number of cars the garage the hold, the number of bathrooms, and the quality of the kitchen and basement. These ten variables are plotted vs the sale price to take a closer look and to try and find some outliers.

From the above ground living area (GrLivArea) plot, we can see that there are only 4 houses that have a living area of more than 4000 square feet. The two points with a high price isn’t too far from the general curve of the data and could help the model predict the price of more expensive houses so those points are left alone. The two houses with the largest living areas, however, don’t fit the general curve of the plot. Possible explanations for their sale price to be low for their living areas could be because they are in the middle of nowhere. It is safe to remove these two points from our training set so that they don’t skew our predictions.

training[which(training$GrLivArea > 4000), c("GrLivArea", "SalePrice")]
##      GrLivArea SalePrice
## 524       4676  12.12676
## 692       4316  13.53447
## 1183      4476  13.52114
## 1299      5642  11.98293
training <- training[-c(524,1299),]

Using Caret’s nearZeroVar function, we find variables that have near zero varience (all the inputs are the same) and remove them to make the training model more efficient, reducing the number of variables down to 124.

# Find zero varience factors using caret
nzv.data <- nearZeroVar(training, saveMetrics = TRUE)

# Remove zero variance factors
drop.cols <- rownames(nzv.data)[nzv.data$nzv == TRUE]
training <- training[,!names(training) %in% drop.cols]
dim(training)
## [1] 1458  124

Predicting

The data is trained using XGBoost, creating a model for which the sale price of the testing data can be predicted. The parameters for XGBoost is tuned using XGBoost’s cross validation function so that the model isn’t overfitting or underfitting the data.

param <- list(colsample_bytree = 1,
              subsample = .6,
              booster = "gbtree",
              max_depth = 8,
              eta = 0.05,
              min_child_weight = 2,
              eval_metric = "rmse",
              objective="reg:linear",
              gamma = 0.01)

# cross validation used to tune parameters for xgboost
#  xgb.cv(params = param,
#            data = trainD,
#            nrounds = 10000,
#            watchlist = list(train = trainD),
#            verbose = TRUE,
#            print_every_n = 20,
#            nthread = 2,
#            nfold = 5)

bstSparse <-
  xgb.train(params = param,
            data = trainD,
            nrounds = 400,
            watchlist = list(train = trainD),
            verbose = TRUE,
            print_every_n = 50,
            nthread = 2)
## [1]  train-rmse:10.954814 
## [51] train-rmse:0.868689 
## [101]    train-rmse:0.106929 
## [151]    train-rmse:0.053322 
## [201]    train-rmse:0.041159 
## [251]    train-rmse:0.035464 
## [301]    train-rmse:0.032978 
## [351]    train-rmse:0.031696 
## [400]    train-rmse:0.031037
prediction <- predict(bstSparse, testD)
prediction <- as.data.frame(as.matrix(prediction))
colnames(prediction) <- "prediction"

model_output <- cbind(test, prediction)
submission <- data.frame(Id = test$Id, SalePrice = model_output$prediction)
submission$SalePrice <- exp(submission$SalePrice) + 1

head(submission)
##     Id SalePrice
## 1 1461  125605.3
## 2 1462  171912.2
## 3 1463  187967.7
## 4 1464  193506.9
## 5 1465  182866.4
## 6 1466  172233.8
summary(submission$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   38629  127090  156913  178813  208235  658751

Using this model, the predicted sale prices got an RMSLE of 0.12291. The top possible scores are around 0.112.

The median of the data is $163,000. With an RMSLE of 0.12291, my model predicts at plus or minus $21,317, about a 13% error. At the top score of 0.112, they median would be plus or minus $19,043, about a 11.7% error. So my model has an error of about 1.3% more than the best possible model. The top scores implement many more advanced techniques, such as stacking regression models which include XGBoost and various other prediction models, and hypertuning the parameters for the models.

Using XGBoost’s importance function, we can look at the features that had the most importance in the model.

importance <- xgb.importance(colnames(training), model = bstSparse)
head(importance, 20)
##                  Feature       Gain       Cover   Frequency
##  1:            TotalBath 0.24114600 0.051132262 0.037841290
##  2:     HouseStyle2Story 0.10863093 0.023977507 0.016605461
##  3: SaleConditionPartial 0.05931603 0.028865785 0.025227527
##  4:          GarageScore 0.05847638 0.025148107 0.015328118
##  5:         BedroomAbvGr 0.04594880 0.009528447 0.005748044
##  6:          OverallCond 0.04208100 0.037894715 0.038160626
##  7:      GarageFinishUnf 0.03643211 0.004821632 0.003991697
##  8:           Fireplaces 0.03032596 0.004493019 0.003512694
##  9:          LotFrontage 0.02747414 0.062601372 0.060035127
## 10:            YearBuilt 0.02421270 0.022573314 0.023471180
## 11:      FoundationPConc 0.02373667 0.005937463 0.003193358
## 12:       FireplaceScore 0.02194458 0.022509967 0.013092767
## 13:           BsmtFinSF1 0.01967137 0.069165047 0.063867156
## 14:            X2ndFlrSF 0.01843681 0.044271648 0.037521954
## 15:           MasVnrArea 0.01647510 0.001391655 0.001277343
## 16:      ElectricalSBrkr 0.01395112 0.028367587 0.033210921
## 17:         BsmtFinType1 0.01374305 0.036439713 0.039437969
## 18:            HeatingQC 0.01275097 0.003957869 0.003512694
## 19:              TotalSF 0.01269224 0.038673355 0.042311991
## 20:           MSZoningRM 0.01129096 0.034960956 0.037681622

Out of the 10 features that were created, 4 of them were in the top 20 most important features and 3 in the top 10. TotalBath, the feature created for the total number of bathrooms, was the most important feature. Garage Score was also in the top 5, with Total Square Footage and Fireplace Score reaching the top 20. Creating the total bath feature was very important and increased model accuracy, since none of the other bathroom features appear in the top 20 meaning that their information was not given in a way beneficial to the model. Garage score importance than its related features.

It makes sense that these features are important in deciding the price of a house. The total number of bathrooms speaks a lot about the size and quality of the house, as does the garage score and fireplace score.

Most of the important features make a lot of sense. The number of bedrooms, the overall condition, and year built are all things a home buyer would consider. The sale condition being partial, meaning that the house wasn’t completed when bought meaning that the house is brand new, would of course raise the price. Houses with two or more stories would cost more than single story houses, evident by 2-story house type being 2nd on the list and 2nd floor square footage being in the top 20. Interesting to note, 2nd floor square footage is more important than total square footage, implicating that the number of stories is very important to home buyers.

Surprisingly, some of the features that are highly correlated with sale price are not very important. Garage Area and Above Ground Living Area are some of the more unimportant features. Nothing about kitchens appears in the top 20 either, despite Kitchen Quality being correlated with sale price. This could be due to those features having a high variability, so the model decides to largely ignore them.

Overall, this model predicted sale prices very accurately. The results give a lot of insight as to what homebuyers view most important. Because the results are quantified rather than speculative, the results would more accurately help home builders build homes that best fit the consumer and make more profit.