The data used in this project was pulled from the beginner Kaggle competition, “House Prices: Advanced Regression Techniques”. Being a beginner project and one of my first data science projects, many of the ideas used in this write up are inspired by other write ups, including:
https://www.kaggle.com/tannercarbonati/detailed-data-analysis-ensemble-modeling https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python/notebook https://www.kaggle.com/juliencs/a-study-on-regression-applied-to-the-ames-dataset
The goal of this project is to predict housing prices using data from the Ames Housing dataset. The dataset provides 79 explanatory variables describing various conditions for a total of 2919 houses, 1460 houses that have the prices included for training and 1459 houses for testing. The data is cleaned and transformed to be put into XGBoost, a gradient boosting method of machine learning. The error is computed using root mean squared logarithmic error (RMSLE).
I first combine the training and testing data sets together to find the variables that have missing inputs.
combined <- rbind(within(train, rm('SalePrice')), test)
na_cols <- which(colSums(is.na(combined)) > 0)
sort(colSums(sapply(combined[na_cols], is.na)), decreasing = TRUE)
## PoolQC MiscFeature Alley Fence FireplaceQu
## 2909 2814 2721 2348 1420
## LotFrontage GarageYrBlt GarageFinish GarageQual GarageCond
## 486 159 159 159 159
## GarageType BsmtCond BsmtExposure BsmtQual BsmtFinType2
## 157 82 82 81 80
## BsmtFinType1 MasVnrType MasVnrArea MSZoning Utilities
## 79 24 23 4 2
## BsmtFullBath BsmtHalfBath Functional Exterior1st Exterior2nd
## 2 2 2 1 1
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Electrical
## 1 1 1 1 1
## KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1
There are some variables which have just a few inputs missing, but also some variables that have the vast majority of their inputs missing. Usually, the presence of an NA means that the house doesn’t have the feature.
For example, the PoolQC feature has a whopping 2909 NAs, signifying that these 2909 houses do not have a pool. For other features such as Utilities and MSZoning, they only have a few missing inputs with no explanation as to why they are missing, so it’s likely that its just an input error. For those features, the most common input is filled in.
fill_missing <- function(df){
#NA for PoolQC means no pool
df[["PoolQC"]][is.na(df[["PoolQC"]])] <- "None"
#NA for MiscFeature means none
df[["MiscFeature"]][is.na(df[["MiscFeature"]])] <- "None"
#NA for Alley means no alley access
df[["Alley"]][is.na(df[["Alley"]])] <- "None"
#NA for Fence means no fence
df[["Fence"]][is.na(df[["Fence"]])] <- "None"
#NA for FireplaceQu means no fireplace
df[["FireplaceQu"]][is.na(df[["FireplaceQu"]])] <- "None"
#For NAs in LotFrontage, we will fill in the median lot frontage from the rest of the data.
df[["LotFrontage"]][is.na(df[["LotFrontage"]])] <- 69
#For NAs in the garage factors, they mean that there is no garage
df[["GarageYrBlt"]][is.na(df[["GarageYrBlt"]])] <- 0
df[["GarageFinish"]][is.na(df[["GarageFinish"]])] <- "None"
df[["GarageQual"]][is.na(df[["GarageQual"]])] <- "None"
df[["GarageCond"]][is.na(df[["GarageCond"]])] <- "None"
df[["GarageType"]][is.na(df[["GarageType"]])] <- "None"
df[["GarageCars"]][is.na(df[["GarageCars"]])] <- 0
df[["GarageArea"]][is.na(df[["GarageArea"]])] <- 0
#For NAs in the basement factors, they mean that there is no basement
df[["BsmtCond"]][is.na(df[["BsmtCond"]])] <- "None"
df[["BsmtExposure"]][is.na(df[["BsmtExposure"]])] <- "None"
df[["BsmtQual"]][is.na(df[["BsmtQual"]])] <- "None"
df[["BsmtFinType2"]][is.na(df[["BsmtFinType2"]])] <- "None"
df[["BsmtFinType1"]][is.na(df[["BsmtFinType1"]])] <- "None"
df[["BsmtFullBath"]][is.na(df[["BsmtFullBath"]])] <- 0
df[["BsmtHalfBath"]][is.na(df[["BsmtHalfBath"]])] <- 0
df[["BsmtFinSF1"]][is.na(df[["BsmtFinSF1"]])] <- 0
df[["BsmtFinSF2"]][is.na(df[["BsmtFinSF2"]])] <- 0
df[["BsmtUnfSF"]][is.na(df[["BsmtUnfSF"]])] <- 0
df[["TotalBsmtSF"]][is.na(df[["TotalBsmtSF"]])] <- 0
#NA for MasVnrType and MasVnrArea means no masory veneer
df[["MasVnrType"]][is.na(df[["MasVnrType"]])] <- "None"
df[["MasVnrArea"]][is.na(df[["MasVnrArea"]])] <- 0
#For NAs in MSZoning, we fill in the most common type, RL
df[["MSZoning"]][is.na(df[["MSZoning"]])] <- "RL"
#For NAs in Utilities, we fill in the most common type, AllPub
df[["Utilities"]][is.na(df[["Utilities"]])] <- "AllPub"
#For NAs in Functional, we fill in the most common type, Typ
df[["Functional"]][is.na(df[["Functional"]])] <- "Typ"
#For NAs in exteriors, we will fill in "Other"
df[["Exterior1st"]][is.na(df[["Exterior1st"]])] <- "Other"
df[["Exterior2nd"]][is.na(df[["Exterior2nd"]])] <- "Other"
#For NAs in Electrical, we fill in the most common type, SBrkr
df[["Electrical"]][is.na(df[["Electrical"]])] <- "SBrkr"
#For NAs in KitchenQual, we fill in the most common type, TA
df[["KitchenQual"]][is.na(df[["KitchenQual"]])] <- "TA"
#For NAs in SaleType, we fill in the most common type, WD
df[["SaleType"]][is.na(df[["SaleType"]])] <- "WD"
return(df)
}
Some of the numeric variables should really be categorical variables and vice versa. For example, the MSSubClass is inputted as a numeric variable, but the numbers are code for each different type of dwelling and the actual codes have no significance, so we change it to categorical. There are many variables that rate features of the house from poor to good which could be represnted as a numeric scale, so we convert those features to numeric.
clean <- function(df){
#Numeric factors that should be categorical
df[["MoSold"]] <- factor(df[["MoSold"]])
df[["GarageYrBlt"]] <- factor(df[["GarageYrBlt"]])
df[["MSSubClass"]] <- factor(df[["MSSubClass"]])
#Categorical factors that should be numeric
df[["ExterQual"]] <- as.numeric(factor(df[["ExterQual"]], levels=c("None","Po","Fa", "TA", "Gd", "Ex")))
df[["ExterCond"]] <- as.numeric(factor(df[["ExterCond"]], levels=c("None","Po","Fa", "TA", "Gd", "Ex")))
df[["BsmtQual"]] <- as.numeric(factor(df[["BsmtQual"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
df[["BsmtCond"]] <- as.numeric(factor(df[["BsmtCond"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
df[["BsmtExposure"]] <- as.numeric(factor(df[["BsmtExposure"]], levels=c("None","No", "Mn", "Av", "Gd")))
df[["BsmtFinType1"]] <- as.numeric(factor(df[["BsmtFinType1"]], levels=c("None","Unf","LwQ","Rec","BLQ","ALQ","GLQ")))
df[["BsmtFinType2"]] <- as.numeric(factor(df[["BsmtFinType2"]], levels=c("None","Unf","LwQ","Rec","BLQ","ALQ","GLQ")))
df[["HeatingQC"]] <- as.numeric(factor(df[["HeatingQC"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
df[["KitchenQual"]] <- as.numeric(factor(df[["KitchenQual"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
df[["FireplaceQu"]] <- as.numeric(factor(df[["FireplaceQu"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
df[["GarageQual"]] <- as.numeric(factor(df[["GarageQual"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
df[["GarageCond"]] <- as.numeric(factor(df[["GarageCond"]], levels=c("None","Po", "Fa", "TA", "Gd", "Ex")))
df[["PoolQC"]] <- as.numeric(factor(df[["PoolQC"]], levels=c("None", "Fa", "TA", "Gd", "Ex")))
return(df)
}
Now that we filled in the missing data and cleaned it up a bit, we look to create some new variables from the existing ones.
The data set gives the number of half baths and full baths in the basements and above ground, but doesn’t give the total number of baths for the entire house, so we add those variables together to get the TotalBath variable. The data also gives us the square footage of the basement and each floor, but doesn’t give the total living square footage, so we add those variables to create the TotalSF variable.
Some features have a quality rating and condition rating, so we combine those two variables together to give each feature a grade.
featureEng <- function(df){
###Combining existing features###
df[["OverallGrade"]] <- df[["OverallQual"]] * df[["OverallCond"]]
df[["GarageGrade"]] <- df[["GarageQual"]] * df[["GarageCond"]]
df[["ExterGrade"]] <- df[["ExterQual"]] * df[["ExterCond"]]
df[["KitchenScore"]] <- df[["KitchenAbvGr"]] * df[["KitchenQual"]]
df[["FireplaceScore"]] <- df[["Fireplaces"]] * df[["FireplaceQu"]]
df[["GarageScore"]] <- df[["GarageArea"]] * df[["GarageQual"]]
df[["PoolScore"]] <- df[["PoolArea"]] * df[["PoolQC"]]
df[["TotalBath"]] <- df[["BsmtFullBath"]] + (0.5 * df[["BsmtHalfBath"]]) + df[["FullBath"]] + (0.5 * df[["HalfBath"]])
df[["TotalSF"]] <- df[["TotalBsmtSF"]] + df[["X1stFlrSF"]] + df[["X2ndFlrSF"]]
df[["TotalPorchSF"]] <- df[["OpenPorchSF"]] + df[["EnclosedPorch"]] + df[["X3SsnPorch"]] + df[["ScreenPorch"]]
return(df)
}
We use one-hot encoding to change our categorical variables into numeric by creating new variables that indicate whether the categorical description is true or not, increasing the number of variables to 387.
combined <- fill_missing(combined)
combined <- clean(combined)
combined <- featureEng(combined)
dmy <- dummyVars(" ~ .", data = combined)
combined2 <- data.frame(predict(dmy, newdata = combined))
dim(combined2)
## [1] 2919 387
Now that we have cleaned the data and added some new features, we explore the data to see the distribution of our dependent variable, the sales price, and look at some of the important variables that influence the sales price.
From the histogram of sale prices, we can see that the data is skewed to the right. It makes sense that most of the houses would be around the same price with some houses being much more expensive. Since we prefer the distribution to be normal, we log transform the sale price to try to make it more normal.
After the sale prices are logarithmically transformed, the distribution curve is a lot closer to the normal curve. Normal dependent variables usually work better for linear regression so we will use the log of the sale price to do the training and predicting then transform it back at the end.
Now that we’ve transformed sale prices to be close to normal, we look at the variables that are highly correlated with sale price. These variables could be important in predicting housing prices and give more insight as to what determines a house’s price.
#look at correlations for factors to sale price
correlations <- cor(training)
## Warning in cor(training): the standard deviation is zero
correlations <- as.data.frame(correlations)
names <- row.names(correlations)
priceCorrelation <- correlations$SalePrice
correlations <- cbind(names, priceCorrelation)
correlations <- correlations[order(-priceCorrelation),]
correlations[1:11, 1:2]
## names priceCorrelation
## [1,] "SalePrice" "1"
## [2,] "OverallQual" "0.817184417921683"
## [3,] "TotalSF" "0.777296214853448"
## [4,] "GrLivArea" "0.70092665254413"
## [5,] "GarageCars" "0.680624807436047"
## [6,] "ExterQual" "0.678839834864308"
## [7,] "TotalBath" "0.673010594043434"
## [8,] "KitchenQual" "0.667893025602166"
## [9,] "GarageScore" "0.653893005412548"
## [10,] "GarageArea" "0.650887555902007"
## [11,] "BsmtQual" "0.615803610631313"
As you would expect, the overall quality and total square footage of the house is highly correlated with the sale price. Some of the other factors that are highly correlated are the number of cars the garage the hold, the number of bathrooms, and the quality of the kitchen and basement. These ten variables are plotted vs the sale price to take a closer look and to try and find some outliers.
From the above ground living area (GrLivArea) plot, we can see that there are only 4 houses that have a living area of more than 4000 square feet. The two points with a high price isn’t too far from the general curve of the data and could help the model predict the price of more expensive houses so those points are left alone. The two houses with the largest living areas, however, don’t fit the general curve of the plot. Possible explanations for their sale price to be low for their living areas could be because they are in the middle of nowhere. It is safe to remove these two points from our training set so that they don’t skew our predictions.
training[which(training$GrLivArea > 4000), c("GrLivArea", "SalePrice")]
## GrLivArea SalePrice
## 524 4676 12.12676
## 692 4316 13.53447
## 1183 4476 13.52114
## 1299 5642 11.98293
training <- training[-c(524,1299),]
Using Caret’s nearZeroVar function, we find variables that have near zero varience (all the inputs are the same) and remove them to make the training model more efficient, reducing the number of variables down to 124.
# Find zero varience factors using caret
nzv.data <- nearZeroVar(training, saveMetrics = TRUE)
# Remove zero variance factors
drop.cols <- rownames(nzv.data)[nzv.data$nzv == TRUE]
training <- training[,!names(training) %in% drop.cols]
dim(training)
## [1] 1458 124
The data is trained using XGBoost, creating a model for which the sale price of the testing data can be predicted. The parameters for XGBoost is tuned using XGBoost’s cross validation function so that the model isn’t overfitting or underfitting the data.
param <- list(colsample_bytree = 1,
subsample = .6,
booster = "gbtree",
max_depth = 8,
eta = 0.05,
min_child_weight = 2,
eval_metric = "rmse",
objective="reg:linear",
gamma = 0.01)
# cross validation used to tune parameters for xgboost
# xgb.cv(params = param,
# data = trainD,
# nrounds = 10000,
# watchlist = list(train = trainD),
# verbose = TRUE,
# print_every_n = 20,
# nthread = 2,
# nfold = 5)
bstSparse <-
xgb.train(params = param,
data = trainD,
nrounds = 400,
watchlist = list(train = trainD),
verbose = TRUE,
print_every_n = 50,
nthread = 2)
## [1] train-rmse:10.954814
## [51] train-rmse:0.868689
## [101] train-rmse:0.106929
## [151] train-rmse:0.053322
## [201] train-rmse:0.041159
## [251] train-rmse:0.035464
## [301] train-rmse:0.032978
## [351] train-rmse:0.031696
## [400] train-rmse:0.031037
prediction <- predict(bstSparse, testD)
prediction <- as.data.frame(as.matrix(prediction))
colnames(prediction) <- "prediction"
model_output <- cbind(test, prediction)
submission <- data.frame(Id = test$Id, SalePrice = model_output$prediction)
submission$SalePrice <- exp(submission$SalePrice) + 1
head(submission)
## Id SalePrice
## 1 1461 125605.3
## 2 1462 171912.2
## 3 1463 187967.7
## 4 1464 193506.9
## 5 1465 182866.4
## 6 1466 172233.8
summary(submission$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 38629 127090 156913 178813 208235 658751
Using this model, the predicted sale prices got an RMSLE of 0.12291. The top possible scores are around 0.112.
The median of the data is $163,000. With an RMSLE of 0.12291, my model predicts at plus or minus $21,317, about a 13% error. At the top score of 0.112, they median would be plus or minus $19,043, about a 11.7% error. So my model has an error of about 1.3% more than the best possible model. The top scores implement many more advanced techniques, such as stacking regression models which include XGBoost and various other prediction models, and hypertuning the parameters for the models.
Using XGBoost’s importance function, we can look at the features that had the most importance in the model.
importance <- xgb.importance(colnames(training), model = bstSparse)
head(importance, 20)
## Feature Gain Cover Frequency
## 1: TotalBath 0.24114600 0.051132262 0.037841290
## 2: HouseStyle2Story 0.10863093 0.023977507 0.016605461
## 3: SaleConditionPartial 0.05931603 0.028865785 0.025227527
## 4: GarageScore 0.05847638 0.025148107 0.015328118
## 5: BedroomAbvGr 0.04594880 0.009528447 0.005748044
## 6: OverallCond 0.04208100 0.037894715 0.038160626
## 7: GarageFinishUnf 0.03643211 0.004821632 0.003991697
## 8: Fireplaces 0.03032596 0.004493019 0.003512694
## 9: LotFrontage 0.02747414 0.062601372 0.060035127
## 10: YearBuilt 0.02421270 0.022573314 0.023471180
## 11: FoundationPConc 0.02373667 0.005937463 0.003193358
## 12: FireplaceScore 0.02194458 0.022509967 0.013092767
## 13: BsmtFinSF1 0.01967137 0.069165047 0.063867156
## 14: X2ndFlrSF 0.01843681 0.044271648 0.037521954
## 15: MasVnrArea 0.01647510 0.001391655 0.001277343
## 16: ElectricalSBrkr 0.01395112 0.028367587 0.033210921
## 17: BsmtFinType1 0.01374305 0.036439713 0.039437969
## 18: HeatingQC 0.01275097 0.003957869 0.003512694
## 19: TotalSF 0.01269224 0.038673355 0.042311991
## 20: MSZoningRM 0.01129096 0.034960956 0.037681622
Out of the 10 features that were created, 4 of them were in the top 20 most important features and 3 in the top 10. TotalBath, the feature created for the total number of bathrooms, was the most important feature. Garage Score was also in the top 5, with Total Square Footage and Fireplace Score reaching the top 20. Creating the total bath feature was very important and increased model accuracy, since none of the other bathroom features appear in the top 20 meaning that their information was not given in a way beneficial to the model. Garage score importance than its related features.
It makes sense that these features are important in deciding the price of a house. The total number of bathrooms speaks a lot about the size and quality of the house, as does the garage score and fireplace score.
Most of the important features make a lot of sense. The number of bedrooms, the overall condition, and year built are all things a home buyer would consider. The sale condition being partial, meaning that the house wasn’t completed when bought meaning that the house is brand new, would of course raise the price. Houses with two or more stories would cost more than single story houses, evident by 2-story house type being 2nd on the list and 2nd floor square footage being in the top 20. Interesting to note, 2nd floor square footage is more important than total square footage, implicating that the number of stories is very important to home buyers.
Surprisingly, some of the features that are highly correlated with sale price are not very important. Garage Area and Above Ground Living Area are some of the more unimportant features. Nothing about kitchens appears in the top 20 either, despite Kitchen Quality being correlated with sale price. This could be due to those features having a high variability, so the model decides to largely ignore them.
Overall, this model predicted sale prices very accurately. The results give a lot of insight as to what homebuyers view most important. Because the results are quantified rather than speculative, the results would more accurately help home builders build homes that best fit the consumer and make more profit.