library(data.table)
library(ggplot2)
library(randomForest)
library(dplyr)
library(corrplot)
library(knitr)
library(kableExtra)
library(easyGgplot2)
library(caret)
Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.
Visit this website if you want to know more about the data.
# File download
if (!file.exists("./Initial Data/test.csv")) {
dir.create("./Initial Data")
download.file("https://www.kaggle.com/c/house-prices-advanced-regression-techniques/download/test.csv",
"./Initial Data/test.csv")
download.file("https://www.kaggle.com/c/house-prices-advanced-regression-techniques/download/train.csv",
"./Initial Data/train.csv")
}
# Data import
raw.test <- fread(input = "./Initial Data/test.csv", sep = ",", stringsAsFactors = F, data.table = F)
raw.train <- fread(input = "./Initial Data/train.csv", sep = ",", stringsAsFactors = F, data.table = F)
# Bind and label both train and test sets
fulldt <- rbind(raw.train[,-81], raw.test)
fulldt <- cbind(fulldt, Set = c(rep("Train", times = dim(raw.train)[1]),
rep("Test", times = dim(raw.test)[1])))
After some exploratory analysis, we notice some of the variables contain NA values.
# Check for missing values
x <- colSums(sapply(fulldt, is.na))
# Set table
x <- data.frame(Variables = names(x), NA.Count = x); rownames(x) <- c()
# Remove variables that don't have missing values
x <- x %>% filter(NA.Count > 0)
kable(x, "html") %>%
kable_styling(full_width = F)
| Variables | NA.Count |
|---|---|
| MSZoning | 4 |
| LotFrontage | 486 |
| Alley | 2721 |
| Utilities | 2 |
| Exterior1st | 1 |
| Exterior2nd | 1 |
| MasVnrType | 24 |
| MasVnrArea | 23 |
| BsmtQual | 81 |
| BsmtCond | 82 |
| BsmtExposure | 82 |
| BsmtFinType1 | 79 |
| BsmtFinSF1 | 1 |
| BsmtFinType2 | 80 |
| BsmtFinSF2 | 1 |
| BsmtUnfSF | 1 |
| TotalBsmtSF | 1 |
| Electrical | 1 |
| BsmtFullBath | 2 |
| BsmtHalfBath | 2 |
| KitchenQual | 1 |
| Functional | 2 |
| FireplaceQu | 1420 |
| GarageType | 157 |
| GarageYrBlt | 159 |
| GarageFinish | 159 |
| GarageCars | 1 |
| GarageArea | 1 |
| GarageQual | 159 |
| GarageCond | 159 |
| PoolQC | 2909 |
| Fence | 2348 |
| MiscFeature | 2814 |
| SaleType | 1 |
Before going any further, we want to deal with the NA values and clean the data. There is a range of ways of dealing with missing values. Here we will apply simple methods in order to focus on other aspects of the project.
Method 1 - Replacing missing values by “0”. These numeric variables use the “NA” to represent the absence of what is being measured. For instance, if a house has a NA in the BsmtFullBath variable, it means it doesn’t have any full bathroom in the basement. So replacing by “0” seems to be the right approach here.
y <- c("LotFrontage", "MasVnrArea", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath")
fulldt[,y] <- apply(fulldt[,y], 2,
function(x) {
replace(x, is.na(x), 0)
}
)
Method 2 - Replacing missing values by “None”. Same as before, but with factor variables.
y <- c("Alley", "BsmtQual", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "FireplaceQu", "PoolQC", "Fence", "MiscFeature", "GarageType", "GarageFinish", "GarageQual", "GarageCond", "BsmtCond")
fulldt[,y] <- apply(fulldt[,y], 2,
function(x) {
replace(x, is.na(x), "None")
}
)
Method 3 - Replacing missing values by the most common value. As stated before, there might be better approachs to this, but this approach will save us time. As the most recurrent value in the factor variable, there is a higher probability that we are not making a mistake by assigning the most common value as the replacement to NAs.
y <- c("MSZoning", "Utilities", "Exterior1st", "Exterior2nd", "MasVnrType", "Electrical", "KitchenQual", "Functional", "SaleType")
fulldt[,y] <- apply(fulldt[,y], 2,
function(x) {
replace(x, is.na(x), names(which.max(table(x))))
}
)
Method 4 - Replacing missing values by the median. Same as before, but now with numerical variables.
y <- c("GarageCars", "GarageArea", "BsmtFinSF1")
fulldt[,y] <- apply(fulldt[,y], 2,
function(x) {
replace(x, is.na(x), median(x, na.rm = T))
}
)
Obs: the missing values is GarageYrBlt. If there is a missing value in the GarageYrBlt variable, we assume that the garage was built the same year as the house.
fulldt$GarageYrBlt[is.na(fulldt$GarageYrBlt)] <- fulldt$YearBuilt[is.na(fulldt$GarageYrBlt)]
Once the missing data was sorted, we want to ensure that each variable is properly classified.
table(sapply(fulldt, class))
##
## character factor integer numeric
## 43 1 27 10
It seems most of variables are classified as character rather than factor. Also a closer look shows that the MSSubClass is taken as a numeric variable while it should be a factor.
# Colect name of variables that are character
class.list <- sapply(fulldt, class)
class.list.character <- names(class.list[which(class.list=="character")])
# Convert to factor
fulldt[class.list.character] <- lapply(fulldt[class.list.character], factor)
# Fix MSSubClass class
fulldt$MSSubClass <- factor(fulldt$MSSubClass)
Now that the data is clean, it’s time we try some feature engineering. The idea here is try to create new features that might have a stronger predictive power. We wont remove any of the original variable so they can be compared to the newly created features.
# Create a "total area" feature by adding the basement area and ground living area
fulldt$TotalArea <- fulldt$GrLivArea + fulldt$TotalBsmtSF
# Create a "total number of baths" feature by adding all bathroom features
fulldt$TotalBaths <- fulldt$BsmtFullBath +
fulldt$BsmtHalfBath +
fulldt$FullBath +
fulldt$HalfBath
# Create a "area aboveground" feature by adding the areas of the first and second floor
fulldt$AreaAbvground <- fulldt$`1stFlrSF` + fulldt$`2ndFlrSF`
It’s time to focus on our machine learning algorithm. And the first step might be to pick which variables, of the vast number that we have, we are going to use.
To do this we will take two different approach. The first will focus solely on the numerical variables (the ones that are classified as “integer” and “numerical”).
# Subset numerical variables that are from the "train" set
fulldt.num.train <- fulldt %>% filter(Set == "Train") %>%
select(which(sapply(.,is.integer)), which(sapply(., is.numeric))) %>%
mutate(SalePrice = raw.train$SalePrice) #Add the "SalePrice" variable
Now we can use this new table to check the correlation among its variables. We want to pick the values that are more strongly correlated to the “SalePrice” variables, because they are the ones who have the strongest predictive power.
correlation <- round(cor(fulldt.num.train),2)
corrplot(correlation, method = "circle")
We can see already that some points are positively correlated (dark blue), while negative correlations might be harder to find. Also, it seems that the newly created features are performing well. But are they performing any better that their originals features? We will have to take a closer look to be sure.
# Set a table with "SalePrice" correlation
x <- data.frame(Variables = rownames(correlation),
Cor = correlation[, "SalePrice"])
# Order it by correlation
x <- x[order(x$Cor, decreasing = T),]
# Pick only values that have strong positive and negative correlation
x <- x[which(x$Cor > 0.5 | x$Cor < -0.5),]
rownames(x) <- c()
kable(x, "html") %>%
kable_styling(full_width = F)
| Variables | Cor |
|---|---|
| SalePrice | 1.00 |
| OverallQual | 0.79 |
| TotalArea | 0.78 |
| AreaAbvground | 0.72 |
| GrLivArea | 0.71 |
| GarageCars | 0.64 |
| GarageArea | 0.62 |
| 1stFlrSF | 0.61 |
| TotalBsmtSF | 0.61 |
| TotalBaths | 0.61 |
| FullBath | 0.56 |
| TotRmsAbvGrd | 0.53 |
| YearBuilt | 0.52 |
| YearRemodAdd | 0.51 |
| GarageYrBlt | 0.51 |
Our new features really outperformed the original ones! We will keep them and discard the originals. We also have to choose between “GarageCars” and “GarageArea” since both represent the same thing. The final feature list should look like this:
kable(x[c(2,3,4,7,10,13),], "html") %>%
kable_styling(full_width = F)
| Variables | Cor | |
|---|---|---|
| 2 | OverallQual | 0.79 |
| 3 | TotalArea | 0.78 |
| 4 | AreaAbvground | 0.72 |
| 7 | GarageArea | 0.62 |
| 10 | TotalBaths | 0.61 |
| 13 | YearBuilt | 0.52 |
The second approach to pick features will be done with the factor variables.
We will use a random forest algorithm to rank each variable by importance. Then we choose the most important features and add to our final algorithm.
# Subset numerical variables that are from the "test" set
fulldt.fac.train <- fulldt %>% filter(Set == "Train") %>%
select(Id, which(sapply(., is.factor))) %>%
mutate(SalePrice = raw.train$SalePrice) # Add SalePrice variable
fulldt.fac.test <- fulldt %>% filter(Set == "Test") %>%
select(Id, which(sapply(., is.factor)))
# Run RF algorithm will all factor variables
rf <- randomForest(SalePrice ~ ., data = fulldt.fac.train, importance = T)
# Create Table with importance values
importance.table <- data.frame(Names = rownames(importance(rf)), '%IncMSE' = importance(rf)[,1])
# Order table
importance.table <- importance.table[order(importance.table[,2], decreasing = T),]
rownames(importance.table) <- c()
# Subset first 10 values
kable(importance.table[1:10,], "html") %>%
kable_styling(full_width = F)
| Names | X.IncMSE |
|---|---|
| Neighborhood | 42.76906 |
| MSSubClass | 36.16564 |
| FireplaceQu | 27.85005 |
| KitchenQual | 22.30418 |
| ExterQual | 21.81155 |
| BsmtQual | 18.64066 |
| GarageType | 15.57402 |
| BsmtFinType1 | 15.46062 |
| MSZoning | 14.99804 |
| HouseStyle | 13.94404 |
Interesting. As expected, the neighborhood and quality features show a high predictive power to SalePrice. What might surprise is the “Rating of basement finished area” (BsmtFinType1).
On other note, As “HouseStyle” and “BldgType” are describing the same thing, we will discard the latter.
Once the features have been selected, closer look at the “SalePrice” variable.
As we can see from the graph, the variable seems to be positively skewed …
and it’s not normal.
It also seem to present some homoscedasticity as it increases:
Luckily all this can be resolved by applying a log transformation to the “SalePrice” variable. This is key because normality, homoscedasticity and linearity (which will be checked soon) are critical assumptions that enable the use multivariate techniques.
fulldt.num.train$SalePrice <- log(fulldt.num.train$SalePrice)
fulldt.fac.train$SalePrice <- log(fulldt.fac.train$SalePrice)
Let’s see how our chosen features relate with our predicted variable “SalePrice”. Starting with the numerical ones:
Linearity seems to hold true. Also, the homoscedasticity is solved. As expected, all features have a positive correlation with SalePrice. We can also notice some outliers, specially in “Total Home Area” and “Area above ground” graphs, but we wont mind them in this exercise.
Let’s take a look at our factor variables:
Notice that all y-axis correspond to Log(SalePrice) values. We can see already the changes of the log transformation in the graphs. Also important to notice the impact that the neighborhood has on the SalePrice, which is as expected.
Once all features have been selected, it is time to add them and use a random forest algorithm to best predict the test results.
# Subset the train rows and selected features
dt.train <- fulldt %>% filter(Set == "Train") %>%
select("Id", "OverallQual", "TotalArea", "AreaAbvground", "GarageArea", "TotalBaths", "YearBuilt",
"Neighborhood", "MSSubClass", "FireplaceQu", "ExterQual", "KitchenQual", "BsmtQual", "HouseStyle") %>%
mutate(SalePrice = log(raw.train$SalePrice)) # Don't forget to do the log transformation
# Same for the test features
dt.test <- fulldt %>% filter(Set == "Test") %>%
select("Id", "OverallQual", "TotalArea", "AreaAbvground", "GarageArea", "TotalBaths", "YearBuilt",
"Neighborhood", "MSSubClass", "FireplaceQu", "ExterQual", "KitchenQual", "BsmtQual", "HouseStyle")
# Random Forest model
fit <- randomForest(SalePrice ~ ., data = dt.train, importance = T)
# Use new model to predict SalePrice values from the test set
pred <- exp(predict(fit , newdata = dt.test))
# Export Result
write.csv(x = data.frame(Id = raw.test$Id, SalePrice = pred), row.names = F, file = "./submission.csv")
After submitting the exported file we discover that our RSME is 0.15019.
Is this House Price Kaggle challenge I try to focus on feature selection in order to deliver the best accuracy possible. Tools like log transformation and some feature engineering also showed handy the final model.