This dataset was initially compiled together by Dr. Dean De Cock at Truman State University. It consists of 79 different variables for almost 3000 different houses sold in Ames, Iowa between 2006-2010. The feature variables for each house observation include quantitative and qualitative(nominal and ordinal) values. We will be using these variables to predict the SalePrice of the given house using Regression techniques.

In this analysis we will be using Mice (Multivariate Imputation by Chained Equations) in order to fill in the various NA values before running any models. Then we will use moments in order to calculate the skewness of particular variables. Finally, we will utilize GBM (Generalized Boosted Regression Models) in order to create a statistical model to predict the SalePrice

library(gbm)
library(mice)
library(moments)
library(ggplot2)

train <- read.csv("/Users/lancefernando/Desktop/DataMining/RProjects/IowaHousing/Data/train.csv", header = TRUE)
test <- read.csv("/Users/lancefernando/Desktop/DataMining/RProjects/IowaHousing/Data/test.csv", header = TRUE)
head(train)

1 Data Preprocessing

Before creating any plots or conducting analysis, it is important to understand the dataset you are working with and what the variables mean.

I’ve created a quick function that will check for the amount of NA values and their proportion in each of our datasets. Lets call it and analyze before any preprocessing.

checkNAplot <- function(tr, te){
  
  library(ggplot2)
  vars <- ""
  perc <- 0
  type <- ""

  for(i in 1: ncol(te)){
    if(anyNA(tr[,i])){
      vars <- c(vars, names(tr)[i])
      type <- c(type, "train")
      perc <- c(perc, length(which(is.na(tr[,i]))) / (nrow(tr) + nrow(te)))
      
    }
    if(anyNA(te[,i])){
      vars <- c(vars, names(te)[i])
      type <- c(type, "test")
      perc <- c(perc, length(which(is.na(te[,i]))) / (nrow(tr) + nrow(te)))
      
    }
  }
  
  if(length(vars) > 1){
    
    vars <- vars[-1]
    type <- type[-1]
    perc <- perc[-1]
    
    vars <- factor(vars)
    
    naData <- data.frame(variables = vars, type = type, percentage = perc)
    naData$variables <- factor(naData$variables, 
                               levels = naData$variables[order(naData$perc)])
    plot <- ggplot(naData,
        aes(x=variables, y=percentage, 
            fill = type)) + geom_bar(stat = "identity") + xlab("Variable") + ylab("Percent missing") + ggtitle("Percentage of Missing Values in Training/Test Sets") + coord_flip() + geom_hline(yintercept = 0.05)
    print("Checking NA process completed.")
    return(plot)
  }
  else
    print("Checking NA process completed.")
    
}
checkNAplot(train,test)

## [1] "Checking NA process completed."

Now lets get to the preprocessing!

Looking through the ‘data_description.txt’ file provided, some variables have string values that have an ordering (i.e, ExterQual : ‘poor’, ‘fair’,…‘excellent’). For these variables, lets change them to ordinal values starting from 1 up. For variables like PoolQC that are ordinal and contain missing (NA) values, we will encode a 0 since to indicate there is no pool. For other variables like Alley that are categorical strings, we will encode a “No” to indicate there is no alley. Here are some examples of how this will be done. The rest will all be behind the scenes since the code is a bit lengthy.

train$LotShape <- as.character(train$LotShape)
test$LotShape <- as.character(test$LotShape)
train$LotShape[which(train$LotShape == "IR3")] <- "1"
train$LotShape[which(train$LotShape == "IR2")] <- "2"
train$LotShape[which(train$LotShape == "IR1")] <- "3"
train$LotShape[which(train$LotShape == "Reg")] <- "4"
test$LotShape[which(test$LotShape == "IR3")] <- "1"
test$LotShape[which(test$LotShape == "IR2")] <- "2"
test$LotShape[which(test$LotShape == "IR1")] <- "3"
test$LotShape[which(test$LotShape == "Reg")] <- "4"
train$LotShape <- as.numeric(train$LotShape)
test$LotShape <- as.numeric(test$LotShape)

Take a look at the new output from the checkNA() function! So much different right?

## [1] "Checking NA process completed."

Now some numeric variables must be converted into factor form since they must be characterized as categorical. In addition, the creator of the dataset suggests we should drop observations whose GrLivArea is greater than 4000.

train$MSSubClass <- factor(train$MSSubClass)
test$MSSubClass <- factor(test$MSSubClass)

train$YrSold <- factor(train$YrSold)
test$YrSold <- factor(test$YrSold)

train$MoSold <- factor(train$MoSold)
test$MoSold <- factor(test$MoSold)

train <- train[-which(train$GrLivArea > 4000),]

2 Data Imputation Using MICE

Before imputing values, look again at the output of the checkNA() function. I explicitly provided the percentage of missing values proportional to the number of observations because this is important in imputing values. There are two types of missing values: MCAR and MNAR.

MCAR (Missing Completely At Random)
MNAR(Missing Not At Random)

The values that we fixed above were considered MNAR and we fixed that! Now there are a few values that are MNAR and we will fix those now. A good threshold to go by is to only impute values where the number of missing values is under 5%. Therefore we are going to drop LotFrontage. Lets go ahead and drop Id and Utilities since Id will not effect pricing and almost all values fall under ‘AllPub’ in Utilities. Check out the table to see how many values fall under ‘AllPub’ in Utilities

table(train$Utilities)

## 
## AllPub NoSeWa 
##   1455      1

drop <- c("LotFrontage", "Id", "Utilities")
train <- train[,!names(train) %in% drop]
test <- test[,!names(test) %in% drop]

Now lets get into the imputation! We will be using the simple mice() function to impute values. The parameters used are as follows:

m : number of different datasets to produce with new imputed values
method : the type of method used to predict missing values
maxit : number of iterations
seed : to replicate process
printFlag : prints process to console

We will be using randomForests (‘rf’) and 5 iterations to impute the missing values.

In order for mice to use randomForests on numeric ordinal variables with less than 5 levels like BsmtFullBath, we must convert it first to a factor.

After creating the datasets, we will rename our train and testing set to new.Train and new.Test respectively. Then we convert those factor variables back to numeric. Take a look at the output of checkNA() now.

Now that we no longer have anymore missing values we’re good to go!

impute.train <- mice(data = train,
                     m = 1,
                     method = "rf",
                     maxit = 1,
                     seed = 1,
                     printFlag = FALSE)

impute.test <- mice(data = test,
                    m = 1,
                    method = "rf",
                    maxit = 1,
                    seed = 1,
                    printFlag = FALSE)

new.Train <- complete(impute.train, 1)

new.Test <- complete(impute.test, 1)

checkNAplot(new.Train, new.Test)

## [1] "Checking NA process completed."

3 Data Transformation

When working with skewed data in the supervised regression setting it is necessary to conduct feature transformation to improve interpretability in analysis and linearity in the model. We will be conducting Log Transformations on the skewed independent variables.

First lets calculate the skewness of each variable.

A variable is considered ‘highly skewed’ if its absolute value is greater than 1.
A variable is considered ‘moderately’ if its absolute value is greater than 0.5.

We will only transform variables that are ‘moderate’ to ‘highly’ skewed. We will also only transform non-categorical variables.

skewedVars <- NA

for(i in names(new.Train)){
  if(is.numeric(new.Train[,i])){
    if(i != "SalePrice"){
      if(length(levels(as.factor(new.Train[,i]))) > 10){
        # Enters this block if variable is non-categorical
        skewVal <- skewness(new.Train[,i])
        print(paste(i, skewVal, sep = ": "))
        if(abs(skewVal) > 0.5){
          skewedVars <- c(skewedVars, i)
        }
      }
    }
  }
}

## [1] "LotArea: 12.5745898057302"
## [1] "YearBuilt: -0.609457810078801"
## [1] "YearRemodAdd: -0.499315878757071"
## [1] "MasVnrArea: 2.65433255666407"
## [1] "BsmtFinSF1: 0.744087901070892"
## [1] "BsmtFinSF2: 4.24420866908465"
## [1] "BsmtUnfSF: 0.920808873459641"
## [1] "TotalBsmtSF: 0.485894099381748"
## [1] "X1stFlrSF: 0.866187325506939"
## [1] "X2ndFlrSF: 0.777064755287085"
## [1] "LowQualFinSF: 8.98929107000649"
## [1] "GrLivArea: 0.834331710586013"
## [1] "TotRmsAbvGrd: 0.660734867252366"
## [1] "GarageYrBlt: -3.85925042349603"
## [1] "GarageArea: 0.132854112343635"
## [1] "WoodDeckSF: 1.54967200222241"
## [1] "OpenPorchSF: 2.33743492707325"
## [1] "EnclosedPorch: 3.08127508380921"
## [1] "X3SsnPorch: 10.2792617974462"
## [1] "ScreenPorch: 4.11139989138831"
## [1] "MiscVal: 24.4181751077856"

Check out the plots of variables before and after transformation.

Now lets perform transformations. We will also be creating new variables that indicate whether or not the value was greater than 0. We do this because having a value of 0 may prove to be significant.

In addition, we perform “log(1+new.Train[,i])” because log(0) = -inf.

Our new training and testing sets will be log.train and log.test respectively.

log.train <- new.Train
log.test <- new.Test

for(i in skewedVars){
  if(0 %in% new.Train[, i]){
    log.train[,i] <- log(1+new.Train[,i])
    log.test[,i] <- log(1+new.Test[,i])
  }
  else{
    log.train[,i] <- log(new.Train[,i])
    log.test[,i] <- log(new.Test[,i])
  }
}

for(i in skewedVars){
  if(0 %in% new.Train[, i]){
    dummyVector <- ifelse(log.train[,i] > 0, 1, 0)
    log.train <- data.frame(log.train, factor(dummyVector))
    colnames(log.train)[ncol(log.train)] <- paste(i, "ZERO", sep="")

    dummyVector <- ifelse(log.test[,i] > 0, 1, 0)
    log.test <- data.frame(log.test, factor(dummyVector))
    colnames(log.test)[ncol(log.test)] <- paste(i, "ZERO", sep="")
  }
}

4 Data Modeling

We have made it to the modeling stage!

We will utilize Generalized Boosted Modeling. The parameters involved are as follows.

distribution : The type of method for distribution
- “gaussian” is used for regression modeling
- “bernoulli” is used for binary categorical modeling
- “multinomial” is used for multi-class categorical modeling etc.
depth : Maximum depth of variable interaction
n.trees : Number of trees to fit
shrinkage : Learning rate applied to each tree in expansion other parameters will work fine as their default values.
- A smaller value makes the model more robust to overfitting.

I have created a method that could easily be altered that will produce the output along with the training RMSE. This will make it easy to tune the parameters.

runBoost <- function(trees, depth, learn, file, trainset, testset){
  set.seed(1)
  boost.price <- gbm(SalePrice~.,
                     data = trainset,
                     distribution = "gaussian",
                     n.trees = trees,
                     interaction.depth = depth,
                     shrinkage = learn)
  
  print(summary(boost.price))
  
  yhat.boost <- predict(boost.price, newdata = testset,
                        n.trees = trees)
  
  rmse.boost <- sqrt(mean((train$SalePrice - 
                             predict(boost.price, newdata = trainset,
                                     n.trees = trees))^2))
  
  print(paste("Training RMSE : ", rmse.boost))
  
  write.csv(data.frame(Id = c(1461:2919), SalePrice = yhat.boost), file, row.names = FALSE)
}

# runBoost(trees = 20000,
#          depth = 10,
#          learn = 0.001,
#          file = "/Users/lancefernando/Desktop/DataMining/RProjects/IowaHousing/Submission42.csv",
#          trainset = log.train,
#          testset = log.test)

5 Recap/Conclusion

By getting familiar with the dataset we are using we can easily fill in many of those variables that were ‘missing not at random’ (MNAR). This in turn gives us a lot more data to work with. Using MICE we were able to impute values that were missing in small amounts (less than 5%). We then transformed skewed variables using log and added dummy variables for those that contained a value of 0. Finally, we fit a GBM to predict the SalePrice using all our variables.

Submitting to the leaderboard we can land a score around the 37% spot. More feature engineering will be necessary to increase our score. With that, I will also re-tune the GBM parameters to consider the Bias-Variance tradeoff.

Predicting Housing Prices in Ames, Iowa

Lance J. Fernando

1 Data Preprocessing

2 Data Imputation Using MICE

3 Data Transformation

4 Data Modeling

5 Recap/Conclusion