1. Introduction

In this post, I will describe how I solved the Housing Price Competition, from Kaggle. As Kaggle says, the data set has 79 explanatory variables that describe almost every aspect of residential homes in Ames, Iowa. The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

The goal is to predict the sales price for each house. Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.

Before we start, it is important to say that I submitted my predictions on March 29th, 2021, and was ranked in the position 2.771º, between 6.829 competitors.

2. Get Data

First, let’s clean our R workspace:

# clean everything done before
rm(list=ls())

After downloading data sets (train and test), we need to load them into R workspace:

# read training and testing data set
train <- read.csv2("./train.csv", sep=",", stringsAsFactors = TRUE)
test <- read.csv2("./test.csv", sep=",", stringsAsFactors = TRUE)

You need to read Data Description.txt that Kaggle provides us for data better understanding. We are not going to do that, because it is above the scope of this project. Instead, let’s just understand data structure:

# view data structure
dim(train)
## [1] 1460   81
str(train)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
##  $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
##  $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
##  $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
##  $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
##  $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

“Id” is just an identification variables. Let’s store it separately and remove it from both training and testing data sets, for a while:

# store and remove variable Id
trainId <- train$Id
testId <- test$Id
train$Id <- NULL
test$Id <- NULL

“SalePrice” is the outcome variable. It belongs to the training data set, but not to the testing data set (obviously). Let’s do to the variable “SalePrice” the same as we did to “Id” (store separately and remove from data set):

# store and remove variable SalePrice
trainSalePrice <- train$SalePrice
train$SalePrice <- NULL

Now we have similar data sets, with the same number of columns (n = 79):

# view data dimensions
cbind(c("Training", "Testing"),
        rbind(dim(train), dim(test)))
##      [,1]       [,2]   [,3]
## [1,] "Training" "1460" "79"
## [2,] "Testing"  "1459" "79"

35 variables are integer and 44 are factors:

# count variable's by class
length(which(sapply(train, is.integer)==TRUE))
## [1] 36
length(which(sapply(train, is.factor)==TRUE))
## [1] 43

3. Explore Data

Let’s make a brief visual data exploration. First, we will plot frequency distributions of the integer variables:

# histograms of integer variables
library(tidyr); library(ggplot2); library(purrr)
train %>%
  keep(is.numeric) %>%   
  gather() %>%                  
  ggplot(aes(value)) + 
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

As we can see, some variables are right-skewed (eg. “BsmtFinSF1”), many of them have outliers (eg. “X1stFlrSF”) and some have near zero variance (eg. BsmtFinSF2). Some models require that we transform distributions like these. For instance, Box-Cox (or its similar Yeo-Johnson) transformation is indicated to deal with skewness; Spatial Sign transformation is indicated to deal with outliers.

About factor variables, we can see that some of them are highly imbalanced (eg. “SaleType”). Some models require converting factor variables to numeric ones. So, we will deal with it transforming all of them to dummy variables.

# frequency distributions of factor variables
library(dplyr)
train %>%
  keep(is.factor) %>%
  gather() %>%
  ggplot(aes(value)) + 
  facet_wrap(~ key, scales = "free") +
  geom_bar()

4. Data Pre-processing

We will run the following pre-processing:

  1. Convert factor to dummy variables;
  2. Inspect and remove variables with lots of NAs, if necessary;
  3. Impute missing values in variables that have NAs and were not excluded;
  4. Remove variables with zero or near zero variance;
  5. Remove highly correlated variables;
  6. Yeo-Johnson transformation: to deal with skewness;
  7. Spatial Sign transformation: to deal with outliers;

All that we do on the training data set, we need to do also on the testing data set. There’s a lot of job to be done, so let’s go.

# convert factor into dummy variables
library(caret)
dummies <- dummyVars(train, data = train)

# apply transformation to the training data set
trainDummy <- as.data.frame(predict(dummies, newdata = train))

# view dimensions
cbind(c("Training with Factor Variables","Training with Dummy Variables"),
      rbind(dim(train), dim(trainDummy)))
##      [,1]                             [,2]   [,3] 
## [1,] "Training with Factor Variables" "1460" "79" 
## [2,] "Training with Dummy Variables"  "1460" "287"

As there were lots of factor variables, the number of columns jumped from 79 to 287. Let’s to the same on the testing data set. But, in this case, we will not run the transformation again. We will apply the transformation that have already been done on the training data set (the list “dummies” created) to the testing data set. This is very important, because the testing data set cannot be used for manipulation purposes. In every step above (1 to 7), we will just apply transformations that have been run on the training to the testing data set.

# apply transformation on the training data set to the testing data set
testDummy <- as.data.frame(predict(dummies, newdata = test))

# view dimensions
cbind(c("Testing with Factor Variables","Testing with Dummy Variables"),
      rbind(dim(test), dim(testDummy)))
##      [,1]                            [,2]   [,3] 
## [1,] "Testing with Factor Variables" "1459" "79" 
## [2,] "Testing with Dummy Variables"  "1459" "287"

After creating dummy variables, some columns have names that can represent an obstacle in the future (eg: with spaces). We need to change their names:

# return column names
colnames(trainDummy)
##   [1] "MSZoning.C (all)"      "MSZoning.FV"           "MSZoning.RH"          
##   [4] "MSZoning.RL"           "MSZoning.RM"           "LotFrontage"          
##   [7] "LotArea"               "Street.Grvl"           "Street.Pave"          
##  [10] "Alley.Grvl"            "Alley.Pave"            "LotShape.IR1"         
##  [13] "LotShape.IR2"          "LotShape.IR3"          "LotShape.Reg"         
##  [16] "LandContour.Bnk"       "LandContour.HLS"       "LandContour.Low"      
##  [19] "LandContour.Lvl"       "Utilities.AllPub"      "Utilities.NoSeWa"     
##  [22] "LotConfig.Corner"      "LotConfig.CulDSac"     "LotConfig.FR2"        
##  [25] "LotConfig.FR3"         "LotConfig.Inside"      "LandSlope.Gtl"        
##  [28] "LandSlope.Mod"         "LandSlope.Sev"         "Neighborhood.Blmngtn" 
##  [31] "Neighborhood.Blueste"  "Neighborhood.BrDale"   "Neighborhood.BrkSide" 
##  [34] "Neighborhood.ClearCr"  "Neighborhood.CollgCr"  "Neighborhood.Crawfor" 
##  [37] "Neighborhood.Edwards"  "Neighborhood.Gilbert"  "Neighborhood.IDOTRR"  
##  [40] "Neighborhood.MeadowV"  "Neighborhood.Mitchel"  "Neighborhood.NAmes"   
##  [43] "Neighborhood.NoRidge"  "Neighborhood.NPkVill"  "Neighborhood.NridgHt" 
##  [46] "Neighborhood.NWAmes"   "Neighborhood.OldTown"  "Neighborhood.Sawyer"  
##  [49] "Neighborhood.SawyerW"  "Neighborhood.Somerst"  "Neighborhood.StoneBr" 
##  [52] "Neighborhood.SWISU"    "Neighborhood.Timber"   "Neighborhood.Veenker" 
##  [55] "Condition1.Artery"     "Condition1.Feedr"      "Condition1.Norm"      
##  [58] "Condition1.PosA"       "Condition1.PosN"       "Condition1.RRAe"      
##  [61] "Condition1.RRAn"       "Condition1.RRNe"       "Condition1.RRNn"      
##  [64] "Condition2.Artery"     "Condition2.Feedr"      "Condition2.Norm"      
##  [67] "Condition2.PosA"       "Condition2.PosN"       "Condition2.RRAe"      
##  [70] "Condition2.RRAn"       "Condition2.RRNn"       "BldgType.1Fam"        
##  [73] "BldgType.2fmCon"       "BldgType.Duplex"       "BldgType.Twnhs"       
##  [76] "BldgType.TwnhsE"       "HouseStyle.1.5Fin"     "HouseStyle.1.5Unf"    
##  [79] "HouseStyle.1Story"     "HouseStyle.2.5Fin"     "HouseStyle.2.5Unf"    
##  [82] "HouseStyle.2Story"     "HouseStyle.SFoyer"     "HouseStyle.SLvl"      
##  [85] "OverallQual"           "OverallCond"           "YearBuilt"            
##  [88] "YearRemodAdd"          "RoofStyle.Flat"        "RoofStyle.Gable"      
##  [91] "RoofStyle.Gambrel"     "RoofStyle.Hip"         "RoofStyle.Mansard"    
##  [94] "RoofStyle.Shed"        "RoofMatl.ClyTile"      "RoofMatl.CompShg"     
##  [97] "RoofMatl.Membran"      "RoofMatl.Metal"        "RoofMatl.Roll"        
## [100] "RoofMatl.Tar&Grv"      "RoofMatl.WdShake"      "RoofMatl.WdShngl"     
## [103] "Exterior1st.AsbShng"   "Exterior1st.AsphShn"   "Exterior1st.BrkComm"  
## [106] "Exterior1st.BrkFace"   "Exterior1st.CBlock"    "Exterior1st.CemntBd"  
## [109] "Exterior1st.HdBoard"   "Exterior1st.ImStucc"   "Exterior1st.MetalSd"  
## [112] "Exterior1st.Plywood"   "Exterior1st.Stone"     "Exterior1st.Stucco"   
## [115] "Exterior1st.VinylSd"   "Exterior1st.Wd Sdng"   "Exterior1st.WdShing"  
## [118] "Exterior2nd.AsbShng"   "Exterior2nd.AsphShn"   "Exterior2nd.Brk Cmn"  
## [121] "Exterior2nd.BrkFace"   "Exterior2nd.CBlock"    "Exterior2nd.CmentBd"  
## [124] "Exterior2nd.HdBoard"   "Exterior2nd.ImStucc"   "Exterior2nd.MetalSd"  
## [127] "Exterior2nd.Other"     "Exterior2nd.Plywood"   "Exterior2nd.Stone"    
## [130] "Exterior2nd.Stucco"    "Exterior2nd.VinylSd"   "Exterior2nd.Wd Sdng"  
## [133] "Exterior2nd.Wd Shng"   "MasVnrType.BrkCmn"     "MasVnrType.BrkFace"   
## [136] "MasVnrType.None"       "MasVnrType.Stone"      "MasVnrArea"           
## [139] "ExterQual.Ex"          "ExterQual.Fa"          "ExterQual.Gd"         
## [142] "ExterQual.TA"          "ExterCond.Ex"          "ExterCond.Fa"         
## [145] "ExterCond.Gd"          "ExterCond.Po"          "ExterCond.TA"         
## [148] "Foundation.BrkTil"     "Foundation.CBlock"     "Foundation.PConc"     
## [151] "Foundation.Slab"       "Foundation.Stone"      "Foundation.Wood"      
## [154] "BsmtQual.Ex"           "BsmtQual.Fa"           "BsmtQual.Gd"          
## [157] "BsmtQual.TA"           "BsmtCond.Fa"           "BsmtCond.Gd"          
## [160] "BsmtCond.Po"           "BsmtCond.TA"           "BsmtExposure.Av"      
## [163] "BsmtExposure.Gd"       "BsmtExposure.Mn"       "BsmtExposure.No"      
## [166] "BsmtFinType1.ALQ"      "BsmtFinType1.BLQ"      "BsmtFinType1.GLQ"     
## [169] "BsmtFinType1.LwQ"      "BsmtFinType1.Rec"      "BsmtFinType1.Unf"     
## [172] "BsmtFinSF1"            "BsmtFinType2.ALQ"      "BsmtFinType2.BLQ"     
## [175] "BsmtFinType2.GLQ"      "BsmtFinType2.LwQ"      "BsmtFinType2.Rec"     
## [178] "BsmtFinType2.Unf"      "BsmtFinSF2"            "BsmtUnfSF"            
## [181] "TotalBsmtSF"           "Heating.Floor"         "Heating.GasA"         
## [184] "Heating.GasW"          "Heating.Grav"          "Heating.OthW"         
## [187] "Heating.Wall"          "HeatingQC.Ex"          "HeatingQC.Fa"         
## [190] "HeatingQC.Gd"          "HeatingQC.Po"          "HeatingQC.TA"         
## [193] "CentralAir.N"          "CentralAir.Y"          "Electrical.FuseA"     
## [196] "Electrical.FuseF"      "Electrical.FuseP"      "Electrical.Mix"       
## [199] "Electrical.SBrkr"      "X1stFlrSF"             "X2ndFlrSF"            
## [202] "LowQualFinSF"          "GrLivArea"             "BsmtFullBath"         
## [205] "BsmtHalfBath"          "FullBath"              "HalfBath"             
## [208] "BedroomAbvGr"          "KitchenAbvGr"          "KitchenQual.Ex"       
## [211] "KitchenQual.Fa"        "KitchenQual.Gd"        "KitchenQual.TA"       
## [214] "TotRmsAbvGrd"          "Functional.Maj1"       "Functional.Maj2"      
## [217] "Functional.Min1"       "Functional.Min2"       "Functional.Mod"       
## [220] "Functional.Sev"        "Functional.Typ"        "Fireplaces"           
## [223] "FireplaceQu.Ex"        "FireplaceQu.Fa"        "FireplaceQu.Gd"       
## [226] "FireplaceQu.Po"        "FireplaceQu.TA"        "GarageType.2Types"    
## [229] "GarageType.Attchd"     "GarageType.Basment"    "GarageType.BuiltIn"   
## [232] "GarageType.CarPort"    "GarageType.Detchd"     "GarageYrBlt"          
## [235] "GarageFinish.Fin"      "GarageFinish.RFn"      "GarageFinish.Unf"     
## [238] "GarageCars"            "GarageArea"            "GarageQual.Ex"        
## [241] "GarageQual.Fa"         "GarageQual.Gd"         "GarageQual.Po"        
## [244] "GarageQual.TA"         "GarageCond.Ex"         "GarageCond.Fa"        
## [247] "GarageCond.Gd"         "GarageCond.Po"         "GarageCond.TA"        
## [250] "PavedDrive.N"          "PavedDrive.P"          "PavedDrive.Y"         
## [253] "WoodDeckSF"            "OpenPorchSF"           "EnclosedPorch"        
## [256] "X3SsnPorch"            "ScreenPorch"           "PoolArea"             
## [259] "PoolQC.Ex"             "PoolQC.Fa"             "PoolQC.Gd"            
## [262] "Fence.GdPrv"           "Fence.GdWo"            "Fence.MnPrv"          
## [265] "Fence.MnWw"            "MiscFeature.Gar2"      "MiscFeature.Othr"     
## [268] "MiscFeature.Shed"      "MiscFeature.TenC"      "MiscVal"              
## [271] "MoSold"                "YrSold"                "SaleType.COD"         
## [274] "SaleType.Con"          "SaleType.ConLD"        "SaleType.ConLI"       
## [277] "SaleType.ConLw"        "SaleType.CWD"          "SaleType.New"         
## [280] "SaleType.Oth"          "SaleType.WD"           "SaleCondition.Abnorml"
## [283] "SaleCondition.AdjLand" "SaleCondition.Alloca"  "SaleCondition.Family" 
## [286] "SaleCondition.Normal"  "SaleCondition.Partial"
# change column names from the training data set
names(trainDummy)[names(trainDummy) == "MSZoning.C (all)"] <- "MSZoning.C"
names(trainDummy)[names(trainDummy) == "Exterior1st.Wd Sdng"] <- "Exterior1st.WdSdng"
names(trainDummy)[names(trainDummy) == "Exterior2nd.Wd Sdng"] <- "Exterior2nd.WdSdng"
names(trainDummy)[names(trainDummy) == "Exterior2nd.Brk Cmn"] <- "Exterior2nd.BrkComm"
names(trainDummy)[names(trainDummy) == "RoofMatl.Tar&Grv"] <- "RoofMatl.Tar.Grv"
names(trainDummy)[names(trainDummy) == "Exterior2nd.Wd Shng"] <- "Exterior2nd.WdShing"

# change column names from the testing data set
names(testDummy)[names(testDummy) == "MSZoning.C (all)"] <- "MSZoning.C"
names(testDummy)[names(testDummy) == "Exterior1st.Wd Sdng"] <- "Exterior1st.WdSdng"
names(testDummy)[names(testDummy) == "Exterior2nd.Wd Sdng"] <- "Exterior2nd.WdSdng"
names(testDummy)[names(testDummy) == "Exterior2nd.Brk Cmn"] <- "Exterior2nd.BrkComm"
names(testDummy)[names(testDummy) == "RoofMatl.Tar&Grv"] <- "RoofMatl.Tar.Grv"
names(testDummy)[names(testDummy) == "Exterior2nd.Wd Shng"] <- "Exterior2nd.WdShing"

Now we will inspect missing values:

# variables with high percentage of missing values
library(naniar)
trainDummy %>%
  miss_var_summary() %>%
  arrange(desc(pct_miss)) %>%
  filter(pct_miss > 20)
## # A tibble: 18 x 3
##    variable         n_miss pct_miss
##    <chr>             <int>    <dbl>
##  1 PoolQC.Ex          1453     99.5
##  2 PoolQC.Fa          1453     99.5
##  3 PoolQC.Gd          1453     99.5
##  4 MiscFeature.Gar2   1406     96.3
##  5 MiscFeature.Othr   1406     96.3
##  6 MiscFeature.Shed   1406     96.3
##  7 MiscFeature.TenC   1406     96.3
##  8 Alley.Grvl         1369     93.8
##  9 Alley.Pave         1369     93.8
## 10 Fence.GdPrv        1179     80.8
## 11 Fence.GdWo         1179     80.8
## 12 Fence.MnPrv        1179     80.8
## 13 Fence.MnWw         1179     80.8
## 14 FireplaceQu.Ex      690     47.3
## 15 FireplaceQu.Fa      690     47.3
## 16 FireplaceQu.Gd      690     47.3
## 17 FireplaceQu.Po      690     47.3
## 18 FireplaceQu.TA      690     47.3
# variables with low percentage of missing values
trainDummy %>%
  miss_var_summary() %>%
  arrange(desc(pct_miss)) %>%
  filter(pct_miss < 5)
## # A tibble: 248 x 3
##    variable         n_miss pct_miss
##    <chr>             <int>    <dbl>
##  1 BsmtExposure.Av      38     2.60
##  2 BsmtExposure.Gd      38     2.60
##  3 BsmtExposure.Mn      38     2.60
##  4 BsmtExposure.No      38     2.60
##  5 BsmtFinType2.ALQ     38     2.60
##  6 BsmtFinType2.BLQ     38     2.60
##  7 BsmtFinType2.GLQ     38     2.60
##  8 BsmtFinType2.LwQ     38     2.60
##  9 BsmtFinType2.Rec     38     2.60
## 10 BsmtFinType2.Unf     38     2.60
## # ... with 238 more rows

18 variables has more than 40% of missing values; too much to be imputed. We will just remove them from both data sets:

# remove variables with high NAs from the training data set
trainNoHighNA <- trainDummy
trainNoHighNA$PoolQC.Ex <- NULL
trainNoHighNA$PoolQC.Fa <- NULL
trainNoHighNA$PoolQC.Gd <- NULL
trainNoHighNA$MiscFeature.Gar2 <- NULL
trainNoHighNA$MiscFeature.Othr <- NULL
trainNoHighNA$MiscFeature.Shed <- NULL
trainNoHighNA$MiscFeature.TenC <- NULL
trainNoHighNA$Alley.Grvl <- NULL
trainNoHighNA$Alley.Pave <- NULL
trainNoHighNA$Fence.GdPrv <- NULL
trainNoHighNA$Fence.GdWo <- NULL
trainNoHighNA$Fence.MnPrv <- NULL
trainNoHighNA$Fence.MnWw <- NULL
trainNoHighNA$FireplaceQu.Ex <- NULL
trainNoHighNA$FireplaceQu.Fa <- NULL
trainNoHighNA$FireplaceQu.Gd <- NULL
trainNoHighNA$FireplaceQu.Po <- NULL
trainNoHighNA$FireplaceQu.TA <- NULL

# view dimensions
cbind(c("Training with High NA Variables","Training without High NA Variables"),
      rbind(dim(trainDummy), dim(trainNoHighNA)))
##      [,1]                                 [,2]   [,3] 
## [1,] "Training with High NA Variables"    "1460" "287"
## [2,] "Training without High NA Variables" "1460" "269"
# remove variables with high NAs from the testing data set
testNoHighNA <- testDummy
testNoHighNA$PoolQC.Ex <- NULL
testNoHighNA$PoolQC.Fa <- NULL
testNoHighNA$PoolQC.Gd <- NULL
testNoHighNA$MiscFeature.Gar2 <- NULL
testNoHighNA$MiscFeature.Othr <- NULL
testNoHighNA$MiscFeature.Shed <- NULL
testNoHighNA$MiscFeature.TenC <- NULL
testNoHighNA$Alley.Grvl <- NULL
testNoHighNA$Alley.Pave <- NULL
testNoHighNA$Fence.GdPrv <- NULL
testNoHighNA$Fence.GdWo <- NULL
testNoHighNA$Fence.MnPrv <- NULL
testNoHighNA$Fence.MnWw <- NULL
testNoHighNA$FireplaceQu.Ex <- NULL
testNoHighNA$FireplaceQu.Fa <- NULL
testNoHighNA$FireplaceQu.Gd <- NULL
testNoHighNA$FireplaceQu.Po <- NULL
testNoHighNA$FireplaceQu.TA <- NULL

# view dimensions
cbind(c("Testing with High NA Variables","Testing without High NA Variables"),
      rbind(dim(testDummy), dim(testNoHighNA)))
##      [,1]                                [,2]   [,3] 
## [1,] "Testing with High NA Variables"    "1459" "287"
## [2,] "Testing without High NA Variables" "1459" "269"

As we saw, there are still variables with NAs, but we are not going to exclude them anymore; we will just impute their missing values.

# run imputation
preProcImpute <- preProcess(trainNoHighNA, method="bagImpute")

# apply imputation to the training data set
trainImpute <- predict(preProcImpute, trainNoHighNA)

# apply imputation to the testing data set
testImpute <- predict(preProcImpute, testNoHighNA)

Next step is to remove variables that have zero or near zero variance:

# remove near zero variance
nzv <- nearZeroVar(trainImpute)
trainNoNzv <- trainImpute[,-nzv]
testNoNzv <- testImpute[,-nzv]

# training dimensions
cbind(c("Training Before NZV","Training After NZV"),
      rbind(dim(trainImpute), dim(trainNoNzv)))
##      [,1]                  [,2]   [,3] 
## [1,] "Training Before NZV" "1460" "269"
## [2,] "Training After NZV"  "1460" "114"
# testing dimensions
cbind(c("Testing Before NZV","Testing After NZV"),
      rbind(dim(testImpute), dim(testNoNzv)))
##      [,1]                 [,2]   [,3] 
## [1,] "Testing Before NZV" "1459" "269"
## [2,] "Testing After NZV"  "1459" "114"

155 variables were removed using this method. Let’s continue our data compression. Now we will remove variables that are highly correlated:

# create a correlation matrix
trainMatrixCor <-  cor(trainNoNzv) 

# find variables highly correlated
trainHighlyCor <- findCorrelation(trainMatrixCor, cutoff = .75) 

# remove variables highly correlated
trainLowCor <- trainNoNzv[,-trainHighlyCor]

# training dimensions
cbind(c("Training with High Correlation","Training without High Correlation"),
      rbind(dim(trainNoNzv), dim(trainLowCor)))
##      [,1]                                [,2]   [,3] 
## [1,] "Training with High Correlation"    "1460" "114"
## [2,] "Training without High Correlation" "1460" "87"
# apply removal to the testing data set
testLowCor <- testNoNzv[,-trainHighlyCor]

# testing dimensions
cbind(c("Testing with High Correlation","Testing without High Correlation"),
      rbind(dim(testNoNzv), dim(testLowCor)))
##      [,1]                               [,2]   [,3] 
## [1,] "Testing with High Correlation"    "1459" "114"
## [2,] "Testing without High Correlation" "1459" "87"

More 27 variables were removed, and we have got with 87 predictors. Finally, we will run Yeo-Johnson and Spatial Sign transformations.

# run other transformation
preProcNorm <- preProcess(trainLowCor, method=c("YeoJohnson","spatialSign"))

# apply imputation to the training data set
trainT <- predict(preProcNorm, trainLowCor)

# apply imputation to the testing data set
testT <- predict(preProcNorm, testLowCor)

5. Modeling

We will fit 3 predictive models: support vector machine; boosted trees; random forest. We will evaluate their performance using resampling, choose the best one and use it to predict on the testing data set, which is already also tidy. First, Support Vector Machine:

# tuning parameters
library(kernlab)
set.seed(231)
sigDist <- sigest(trainSalePrice ~ ., data = trainT, frac = 1)
svmTuneGrid <- data.frame(sigma = as.vector(sigDist)[1], C = 2^(-2:7))

# fit support vector machine (SVM) model
set.seed(1056)
modSVM <- train(x = trainT,
                y = log(trainSalePrice),
                method = "svmRadial",
                tuneGrid = svmTuneGrid,
                trControl = trainControl(method = "boot", number = 50))
modSVM
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1460 samples
##   87 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (50 reps) 
## Summary of sample sizes: 1460, 1460, 1460, 1460, 1460, 1460, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE       
##     0.25  0.1440003  0.8757342  0.09824123
##     0.50  0.1378084  0.8829720  0.09434368
##     1.00  0.1353978  0.8855670  0.09289465
##     2.00  0.1356662  0.8844418  0.09369782
##     4.00  0.1380251  0.8804267  0.09623070
##     8.00  0.1413722  0.8749687  0.09942430
##    16.00  0.1435333  0.8713495  0.10160059
##    32.00  0.1436990  0.8710303  0.10190317
##    64.00  0.1436990  0.8710303  0.10190317
##   128.00  0.1436990  0.8710303  0.10190317
## 
## Tuning parameter 'sigma' was held constant at a value of 0.004774938
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.004774938 and C = 1.

Now let’s fit a Gradient Boosting Machine:

# tuning parameters
gbmGrid <-  expand.grid(interaction.depth = c(1, 5, 9), 
                        n.trees = (1:30)*50, 
                        shrinkage = 0.1,
                        n.minobsinnode = 20)

# fit model
set.seed(1056)
modGBM <- train(x = trainT,
                y = log(trainSalePrice),
                method = "gbm",
                trControl = trainControl(method = "boot", number = 50),
                verbose = FALSE,
                tuneGrid = gbmGrid)
modGBM
## Stochastic Gradient Boosting 
## 
## 1460 samples
##   87 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (50 reps) 
## Summary of sample sizes: 1460, 1460, 1460, 1460, 1460, 1460, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   MAE      
##   1                    50     0.1896843  0.7978743  0.1350812
##   1                   100     0.1628582  0.8360615  0.1171258
##   1                   150     0.1555150  0.8482303  0.1127036
##   1                   200     0.1521498  0.8544884  0.1100794
##   1                   250     0.1500371  0.8584168  0.1082428
##   1                   300     0.1488215  0.8606786  0.1071327
##   1                   350     0.1478866  0.8624235  0.1062296
##   1                   400     0.1472136  0.8636670  0.1055500
##   1                   450     0.1466909  0.8646312  0.1050525
##   1                   500     0.1463869  0.8651802  0.1047382
##   1                   550     0.1461018  0.8657075  0.1044149
##   1                   600     0.1460036  0.8658972  0.1042330
##   1                   650     0.1458046  0.8662765  0.1039911
##   1                   700     0.1457592  0.8663705  0.1038609
##   1                   750     0.1456476  0.8665987  0.1037917
##   1                   800     0.1455989  0.8666782  0.1037428
##   1                   850     0.1456259  0.8666319  0.1036915
##   1                   900     0.1455890  0.8667142  0.1036442
##   1                   950     0.1456278  0.8666805  0.1036869
##   1                  1000     0.1456678  0.8665815  0.1036653
##   1                  1050     0.1456252  0.8666530  0.1036202
##   1                  1100     0.1456041  0.8667292  0.1036199
##   1                  1150     0.1456047  0.8667235  0.1035675
##   1                  1200     0.1456471  0.8666310  0.1035833
##   1                  1250     0.1456240  0.8666771  0.1035384
##   1                  1300     0.1456344  0.8666684  0.1034991
##   1                  1350     0.1456899  0.8665757  0.1035364
##   1                  1400     0.1457388  0.8665122  0.1036052
##   1                  1450     0.1456977  0.8665790  0.1035622
##   1                  1500     0.1457372  0.8665104  0.1035853
##   5                    50     0.1518811  0.8562104  0.1079560
##   5                   100     0.1454783  0.8668424  0.1026264
##   5                   150     0.1443612  0.8689228  0.1015340
##   5                   200     0.1441420  0.8693459  0.1011605
##   5                   250     0.1438411  0.8698813  0.1007959
##   5                   300     0.1438434  0.8699272  0.1007881
##   5                   350     0.1438885  0.8698611  0.1007853
##   5                   400     0.1439400  0.8697737  0.1007965
##   5                   450     0.1440719  0.8695728  0.1009243
##   5                   500     0.1441248  0.8695148  0.1009501
##   5                   550     0.1442402  0.8693273  0.1010626
##   5                   600     0.1443288  0.8691701  0.1011139
##   5                   650     0.1444276  0.8690098  0.1011817
##   5                   700     0.1444469  0.8689695  0.1012085
##   5                   750     0.1444940  0.8688944  0.1012501
##   5                   800     0.1445398  0.8688034  0.1012757
##   5                   850     0.1446111  0.8686853  0.1013308
##   5                   900     0.1446897  0.8685548  0.1013989
##   5                   950     0.1447559  0.8684392  0.1014588
##   5                  1000     0.1448040  0.8683583  0.1015107
##   5                  1050     0.1448693  0.8682379  0.1015772
##   5                  1100     0.1449159  0.8681594  0.1016170
##   5                  1150     0.1449530  0.8680975  0.1016548
##   5                  1200     0.1449954  0.8680207  0.1016869
##   5                  1250     0.1450206  0.8679842  0.1017212
##   5                  1300     0.1450721  0.8678926  0.1017574
##   5                  1350     0.1451014  0.8678447  0.1017928
##   5                  1400     0.1451559  0.8677471  0.1018416
##   5                  1450     0.1451912  0.8676853  0.1018688
##   5                  1500     0.1452131  0.8676450  0.1018873
##   9                    50     0.1490779  0.8606933  0.1048732
##   9                   100     0.1453883  0.8670864  0.1017148
##   9                   150     0.1448618  0.8680743  0.1012029
##   9                   200     0.1449111  0.8680200  0.1011660
##   9                   250     0.1450621  0.8677825  0.1011991
##   9                   300     0.1451356  0.8676500  0.1012852
##   9                   350     0.1452810  0.8674146  0.1013912
##   9                   400     0.1453768  0.8672449  0.1014888
##   9                   450     0.1454443  0.8671506  0.1015379
##   9                   500     0.1454828  0.8670887  0.1015683
##   9                   550     0.1455536  0.8669717  0.1016413
##   9                   600     0.1456087  0.8668836  0.1016883
##   9                   650     0.1456748  0.8667706  0.1017670
##   9                   700     0.1457079  0.8667146  0.1018031
##   9                   750     0.1457389  0.8666659  0.1018400
##   9                   800     0.1457575  0.8666335  0.1018628
##   9                   850     0.1457873  0.8665819  0.1018907
##   9                   900     0.1458130  0.8665375  0.1019237
##   9                   950     0.1458313  0.8665070  0.1019369
##   9                  1000     0.1458536  0.8664674  0.1019588
##   9                  1050     0.1458659  0.8664460  0.1019674
##   9                  1100     0.1458845  0.8664165  0.1019795
##   9                  1150     0.1458953  0.8663987  0.1019899
##   9                  1200     0.1459087  0.8663743  0.1020006
##   9                  1250     0.1459195  0.8663558  0.1020116
##   9                  1300     0.1459282  0.8663410  0.1020181
##   9                  1350     0.1459312  0.8663365  0.1020216
##   9                  1400     0.1459383  0.8663244  0.1020284
##   9                  1450     0.1459421  0.8663180  0.1020321
##   9                  1500     0.1459467  0.8663099  0.1020359
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 250, interaction.depth =
##  5, shrinkage = 0.1 and n.minobsinnode = 20.

And finally, we will fit a Random Forest model:

# tuning parameters
tg <- data.frame(mtry = seq(2, 10, by =2))

# fit model
set.seed(1056)
modRF <- train(x = trainT,
               y = log(trainSalePrice),
               method = "rf", 
               trControl = trainControl(method = "boot", number = 50),
               prox=TRUE,
               tuneGrid = tg)
modRF
## Random Forest 
## 
## 1460 samples
##   87 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (50 reps) 
## Summary of sample sizes: 1460, 1460, 1460, 1460, 1460, 1460, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##    2    0.1927070  0.8213333  0.1352599
##    4    0.1730671  0.8443651  0.1194643
##    6    0.1652054  0.8524292  0.1134408
##    8    0.1609182  0.8563949  0.1101898
##   10    0.1585318  0.8582168  0.1085345
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 10.

6. Evaluate Models’ Performance

Now we will evaluate models’ performances . First, we will collect the resampling results:

# resampling
resamps <- resamples(list(SVM = modSVM,
                          GBM = modGBM,
                          RF = modRF))
resamps
## 
## Call:
## resamples.default(x = list(SVM = modSVM, GBM = modGBM, RF = modRF))
## 
## Models: SVM, GBM, RF 
## Number of resamples: 50 
## Performance metrics: MAE, RMSE, Rsquared 
## Time estimates for: everything, final model fit
summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: SVM, GBM, RF 
## Number of resamples: 50 
## 
## MAE 
##           Min.    1st Qu.     Median       Mean    3rd Qu.      Max. NA's
## SVM 0.08696610 0.09065311 0.09269723 0.09289465 0.09473787 0.1004699    0
## GBM 0.09324981 0.09842706 0.10048341 0.10079589 0.10300642 0.1082278    0
## RF  0.09696122 0.10592288 0.10879866 0.10853447 0.11195312 0.1182282    0
## 
## RMSE 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## SVM 0.1215586 0.1291939 0.1344441 0.1353978 0.1414358 0.1530272    0
## GBM 0.1296998 0.1380851 0.1429401 0.1438411 0.1475543 0.1654836    0
## RF  0.1372883 0.1524000 0.1589761 0.1585318 0.1634998 0.1786683    0
## 
## Rsquared 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## SVM 0.8407383 0.8790968 0.8867549 0.8855670 0.8949495 0.9073738    0
## GBM 0.8139848 0.8635301 0.8731588 0.8698813 0.8791183 0.8910537    0
## RF  0.7984019 0.8531286 0.8587676 0.8582168 0.8663959 0.8896606    0

Support Vector Machine has the least mean and median RMSE. Let’s see if there are statiscally significant differences between models:

# compute the differences
difValues <- diff(resamps)

# the actual paired t-test:
difValues$statistics$RMSE
## $SVM.diff.GBM
## 
##  One Sample t-test
## 
## data:  x
## t = -14.085, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 98.33333 percent confidence interval:
##  -0.009929334 -0.006957341
## sample estimates:
##    mean of x 
## -0.008443337 
## 
## 
## $SVM.diff.RF
## 
##  One Sample t-test
## 
## data:  x
## t = -26.99, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 98.33333 percent confidence interval:
##  -0.02525879 -0.02100926
## sample estimates:
##   mean of x 
## -0.02313403 
## 
## 
## $GBM.diff.RF
## 
##  One Sample t-test
## 
## data:  x
## t = -17.8, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 98.33333 percent confidence interval:
##  -0.01673657 -0.01264481
## sample estimates:
##   mean of x 
## -0.01469069

How can we view these differences? There are several ways. I will show just one of them:

# plot models' differences
bwplot(difValues, layout = c(1, 3))

It’s clear, visually and numerically, that Support Vector Machine (SVM) had a better performance, comparing to Random Forest and Gradient Boosting Machine, with this data set.

It is also important to inspect visually models’ performance. We will do that plotting observed values vs. predicted values, from the training data set:

# get predicted values on the training data set
trainPredSVM <- predict(modSVM, trainT)

# visual performance evaluation
qplot(trainPredSVM, log(trainSalePrice))

Attention to this point: the graph does not show the realistic performance. It is the unrealistic performance, as we are using the training data set. To get a realistic visual inspection, we should have used a validation data set, but we opted not to do that, to make things simple.

7. Conclusion

Just for curiosity, let’s see what variables are more important to predict house prices:

# variable importance
roc_imp <- varImp(modSVM, scale = FALSE)
roc_imp
## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 87)
## 
##                   Overall
## OverallQual        0.6377
## GarageArea         0.4257
## FullBath           0.3558
## X1stFlrSF          0.3511
## GarageYrBlt        0.3431
## YearRemodAdd       0.3070
## GarageFinish.Unf   0.3046
## Fireplaces         0.2988
## TotRmsAbvGrd       0.2853
## ExterQual.Gd       0.2650
## HeatingQC.Ex       0.2345
## OpenPorchSF        0.2172
## BsmtFinType1.GLQ   0.2100
## LotArea            0.2075
## GarageFinish.Fin   0.2071
## LotFrontage        0.1760
## KitchenQual.Gd     0.1693
## MasVnrArea         0.1624
## OverallCond        0.1537
## GarageType.Attchd  0.1470
plot(roc_imp, top = 20)

“OverallQual” rates the overall material and finish of the house; it ranges from “very poor” (01) to “very excelent” (10). “FullBath” measures Full bathrooms above grade. “X1stFlrSF” brings the first floor square feet. Makes sense!

Now, the final step is to apply SVM model to the testing data set, get predicted values, exponentiate predicted values - remember, we used log(SalePrice) as the outcome -, prepare the data frame, export the csv file and submit the results.

# get predicted values on the testing data set
testPredSVM <- predict(modSVM, testT)

# build submission file
submission <- as.data.frame(cbind(testId, testPredSVM))

# change column names
colnames(submission)[1] <- "Id"
colnames(submission)[2] <- "SalePrice"

# exponentiate predicted values
submission$SalePrice <- exp(submission$SalePrice)

# export file
write.csv(submission,"./submission.csv", row.names = FALSE)