Hey there! Welcome to another R project of mine. This time, I compete in a just-for-fun Kaggle competition design to help build my skills. Kaggle.com is a website that houses a bunch of datasets. The site also hosts data science competitions that range from just-for-fun to cash prizes (currently there is a $120,000 prize competition).

The competition I’m working on is predicting the prices of homes of some random town in Idaho based on factors like number of beds, baths, area, neighborhood, etc. Kaggle supplies two relatively clean datasets, the training and testing sets. I need to produce a list of prices for the houses in the test dataset, which will be compared with their actual prices for accuracy. I will be graded on how far off my predictions are.

The methods I employed so far are linear regression and random forest. Ultimately, I decided to use the random forest algorithm as my champion method.

This analysis is useful beyond the competition for those looking to buy/sell a house and would like to know important variables for appreciation/depreciation. Since a chunk of a person’s wealth is their home, it’s important.

The process for this project is as follows:

  1. Import, adjust, and inspect the data
  2. Apply and evaluate a linear regression attempt
  3. Apply and evaluate a random forest attempt
  4. Apply models to test set
  5. Comparing results

The main packages used for this project are esquisse, tidyverse, randomForest, caret.

And with that…

The “Watcher House” from NJ, just Google this story.

The “Watcher House” from NJ, just Google this story.

  1. Import, adjust, and inspect the data

So, here is where I’m going to import and adjust the data locally: train_set

A goal for me when importing and cleaning the data is to drop as few records as possible. In the end I need a prediction for every record in the test dataset. How I tackled this goal was by running code that searched the training set for NAs and returned the column name. I would then inspect the column and usually drop it using the subset() function. Lastly, I inspect the set as a whole using str(). As you can see, the dataset is pretty wide with the variables I did keep.

I used the esquisse() package, which is not shown, to play with the data visually by myself.

# Importing the training set
train_set <- train<-read.csv("train.csv", stringsAsFactors=TRUE)

# Dropping variables with NA values and variables I feel are less relevant
train_set <- subset(train_set, select = -c(Alley,PoolQC, Fence,
                                           MiscFeature, FireplaceQu,
                                           GarageFinish, BsmtFinType2,
                                           BsmtFinType1,BsmtQual,
                                           BsmtCond, BsmtExposure,
                                           LotFrontage, MasVnrType,
                                           MasVnrArea, Electrical,
                                           GarageType, GarageYrBlt,
                                           GarageQual, GarageCond))

# This just makes sure I got rid of the columns with NAs in them
colnames(train_set)[ apply(train_set, 2, anyNA)]
## character(0)
# This removes the rows with NAs (hopefully none)
train_set <-na.omit(train_set)

# Check the structure of the training set                    
str(train_set)
## 'data.frame':    1460 obs. of  62 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
  1. Apply and evaluate a linear regression attempt

Let’s dive into actual analytics

Linear regression is useful for this problem since it is strong at exploring the relationships of numbers when using numbers. Factor data and time data get a little more messy.

Below we run a regression, first looking at the OverallQual variable as a predictor. OverallQual is a somewhat subjective number that is supposed to encompass all the other factors into a single number rating of the house. The regression pulled an adjusted R-squared of 0.6257, which means that 62% of the variation in the sale price is explained by this OverallQual number. Not bad. The way this can be implemented for predicting is by looking at the Estimate part of the summary, which gives 45435.8. This means for every 1 increase in OverallQual, the sale price of the house should go up by $45425.80. In equation form, which includes the constant intercept of -96206.1, this looks like:

sale price = 45435.8(OverallQual) - 96206.1 +- Error

We run a second regression, now using LotArea (total area of the land), TotalBsmtSF (square footage of the basement), FullBath (how many full baths), and GarageArea (total area of the garage). I’ll be honest, there will be some overlap of LotArea and other areas like GarageArea, which is generally bad. In this case, however, we are talking about how a home is divided amongst various functional areas like basement, garage, etc. These areas have different values and can be included with that overlap. The adjusted R-squared for this regression is 0.6002, which is less than the first regression and therefore less helpful. In practicality, a homeowner can more easily figure out the factors in this second regression since they are less subjective, which makes this analysis useful. In similar fashion to the first regression, this multiple linear regression’s equation looks like:

sale price = 0.6301(LotArea) + 60.99(TotalBsmtSF) + 45060(FullBath) + 118.30(GarageArea) - 16680 +- Error

So now we have something we can use to predict house prices.

# Linear regression with just overall quality as a predictor
fit1<-lm(train_set$SalePrice ~ train_set$OverallQual)

summary(fit1)
## 
## Call:
## lm(formula = train_set$SalePrice ~ train_set$OverallQual)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -198152  -29409   -1845   21463  396848 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -96206.1     5756.4  -16.71   <2e-16 ***
## train_set$OverallQual  45435.8      920.4   49.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48620 on 1458 degrees of freedom
## Multiple R-squared:  0.6257, Adjusted R-squared:  0.6254 
## F-statistic:  2437 on 1 and 1458 DF,  p-value: < 2.2e-16
plot(train_set$OverallQual,train_set$SalePrice)

# Let's try raw dimensions or all continous values as predictors
fit2<-lm(train_set$SalePrice ~  
           train_set$LotArea+
           train_set$TotalBsmtSF+
           train_set$FullBath+
           train_set$GarageArea)

summary(fit2)
## 
## Call:
## lm(formula = train_set$SalePrice ~ train_set$LotArea + train_set$TotalBsmtSF + 
##     train_set$FullBath + train_set$GarageArea)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -494114  -26830   -2830   19489  375434 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.668e+04  4.549e+03  -3.668 0.000254 ***
## train_set$LotArea      6.301e-01  1.368e-01   4.605 4.48e-06 ***
## train_set$TotalBsmtSF  6.099e+01  3.543e+00  17.212  < 2e-16 ***
## train_set$FullBath     4.506e+04  2.646e+03  17.029  < 2e-16 ***
## train_set$GarageArea   1.183e+02  7.391e+00  16.011  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50230 on 1455 degrees of freedom
## Multiple R-squared:  0.6013, Adjusted R-squared:  0.6002 
## F-statistic: 548.5 on 4 and 1455 DF,  p-value: < 2.2e-16
  1. Apply and evaluate a random forest attempt

Stay with me on this, this algorithm is a little complicated to understand by following the code. How this algorithm works, using an oversimplified example, is like asking eight friends to give you a guess on the sale price of house when giving them information such as the number of beds, baths, the location, time of year, etc. After asking those eight friends for their guesses, we average their answers. If we change the question to something like, “is the house two stories?” where the answer is yes or no, we take majority rule instead of the average. This algorithm is therefore versatile for predicting both continous values such as price and binary values such as yes/no. The basic building block of this algorithm is a decision tree, which is when one of your friends asks themselves questions to get closer to a guess. So your friend might ask “What state are we in?” Since we are in Idaho and not New Jersey, we can expect sale prices of houses to be lower than our frame of reference. Your friend might ask then “how old is the house?”, and adjust their guess even further. At the end of the questions there should be a relatively informed estimate. Doing this several times makes the decsion tree a forest, or a random forest since we might ask twelve friends but randomly use the input of eight of them.

For our model, we asked 32 friends and decided to use only the input of eight. Even though the 32 friends were more accurate based on R-squared, it was only a slight improvement. The slight improvement is not worth the computational effort, so we stick with eight.

# Setting seed so we get the same "randomness"
set.seed(1234)

# Tuning the parameters for the decisions
ctrl <-
  trainControl(method = "cv",
               number = 10,
               selectionFunction = "oneSE")

# Laying out how many "friends" to ask
grid_rf <- expand.grid(.mtry = c(8, 16, 32))

# Running the model
rf_mod <-
  train(
    SalePrice ~ .,
    data = train_set,
    method = "rf",
    metric = "Rsquared",
    trControl = ctrl,
    tuneGrid = grid_rf)

# Summary of results
rf_mod
## Random Forest 
## 
## 1460 samples
##   61 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1315, 1313, 1314, 1313, 1314, 1315, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    8    30886.50  0.8688719  18111.24
##   16    29309.95  0.8729830  17065.96
##   32    28324.57  0.8765083  16587.97
## 
## Rsquared was used to select the optimal model using  the one SE rule.
## The final value used for the model was mtry = 8.
  1. Apply models to test dataset

The nest step we need to take is applying the weights from our training models to the test dataset, or unseen data. Similarly to the training set, we import and clean the test set by dropping variables that have too many NA values or seem insignificant. I leave out dropping observations (rows) with NA values since I need to keep all my observations to get graded. After importing and cleaning the data, we apply the random forest model and the first linear regression model.

Next, we compare the results.

# Importing test data
test_set <- test <- read.csv("test.csv", stringsAsFactors = TRUE )

# Dropping the same variables we dropped from the training set
test_set <- subset(test_set, select = -c(Alley,PoolQC, Fence, MiscFeature, FireplaceQu, GarageFinish, BsmtFinType2, BsmtFinType1, BsmtQual, BsmtCond, BsmtExposure, LotFrontage, MasVnrType, MasVnrArea, Electrical, GarageType, GarageYrBlt, GarageQual, GarageCond))

# Checking for columns with NA values
colnames(test_set)[ apply(test_set, 2, anyNA)]
##  [1] "MSZoning"     "Utilities"    "Exterior1st"  "Exterior2nd" 
##  [5] "BsmtFinSF1"   "BsmtFinSF2"   "BsmtUnfSF"    "TotalBsmtSF" 
##  [9] "BsmtFullBath" "BsmtHalfBath" "KitchenQual"  "Functional"  
## [13] "GarageCars"   "GarageArea"   "SaleType"
# Applying the RF model to the test set to predict Sale Price
PredictHousePrice <- predict(rf_mod, newdata = test_set,method = "rf")

# Applying the first regression model to the test set to predict Sale Price
RegressionHousePrice <- predict(fit1, newdata = test_set,method = "lm")
## Warning: 'newdata' had 1459 rows but variables found have 1460 rows
  1. Comparing results

Lastly, let’s compare results with simple metrics to see if our predictions fall within the realm of possibility. First we will look at the distribution of values for the SalePrice variable in the training set. We will then compare that to the distribution of values for the random forest model and the regression models.

Weirdly enough, the regression model as the same average as the training set, which is strange. I don’t have a good answer for that. The regression model also as a negative value for the minimum value, which is impractical since homes rarely have negative sale prices. Regression, however, is a formula that can be negative based on weights.

The random forest model acts a little more realistically, skipping out on negative values and not copying the training set’s average value. Both the regression and random forest models have tighter ranges (max-min) than the training set. This can be because I took out predictor values that are rare but important (like a helicopter landing pad or something). By doing this, I hurt my chances of predicting outliers but I should increase my general accuracy.

# Training set
summary(train_set$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
# Random Forest (test set)
summary(PredictHousePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   76892  133835  160939  179624  208830  428084
# Regression (test set)
summary(RegressionHousePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -50770  130973  176409  180921  221845  358152

So, this was my attempt at predicting for a competition. Thanks!