8 December 2016

Housing Data

For this presentation I am using housing data from a Kaggle competition. For details, see https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data.

Some of the data is shown below:

head(train)
##   SalePrice LotArea GrLivArea GarageYrBlt LotFrontage
## 1    208500    8450      1710        2003          65
## 2    181500    9600      1262        1976          80
## 3    223500   11250      1786        2001          68
## 4    140000    9550      1717        1998          60
## 5    250000   14260      2198        2000          84
## 6    143000   14115      1362        1993          85

Outliers

Outliers cause problems with building the prediction model. The variable to predict is SalePrice.

Outlier Detection

Outliers need to be detected without using SalePrice. We try to find them by using Mahalanobis distance calculation.

m_dist <- mahalanobis(train[, -1], 
                      colMeans(train[, -1]), 
                      cov(train[, -1]))
train$m_dist <- round(m_dist, 1)
head(train)
##   SalePrice LotArea GrLivArea GarageYrBlt LotFrontage m_dist
## 1    208500    8450      1710        2003          65    1.1
## 2    181500    9600      1262        1976          80    0.7
## 3    223500   11250      1786        2001          68    1.0
## 4    140000    9550      1717        1998          60    1.0
## 5    250000   14260      2198        2000          84    1.9
## 6    143000   14115      1362        1993          85    1.4

Outlier Detection (cont.)

We define a threshold for Mahalanobis distance. Above the threshold, the observation is defined as outlier.

train$Outlier <- 0
train$Outlier[train$m_dist >= 35] <- 1
sum(train$Outlier)
## [1] 7

There were 7 outliers detected by using a threshold of 35. This needs some experimenting to reduce the number of false positives.

Outlier Plot

Conclusion

Mahalanobis distance calculation can be used to detect outliers without using the variable that needs to be predicted. However, be mindful about:

  • Do not use too many variables
  • Experimenting with the threshold
  • It does not work with non-linear data