This project is part of the “House Prices: Advanced Regression Techniques” contest for Kaggle.
The aim is to use machine learning techniques to predict house prices based on specific features.
https://www.kaggle.com/datasets
My model was processed against the test data, and came back with a Root Mean Square Logarithmic Error of 0.29416.
This document outlines the steps taken in order to create the most accurate prediction model. In order to avoid overfitting, I’ve decided to keep thing simple and rely on a linear model with the fewest possible predictors necessary for an accurate outcome.
But before we do anything else, we’ll first need to load all necessary libraries and define functions that I’ll be using for various aspects of the project.
# Load Libraries
library(mice)
## Loading required package: Rcpp
## mice 2.25 2015-11-09
library(Metrics)
## Warning: package 'Metrics' was built under R version 3.3.3
set.seed(1234)
# Load working data
x <- read.csv("train.csv")
test <- read.csv("test.csv")
# A function to limit number of NA values in any given variable
naThreshold <- function(x)
{
threshold <- 0.9
totalRows <- length(x)
naRows <- length(x[-is.na(x)])
# If over 90% are non-na, then return TRUE
if((naRows / totalRows) > threshold )
{
TRUE
}
else
{
FALSE
}
}
# Filter out columns with long text descriptions.
# (Thankfully, there are none to be found)
longText <- function(x)
{
x <- as.character(x)
x <- x[!is.na(x)]
# Return True if contains long text
if(length(x[nchar(x) > 25]) > 0) { TRUE } else { FALSE }
}
# Find integer columns that have non-zero values for at least 90% of values
# Needs to be done after factor reclassification because 0 is a common factor variable
zeroThreshold <- function(x)
{
threshold <- 0.90
totalRows <- length(x)
valueRows <- length(x[x > 0])
# If over 90% are non-na, then return TRUE
if((valueRows / totalRows) > threshold )
{
TRUE
}
else
{
FALSE
}
}
# Correlation Plots against sale price
plotCorr <- function(tmpColumns, tmpBenchmark)
{
par(mfrow = c(3, 3), mar=c(1, 1, 2, 4))
lapply(names(tmpColumns), function(x) plot(tmpColumns[[x]], y=tmpBenchmark/10000, main=x))
}
# Find the difference between the lowest and highest average sale price for levels of a given factor vaiable
rangeFinder <- function(x, tmpCompare)
{
tmpMeanList <- as.vector(cbind(by(tmpCompare, x, mean)))
max(tmpMeanList) - min(tmpMeanList)
}
# Plot the factors
factorPlot <- function(x, tmpCompare)
{
tmpMeans <- cbind(by(tmpCompare, x, mean))
tmpMeans <- data.frame(factorName = rownames(tmpMeans), factorValue = as.vector(tmpMeans))
tmpMeans <- tmpMeans[order(tmpMeans$factorValue),]
plot(tmpMeans, levels=sort(as.numeric(tmpMeans$factorValue)))
}
# Find root mean squared error
rmse <- function(error)
{
sqrt(mean(error^2))
}
Before exploring, we’ll clean irrelevant data from the dataset
# Filter down to columns that have data for at least 90% of houses
naThresholdList <- data.frame(colNames = names(x), qualify = apply(x, MARGIN = 2, FUN= naThreshold))
colsToDelete <- naThresholdList[naThresholdList$qualify == FALSE,]$colNames
x <- x[,colsToDelete]
# Filter zero-variance columns
# They all seem to have at least one value.
nonzeroVariance <- data.frame(colNames = names(x), lowVarValue = apply(x, MARGIN=2, FUN= function(x){ if(length(unique(x)) < 2) { FALSE } else { TRUE } }))
# Find columns with long text, such as descriptions or notes that aren't useful for our exploration. Thankfully, there are none.
longTextCols <- data.frame(colNames = names(x), longText = apply(x, MARGIN=2, FUN=longText))
# Get the number of unique value for each column, as some nueric columns might actually be coded factors
sub25UniqueCols <-
data.frame(
colNames = names(x),
uniqueVals = as.vector(apply(x, MARGIN=2, FUN=
function(x){ length(unique(x)) }
)))
# If a column has fewer than 25 unique values, declare it a factor
colsToFactor <- sub25UniqueCols[sub25UniqueCols$uniqueVals < 25, ]$colNames
x[, as.vector(colsToFactor)] <- sapply(x[, as.vector(colsToFactor)], as.factor)
# Identify which columns are integer values
integerCols <- data.frame(colname = names(x), colclass = sapply(x, FUN=class))
integerCols <- as.vector(integerCols[integerCols$colclass %in% "integer", ]$colname)
# Remove irrelevant columns
colsToRemove <- data.frame(colnames = integerCols, remval = as.vector(apply(x[,integerCols], MARGIN=2, FUN=zeroThreshold)))
colsToRemove <- as.vector(colsToRemove[colsToRemove$remval == FALSE,]$colnames)
x <- x[,!(names(x) %in% colsToRemove)]
# And now the data set is down to 43 columns
names(x)
## [1] "HeatingQC" "Functional" "Fireplaces" "LowQualFinSF"
## [5] "BsmtHalfBath" "X1stFlrSF" "Fence" "GrLivArea"
## [9] "FireplaceQu" "Condition2" "BldgType" "MSZoning"
## [13] "Heating" "GarageFinish" "GarageYrBlt" "SaleType"
## [17] "SaleCondition" "PavedDrive" "GarageCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "OverallCond" "MasVnrType"
## [25] "Street" "Alley" "Neighborhood" "PoolArea"
## [29] "BsmtUnfSF" "TotalBsmtSF" "Condition1" "MiscVal"
## [33] "MoSold" "FullBath" "BsmtFinType2" "Utilities"
## [37] "LotConfig" "MSSubClass" "CentralAir" "Electrical"
## [41] "PoolQC" "ExterQual" "Exterior2nd" "Foundation"
## [45] "ExterCond" "GarageCars" "MiscFeature" "GarageType"
## [49] "OverallQual" "YrSold" "X3SsnPorch" "GarageArea"
## [53] "KitchenQual" "TotRmsAbvGrd" "SalePrice"
Let’s explore the data to see some of its characteristics and decide which variables we’d want to keep.
# Split column names into Character and Integer
xColumns <- data.frame(colName = names(x), colClass = as.vector(sapply(x, class)))
intNames <- as.vector(xColumns[xColumns$colClass %in% "integer", ]$colName)
factorNames <- as.vector(xColumns[xColumns$colClass %in% c("factor", "character"), ]$colName)
# Plot all integer columns against sale price, to determine which would be better off as factors
plotCorr(x[,intNames[1:length(intNames)]], x$SalePrice)
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
# This plot shows a few really strong correlations.
# Now let's look at the actual correlation numbers
sapply(x[,intNames[1:length(intNames)]], cor, y=x$SalePrice)
## X1stFlrSF GrLivArea GarageYrBlt YearBuilt YearRemodAdd
## 0.6058522 0.7086245 NA 0.5228973 0.5071010
## BsmtUnfSF TotalBsmtSF GarageArea SalePrice
## 0.2144791 0.6135806 0.6234314 1.0000000
# Run a correlation matrix between all the numeric columns to prevent varryance inflation'
cor(x[,intNames[1:length(intNames)]], x[,intNames[1:length(intNames)]])
## X1stFlrSF GrLivArea GarageYrBlt YearBuilt YearRemodAdd
## X1stFlrSF 1.0000000 0.5660240 NA 0.2819859 0.2403793
## GrLivArea 0.5660240 1.0000000 NA 0.1990097 0.2873885
## GarageYrBlt NA NA NA NA NA
## YearBuilt 0.2819859 0.1990097 NA 1.0000000 0.5928550
## YearRemodAdd 0.2403793 0.2873885 NA 0.5928550 1.0000000
## BsmtUnfSF 0.3179874 0.2402573 NA 0.1490404 0.1811331
## TotalBsmtSF 0.8195300 0.4548682 NA 0.3914520 0.2910656
## GarageArea 0.4897817 0.4689975 NA 0.4789538 0.3715998
## SalePrice 0.6058522 0.7086245 NA 0.5228973 0.5071010
## BsmtUnfSF TotalBsmtSF GarageArea SalePrice
## X1stFlrSF 0.3179874 0.8195300 0.4897817 0.6058522
## GrLivArea 0.2402573 0.4548682 0.4689975 0.7086245
## GarageYrBlt NA NA NA NA
## YearBuilt 0.1490404 0.3914520 0.4789538 0.5228973
## YearRemodAdd 0.1811331 0.2910656 0.3715998 0.5071010
## BsmtUnfSF 1.0000000 0.4153596 0.1833027 0.2144791
## TotalBsmtSF 0.4153596 1.0000000 0.4866655 0.6135806
## GarageArea 0.1833027 0.4866655 1.0000000 0.6234314
## SalePrice 0.2144791 0.6135806 0.6234314 1.0000000
# Remove unnecessary integer variables
numToRemove <- c("TotalBsmtSF", "GarageYearBlt", "YearRemodAdd", "X1stFlrSF")
TotalBsmtSF seems to be correlated with X1stFlrSF, which makes sense since they should both be about equal unless there is no basement. GarageYearBlt is almost identical to YearBuilt. So I can just remove it.
YearRemodAdd is highly corelated with YearBuilt.
GrLivArea is also highly correlated with X1stFlrSF. I’ll remove X1stFlrSF, since the living area seems more descriptive.
Having done this, let’s look at the factors.
# Get the ranges of the averages for each factor
factorRanges <- data.frame(colNames = names(x[,factorNames]), colRange = as.vector(apply(x[,factorNames], MARGIN=2, FUN=rangeFinder, tmpCompare = x$SalePrice)))
# Filter down to factor where the minumum and maxumum are separated by at least 200K
factorRanges <- factorRanges[factorRanges$colRange > 200000, ]
factorSubset <- x[,names(x) %in% factorRanges$colNames]
# Plot each remaining factor variable against the sale price
par(mfrow = c(4, 4), mar=c(1, 1, 2, 4))
apply(factorSubset[,1:length(names(factorSubset))], MARGIN=2, FUN=factorPlot, tmpCompare=x$SalePrice)
## $LowQualFinSF
## $LowQualFinSF$stats
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 181433.7 124900 135000 79500 122000 133000 84500 108750 104000 191000
## [2,] 181433.7 124900 135000 79500 122000 133000 84500 108750 104000 191000
## [3,] 181433.7 124900 135000 79500 122000 133000 84500 108750 104000 191000
## [4,] 181433.7 124900 135000 79500 122000 133000 84500 108750 104000 191000
## [5,] 181433.7 124900 135000 79500 122000 133000 84500 108750 104000 191000
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 130000 235000 197000 179500 115000 200500 118500 85000 102000 2e+05
## [2,] 130000 235000 197000 179500 115000 200500 118500 85000 102000 2e+05
## [3,] 130000 235000 197000 179500 115000 200500 118500 85000 102000 2e+05
## [4,] 130000 235000 197000 179500 115000 200500 118500 85000 102000 2e+05
## [5,] 130000 235000 197000 179500 115000 200500 118500 85000 102000 2e+05
## [,21] [,22] [,23] [,24]
## [1,] 128500 169000 475000 147517.3
## [2,] 128500 169000 475000 147517.3
## [3,] 128500 169000 475000 147517.3
## [4,] 128500 169000 475000 147517.3
## [5,] 128500 169000 475000 147517.3
##
## $LowQualFinSF$n
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $LowQualFinSF$conf
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 181433.7 124900 135000 79500 122000 133000 84500 108750 104000 191000
## [2,] 181433.7 124900 135000 79500 122000 133000 84500 108750 104000 191000
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 130000 235000 197000 179500 115000 200500 118500 85000 102000 2e+05
## [2,] 130000 235000 197000 179500 115000 200500 118500 85000 102000 2e+05
## [,21] [,22] [,23] [,24]
## [1,] 128500 169000 475000 147517.3
## [2,] 128500 169000 475000 147517.3
##
## $LowQualFinSF$out
## numeric(0)
##
## $LowQualFinSF$group
## numeric(0)
##
## $LowQualFinSF$names
## [1] "0" "120" "144" "156" "205" "232" "234" "360" "371" "384" "390"
## [12] "392" "397" "420" "473" "479" "481" "513" "514" "515" "528" "53"
## [23] "572" "80"
##
##
## $FireplaceQu
## $FireplaceQu$stats
## [,1] [,2] [,3] [,4] [,5]
## [1,] 337712.5 167298.5 226351.4 129764.1 205723.5
## [2,] 337712.5 167298.5 226351.4 129764.1 205723.5
## [3,] 337712.5 167298.5 226351.4 129764.1 205723.5
## [4,] 337712.5 167298.5 226351.4 129764.1 205723.5
## [5,] 337712.5 167298.5 226351.4 129764.1 205723.5
##
## $FireplaceQu$n
## [1] 1 1 1 1 1
##
## $FireplaceQu$conf
## [,1] [,2] [,3] [,4] [,5]
## [1,] 337712.5 167298.5 226351.4 129764.1 205723.5
## [2,] 337712.5 167298.5 226351.4 129764.1 205723.5
##
## $FireplaceQu$out
## numeric(0)
##
## $FireplaceQu$group
## numeric(0)
##
## $FireplaceQu$names
## [1] "Ex" "Fa" "Gd" "Po" "TA"
##
##
## $Condition2
## $Condition2$stats
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 106500 121166.7 181169.4 325000 284875 190000 136905 96750
## [2,] 106500 121166.7 181169.4 325000 284875 190000 136905 96750
## [3,] 106500 121166.7 181169.4 325000 284875 190000 136905 96750
## [4,] 106500 121166.7 181169.4 325000 284875 190000 136905 96750
## [5,] 106500 121166.7 181169.4 325000 284875 190000 136905 96750
##
## $Condition2$n
## [1] 1 1 1 1 1 1 1 1
##
## $Condition2$conf
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 106500 121166.7 181169.4 325000 284875 190000 136905 96750
## [2,] 106500 121166.7 181169.4 325000 284875 190000 136905 96750
##
## $Condition2$out
## numeric(0)
##
## $Condition2$group
## numeric(0)
##
## $Condition2$names
## [1] "Artery" "Feedr" "Norm" "PosA" "PosN" "RRAe" "RRAn" "RRNn"
##
##
## $Neighborhood
## $Neighborhood$stats
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 194870.9 137500 104493.8 124834.1 212565.4 197965.8 210624.7 128219.7
## [2,] 194870.9 137500 104493.8 124834.1 212565.4 197965.8 210624.7 128219.7
## [3,] 194870.9 137500 104493.8 124834.1 212565.4 197965.8 210624.7 128219.7
## [4,] 194870.9 137500 104493.8 124834.1 212565.4 197965.8 210624.7 128219.7
## [5,] 194870.9 137500 104493.8 124834.1 212565.4 197965.8 210624.7 128219.7
## [,9] [,10] [,11] [,12] [,13] [,14] [,15]
## [1,] 192854.5 100123.8 98576.47 156270.1 145847.1 335295.3 142694.4
## [2,] 192854.5 100123.8 98576.47 156270.1 145847.1 335295.3 142694.4
## [3,] 192854.5 100123.8 98576.47 156270.1 145847.1 335295.3 142694.4
## [4,] 192854.5 100123.8 98576.47 156270.1 145847.1 335295.3 142694.4
## [5,] 192854.5 100123.8 98576.47 156270.1 145847.1 335295.3 142694.4
## [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
## [1,] 316270.6 189050.1 128225.3 136793.1 186555.8 225379.8 310499 142591.4
## [2,] 316270.6 189050.1 128225.3 136793.1 186555.8 225379.8 310499 142591.4
## [3,] 316270.6 189050.1 128225.3 136793.1 186555.8 225379.8 310499 142591.4
## [4,] 316270.6 189050.1 128225.3 136793.1 186555.8 225379.8 310499 142591.4
## [5,] 316270.6 189050.1 128225.3 136793.1 186555.8 225379.8 310499 142591.4
## [,24] [,25]
## [1,] 242247.4 238772.7
## [2,] 242247.4 238772.7
## [3,] 242247.4 238772.7
## [4,] 242247.4 238772.7
## [5,] 242247.4 238772.7
##
## $Neighborhood$n
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $Neighborhood$conf
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 194870.9 137500 104493.8 124834.1 212565.4 197965.8 210624.7 128219.7
## [2,] 194870.9 137500 104493.8 124834.1 212565.4 197965.8 210624.7 128219.7
## [,9] [,10] [,11] [,12] [,13] [,14] [,15]
## [1,] 192854.5 100123.8 98576.47 156270.1 145847.1 335295.3 142694.4
## [2,] 192854.5 100123.8 98576.47 156270.1 145847.1 335295.3 142694.4
## [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
## [1,] 316270.6 189050.1 128225.3 136793.1 186555.8 225379.8 310499 142591.4
## [2,] 316270.6 189050.1 128225.3 136793.1 186555.8 225379.8 310499 142591.4
## [,24] [,25]
## [1,] 242247.4 238772.7
## [2,] 242247.4 238772.7
##
## $Neighborhood$out
## numeric(0)
##
## $Neighborhood$group
## numeric(0)
##
## $Neighborhood$names
## [1] "Blmngtn" "Blueste" "BrDale" "BrkSide" "ClearCr" "CollgCr" "Crawfor"
## [8] "Edwards" "Gilbert" "IDOTRR" "MeadowV" "Mitchel" "NAmes" "NoRidge"
## [15] "NPkVill" "NridgHt" "NWAmes" "OldTown" "Sawyer" "SawyerW" "Somerst"
## [22] "StoneBr" "SWISU" "Timber" "Veenker"
##
##
## $PoolArea
## $PoolArea$stats
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 180404.7 160000 235000 250000 745000 171000 181000 274970
## [2,] 180404.7 160000 235000 250000 745000 171000 181000 274970
## [3,] 180404.7 160000 235000 250000 745000 171000 181000 274970
## [4,] 180404.7 160000 235000 250000 745000 171000 181000 274970
## [5,] 180404.7 160000 235000 250000 745000 171000 181000 274970
##
## $PoolArea$n
## [1] 1 1 1 1 1 1 1 1
##
## $PoolArea$conf
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 180404.7 160000 235000 250000 745000 171000 181000 274970
## [2,] 180404.7 160000 235000 250000 745000 171000 181000 274970
##
## $PoolArea$out
## numeric(0)
##
## $PoolArea$group
## numeric(0)
##
## $PoolArea$names
## [1] "0" "480" "512" "519" "555" "576" "648" "738"
##
##
## $MiscVal
## $MiscVal$stats
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 181964.7 256000 122750 160000 163000 151500 202450 266500 2e+05 55000
## [2,] 181964.7 256000 122750 160000 163000 151500 202450 266500 2e+05 55000
## [3,] 181964.7 256000 122750 160000 163000 151500 202450 266500 2e+05 55000
## [4,] 181964.7 256000 122750 160000 163000 151500 202450 266500 2e+05 55000
## [5,] 181964.7 256000 122750 160000 163000 151500 202450 266500 2e+05 55000
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]
## [1,] 148940.9 116437.5 152500 169300 55993 81000 123350 88000 169890
## [2,] 148940.9 116437.5 152500 169300 55993 81000 123350 88000 169890
## [3,] 148940.9 116437.5 152500 169300 55993 81000 123350 88000 169890
## [4,] 148940.9 116437.5 152500 169300 55993 81000 123350 88000 169890
## [5,] 148940.9 116437.5 152500 169300 55993 81000 123350 88000 169890
## [,20] [,21]
## [1,] 110000 190000
## [2,] 110000 190000
## [3,] 110000 190000
## [4,] 110000 190000
## [5,] 110000 190000
##
## $MiscVal$n
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $MiscVal$conf
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 181964.7 256000 122750 160000 163000 151500 202450 266500 2e+05 55000
## [2,] 181964.7 256000 122750 160000 163000 151500 202450 266500 2e+05 55000
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]
## [1,] 148940.9 116437.5 152500 169300 55993 81000 123350 88000 169890
## [2,] 148940.9 116437.5 152500 169300 55993 81000 123350 88000 169890
## [,20] [,21]
## [1,] 110000 190000
## [2,] 110000 190000
##
## $MiscVal$out
## numeric(0)
##
## $MiscVal$group
## numeric(0)
##
## $MiscVal$names
## [1] "0" "1150" "1200" "1300" "1400" "15500" "2000" "2500"
## [9] "350" "3500" "400" "450" "480" "500" "54" "560"
## [17] "600" "620" "700" "800" "8300"
##
##
## $FullBath
## $FullBath$stats
## [,1] [,2] [,3] [,4]
## [1,] 165200.9 134751.4 213009.8 347822.9
## [2,] 165200.9 134751.4 213009.8 347822.9
## [3,] 165200.9 134751.4 213009.8 347822.9
## [4,] 165200.9 134751.4 213009.8 347822.9
## [5,] 165200.9 134751.4 213009.8 347822.9
##
## $FullBath$n
## [1] 1 1 1 1
##
## $FullBath$conf
## [,1] [,2] [,3] [,4]
## [1,] 165200.9 134751.4 213009.8 347822.9
## [2,] 165200.9 134751.4 213009.8 347822.9
##
## $FullBath$out
## numeric(0)
##
## $FullBath$group
## numeric(0)
##
## $FullBath$names
## [1] "0" "1" "2" "3"
##
##
## $PoolQC
## $PoolQC$stats
## [,1] [,2] [,3]
## [1,] 490000 215500 201990
## [2,] 490000 215500 201990
## [3,] 490000 215500 201990
## [4,] 490000 215500 201990
## [5,] 490000 215500 201990
##
## $PoolQC$n
## [1] 1 1 1
##
## $PoolQC$conf
## [,1] [,2] [,3]
## [1,] 490000 215500 201990
## [2,] 490000 215500 201990
##
## $PoolQC$out
## numeric(0)
##
## $PoolQC$group
## numeric(0)
##
## $PoolQC$names
## [1] "Ex" "Fa" "Gd"
##
##
## $ExterQual
## $ExterQual$stats
## [,1] [,2] [,3] [,4]
## [1,] 367361 87985.21 231633.5 144341.3
## [2,] 367361 87985.21 231633.5 144341.3
## [3,] 367361 87985.21 231633.5 144341.3
## [4,] 367361 87985.21 231633.5 144341.3
## [5,] 367361 87985.21 231633.5 144341.3
##
## $ExterQual$n
## [1] 1 1 1 1
##
## $ExterQual$conf
## [,1] [,2] [,3] [,4]
## [1,] 367361 87985.21 231633.5 144341.3
## [2,] 367361 87985.21 231633.5 144341.3
##
## $ExterQual$out
## numeric(0)
##
## $ExterQual$group
## numeric(0)
##
## $ExterQual$names
## [1] "Ex" "Fa" "Gd" "TA"
##
##
## $Exterior2nd
## $Exterior2nd$stats
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 114060.6 138000 126714.3 195818 105000 230093.8 167661.6 252070
## [2,] 114060.6 138000 126714.3 195818 105000 230093.8 167661.6 252070
## [3,] 114060.6 138000 126714.3 195818 105000 230093.8 167661.6 252070
## [4,] 114060.6 138000 126714.3 195818 105000 230093.8 167661.6 252070
## [5,] 114060.6 138000 126714.3 195818 105000 230093.8 167661.6 252070
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## [1,] 149803.2 319000 168112.4 158224.8 155905.2 214432.5 148386.1 161328.9
## [2,] 149803.2 319000 168112.4 158224.8 155905.2 214432.5 148386.1 161328.9
## [3,] 149803.2 319000 168112.4 158224.8 155905.2 214432.5 148386.1 161328.9
## [4,] 149803.2 319000 168112.4 158224.8 155905.2 214432.5 148386.1 161328.9
## [5,] 149803.2 319000 168112.4 158224.8 155905.2 214432.5 148386.1 161328.9
##
## $Exterior2nd$n
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $Exterior2nd$conf
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 114060.6 138000 126714.3 195818 105000 230093.8 167661.6 252070
## [2,] 114060.6 138000 126714.3 195818 105000 230093.8 167661.6 252070
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## [1,] 149803.2 319000 168112.4 158224.8 155905.2 214432.5 148386.1 161328.9
## [2,] 149803.2 319000 168112.4 158224.8 155905.2 214432.5 148386.1 161328.9
##
## $Exterior2nd$out
## numeric(0)
##
## $Exterior2nd$group
## numeric(0)
##
## $Exterior2nd$names
## [1] "AsbShng" "AsphShn" "Brk Cmn" "BrkFace" "CBlock" "CmentBd" "HdBoard"
## [8] "ImStucc" "MetalSd" "Other" "Plywood" "Stone" "Stucco" "VinylSd"
## [15] "Wd Sdng" "Wd Shng"
##
##
## $GarageCars
## $GarageCars$stats
## [,1] [,2] [,3] [,4] [,5]
## [1,] 103317.3 128116.7 183851.7 309636.1 192655.8
## [2,] 103317.3 128116.7 183851.7 309636.1 192655.8
## [3,] 103317.3 128116.7 183851.7 309636.1 192655.8
## [4,] 103317.3 128116.7 183851.7 309636.1 192655.8
## [5,] 103317.3 128116.7 183851.7 309636.1 192655.8
##
## $GarageCars$n
## [1] 1 1 1 1 1
##
## $GarageCars$conf
## [,1] [,2] [,3] [,4] [,5]
## [1,] 103317.3 128116.7 183851.7 309636.1 192655.8
## [2,] 103317.3 128116.7 183851.7 309636.1 192655.8
##
## $GarageCars$out
## numeric(0)
##
## $GarageCars$group
## numeric(0)
##
## $GarageCars$names
## [1] "0" "1" "2" "3" "4"
##
##
## $OverallQual
## $OverallQual$stats
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 50150 438588.4 51770.33 87473.75 108420.7 133523.3 161603 207716.4
## [2,] 50150 438588.4 51770.33 87473.75 108420.7 133523.3 161603 207716.4
## [3,] 50150 438588.4 51770.33 87473.75 108420.7 133523.3 161603 207716.4
## [4,] 50150 438588.4 51770.33 87473.75 108420.7 133523.3 161603 207716.4
## [5,] 50150 438588.4 51770.33 87473.75 108420.7 133523.3 161603 207716.4
## [,9] [,10]
## [1,] 274735.5 367513
## [2,] 274735.5 367513
## [3,] 274735.5 367513
## [4,] 274735.5 367513
## [5,] 274735.5 367513
##
## $OverallQual$n
## [1] 1 1 1 1 1 1 1 1 1 1
##
## $OverallQual$conf
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 50150 438588.4 51770.33 87473.75 108420.7 133523.3 161603 207716.4
## [2,] 50150 438588.4 51770.33 87473.75 108420.7 133523.3 161603 207716.4
## [,9] [,10]
## [1,] 274735.5 367513
## [2,] 274735.5 367513
##
## $OverallQual$out
## numeric(0)
##
## $OverallQual$group
## numeric(0)
##
## $OverallQual$names
## [1] "1" "10" "2" "3" "4" "5" "6" "7" "8" "9"
##
##
## $X3SsnPorch
## $X3SsnPorch$stats
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 180448.8 180000 135000 211500 392500 149500 238000 135000 222000
## [2,] 180448.8 180000 135000 211500 392500 149500 238000 135000 222000
## [3,] 180448.8 180000 135000 211500 392500 149500 238000 135000 222000
## [4,] 180448.8 180000 135000 211500 392500 149500 238000 135000 222000
## [5,] 180448.8 180000 135000 211500 392500 149500 238000 135000 222000
## [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]
## [1,] 228500 184500 169990 194500 231500 262500 394617 143000 180500 180500
## [2,] 228500 184500 169990 194500 231500 262500 394617 143000 180500 180500
## [3,] 228500 184500 169990 194500 231500 262500 394617 143000 180500 180500
## [4,] 228500 184500 169990 194500 231500 262500 394617 143000 180500 180500
## [5,] 228500 184500 169990 194500 231500 262500 394617 143000 180500 180500
## [,20]
## [1,] 179900
## [2,] 179900
## [3,] 179900
## [4,] 179900
## [5,] 179900
##
## $X3SsnPorch$n
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $X3SsnPorch$conf
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 180448.8 180000 135000 211500 392500 149500 238000 135000 222000
## [2,] 180448.8 180000 135000 211500 392500 149500 238000 135000 222000
## [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19]
## [1,] 228500 184500 169990 194500 231500 262500 394617 143000 180500 180500
## [2,] 228500 184500 169990 194500 231500 262500 394617 143000 180500 180500
## [,20]
## [1,] 179900
## [2,] 179900
##
## $X3SsnPorch$out
## numeric(0)
##
## $X3SsnPorch$group
## numeric(0)
##
## $X3SsnPorch$names
## [1] "0" "130" "140" "144" "153" "162" "168" "180" "182" "196" "216"
## [12] "23" "238" "245" "290" "304" "320" "407" "508" "96"
##
##
## $KitchenQual
## $KitchenQual$stats
## [,1] [,2] [,3] [,4]
## [1,] 328554.7 105565.2 212116 139962.5
## [2,] 328554.7 105565.2 212116 139962.5
## [3,] 328554.7 105565.2 212116 139962.5
## [4,] 328554.7 105565.2 212116 139962.5
## [5,] 328554.7 105565.2 212116 139962.5
##
## $KitchenQual$n
## [1] 1 1 1 1
##
## $KitchenQual$conf
## [,1] [,2] [,3] [,4]
## [1,] 328554.7 105565.2 212116 139962.5
## [2,] 328554.7 105565.2 212116 139962.5
##
## $KitchenQual$out
## numeric(0)
##
## $KitchenQual$group
## numeric(0)
##
## $KitchenQual$names
## [1] "Ex" "Fa" "Gd" "TA"
##
##
## $TotRmsAbvGrd
## $TotRmsAbvGrd$stats
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 296279.2 318022 280971.5 2e+05 39300 111217.6 122844.6 141550.7
## [2,] 296279.2 318022 280971.5 2e+05 39300 111217.6 122844.6 141550.7
## [3,] 296279.2 318022 280971.5 2e+05 39300 111217.6 122844.6 141550.7
## [4,] 296279.2 318022 280971.5 2e+05 39300 111217.6 122844.6 141550.7
## [5,] 296279.2 318022 280971.5 2e+05 39300 111217.6 122844.6 141550.7
## [,9] [,10] [,11] [,12]
## [1,] 161303.3 196666.8 213427.5 252988.2
## [2,] 161303.3 196666.8 213427.5 252988.2
## [3,] 161303.3 196666.8 213427.5 252988.2
## [4,] 161303.3 196666.8 213427.5 252988.2
## [5,] 161303.3 196666.8 213427.5 252988.2
##
## $TotRmsAbvGrd$n
## [1] 1 1 1 1 1 1 1 1 1 1 1 1
##
## $TotRmsAbvGrd$conf
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 296279.2 318022 280971.5 2e+05 39300 111217.6 122844.6 141550.7
## [2,] 296279.2 318022 280971.5 2e+05 39300 111217.6 122844.6 141550.7
## [,9] [,10] [,11] [,12]
## [1,] 161303.3 196666.8 213427.5 252988.2
## [2,] 161303.3 196666.8 213427.5 252988.2
##
## $TotRmsAbvGrd$out
## numeric(0)
##
## $TotRmsAbvGrd$group
## numeric(0)
##
## $TotRmsAbvGrd$names
## [1] "10" "11" "12" "14" "2" "3" "4" "5" "6" "7" "8" "9"
From the plot, it seems that the following columns have the most useful distribution of means:
I’ve ignored OverallQual, since this is probably highly correlated with ExterQual and KitchenQual
Neighborhood seems to have some really great data, but it contains so many different factors that overfitting may occur. Because of this, I think it would be useful to create a new variable that contains the average price per square foot for each neighborhood.
# Indicate which factors to keep
factorsToKeep <- c("FireplaceQu", "FullBath", "Exterior2nd", "PoolQC", "ExterCond", "KitchenQual", "TotRmsAbvGrd", "Neighborhood")
# Compile a new data frame containing only the desired numeric variables, and impute the missing numeric data
# Also, set the integers to numeric, and characters to factors
xNumeric <- x[,!(names(x) %in% numToRemove)]
xNumeric <- xNumeric[, names(xNumeric) %in% integerCols]
# GarageCars must be processed as numeric in order to get around a problem with the model predictions
xNumeric$GarageCars <- as.numeric(x$GarageCars)
# Impute missing data
tempData <- mice(xNumeric, m=2, maxit=50, meth='pmm', seed=500, printFlag=FALSE)
xNumeric <- as.data.frame(complete(tempData, 1))
xNumeric <- data.frame(sapply(xNumeric, as.numeric))
xNumeric <- apply(xNumeric, MARGIN=2, FUN=as.numeric)
xFactors <- x[,names(x) %in% factorsToKeep]
xFactors <- data.frame(sapply(xFactors, as.factor))
xFinal <- cbind(xNumeric, xFactors)
# Add a new variable that is the average price per square foot for each neighbourhood
tmpPrceFTNeighbourhood <- data.frame(aggregate(x[, c("SalePrice")] / x[, c("GrLivArea")], list(x$Neighborhood), mean))
names(tmpPrceFTNeighbourhood) <- c("Neighborhood", "sqfPrice")
xFinal <- merge(x = xFinal, y = tmpPrceFTNeighbourhood, by = "Neighborhood", all = TRUE)
xFinal$sqfPrice <- as.numeric(xFinal$sqfPrice)
# Fix issues with the xFinal data frame
xFinal$YrSold <- as.numeric(x$YrSold)
xFinal$BldgType <- as.factor(x$BldgType)
xFinal$SaleCondition <- as.factor(x$SaleCondition)
# Split the columns by class
columnClasses <- data.frame(colNames = names(xFinal), colClass = sapply(xFinal, class))
Prepare the test set for processing.
#Split the test set into factors and numeric
testNumeric <- test[,names(test) %in% c(as.vector(columnClasses[columnClasses$colClass %in% "numeric",]$colNames), "Id")]
testFactor <- test[,names(test) %in% as.vector(columnClasses[columnClasses$colClass %in% "factor",]$colNames)]
# Apply factor and numeric data types
testNumeric <- as.data.frame(lapply(testNumeric, as.numeric))
testFactor <- as.data.frame(lapply(testFactor, as.factor))
# Impute missing numeric values
tempData <- mice(testNumeric, m=2, maxit=50, meth='pmm', seed=500, printFlag=FALSE)
testNumeric <- as.data.frame(complete(tempData, 1))
testNumeric <- data.frame(sapply(testNumeric, as.numeric))
testNumeric <- apply(testNumeric, MARGIN=2, FUN=as.numeric)
# Add a new variable that is the average price per square foot for each neighbourhood. This is based on X, since there is no saleprice data in the test set
testFactor <- merge(x = testFactor, y = tmpPrceFTNeighbourhood, by = "Neighborhood", all = TRUE)
testFactor$sqfPrice <- as.numeric(testFactor$sqfPrice)
# Recompile final test set
testFinal <- as.data.frame(cbind(testNumeric, testFactor))
# Check the details
sapply(testFinal, class)
## Id YearBuilt BsmtUnfSF GrLivArea GarageYrBlt
## "numeric" "numeric" "numeric" "numeric" "numeric"
## GarageCars GarageArea YrSold Neighborhood BldgType
## "numeric" "numeric" "numeric" "factor" "factor"
## Exterior2nd ExterCond FullBath KitchenQual TotRmsAbvGrd
## "factor" "factor" "factor" "factor" "factor"
## FireplaceQu PoolQC SaleCondition sqfPrice
## "factor" "factor" "factor" "numeric"
Now that we’ve cleaned up and analyzed the data, we’ll create our linear model.
Instead of basing the price on the neighbourhood, we base the slope on the interaction between the GrLivArea and sqfPrice. Other critical values include YearBuilt, GarageCars, ExterQual, YrSold, PoolQC and KitchenQual.
Once this model is fitted, we’ll examine diagnostic data and make predictions against the test data.
# Modify test and training data to deal with prediction issues.
testFinal$GarageCars <- as.numeric(testFinal$GarageCars)
xFinal$GarageCars <- as.numeric(xFinal$GarageCars)
# Find factors not present in the test set (these will cause NA records)
testFinal[!(testFinal$ExterCond %in% xFinal$ExterCond),]$ExterCond
## factor(0)
## Levels: Ex Fa Gd Po TA
testFinal[!(testFinal$KitchenQual %in% xFinal$KitchenQual),]$KitchenQual
## [1] <NA>
## Levels: Ex Fa Gd TA
# Set N/A Kitchen to Typical/Average.
testFinal[is.na(testFinal$KitchenQual),]$KitchenQual <- "TA"
# Plot the model
linearFit <- lm(SalePrice ~ GrLivArea:sqfPrice + YearBuilt + GarageCars + ExterCond + YrSold + KitchenQual, data = xFinal)
# Model summary
summary(linearFit)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea:sqfPrice + YearBuilt + GarageCars +
## ExterCond + YrSold + KitchenQual, data = xFinal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -335127 -17672 469 15209 250874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.447e+06 1.418e+06 1.726 0.08464 .
## YearBuilt 2.227e+02 4.099e+01 5.432 6.54e-08 ***
## GarageCars 1.444e+04 1.701e+03 8.487 < 2e-16 ***
## ExterCondFa -7.084e+04 2.182e+04 -3.247 0.00119 **
## ExterCondGd -5.142e+04 2.086e+04 -2.465 0.01382 *
## ExterCondPo -8.886e+04 4.171e+04 -2.130 0.03333 *
## ExterCondTA -5.913e+04 2.074e+04 -2.851 0.00442 **
## YrSold -1.361e+03 7.051e+02 -1.931 0.05373 .
## KitchenQualFa -6.682e+04 7.704e+03 -8.674 < 2e-16 ***
## KitchenQualGd -5.260e+04 4.079e+03 -12.896 < 2e-16 ***
## KitchenQualTA -6.699e+04 4.533e+03 -14.778 < 2e-16 ***
## GrLivArea:sqfPrice 6.344e-01 1.668e-02 38.038 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35620 on 1448 degrees of freedom
## Multiple R-squared: 0.8005, Adjusted R-squared: 0.799
## F-statistic: 528.1 on 11 and 1448 DF, p-value: < 2.2e-16
# Plot residuals and diagnostics
par(mfrow = c(2, 2))
plot(linearFit)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
# Run preditions
predicted = predict(linearFit,testFinal)
# Check for missing values
length(predicted[is.na(predicted)])
## [1] 0
finalSubmission <- data.frame(Id = testFinal$Id, SalePrice = predicted)
testFinal[testFinal$Id == finalSubmission[!complete.cases(finalSubmission),]$ID, ]
## [1] Id YearBuilt BsmtUnfSF GrLivArea GarageYrBlt
## [6] GarageCars GarageArea YrSold Neighborhood BldgType
## [11] Exterior2nd ExterCond FullBath KitchenQual TotRmsAbvGrd
## [16] FireplaceQu PoolQC SaleCondition sqfPrice
## <0 rows> (or 0-length row.names)
# Create submission data frame
write.csv(finalSubmission, file = "finalsubmission.csv", row.names=FALSE)
# testFinal[,c("GrLivArea", "sqfPrice", "YearBuilt", "GarageCars", "ExterCond", "YrSold", "KitchenQual")]
This is about as close as I was able to get with the current data.
The flaring around the residuals vs fitted indicate that there may be another unknown variable somewhere that might possibly affect the slope of the regression line.
The Normal Q-Q Plot indicates that residuals are not completely normally distributed. This is in line with the hypothesis proposed by the Residuals vs. Fitted plot.
The scale-location plot indicates that residuals increase as the predicted value gets larger. Once again, this is as if some other variable was influencing the slope of the regression line.
The residuals vs Leverage indicates no major outliers.
The sale price appears to have a linear explanation, however there appears to be some unknown value that may possibly influence the slope of the regression line. other variables might include things such as condo fees, taxation, selling agent or crime rate.