In this project, we are presented with a train and test set of the Ames Iowa house data from Kaggle. The goal is to create a model that is capable of predicting house prices using the available variables in the dataset. The Mulitple Linear Regression model will be explored and utilized in this analysis.

Data Exploration & Cleaning

# Importing the necessary libraries and packages
library(ggplot2)
library(data.table)
library(mice)
library(caTools)
library(dplyr)
library(MASS)
# Importing the data
data <- read.csv("training.csv")
# Exploring basic information about the data
colnames(data)
 [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
 [5] "LotArea"       "Street"        "Alley"         "LotShape"     
 [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
[13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
[17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
[21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
[25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
[29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
[33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
[37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
[41] "HeatingQC"     "CentralAir"    "Electrical"    "X1stFlrSF"    
[45] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
[49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
[53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
[57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
[61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
[65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
[69] "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
[73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
[77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
[81] "SalePrice"    
str(data)
'data.frame':   1060 obs. of  81 variables:
 $ Id           : int  1 3 4 5 8 10 11 12 14 15 ...
 $ MSSubClass   : int  60 60 70 60 60 190 20 60 20 20 ...
 $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ LotFrontage  : int  65 68 60 84 NA 50 70 85 91 NA ...
 $ LotArea      : int  8450 11250 9550 14260 10382 7420 11200 11924 10652 10920 ...
 $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
 $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
 $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 1 1 1 1 4 4 1 1 1 ...
 $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Utilities    : Factor w/ 1 level "AllPub": 1 1 1 1 1 1 1 1 1 1 ...
 $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 5 1 3 1 1 5 5 5 1 ...
 $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
 $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 6 7 14 17 4 19 16 6 13 ...
 $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 5 1 3 3 3 3 ...
 $ Condition2   : Factor w/ 6 levels "Artery","Feedr",..: 3 3 3 3 3 1 3 3 3 3 ...
 $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 2 1 1 1 1 ...
 $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 6 6 6 6 2 3 6 3 3 ...
 $ OverallQual  : int  7 7 7 8 7 5 5 9 7 6 ...
 $ OverallCond  : int  5 5 5 5 6 6 5 5 5 5 ...
 $ YearBuilt    : int  2003 2001 1915 2000 1973 1939 1965 2005 2006 1960 ...
 $ YearRemodAdd : int  2003 2002 1970 2000 1973 1950 1965 2006 2007 1960 ...
 $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 4 4 2 4 ...
 $ RoofMatl     : Factor w/ 7 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Exterior1st  : Factor w/ 14 levels "AsbShng","AsphShn",..: 12 12 13 12 7 8 7 14 12 8 ...
 $ Exterior2nd  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 13 15 13 7 9 7 15 13 9 ...
 $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 2 3 2 4 3 3 4 4 2 ...
 $ MasVnrArea   : int  196 162 0 350 240 0 0 286 306 212 ...
 $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 4 3 4 4 4 1 3 4 ...
 $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 3 1 3 2 1 2 3 3 2 ...
 $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 4 3 3 4 4 1 3 4 ...
 $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 2 4 4 4 4 4 4 4 ...
 $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 3 4 1 3 4 4 4 1 4 ...
 $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 3 1 3 1 3 5 3 6 2 ...
 $ BsmtFinSF1   : int  706 486 216 655 859 851 906 998 0 733 ...
 $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 2 6 6 6 6 6 ...
 $ BsmtFinSF2   : int  0 0 0 0 32 0 0 0 0 0 ...
 $ BsmtUnfSF    : int  150 434 540 490 216 140 134 177 1494 520 ...
 $ TotalBsmtSF  : int  856 920 756 1145 1107 991 1040 1175 1494 1253 ...
 $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 3 1 1 1 1 1 1 5 ...
 $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
 $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ X1stFlrSF    : int  856 920 961 1145 1107 1077 1040 1182 1494 1253 ...
 $ X2ndFlrSF    : int  854 866 756 1053 983 0 0 1142 0 0 ...
 $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
 $ GrLivArea    : int  1710 1786 1717 2198 2090 1077 1040 2324 1494 1253 ...
 $ BsmtFullBath : int  1 1 1 1 1 1 1 1 0 1 ...
 $ BsmtHalfBath : int  0 0 0 0 0 0 0 0 0 0 ...
 $ FullBath     : int  2 2 1 2 2 1 1 3 2 1 ...
 $ HalfBath     : int  1 1 0 1 1 0 0 0 0 1 ...
 $ BedroomAbvGr : int  3 3 3 4 3 2 3 4 3 2 ...
 $ KitchenAbvGr : int  1 1 1 1 1 2 1 1 1 1 ...
 $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 3 4 4 4 1 3 4 ...
 $ TotRmsAbvGrd : int  8 6 7 9 7 5 5 11 7 5 ...
 $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ Fireplaces   : int  0 1 1 1 2 2 0 2 1 1 ...
 $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 3 5 5 5 NA 3 3 2 ...
 $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 6 2 2 2 6 4 2 2 ...
 $ GarageYrBlt  : int  2003 2001 1998 2000 1973 1939 1965 2005 2006 1960 ...
 $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 3 2 2 2 3 1 2 2 ...
 $ GarageCars   : int  2 2 3 3 2 1 1 3 3 1 ...
 $ GarageArea   : int  548 608 642 836 484 205 384 736 840 352 ...
 $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 3 5 5 5 5 ...
 $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
 $ WoodDeckSF   : int  0 0 0 192 235 0 0 147 160 0 ...
 $ OpenPorchSF  : int  61 42 35 84 204 4 0 21 33 213 ...
 $ EnclosedPorch: int  0 0 272 0 228 0 0 0 0 176 ...
 $ X3SsnPorch   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
 $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA 2 ...
 $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA 3 NA NA NA NA NA ...
 $ MiscVal      : int  0 0 0 0 350 0 0 0 0 0 ...
 $ MoSold       : int  2 9 2 12 11 1 2 7 8 5 ...
 $ YrSold       : int  2008 2008 2006 2008 2009 2008 2008 2006 2007 2008 ...
 $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 7 7 9 ...
 $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 1 5 5 5 5 6 6 5 ...
 $ SalePrice    : int  208500 223500 140000 250000 200000 118000 129500 345000 279500 157000 ...

It looks like we have 1060 observations and 80 variables, excluding target variables ‘SalePrice’; 23 nominal, 23 ordinal, 14 discrete, and 20 continuous. Also, it is clear that we are missing values in our dataset. Let’s determine the distribution of the mising data.

# Checking what prcent of all the data is composed of missing values 
(sum(is.na(data))/(nrow(data)*ncol(data)))*100
[1] 5.885162
# Let's get a break down of the number of missing values in each column
missing.col <- (colSums(is.na(data))[colSums(is.na(data)) > 0]/nrow(data))*100
missing.col.names <- row.names(data.frame(missing.col))
missing.col.percent <- data.frame(missing.col)
missing.col.percent <- setorder(data.frame(missing.col), missing.col)
colnames(missing.col.percent) <- 'percent.missing'
missing.col.percent

It seems that ‘LotFrontage’, ‘FireplaceQu’, ‘Fence’, ‘Alley’, ‘MiscFeature’, and ‘PoolQC’ possess significant proportions of missing data, over 5%. More pressing is the fact that some of the variables, such as ‘PoolQC’, contain no data at all and even ‘LotFrontage’, which has the least missing data of the variables previously mentioned, is missing almost 20% of its data.

After consulting the data documentation, it seems that many of the variables use ‘NA’ as respresenting the absence of a future. For example, ‘FireplaceQu’ represents fireplace quality of a home. Now, although this variable is missing almost half of it’s data, that is due to the fact that a missing value denotes the absence of a fireplace in the home and not a missing piece of infromation of data. So, before continuing, let’s add ‘NA’ as a feature level where it is appropriate based on the data documentation.

# Adding 'NA' as a factor level
col_list <- c('Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature')
data[col_list] <- lapply(data[col_list], addNA)

Now, let’s see how much missing data remains in the dataset.

(sum(is.na(data))/(nrow(data)*ncol(data)))*100
[1] 0.3004892
missing.col <- (colSums(is.na(data))[colSums(is.na(data)) > 0]/nrow(data))*100
missing.col.names <- row.names(data.frame(missing.col))
missing.col.percent <- data.frame(missing.col)
missing.col.percent <- setorder(data.frame(missing.col), missing.col)
colnames(missing.col.percent) <- 'percent.missing'
missing.col.percent
str(data[,c('MasVnrType','MasVnrArea','GarageYrBlt','LotFrontage')])
'data.frame':   1060 obs. of  4 variables:
 $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 2 3 2 4 3 3 4 4 2 ...
 $ MasVnrArea : int  196 162 0 350 240 0 0 286 306 212 ...
 $ GarageYrBlt: int  2003 2001 1998 2000 1973 1939 1965 2005 2006 1960 ...
 $ LotFrontage: int  65 68 60 84 NA 50 70 85 91 NA ...

Wow, what a difference. We have reduced the total proportion of missing data by almost half, ~6% versus ~3%, and where we had 18 variables with missing values, we now only have 4.

Now, we will use the ‘mice’ package to impute any missing values that remain. In order to prevent the injection of any personal bias, we will preemptively split the data into a train and test set first and impute the two data sets independently. Before that, let’s see if there are any columns we can, and should, remove all together because of a lack of additional predictability or information.

# Checking which variables have only a single value for all observations
for(x in colnames(data)){
  if(length(unique(data[,x])) == 1){
    print(x)
  }
}
[1] "Utilities"

We see that ‘Utilities’ gives us no additional predicitve power as the only value present for all observations is ‘AllPub’. Further more, the ‘Id’ variable serves no purpose other than identification and should o be included in the model. For these reasons, both of these columns will be removed from the dataset.

# Dropping the 'Id' and 'Utilities' columns
data <- data[, !(names(data) %in% c('Id','Utilities'))]

Let’s continue with the imputation of the reamining missing values.

# Confirming that there are no more missing values in the dataset
train.imp.miss <- (sum(is.na(train.imputed))/(nrow(train.imputed)*ncol(train.imputed)))*100
test.imp.miss <- (sum(is.na(test.imputed))/(nrow(test.imputed)*ncol(test.imputed)))*100
print("Percent missing data in imputed train set: ")
[1] "Percent missing data in imputed train set: "
print(train.imp.miss)
[1] 0
print("Percent missing data in imputed test set: ")
[1] "Percent missing data in imputed test set: "
print(test.imp.miss)
[1] 0

Now, let’s continue by exploring the correlations between the variables in the dataset

# Creating a dataframe of correlations between SalePrice and numeric vars.
train.numeric <- dplyr::select_if(data, is.numeric)
corr.matrix <- cor(train.numeric, use = 'pairwise.complete.obs')
corr.df <- data.frame(row=rownames(corr.matrix)[row(corr.matrix)[upper.tri(corr.matrix)]], col=colnames(corr.matrix)[col(corr.matrix)[upper.tri(corr.matrix)]],corr=corr.matrix[upper.tri(corr.matrix)])
corr.df <- corr.df[order(corr.df$'corr'),]
sale.price.corr <- corr.df[corr.df$col == 'SalePrice',]
head(sale.price.corr)
tail(sale.price.corr)

It looks like ‘OverallQual’, ‘GrLivArea’, ‘GarageCars’, and ‘GarageArea’ all have “significant” correlations with the response varaible, SalePrice. By “significant”, we mean a correlation above 0.6. These may be important predictors of SalePrice. Let us also visualize the distribution of SalePrice.

# Plot of SalePrice distribution
sp.plot <- ggplot(data = train.imputed, aes(x=train.imputed$SalePrice)) + geom_histogram(bins = 50, color = 'white', fill = 'salmon')
log.sp.plot <- ggplot(data = train.imputed, aes(x=log(train.imputed$SalePrice))) + geom_histogram(bins = 50, color = 'white', fill = 'salmon') 
sp.plot

log.sp.plot

The distribution does seem to become more normal when a long transform is applied. This may improve the performance of a linear model. Will need to come back later and determine if the log trandform really has a beneficial effect.

Let’s now utilize methodical approaches to select the features that will provide us with the most predictability in our model. We will explore forward, backward, and stepwise selection as well as Lasso Regression.

# Some levels present in test set that are not accounted for in train set
 train.imputed <- train.imputed[ ,!(names(train.imputed) %in% c("Exterior1st"))]
 test.imputed <- test.imputed[ ,!(names(test.imputed) %in% c("Exterior1st"))]
# Forward Selection
forward.fit <- lm(SalePrice~1, data = train.imputed)
step.f <- stepAIC(forward.fit, direction="forward", scope = formula(lm(SalePrice~., data = train.imputed)))

# Backward Selection
backward.fit <- lm(SalePrice~.,data = train.imputed)
step.b <- stepAIC(backward.fit, direction="backward")

# Stepwise Selection
stepwise.fit <- lm(SalePrice~.,data = train.imputed)
step.s <- stepAIC(stepwise.fit, direction="both")
# Selection results
step.f$call # Forward results
lm(formula = SalePrice ~ OverallQual + GrLivArea + Neighborhood + 
    RoofMatl + BsmtFinSF1 + BsmtQual + Condition2 + MSSubClass + 
    BsmtExposure + PoolQC + KitchenQual + OverallCond + GarageArea + 
    LotArea + YearBuilt + SaleCondition + TotalBsmtSF + SaleType + 
    Functional + MSZoning + Fireplaces + ExterQual + ScreenPorch + 
    YearRemodAdd + BldgType + GarageCars + MoSold + BedroomAbvGr + 
    EnclosedPorch + WoodDeckSF + MasVnrArea + TotRmsAbvGrd + 
    LandSlope + BsmtFinType1 + X3SsnPorch, data = train.imputed)
print("######################")
[1] "######################"
step.b$call # Backward results
lm(formula = SalePrice ~ MSSubClass + MSZoning + LotArea + LandContour + 
    LandSlope + Neighborhood + Condition1 + Condition2 + BldgType + 
    OverallQual + OverallCond + YearBuilt + YearRemodAdd + RoofMatl + 
    MasVnrArea + ExterQual + BsmtQual + BsmtExposure + BsmtFinType1 + 
    BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + 
    LowQualFinSF + BedroomAbvGr + KitchenQual + TotRmsAbvGrd + 
    Functional + Fireplaces + GarageCars + WoodDeckSF + EnclosedPorch + 
    X3SsnPorch + ScreenPorch + PoolQC + MoSold + SaleType + SaleCondition, 
    data = train.imputed)
print("######################")
[1] "######################"
step.s$call # Stepwise results
lm(formula = SalePrice ~ MSSubClass + MSZoning + LotArea + LandContour + 
    LandSlope + Neighborhood + Condition1 + Condition2 + BldgType + 
    OverallQual + OverallCond + YearBuilt + YearRemodAdd + RoofMatl + 
    MasVnrArea + ExterQual + BsmtQual + BsmtExposure + BsmtFinType1 + 
    BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + 
    LowQualFinSF + BedroomAbvGr + KitchenQual + TotRmsAbvGrd + 
    Functional + Fireplaces + GarageCars + WoodDeckSF + EnclosedPorch + 
    X3SsnPorch + ScreenPorch + PoolQC + MoSold + SaleType + SaleCondition, 
    data = train.imputed)

Model Building

Let’s predict the ‘SalePrice’, of each test observation, to be the average of all the train ‘SalePrice’ values, and use the resulting metrics as a benchmark against which all following model performances will be compared.

# Let's predict the average for each test observation and use it as a benchmark
sale.price.mean <- mean(train.imputed$SalePrice)
avg.vector <- rep(c(sale.price.mean), times = 212)
cat("Bias: ", mean(avg.vector-test.y))
Bias:  7299.447
cat("\nMaximum Deviation: ", max(avg.vector-test.y))

Maximum Deviation:  146626.7
cat("\nMean Absolute Deviation: ", mean(abs(avg.vector-test.y)))

Mean Absolute Deviation:  53682.19
cat("\nMean Square Error: ", mean((avg.vector-test.y)**2))

Mean Square Error:  4959006479
cat("\nRoot Mean Square Error: ", sqrt(mean((avg.vector-test.y)**2)))

Root Mean Square Error:  70420.21

Moving on to actually building the models.

# Linear model with forward selected features
forward.lm <- lm(formula(step.f$call), data = train.imputed)
forward.predictions <- predict(forward.lm, test.imputed)
prediction from a rank-deficient fit may be misleading
summary(forward.lm)$adj.r.squared
[1] 0.9130982
cat("Bias: ", mean(forward.predictions-test.y))
Bias:  2181.083
cat("\nMaximum Deviation: ", max(forward.predictions-test.y))

Maximum Deviation:  95389.27
cat("\nMean Absolute Deviation: ", mean(abs(forward.predictions-test.y)))

Mean Absolute Deviation:  17577.36
cat("\nMean Square Error: ", mean((forward.predictions-test.y)**2))

Mean Square Error:  685348829
cat("\nRoot Mean Square Error: ", sqrt(mean((forward.predictions-test.y)**2)))

Root Mean Square Error:  26179.17

Linear model with backward selected features

# Linear model with backward selected features
backward.lm <- lm(formula(step.b$call), data = train.imputed)
backward.predictions <- predict(backward.lm, test.imputed)
prediction from a rank-deficient fit may be misleading
summary(backward.lm)$adj.r.squared
[1] 0.9139194
cat("Bias: ", mean(backward.predictions-test.y))
Bias:  2065.778
cat("\nMaximum Deviation: ", max(backward.predictions-test.y))

Maximum Deviation:  101633.9
cat("\nMean Absolute Deviation: ", mean(abs(backward.predictions-test.y)))

Mean Absolute Deviation:  17648.72
cat("\nMean Square Error: ", mean((backward.predictions-test.y)**2))

Mean Square Error:  679802238
cat("\nRoot Mean Square Error: ", sqrt(mean((backward.predictions-test.y)**2)))

Root Mean Square Error:  26073.02

Linear model with stepwise selected features

# Linear model with stepwise selected features
stepwise.lm <- lm(formula(step.s$call), data = train.imputed)
stepwise.predictions <- predict(stepwise.lm, test.imputed)
prediction from a rank-deficient fit may be misleading
summary(stepwise.lm)$adj.r.squared
[1] 0.9139194
cat("Bias: ", mean(stepwise.predictions-test.y))
Bias:  2065.778
cat("\nMaximum Deviation: ", max(stepwise.predictions-test.y))

Maximum Deviation:  101633.9
cat("\nMean Absolute Deviation: ", mean(abs(stepwise.predictions-test.y)))

Mean Absolute Deviation:  17648.72
cat("\nMean Square Error: ", mean((stepwise.predictions-test.y)**2))

Mean Square Error:  679802238
cat("\nRoot Mean Square Error: ", sqrt(mean((stepwise.predictions-test.y)**2)))

Root Mean Square Error:  26073.02

LASSO Regression

lr.test <- as.matrix(data = test.imputed)
Error in as.matrix.data.frame(data = test.imputed) : 
  argument "x" is missing, with no default
fit.lasso <- glmnet(lr.train, train.imputed$SalePrice, alpha = 1, lambda = 7564.633)
lasso.predict <- predict(fit.lasso, lr.test)
cat("Bias: ", mean(lasso.predict-test.y))
Bias:  -786.1837
cat("\nMaximum Deviation: ", max(lasso.predict-test.y))

Maximum Deviation:  76364.84
cat("\nMean Absolute Deviation: ", mean(abs(lasso.predict-test.y)))

Mean Absolute Deviation:  21316.57
cat("\nMean Square Error: ", mean((lasso.predict-test.y)**2))

Mean Square Error:  902434514
cat("\nRoot Mean Square Error: ", sqrt(mean((lasso.predict-test.y)**2)))

Root Mean Square Error:  30040.55

Results

Now let’s check the diagnostics for the forward selected linear model as it seems to be the ‘best’ model because it contains the variable with the strongest correlation with the response variable.

plot(forward.lm)
not plotting observations with leverage one:
  6, 76, 330, 351, 357, 397, 454, 559, 592, 689, 739, 755, 807, 831

not plotting observations with leverage one:
  6, 76, 330, 351, 357, 397, 454, 559, 592, 689, 739, 755, 807, 831

plot(test.y,forward.predictions) +
abline(0,1, col = 'red');
integer(0)

Conclusions

---
title: "GLM Project (Preliminary Findings)"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

In this project, we are presented with a train and test set of the Ames Iowa house data from Kaggle. The goal is to create a model that is capable of predicting house prices using the available variables in the dataset. The Mulitple Linear Regression model will be explored and utilized in this analysis. 

* [Data Exploration & Cleaning](#dataexplorationandcleaning)
* [Model Building](#modelbuilding)
* [Results](#results)
* [Conclusions](#conclusions)

## Data Exploration & Cleaning {#dataexplorationandcleaning}

```{r,message=FALSE}
# Importing the necessary libraries and packages
library(ggplot2)
library(data.table)
library(mice)
library(caTools)
library(dplyr)
library(MASS)
library(glmnet)
```

```{r}
# Importing the data
data <- read.csv("training.csv")
```

```{r}
# Exploring basic information about the data
colnames(data)
str(data)
```

It looks like we have 1060 observations and 80 variables, excluding target variables 'SalePrice'; 23 nominal, 23 ordinal, 14 discrete, and 20 continuous. Also, it is clear that we are missing values in our dataset. Let's determine the distribution of the mising data.

```{r}
# Checking what prcent of all the data is composed of missing values 
(sum(is.na(data))/(nrow(data)*ncol(data)))*100
```

```{r}
# Let's get a break down of the number of missing values in each column

missing.col <- (colSums(is.na(data))[colSums(is.na(data)) > 0]/nrow(data))*100
missing.col.names <- row.names(data.frame(missing.col))
missing.col.percent <- data.frame(missing.col)
missing.col.percent <- setorder(data.frame(missing.col), missing.col)
colnames(missing.col.percent) <- 'percent.missing'
missing.col.percent
```

It seems that 'LotFrontage', 'FireplaceQu', 'Fence', 'Alley', 'MiscFeature', and 'PoolQC' possess significant proportions of missing data, over 5%. More pressing is the fact that some of the variables, such as 'PoolQC', contain no data at all and even 'LotFrontage', which has the least missing data of the variables previously mentioned, is missing almost 20% of its data.

After consulting the data documentation, it seems that many of the variables use 'NA' as respresenting the absence of a future. For example, 'FireplaceQu' represents fireplace quality of a home. Now, although this variable is missing almost half of it's data, that is due to the fact that a missing value denotes the absence of a fireplace in the home and not a missing piece of infromation of data. So, before continuing, let's add 'NA' as a feature level where it is appropriate based on the data documentation.

```{r}
# Adding 'NA' as a factor level

col_list <- c('Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature')

data[col_list] <- lapply(data[col_list], addNA)
```

Now, let's see how much missing data remains in the dataset.

```{r}
(sum(is.na(data))/(nrow(data)*ncol(data)))*100
```

```{r}
missing.col <- (colSums(is.na(data))[colSums(is.na(data)) > 0]/nrow(data))*100
missing.col.names <- row.names(data.frame(missing.col))
missing.col.percent <- data.frame(missing.col)
missing.col.percent <- setorder(data.frame(missing.col), missing.col)
colnames(missing.col.percent) <- 'percent.missing'
missing.col.percent
```

```{r}
str(data[,c('MasVnrType','MasVnrArea','GarageYrBlt','LotFrontage')])
```

Wow, what a difference. We have reduced the total proportion of missing data by almost half, ~6% versus ~3%, and where we had 18 variables with missing values, we now only have 4. 

Now, we will use the 'mice' package to impute any missing values that remain. In order to prevent the injection of any personal bias, we will preemptively split the data into a train and test set first and impute the two data sets independently. Before that, let's see if there are any columns we can, and should, remove all together because of a lack of additional predictability or information.

```{r}
# Checking which variables have only a single value for all observations
for(x in colnames(data)){
  if(length(unique(data[,x])) == 1){
    print(x)
  }
}
```

We see that 'Utilities' gives us no additional predicitve power as the only value present for all observations is 'AllPub'. Further more, the 'Id' variable serves no purpose other than identification and should o be included in the model. For these reasons, both of these columns will be removed from the dataset.

```{r}
# Dropping the 'Id' and 'Utilities' columns
data <- data[, !(names(data) %in% c('Id','Utilities'))]
```

Let's continue with the imputation of the reamining missing values.

```{r,results='hide'}
# Creating a train and test split
set.seed(2019)
split <- sample.split(data$SalePrice, SplitRatio = 0.8)
train <- subset(data, split == TRUE)
test <- subset(data, split == FALSE)
test.x <- test[ ,!(names(test) %in% c("SalePrice"))]
test.y <- test$SalePrice

# Imputing missing values
train.mids <- mice(train, m=1, method='cart', seed=2019)
test.mids <- mice(test.x, m=1, method='cart', seed=2019)

train.imputed <- complete(train.mids,1)
test.imputed <- complete(test.mids,1)
``` 

```{r}
# Confirming that there are no more missing values in the dataset
train.imp.miss <- (sum(is.na(train.imputed))/(nrow(train.imputed)*ncol(train.imputed)))*100
test.imp.miss <- (sum(is.na(test.imputed))/(nrow(test.imputed)*ncol(test.imputed)))*100
print("Percent missing data in imputed train set: ")
print(train.imp.miss)
print("Percent missing data in imputed test set: ")
print(test.imp.miss)
```

Now, let's continue by exploring the correlations between the variables in the dataset 

```{r}
# Creating a dataframe of correlations between SalePrice and numeric vars.
train.numeric <- dplyr::select_if(data, is.numeric)
corr.matrix <- cor(train.numeric, use = 'pairwise.complete.obs')
corr.df <- data.frame(row=rownames(corr.matrix)[row(corr.matrix)[upper.tri(corr.matrix)]], col=colnames(corr.matrix)[col(corr.matrix)[upper.tri(corr.matrix)]],corr=corr.matrix[upper.tri(corr.matrix)])
corr.df <- corr.df[order(corr.df$'corr'),]
sale.price.corr <- corr.df[corr.df$col == 'SalePrice',]
head(sale.price.corr)
tail(sale.price.corr)
```

It looks like 'OverallQual', 'GrLivArea', 'GarageCars', and 'GarageArea' all have "significant" correlations with the response varaible, SalePrice. By "significant", we mean a correlation above 0.6. These may be important predictors of SalePrice. Let us also visualize the distribution of SalePrice.

```{r}
# Plot of SalePrice distribution
sp.plot <- ggplot(data = train.imputed, aes(x=train.imputed$SalePrice)) + geom_histogram(bins = 50, color = 'white', fill = 'salmon')

log.sp.plot <- ggplot(data = train.imputed, aes(x=log(train.imputed$SalePrice))) + geom_histogram(bins = 50, color = 'white', fill = 'salmon') 

sp.plot
log.sp.plot
```

The distribution does seem to become more normal when a long transform is applied. This may improve the performance of a linear model. Will need to come back later and determine if the log trandform really has a beneficial effect. 

Let's now utilize methodical approaches to select the features that will provide us with the most predictability in our model. We will explore forward, backward, and stepwise selection as well as Lasso Regression.

```{r}
# Some levels present in test set that are not accounted for in train set
 train.imputed <- train.imputed[ ,!(names(train.imputed) %in% c("Exterior1st"))]
 test.imputed <- test.imputed[ ,!(names(test.imputed) %in% c("Exterior1st"))]
```

```{r}
# Forward Selection
forward.fit <- lm(SalePrice~1, data = train.imputed)
step.f <- stepAIC(forward.fit, direction="forward", scope = formula(lm(SalePrice~., data = train.imputed)))

# Backward Selection
backward.fit <- lm(SalePrice~.,data = train.imputed)
step.b <- stepAIC(backward.fit, direction="backward")

# Stepwise Selection
stepwise.fit <- lm(SalePrice~.,data = train.imputed)
step.s <- stepAIC(stepwise.fit, direction="both")
```

```{r}
# Selection results
step.f$call # Forward results
print("######################")
step.b$call # Backward results
print("######################")
step.s$call # Stepwise results
```


## Model Building {#modelbuilding}

Let's predict the 'SalePrice', of each test observation, to be the average of all the train 'SalePrice' values, and use the resulting metrics as a benchmark against which all following model performances will be compared.

```{r}
# Let's predict the average for each test observation and use it as a benchmark
sale.price.mean <- mean(train.imputed$SalePrice)
avg.vector <- rep(c(sale.price.mean), times = 212)

cat("Bias: ", mean(avg.vector-test.y))
cat("\nMaximum Deviation: ", max(avg.vector-test.y))
cat("\nMean Absolute Deviation: ", mean(abs(avg.vector-test.y)))
cat("\nMean Square Error: ", mean((avg.vector-test.y)**2))
cat("\nRoot Mean Square Error: ", sqrt(mean((avg.vector-test.y)**2)))
```

Moving on to actually building the models.

```{r}
# Linear model with forward selected features
forward.lm <- lm(formula(step.f$call), data = train.imputed)
forward.predictions <- predict(forward.lm, test.imputed)

summary(forward.lm)$adj.r.squared
cat("Bias: ", mean(forward.predictions-test.y))
cat("\nMaximum Deviation: ", max(forward.predictions-test.y))
cat("\nMean Absolute Deviation: ", mean(abs(forward.predictions-test.y)))
cat("\nMean Square Error: ", mean((forward.predictions-test.y)**2))
cat("\nRoot Mean Square Error: ", sqrt(mean((forward.predictions-test.y)**2)))
```

Linear model with backward selected features

```{r}
# Linear model with backward selected features
backward.lm <- lm(formula(step.b$call), data = train.imputed)
backward.predictions <- predict(backward.lm, test.imputed)

summary(backward.lm)$adj.r.squared
cat("Bias: ", mean(backward.predictions-test.y))
cat("\nMaximum Deviation: ", max(backward.predictions-test.y))
cat("\nMean Absolute Deviation: ", mean(abs(backward.predictions-test.y)))
cat("\nMean Square Error: ", mean((backward.predictions-test.y)**2))
cat("\nRoot Mean Square Error: ", sqrt(mean((backward.predictions-test.y)**2)))
```

Linear model with stepwise selected features

```{r}
# Linear model with stepwise selected features
stepwise.lm <- lm(formula(step.s$call), data = train.imputed)
stepwise.predictions <- predict(stepwise.lm, test.imputed)

summary(stepwise.lm)$adj.r.squared
cat("Bias: ", mean(stepwise.predictions-test.y))
cat("\nMaximum Deviation: ", max(stepwise.predictions-test.y))
cat("\nMean Absolute Deviation: ", mean(abs(stepwise.predictions-test.y)))
cat("\nMean Square Error: ", mean((stepwise.predictions-test.y)**2))
cat("\nRoot Mean Square Error: ", sqrt(mean((stepwise.predictions-test.y)**2)))
```

LASSO Regression

```{r}
# LASSO Regression
lr.train <- model.matrix(SalePrice~., data = train.imputed)
test.imputed$SalePrice <- test.y
lr.test <- model.matrix(SalePrice~., data = test.imputed)
grid <- 10^seq(4, -2, length = 100)

set.seed (2019)
cv.lasso <- cv.glmnet(lr.train, train$SalePrice, alpha = 1, lambda = grid)
bestlam <- cv.lasso$lambda.min
bestlam # Best lambda = 7564.633
```

```{r}
fit.lasso <- glmnet(lr.train, train.imputed$SalePrice, alpha = 1, lambda = 7564.633)
lasso.predict <- predict(fit.lasso, lr.test)

cat("Bias: ", mean(lasso.predict-test.y))
cat("\nMaximum Deviation: ", max(lasso.predict-test.y))
cat("\nMean Absolute Deviation: ", mean(abs(lasso.predict-test.y)))
cat("\nMean Square Error: ", mean((lasso.predict-test.y)**2))
cat("\nRoot Mean Square Error: ", sqrt(mean((lasso.predict-test.y)**2)))
```


## Results {#results}

Now let's check the diagnostics for the forward selected linear model as it seems to be the 'best' model because it contains the variable with the strongest correlation with the response variable. 

```{r}
#Residual Plots
plot(forward.lm)
```

```{r}
# Plot of predicted versus true reponse test values
plot(test.y,forward.predictions) +
abline(0,1, col = 'red')
```

## Conclusions {#conclusions}

...

