#suppress warnings globally
options(warn=-1)
options(scipen = 999) #suppress the appearance of scientific notation throughout the analysis
rm(list=ls())
In order to put my newly acquired data science skills to the test, I decided to the explore the well-known Ames Housing dataset. This particular dataset documents 2006-2010 actual house prices in the City of Ames, Iowa. The dataset details the price of houses and is accompanied by a broad range of variables. The type of variables contained in this dataset primarily pertain to the quality, dimensions and quantity of physical characteristics associated with the respective properties. Generally, these covariates are the sort of features that prospective buyers are likely to be interested in, features such as the age of the property, the quality of the exterior and how many bathrooms the property holds. I refer the curious reader to the following article, where an in-depth rundown of the various features can be obtained.
EDA:EXPLORATORY DATA ANALYSIS
MODELLING
Employing an alpha level of 0.05, Overall Quality (0.86), Overall area (78.8), External Quality (69.4) and the number of garage cars (0.66) turned out to have the highest significant correlations with Sale Price. The age of the house (-0.56) and time since remodelled (-0.53) had the highest significant negative correlations with Sale Price. Pearson’s correlation coefficient (r) was employed- this metric quantifies linear association. Of course, Pearson’s correlation is unsuitable for adequately capturing curvilinear relationships between variables. The house prices ranged from $12,789 to $755,000, with the average price equating to $ 180,796 (standard deviation= $79,886.69).
I employed a range of predictive algorithms in order to discover an algorithm that would yield optimally low residual error. Of the algorithms tested, Extreme Gradient Boosting and Random Forest algorithms demonstrated the best predictive performances. I was able to obtain a mean absolute error (MAE) of $15,045 using extreme gradient boosting on unseen data (segregated validation set) and a MAE of $15,981 by employing the random forest algorithm. In terms of the features with the most important global/overall importance, both the random forest model and the extreme gradient boosting model agreed that Overall Area (of the interior house) and Overall Quality were the top two primary drivers responsible for estimating Sale Price. Additionally, I ran a classic multiple linear regression: the most important features for prediction turned out to be Overall Quality in first place, with Overall Area being the fourth most important feature. Thus, given the consensus between these differing algorithms, our confidence increases that these variables are indeed important predictors for determining sale price. As sale price increased, all of the models tested struggled to consistently and accurately predict the respective house prices. One potential hypothesis - admittedly, a tentative one- that could account for the deficit in predictive capacity in these upper price regions, is that such prices are less likely to be shaped by ordinary consumer behaviour. Buyers operating in the higher price region are not constrained by the usual economic push/pull factors; in other words, the liberated movement of human whim is introducing volatility.
AVENUES FOR FURTHER STUDY
Consider experimentation with polynomial terms/log transformations/ splines to improve modelling.
Consider reintroducing some of the discarded variables.
I recruited the following libraries at various stages throughout the modelling process
library(ggplot2)
library(caret)
library(tidyr)
library(DMwR)
library(missForest)
library(randomForest)
library(tidymodels)
library(ranger)
library(recipes)
library(workflows)
library(themis)
library(xgboost)
library(ggdark)
library(viridis)
library(DALEXtra)
library(ggthemes)
library(lares)
library(kableExtra)
library(tidyr)
library(tidyverse)
library(reshape)
library(dplyr)
library(modelStudio)
library(DALEXtra)
library(DALEX)
library(tidymodels)
library(vip)
I decided to define a couple of customised visualisation themes at the outset of the project; I wanted to be able to easily format the visual output of the analysis in a consistent manner. I feel that visual consistency enhances the fluidity of the reading process.
ISAAC.THEME <- theme(plot.title = element_text(hjust = 0.5), # title centralised
plot.background = element_rect(fill="black"), #background to be black
panel.background = element_rect(fill="gray20"), # panel background to be dark grey
panel.grid.minor = element_line(color="black"), # grid lines to be removed
panel.grid.major = element_line(color="black"), # grid lines to be removed
axis.text = element_text(color="white"), # axis text to appear white
title = element_text(color="white", face="bold"), # title to be white and bold.
legend.background = element_rect(fill="black"), # legend background appear as black
legend.text = element_text(color="white", size=20), # legend text to appear white
legend.key = element_rect(fill="black", color="black"),
legend.position = "top")
ISAAC.THEME2 <- theme(plot.title = element_text(hjust = 0.5), # title centralised
plot.background = element_rect(fill="black"), # background to be black
panel.background = element_rect(fill="gray20"), # panel background to be grey
panel.grid.minor = element_line(color="black"), # grid lines to be removed
panel.grid.major = element_line(color="black"), # grid lines to be removed
axis.text = element_text(color="white"), # axis text to be white
title = element_text(color="white", face="bold"), # title to appear white and bold.
legend.background = element_rect(fill="black"), # legend of background should appear black
legend.text = element_text(color=NULL, size=12 ), # legend text should appear white
legend.key = element_rect(fill=NULL),
legend.position = 'none')
I had previously saved the dataset in on my computer. I commenced the analysis by loading the respective dataset in its raw form.
load("ames.RData")
In accordance with tradition, I kicked-off the analysis by inspecting the broad structure of the data provided. I wanted to glance the contents of the dataset.
df_str(ames, return = "plot")
## Other available 'return' options: "skimr", "numbers", "names", "distr"
RAW DATA
I wished to obtain more insight into the contents of the data in its unprocessed form; I decided to inspect the first fifteen rows:
| Order | PID | MS.SubClass | MS.Zoning | Lot.Frontage | Lot.Area | Street | Alley | Lot.Shape | Land.Contour | Utilities | Lot.Config | Land.Slope | Neighborhood | Condition.1 | Condition.2 | Bldg.Type | House.Style | Overall.Qual | Overall.Cond | Year.Built | Year.Remod.Add | Roof.Style | Roof.Matl | Exterior.1st | Exterior.2nd | Mas.Vnr.Type | Mas.Vnr.Area | Exter.Qual | Exter.Cond | Foundation | Bsmt.Qual | Bsmt.Cond | Bsmt.Exposure | BsmtFin.Type.1 | BsmtFin.SF.1 | BsmtFin.Type.2 | BsmtFin.SF.2 | Bsmt.Unf.SF | Total.Bsmt.SF | Heating | Heating.QC | Central.Air | Electrical | X1st.Flr.SF | X2nd.Flr.SF | Low.Qual.Fin.SF | Gr.Liv.Area | Bsmt.Full.Bath | Bsmt.Half.Bath | Full.Bath | Half.Bath | Bedroom.AbvGr | Kitchen.AbvGr | Kitchen.Qual | TotRms.AbvGrd | Functional | Fireplaces | Fireplace.Qu | Garage.Type | Garage.Yr.Blt | Garage.Finish | Garage.Cars | Garage.Area | Garage.Qual | Garage.Cond | Paved.Drive | Wood.Deck.SF | Open.Porch.SF | Enclosed.Porch | X3Ssn.Porch | Screen.Porch | Pool.Area | Pool.QC | Fence | Misc.Feature | Misc.Val | Mo.Sold | Yr.Sold | Sale.Type | Sale.Condition | SalePrice |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 526301100 | 20 | RL | 141 | 31770 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 5 | 1960 | 1960 | Hip | CompShg | BrkFace | Plywood | Stone | 112 | TA | TA | CBlock | TA | Gd | Gd | BLQ | 639 | Unf | 0 | 441 | 1080 | GasA | Fa | Y | SBrkr | 1656 | 0 | 0 | 1656 | 1 | 0 | 1 | 0 | 3 | 1 | TA | 7 | Typ | 2 | Gd | Attchd | 1960 | Fin | 2 | 528 | TA | TA | P | 210 | 62 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2010 | WD | Normal | 215000 |
| 2 | 526350040 | 20 | RH | 80 | 11622 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Feedr | Norm | 1Fam | 1Story | 5 | 6 | 1961 | 1961 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | CBlock | TA | TA | No | Rec | 468 | LwQ | 144 | 270 | 882 | GasA | TA | Y | SBrkr | 896 | 0 | 0 | 896 | 0 | 0 | 1 | 0 | 2 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1961 | Unf | 1 | 730 | TA | TA | Y | 140 | 0 | 0 | 0 | 120 | 0 | NA | MnPrv | NA | 0 | 6 | 2010 | WD | Normal | 105000 |
| 3 | 526351010 | 20 | RL | 81 | 14267 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 6 | 1958 | 1958 | Hip | CompShg | Wd Sdng | Wd Sdng | BrkFace | 108 | TA | TA | CBlock | TA | TA | No | ALQ | 923 | Unf | 0 | 406 | 1329 | GasA | TA | Y | SBrkr | 1329 | 0 | 0 | 1329 | 0 | 0 | 1 | 1 | 3 | 1 | Gd | 6 | Typ | 0 | NA | Attchd | 1958 | Unf | 1 | 312 | TA | TA | Y | 393 | 36 | 0 | 0 | 0 | 0 | NA | NA | Gar2 | 12500 | 6 | 2010 | WD | Normal | 172000 |
| 4 | 526353030 | 20 | RL | 93 | 11160 | Pave | NA | Reg | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 7 | 5 | 1968 | 1968 | Hip | CompShg | BrkFace | BrkFace | None | 0 | Gd | TA | CBlock | TA | TA | No | ALQ | 1065 | Unf | 0 | 1045 | 2110 | GasA | Ex | Y | SBrkr | 2110 | 0 | 0 | 2110 | 1 | 0 | 2 | 1 | 3 | 1 | Ex | 8 | Typ | 2 | TA | Attchd | 1968 | Fin | 2 | 522 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2010 | WD | Normal | 244000 |
| 5 | 527105010 | 60 | RL | 74 | 13830 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 5 | 5 | 1997 | 1998 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | PConc | Gd | TA | No | GLQ | 791 | Unf | 0 | 137 | 928 | GasA | Gd | Y | SBrkr | 928 | 701 | 0 | 1629 | 0 | 0 | 2 | 1 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1997 | Fin | 2 | 482 | TA | TA | Y | 212 | 34 | 0 | 0 | 0 | 0 | NA | MnPrv | NA | 0 | 3 | 2010 | WD | Normal | 189900 |
| 6 | 527105030 | 60 | RL | 78 | 9978 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 6 | 6 | 1998 | 1998 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 20 | TA | TA | PConc | TA | TA | No | GLQ | 602 | Unf | 0 | 324 | 926 | GasA | Ex | Y | SBrkr | 926 | 678 | 0 | 1604 | 0 | 0 | 2 | 1 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 1998 | Fin | 2 | 470 | TA | TA | Y | 360 | 36 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 6 | 2010 | WD | Normal | 195500 |
| 7 | 527127150 | 120 | RL | 41 | 4920 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | StoneBr | Norm | Norm | TwnhsE | 1Story | 8 | 5 | 2001 | 2001 | Gable | CompShg | CemntBd | CmentBd | None | 0 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 616 | Unf | 0 | 722 | 1338 | GasA | Ex | Y | SBrkr | 1338 | 0 | 0 | 1338 | 1 | 0 | 2 | 0 | 2 | 1 | Gd | 6 | Typ | 0 | NA | Attchd | 2001 | Fin | 2 | 582 | TA | TA | Y | 0 | 0 | 170 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2010 | WD | Normal | 213500 |
| 8 | 527145080 | 120 | RL | 43 | 5005 | Pave | NA | IR1 | HLS | AllPub | Inside | Gtl | StoneBr | Norm | Norm | TwnhsE | 1Story | 8 | 5 | 1992 | 1992 | Gable | CompShg | HdBoard | HdBoard | None | 0 | Gd | TA | PConc | Gd | TA | No | ALQ | 263 | Unf | 0 | 1017 | 1280 | GasA | Ex | Y | SBrkr | 1280 | 0 | 0 | 1280 | 0 | 0 | 2 | 0 | 2 | 1 | Gd | 5 | Typ | 0 | NA | Attchd | 1992 | RFn | 2 | 506 | TA | TA | Y | 0 | 82 | 0 | 0 | 144 | 0 | NA | NA | NA | 0 | 1 | 2010 | WD | Normal | 191500 |
| 9 | 527146030 | 120 | RL | 39 | 5389 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | StoneBr | Norm | Norm | TwnhsE | 1Story | 8 | 5 | 1995 | 1996 | Gable | CompShg | CemntBd | CmentBd | None | 0 | Gd | TA | PConc | Gd | TA | No | GLQ | 1180 | Unf | 0 | 415 | 1595 | GasA | Ex | Y | SBrkr | 1616 | 0 | 0 | 1616 | 1 | 0 | 2 | 0 | 2 | 1 | Gd | 5 | Typ | 1 | TA | Attchd | 1995 | RFn | 2 | 608 | TA | TA | Y | 237 | 152 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 3 | 2010 | WD | Normal | 236500 |
| 10 | 527162130 | 60 | RL | 60 | 7500 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1999 | 1999 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | PConc | TA | TA | No | Unf | 0 | Unf | 0 | 994 | 994 | GasA | Gd | Y | SBrkr | 1028 | 776 | 0 | 1804 | 0 | 0 | 2 | 1 | 3 | 1 | Gd | 7 | Typ | 1 | TA | Attchd | 1999 | Fin | 2 | 442 | TA | TA | Y | 140 | 60 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 6 | 2010 | WD | Normal | 189000 |
| 11 | 527163010 | 60 | RL | 75 | 10000 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 6 | 5 | 1993 | 1994 | Gable | CompShg | HdBoard | HdBoard | None | 0 | TA | TA | PConc | Gd | TA | No | Unf | 0 | Unf | 0 | 763 | 763 | GasA | Gd | Y | SBrkr | 763 | 892 | 0 | 1655 | 0 | 0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 1 | TA | Attchd | 1993 | Fin | 2 | 440 | TA | TA | Y | 157 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2010 | WD | Normal | 175900 |
| 12 | 527165230 | 20 | RL | NA | 7980 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 1Story | 6 | 7 | 1992 | 2007 | Gable | CompShg | HdBoard | HdBoard | None | 0 | TA | Gd | PConc | Gd | TA | No | ALQ | 935 | Unf | 0 | 233 | 1168 | GasA | Ex | Y | SBrkr | 1187 | 0 | 0 | 1187 | 1 | 0 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 0 | NA | Attchd | 1992 | Fin | 2 | 420 | TA | TA | Y | 483 | 21 | 0 | 0 | 0 | 0 | NA | GdPrv | Shed | 500 | 3 | 2010 | WD | Normal | 185000 |
| 13 | 527166040 | 60 | RL | 63 | 8402 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 6 | 5 | 1998 | 1998 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | PConc | Gd | TA | No | Unf | 0 | Unf | 0 | 789 | 789 | GasA | Gd | Y | SBrkr | 789 | 676 | 0 | 1465 | 0 | 0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 1 | Gd | Attchd | 1998 | Fin | 2 | 393 | TA | TA | Y | 0 | 75 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2010 | WD | Normal | 180400 |
| 14 | 527180040 | 20 | RL | 85 | 10176 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 1Story | 7 | 5 | 1990 | 1990 | Gable | CompShg | HdBoard | HdBoard | None | 0 | TA | TA | PConc | Gd | TA | Gd | GLQ | 637 | Unf | 0 | 663 | 1300 | GasA | Gd | Y | SBrkr | 1341 | 0 | 0 | 1341 | 1 | 0 | 1 | 1 | 2 | 1 | Gd | 5 | Typ | 1 | Po | Attchd | 1990 | Unf | 2 | 506 | TA | TA | Y | 192 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2010 | WD | Normal | 171500 |
| 15 | 527182190 | 120 | RL | NA | 6820 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | StoneBr | Norm | Norm | TwnhsE | 1Story | 8 | 5 | 1985 | 1985 | Gable | CompShg | HdBoard | HdBoard | None | 0 | Gd | TA | PConc | Gd | TA | Av | GLQ | 368 | BLQ | 1120 | 0 | 1488 | GasA | TA | Y | SBrkr | 1502 | 0 | 0 | 1502 | 1 | 0 | 1 | 1 | 1 | 1 | Gd | 4 | Typ | 0 | NA | Attchd | 1985 | RFn | 2 | 528 | TA | TA | Y | 0 | 54 | 0 | 0 | 140 | 0 | NA | NA | NA | 0 | 6 | 2010 | WD | Normal | 212000 |
CLEARER OVERVIEW USING PSYCH
I recruited the psych package to obtain summary statistics regarding the variability of the data, including the kurtosis and skew of the respective data.
psych::describe(ames) %>%
kable() %>% kable_styling(bootstrap_options = "striped") %>%
scroll_box(width = "100%", height = "300px")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Order | 1 | 2930 | 1465.5000000 | 845.9624696 | 1465.5 | 1465.5000000 | 1086.0045 | 1 | 2930 | 2929 | 0.0000000 | -1.2012287 | 15.6284996 |
| PID | 2 | 2930 | 714464496.9887372 | 188730844.6493901 | 535453620.0 | 712423200.4014505 | 12373193.9730 | 526301100 | 1007100110 | 480799010 | 0.0558287 | -1.9944772 | 3486655.7767905 |
| MS.SubClass | 3 | 2930 | 57.3873720 | 42.6380246 | 50.0 | 49.7632253 | 44.4780 | 20 | 190 | 170 | 1.3561897 | 1.3793717 | 0.7877044 |
| MS.Zoning* | 4 | 2930 | 5.9672355 | 0.8656517 | 6.0 | 6.0720990 | 0.0000 | 1 | 7 | 6 | -2.6139550 | 8.4051820 | 0.0159922 |
| Lot.Frontage | 5 | 2440 | 69.2245902 | 23.3653350 | 68.0 | 68.3524590 | 17.7912 | 21 | 313 | 292 | 1.4972247 | 11.1977390 | 0.4730174 |
| Lot.Area | 6 | 2930 | 10147.9218430 | 7880.0177594 | 9436.5 | 9481.0499147 | 3024.5040 | 1300 | 215245 | 213945 | 12.8077740 | 264.3869706 | 145.5772081 |
| Street* | 7 | 2930 | 1.9959044 | 0.0638763 | 2.0 | 2.0000000 | 0.0000 | 1 | 2 | 1 | -15.5217245 | 239.0055030 | 0.0011801 |
| Alley* | 8 | 198 | 1.3939394 | 0.4898603 | 1.0 | 1.3687500 | 0.0000 | 1 | 2 | 1 | 0.4308369 | -1.8235128 | 0.0348129 |
| Lot.Shape* | 9 | 2930 | 2.9402730 | 1.4121054 | 4.0 | 3.0503413 | 0.0000 | 1 | 4 | 3 | -0.6062653 | -1.6026962 | 0.0260876 |
| Land.Contour* | 10 | 2930 | 3.7778157 | 0.7031991 | 4.0 | 3.9982935 | 0.0000 | 1 | 4 | 3 | -3.1226507 | 8.4362234 | 0.0129911 |
| Utilities* | 11 | 2930 | 1.0017065 | 0.0554058 | 1.0 | 1.0000000 | 0.0000 | 1 | 3 | 2 | 34.0201767 | 1187.9571398 | 0.0010236 |
| Lot.Config* | 12 | 2930 | 4.0552901 | 1.6039218 | 5.0 | 4.3191126 | 0.0000 | 1 | 5 | 4 | -1.1937666 | -0.4447109 | 0.0296312 |
| Land.Slope* | 13 | 2930 | 1.0535836 | 0.2483042 | 1.0 | 1.0000000 | 0.0000 | 1 | 3 | 2 | 4.9830595 | 26.6246706 | 0.0045872 |
| Neighborhood* | 14 | 2930 | 15.2952218 | 7.0220750 | 16.0 | 15.3993174 | 8.8956 | 1 | 28 | 27 | -0.1960192 | -1.1853548 | 0.1297274 |
| Condition.1* | 15 | 2930 | 3.0402730 | 0.8724077 | 3.0 | 3.0000000 | 0.0000 | 1 | 9 | 8 | 2.9876854 | 15.7385319 | 0.0161171 |
| Condition.2* | 16 | 2930 | 3.0020478 | 0.2090376 | 3.0 | 3.0000000 | 0.0000 | 1 | 8 | 7 | 12.0767140 | 308.9729348 | 0.0038618 |
| Bldg.Type* | 17 | 2930 | 1.5170648 | 1.2188973 | 1.0 | 1.1719283 | 0.0000 | 1 | 5 | 4 | 2.1518391 | 3.0038936 | 0.0225182 |
| House.Style* | 18 | 2930 | 4.0238908 | 1.9106293 | 3.0 | 4.0102389 | 0.0000 | 1 | 8 | 7 | 0.3211783 | -0.9501428 | 0.0352974 |
| Overall.Qual | 19 | 2930 | 6.0948805 | 1.4110261 | 6.0 | 6.0755119 | 1.4826 | 1 | 10 | 9 | 0.1904388 | 0.0481942 | 0.0260676 |
| Overall.Cond | 20 | 2930 | 5.5631399 | 1.1115366 | 5.0 | 5.4680034 | 0.0000 | 1 | 9 | 8 | 0.5738415 | 1.4837965 | 0.0205348 |
| Year.Built | 21 | 2930 | 1971.3563140 | 30.2453606 | 1973.0 | 1974.2474403 | 37.0650 | 1872 | 2010 | 138 | -0.6038435 | -0.5046106 | 0.5587595 |
| Year.Remod.Add | 22 | 2930 | 1984.2665529 | 20.8602859 | 1993.0 | 1985.6284130 | 20.7564 | 1950 | 2010 | 60 | -0.4514001 | -1.3424069 | 0.3853776 |
| Roof.Style* | 23 | 2930 | 2.3948805 | 0.8197217 | 2.0 | 2.2431741 | 0.0000 | 1 | 6 | 5 | 1.5584134 | 0.8882255 | 0.0151437 |
| Roof.Matl* | 24 | 2930 | 2.0627986 | 0.5382105 | 2.0 | 2.0000000 | 0.0000 | 1 | 8 | 7 | 8.7203343 | 76.9796054 | 0.0099430 |
| Exterior.1st* | 25 | 2930 | 11.1604096 | 3.6506082 | 14.0 | 11.4718430 | 1.4826 | 1 | 16 | 15 | -0.5885754 | -0.7610933 | 0.0674422 |
| Exterior.2nd* | 26 | 2930 | 11.8709898 | 3.9980891 | 15.0 | 12.1889932 | 2.9652 | 1 | 17 | 16 | -0.5636232 | -0.8976905 | 0.0738616 |
| Mas.Vnr.Type* | 27 | 2930 | 4.4269625 | 1.0775448 | 5.0 | 4.4577645 | 0.0000 | 1 | 6 | 5 | -0.6932065 | -0.6198667 | 0.0199068 |
| Mas.Vnr.Area | 28 | 2907 | 101.8968008 | 179.1126106 | 0.0 | 61.1388053 | 0.0000 | 0 | 1600 | 1600 | 2.6042950 | 9.2601663 | 3.3220307 |
| Exter.Qual* | 29 | 2930 | 3.5290102 | 0.7016622 | 4.0 | 3.6424915 | 0.0000 | 1 | 4 | 3 | -1.7925145 | 3.6662698 | 0.0129627 |
| Exter.Cond* | 30 | 2930 | 4.7098976 | 0.7723926 | 5.0 | 4.9261945 | 0.0000 | 1 | 5 | 4 | -2.5031881 | 5.1134440 | 0.0142694 |
| Foundation* | 31 | 2930 | 2.3924915 | 0.7261949 | 2.0 | 2.4539249 | 1.4826 | 1 | 6 | 5 | 0.0107612 | 0.7550324 | 0.0134159 |
| Bsmt.Qual* | 32 | 2851 | 4.6878288 | 1.3131144 | 4.0 | 4.8487505 | 2.9652 | 1 | 6 | 5 | -0.4618164 | -0.8243922 | 0.0245926 |
| Bsmt.Cond* | 33 | 2851 | 5.7972641 | 0.6970662 | 6.0 | 6.0000000 | 0.0000 | 1 | 6 | 5 | -3.3552418 | 10.0737610 | 0.0130550 |
| Bsmt.Exposure* | 34 | 2851 | 4.2714837 | 1.1375368 | 5.0 | 4.4660237 | 0.0000 | 1 | 5 | 4 | -1.1653915 | -0.3047203 | 0.0213043 |
| BsmtFin.Type.1* | 35 | 2851 | 4.7551736 | 1.8096342 | 4.0 | 4.8193775 | 2.9652 | 1 | 7 | 6 | -0.0393708 | -1.3628952 | 0.0338916 |
| BsmtFin.SF.1 | 36 | 2929 | 442.6295664 | 455.5908391 | 370.0 | 384.0840085 | 548.5620 | 0 | 5644 | 5644 | 1.4147320 | 6.8388117 | 8.4181236 |
| BsmtFin.Type.2* | 37 | 2851 | 6.6720449 | 1.0167651 | 7.0 | 6.9706269 | 0.0000 | 1 | 7 | 6 | -3.3854719 | 10.8381116 | 0.0190424 |
| BsmtFin.SF.2 | 38 | 2929 | 49.7224309 | 169.1684756 | 0.0 | 2.0366738 | 0.0000 | 0 | 1526 | 1526 | 4.1357391 | 18.7325405 | 3.1257897 |
| Bsmt.Unf.SF | 39 | 2929 | 559.2625469 | 439.4941528 | 466.0 | 510.7735608 | 415.1280 | 0 | 2336 | 2336 | 0.9221075 | 0.4044530 | 8.1206990 |
| Total.Bsmt.SF | 40 | 2929 | 1051.6145442 | 440.6150670 | 990.0 | 1035.0537313 | 349.8936 | 0 | 6110 | 6110 | 1.1550204 | 9.1097038 | 8.1414106 |
| Heating* | 41 | 2930 | 2.0252560 | 0.2452214 | 2.0 | 2.0000000 | 0.0000 | 1 | 6 | 5 | 12.0957649 | 168.4515977 | 0.0045303 |
| Heating.QC* | 42 | 2930 | 2.5389078 | 1.7439418 | 1.0 | 2.4236348 | 0.0000 | 1 | 5 | 4 | 0.4808449 | -1.5194533 | 0.0322180 |
| Central.Air* | 43 | 2930 | 1.9331058 | 0.2498813 | 2.0 | 2.0000000 | 0.0000 | 1 | 2 | 1 | -3.4653089 | 10.0117831 | 0.0046164 |
| Electrical* | 44 | 2930 | 5.6846416 | 1.0492497 | 6.0 | 6.0000000 | 0.0000 | 1 | 6 | 5 | -3.0814461 | 7.6505468 | 0.0193841 |
| X1st.Flr.SF | 45 | 2930 | 1159.5576792 | 391.8908853 | 1084.0 | 1127.1689420 | 349.8936 | 334 | 5095 | 4761 | 1.4679244 | 6.9480810 | 7.2398797 |
| X2nd.Flr.SF | 46 | 2930 | 335.4559727 | 428.3957150 | 0.0 | 272.8959044 | 0.0000 | 0 | 2065 | 2065 | 0.8655698 | -0.4179641 | 7.9142781 |
| Low.Qual.Fin.SF | 47 | 2930 | 4.6767918 | 46.3105100 | 0.0 | 0.0000000 | 0.0000 | 0 | 1064 | 1064 | 12.1057567 | 175.1836856 | 0.8555507 |
| Gr.Liv.Area | 48 | 2930 | 1499.6904437 | 505.5088875 | 1442.0 | 1452.2453072 | 461.0886 | 334 | 5642 | 5308 | 1.2728055 | 4.1238681 | 9.3388841 |
| Bsmt.Full.Bath | 49 | 2928 | 0.4313525 | 0.5248202 | 0.0 | 0.3963311 | 0.0000 | 0 | 3 | 3 | 0.6160073 | -0.7502980 | 0.0096990 |
| Bsmt.Half.Bath | 50 | 2928 | 0.0611339 | 0.2452536 | 0.0 | 0.0000000 | 0.0000 | 0 | 2 | 2 | 3.9367587 | 14.8820022 | 0.0045324 |
| Full.Bath | 51 | 2930 | 1.5665529 | 0.5529406 | 2.0 | 1.5575939 | 0.0000 | 0 | 4 | 4 | 0.1717761 | -0.5442378 | 0.0102151 |
| Half.Bath | 52 | 2930 | 0.3795222 | 0.5026293 | 0.0 | 0.3387372 | 0.0000 | 0 | 2 | 2 | 0.6969988 | -1.0319744 | 0.0092857 |
| Bedroom.AbvGr | 53 | 2930 | 2.8542662 | 0.8277311 | 3.0 | 2.8323379 | 0.0000 | 0 | 8 | 8 | 0.3053813 | 1.8828128 | 0.0152917 |
| Kitchen.AbvGr | 54 | 2930 | 1.0443686 | 0.2140762 | 1.0 | 1.0000000 | 0.0000 | 0 | 3 | 3 | 4.3094087 | 19.8182212 | 0.0039549 |
| Kitchen.Qual* | 55 | 2930 | 3.8563140 | 1.2692368 | 5.0 | 4.0251706 | 0.0000 | 1 | 5 | 4 | -0.6206625 | -0.6780245 | 0.0234482 |
| TotRms.AbvGrd | 56 | 2930 | 6.4430034 | 1.5729644 | 6.0 | 6.3336177 | 1.4826 | 2 | 15 | 13 | 0.7527712 | 1.1477391 | 0.0290593 |
| Functional* | 57 | 2930 | 7.6918089 | 1.1754144 | 8.0 | 8.0000000 | 0.0000 | 1 | 8 | 7 | -3.8253633 | 13.7939856 | 0.0217149 |
| Fireplaces | 58 | 2930 | 0.5993174 | 0.6479209 | 1.0 | 0.5183447 | 1.4826 | 0 | 4 | 4 | 0.7384585 | 0.0971731 | 0.0119698 |
| Fireplace.Qu* | 59 | 1508 | 3.7194960 | 1.1264883 | 3.0 | 3.7831126 | 1.4826 | 1 | 5 | 4 | -0.1217511 | -1.0149437 | 0.0290086 |
| Garage.Type* | 60 | 2773 | 3.2830869 | 1.7903034 | 2.0 | 3.1144660 | 0.0000 | 1 | 6 | 5 | 0.7476409 | -1.3113362 | 0.0339979 |
| Garage.Yr.Blt | 61 | 2771 | 1978.1324432 | 25.5284113 | 1979.0 | 1980.7108705 | 31.1346 | 1895 | 2207 | 312 | -0.3842554 | 1.8176404 | 0.4849596 |
| Garage.Finish* | 62 | 2773 | 3.1799495 | 0.8229231 | 3.0 | 3.2257774 | 1.4826 | 1 | 4 | 3 | -0.3510946 | -1.4163756 | 0.0156273 |
| Garage.Cars | 63 | 2929 | 1.7668146 | 0.7605664 | 2.0 | 1.7680171 | 0.0000 | 0 | 5 | 5 | -0.2196113 | 0.2402900 | 0.0140533 |
| Garage.Area | 64 | 2929 | 472.8197337 | 215.0465485 | 480.0 | 468.3458422 | 182.3598 | 0 | 1488 | 1488 | 0.2417464 | 0.9446576 | 3.9734961 |
| Garage.Qual* | 65 | 2772 | 5.8405483 | 0.6634018 | 6.0 | 6.0000000 | 0.0000 | 1 | 6 | 5 | -4.0371873 | 14.7948678 | 0.0126003 |
| Garage.Cond* | 66 | 2772 | 5.8979076 | 0.5319486 | 6.0 | 6.0000000 | 0.0000 | 1 | 6 | 5 | -5.2866785 | 27.1445647 | 0.0101035 |
| Paved.Drive* | 67 | 2930 | 2.8313993 | 0.5363888 | 3.0 | 3.0000000 | 0.0000 | 1 | 3 | 2 | -2.9849322 | 7.1493968 | 0.0099094 |
| Wood.Deck.SF | 68 | 2930 | 93.7518771 | 126.3615619 | 0.0 | 71.2141638 | 0.0000 | 0 | 1424 | 1424 | 1.8407918 | 6.7337405 | 2.3344317 |
| Open.Porch.SF | 69 | 2930 | 47.5334471 | 67.4834001 | 27.0 | 33.8724403 | 40.0302 | 0 | 742 | 742 | 2.5327906 | 10.9241024 | 1.2467034 |
| Enclosed.Porch | 70 | 2930 | 23.0116041 | 64.1390592 | 0.0 | 4.8297782 | 0.0000 | 0 | 1012 | 1012 | 4.0103363 | 28.4151135 | 1.1849193 |
| X3Ssn.Porch | 71 | 2930 | 2.5924915 | 25.1413310 | 0.0 | 0.0000000 | 0.0000 | 0 | 508 | 508 | 11.3921213 | 149.6265877 | 0.4644666 |
| Screen.Porch | 72 | 2930 | 16.0020478 | 56.0873702 | 0.0 | 0.0000000 | 0.0000 | 0 | 576 | 576 | 3.9534162 | 17.8124277 | 1.0361706 |
| Pool.Area | 73 | 2930 | 2.2433447 | 35.5971806 | 0.0 | 0.0000000 | 0.0000 | 0 | 800 | 800 | 16.9218026 | 299.0552932 | 0.6576303 |
| Pool.QC* | 74 | 13 | 2.4615385 | 1.1982894 | 3.0 | 2.4545455 | 1.4826 | 1 | 4 | 3 | -0.0507909 | -1.6761110 | 0.3323457 |
| Fence* | 75 | 572 | 2.4125874 | 0.8355387 | 3.0 | 2.4890830 | 0.0000 | 1 | 4 | 3 | -0.6760973 | -0.8887827 | 0.0349356 |
| Misc.Feature* | 76 | 106 | 3.8490566 | 0.5484602 | 4.0 | 4.0000000 | 0.0000 | 1 | 5 | 4 | -3.1640389 | 10.3679528 | 0.0532712 |
| Misc.Val | 77 | 2930 | 50.6351536 | 566.3442883 | 0.0 | 0.0000000 | 0.0000 | 0 | 17000 | 17000 | 21.9772674 | 564.8476877 | 10.4627709 |
| Mo.Sold | 78 | 2930 | 6.2160410 | 2.7144924 | 6.0 | 6.1569966 | 2.9652 | 1 | 12 | 11 | 0.1923989 | -0.4576397 | 0.0501481 |
| Yr.Sold | 79 | 2930 | 2007.7904437 | 1.3166129 | 2008.0 | 2007.7380546 | 1.4826 | 2006 | 2010 | 4 | 0.1347244 | -1.1589064 | 0.0243234 |
| Sale.Type* | 80 | 2930 | 9.3587031 | 1.8781600 | 10.0 | 9.8745734 | 0.0000 | 1 | 10 | 9 | -3.3213129 | 10.7566598 | 0.0346975 |
| Sale.Condition* | 81 | 2930 | 4.7798635 | 1.0762992 | 5.0 | 5.0000000 | 0.0000 | 1 | 6 | 5 | -2.7934757 | 7.2515027 | 0.0198838 |
| SalePrice | 82 | 2930 | 180796.0600683 | 79886.6923567 | 160000.0 | 170429.1510239 | 54856.2000 | 12789 | 755000 | 742211 | 1.7417153 | 5.1025881 | 1475.8445972 |
I was keen to establish how complete the dataset was; I used the following stream of code to provide clear insight into this question
ames%>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(Absent.data = n())%>%
filter(is.missing == TRUE, Absent.data > 1) %>%
select(-is.missing) %>%
arrange(desc(Absent.data)) %>%
ggplot(aes(x = reorder(key, Absent.data), y = Absent.data, fill = key)) +
geom_col() +
coord_flip() +
xlab("Variable") +
ylab("Absent values")+
theme(legend.position='none')+
geom_text( aes( label = Absent.data), colour ="white", hjust = 1.0, size = 4.0)+
coord_flip()+
ISAAC.THEME2
## `summarise()` has grouped output by 'key'. You can override using the `.groups` argument.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
The primary documentation relating to the contents of the Ames Housing dataset is revealing: surprisingly, the designation ‘NA’ for some of the variables provided did not signal the standard not available meaning . For instance, in relation to Pool.QC, the designation NA translates as No Pool, NA means No Basement for Bsmt.Qual and No Alley in relation to Alley.
Of course, I could not be entirely sure whether some of these NA entries indicate the absence of certain variables like pools and fireplaces were the result of systematic labelling or accidental omissions. However, in the case of the pool and fireplace variables, I could reduce the likelihood that the NA entries were caused by accidental omissions. Before I permitted myself to recode with distasteful haste, I engaged in a little verification. I simply reasoned that if a house truly does not have a pool, clearly the variable Pool.Area (denoting area of pool) should not exceed 0.
POOL
ames%>%
select(Pool.Area, Pool.QC)%>%
filter(Pool.Area > 0 & is.na(Pool.QC))
## [1] Pool.Area Pool.QC
## <0 rows> (or 0-length row.names)
The 0 result started suggested that there were no issues there.
FIREPLACES
Similarly, I reasoned that if a house truly does not have a fireplace, then naturally it should not have an accompanying rating.
ames %>%
select(Fireplaces, Fireplace.Qu) %>% filter(Fireplaces > 0 & is.na(Fireplace.Qu))
## [1] Fireplaces Fireplace.Qu
## <0 rows> (or 0-length row.names)
Again, I was satisfied that there were no issues here either.
I decided to relabel the NA variables appropriately. This deficit in the respective data entries can be classified as MNAR: missing not at random. I was now in a position to confidently perform some blanket coding.
# create temporary subset for recoding
x<- ames[c('Alley','Bsmt.Exposure', 'BsmtFin.Type.1','BsmtFin.Type.2', 'Fireplace.Qu', 'Garage.Type', 'Garage.Finish', 'Garage.Qual', 'Garage.Cond','Pool.QC','Fence','Misc.Feature')] #list of variables to be recoded
# recode specified variables
x=as.matrix(x)
x[is.na(x)] <-"None"
x=as.data.frame(x)
myvars <- names(ames) %in% c("Alley", "Bsmt.Exposure", "BsmtFin.Type.1", "BsmtFin.Type.2", "Fireplace.Qu", "Garage.Type", "Garage.Finish", "Garage.Qual", "Garage.Cond","Pool.QC","Fence","Misc.Feature")
ames.reduced <- ames[!myvars] #remove specified variables in their original format
ames.Updated<- bind_cols(ames.reduced, x )
# append updated variables to reduced dataset
In aforementioned code chunk, I extracted the variables where the designation NA indicates the absence of the respective feature(s) into a separate data frame, before recoding NA as ‘None’. Thereafter, I removed the original variables, before using dplyr’s ‘bind_cols’ function to append the revised variables back to the original dataset. These manoeuvres significantly reduced my NA burden - the following code block and resultant graphical display illustrates the revised level of missingness.
ames.Updated %>%
gather(key = "key", value = "val") %>%
mutate(is.missing = is.na(val)) %>%
group_by(key, is.missing) %>%
summarise(Absent.data = n())%>%
filter(is.missing == TRUE, Absent.data > 1) %>%
select(-is.missing) %>%
arrange(desc(Absent.data)) %>%
ggplot(aes(x = reorder(key, Absent.data), y = Absent.data, fill = key)) +
geom_col() +
coord_flip() +
xlab("Variable") +
ylab("Absent values")+
theme(legend.position='none')+
geom_text( aes( label = Absent.data), colour ="white", hjust = 1.0, size = 4.0)+
coord_flip()+
ISAAC.THEME2
## `summarise()` has grouped output by 'key'. You can override using the `.groups` argument.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
The coding provided great improvement! My NA burden was reduced considerably.
DETECT EMPTY STRINGS
The is.na function does not detect the presence of missing strings, so I decided to do an independent check for their presence in the dataset and then tabulate the results
EmptyStrings <- colSums(ames.Updated == "", na.rm =TRUE)
rev(stack(EmptyStrings[EmptyStrings > 0]))%>%kable()%>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
| ind | values |
|---|---|
| Mas.Vnr.Type | 23 |
| Bsmt.Qual | 1 |
| Bsmt.Cond | 1 |
| Electrical | 1 |
| Bsmt.Exposure | 4 |
| BsmtFin.Type.1 | 1 |
| BsmtFin.Type.2 | 2 |
| Garage.Finish | 2 |
| Garage.Qual | 1 |
| Garage.Cond | 1 |
I decided to recode all the empty strings as NA- if you happen to have Saint-like patience, I’ll show you how I addressed these entries alongside the other NAs in the imputation phase!
ames.Updated <- ames.Updated %>% mutate_all(na_if,"")
I decided to remove the features Order and Parcel identification(PID) from the dataset as these are effectively arbitrary and unlikely to have been useful for any form of meaningful generalisation.
#remove Order, PID variables
ames.Updated<-ames.Updated%>%select(-c(Order,PID))
A close inspection of the dataset revealed that the variable MS.SubClass had been incorrectly labelled as an integer. I converted this variable to a factor variable. MS SubClass is a nominal variable that characterises the type of dwelling, from a 1-story building to a duplex style of house . I converted this variable to a factor before recoding the various levels simply using the letters of the alphabet at this stage.
class(ames.Updated$MS.SubClass)
## [1] "integer"
ames.Updated$MS.SubClass<-as.factor(ames.Updated$MS.SubClass)
levels(ames.Updated$MS.SubClass) <- c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P")
MONTH SOLD
Month sold (Mo.Sold) was entered as a an integer, however I reclassified it as a factor variable
ames.Updated[["Mo.Sold"]] <- factor(ames.Updated[["Mo.Sold"]])
ames.Updated$Month.Sold <- month.abb[ames.Updated$Mo.Sold]
ames.Updated$Mo.Sold<-NULL
ames.Updated$Month.Sold<-as.factor(ames.Updated$Month.Sold)
I decided to modify the variables Year.Built, Garage.Yr.Blt and Year.Remod.Add slightly: instead of defining these variables by their respective year, I wanted to extract the number of years elapsed. I accomplished this by simply subtracting the year from 2010 (the year in which the sale prices were recorded). I felt that this would make the data friendlier for usage with the various algorithms that were to be employed in due course.
ames.Updated$Age.of.house <- 2010 - ames.Updated$Year.Built
ames.Updated$Year.Built<-NULL
ames.Updated$Age.of.garage <- 2010 - ames.Updated$Garage.Yr.Blt
ames.Updated$Garage.Yr.Blt <-NULL
ames.Updated$Yrs.since.Remod<-2010-ames.Updated$Year.Remod.Add
ames.Updated$Year.Remod.Add<-NULL
The original dataset contained four variables quantifying the types of bathrooms; I decided to aggregate these variables into a united variable simply entitled ‘Bathroom.Tally’.
ames.Updated$Bathroom.Tally <- 0.5 * ames.Updated$Half.Bath + 0.5 * ames.Updated$Bsmt.Half.Bath + ames.Updated$Bsmt.Full.Bath + ames.Updated$Full.Bath
I decided to combine the interior area related variables, namely Total.Bsmt.SF (total square feet of basement area) and Gr.Liv.Area (above ground living area of the house in square feet) .
ames.Updated$Overall.AREA <- ames.Updated$Gr.Liv.Area + ames.Updated$Total.Bsmt.SF
I then removed the constituent variables of Bathroom.tally, Overall.area.size and Porch area:
ames.Updated<-ames.Updated%>%select(-c(Half.Bath, Bsmt.Half.Bath,Bsmt.Full.Bath,Full.Bath,Gr.Liv.Area,Total.Bsmt.SF))
ELIMINATE NEAR ZERO VARIANCE FACTORS
The nearZeroVar functions evaluates the ratio of the highest frequency value to the next most frequent value(freqCut) . In addition, the function evaluates the percentage of unique values in relation to the overall quantity of samples (uniqueCut). The default settings are freqCut = 19 and uniqueCut = 10. Prior to inspecting the data for near-zero variance predictors , I decided to use dplyr’s drop_na function so that I was examining the dataset using only ‘complete cases’.
ames.Updated.comp<-ames.Updated%>%drop_na()
nzv.data <- nearZeroVar(ames.Updated.comp, saveMetrics = TRUE)
drop.cols <- rownames(nzv.data)[nzv.data$nzv == TRUE]
ames.Updated.2<- ames.Updated[,!names(ames.Updated)%in% drop.cols]
Applying the near zero variance function flagged the following variables: Alley,Street, Land.Contour, Utilities, Land.Slope, Misc.Val, Pool.Area,Screen.Porch, Enclosed.Porch, Open.Porch.SF,Garage.Cond,Kitchen.Abv.Gr,Low.Qual.Fin.SF,Bsmt.Fin.SF.2, Condition.2, Bsmt.Cond, BsmtFin.Type.2,Roof.Matl for removal. All of the suggested variables were removed.
When independent variables are highly correlated, changes in one variable are associated with changes in the other correlated variables. When conducting multiple linear regression, it becomes challenging to change one feature without simultaneously changing another independent variable, thus the algorithm struggles to discern the unique relationship that exists between particular individual independent variables upon the dependent variable. Unfortunately, the coefficient estimates become unreliable, they are sensitive to small changes in the model. Whilst the presence of multicollinearity influences the coefficient and p-values, it will not impair predictive performance.
I extracted a subset consisting of all complete cases.
ames_complete <-
ames.Updated.2[complete.cases(ames.Updated.2), ]
NUMERIC AND CATEGORICAL COMPLETE CASES
I then extracted a new subset containing all numeric variables:
amesnumeric.comp <- dplyr::select_if(ames_complete, is.numeric)
I then decided to find out what numeric variables were most correlated with Sale Price
corr_var(amesnumeric.comp, # name of dataset
SalePrice, # name of variable to focus on
max_pvalue = 0.05
,top = 10)
I removed the dependent variable SalePrice before examining the pairwise correlations:
amesnumeric2 <-amesnumeric.comp[c(-19)]
PAIRWISE CORRELATIONS ACROSS COVARIATES
The caret findCorrelation evaluates pair-wise correlations across all variables, flagging variables that are highly correlated. Of the identified pairs, the function recommends the removal of the variable with the highest average absolute correlation across the dataset. However, I decided to retain the variables that have the highest correlation with the dependent variable i.e. SalePrice.I proceeded to find out which of the variables had the highest correlation with the dependent variable. Firstly, I extracted a subset with the suggested variables for removal, I was then able to easily arrange the correlations with SalePrice in descending order.
library(caret)
identify<-caret::findCorrelation(cor(amesnumeric2),cutoff = 0.75,names = T, verbose = T)
## Compare row 23 and column 9 with corr 0.832
## Means: 0.419 vs 0.236 so flagging column 23
## Compare row 14 and column 15 with corr 0.852
## Means: 0.351 vs 0.218 so flagging column 14
## Compare row 19 and column 20 with corr 0.84
## Means: 0.286 vs 0.207 so flagging column 19
## All correlations <= 0.75
ames.suggestions<- amesnumeric.comp[c('Garage.Cars','Garage.Area','Overall.AREA', 'X1st.Flr.SF','Age.of.garage', 'Age.of.house','SalePrice')]
#evaluate correlations against dependent variable
ames.suggestions %>% corr_var(SalePrice)
I removed the three variables that had the lowest correlation with SalePrice from the amesnumeric.comp subset.
#variable to be removed
minus.vars<-c("Age.of.garage","Age.of.house","X1st.Flr.SF")
#create numeric subset devoid of variables specified in minus.vars
amesnumeric.3 <- amesnumeric.comp[,!names(amesnumeric.comp) %in% minus.vars]
At this stage, I removed the same variables from the main dataset
ames.Updated.3<-ames.Updated.2[,!names(ames.Updated.2)%in% minus.vars]
I decided to evaluate which variables had the highest correlations (using Pearson’s r) with SalePrice. I shall took a subset featuring only complete cases from the previously updated primary dataset:
ames.comp2<-ames.Updated.3[complete.cases(ames.Updated.3),]
I took a further subset containing only the numeric variables from the subset containing only complete cases before evaluating the respective correlations.
amesnumeric.4<-dplyr::select_if(ames.comp2, is.numeric)
corr_var(amesnumeric.4, # name of dataset
SalePrice, # name of variable to focus on
max_pvalue = 0.05
,top = 10)
I wanted a fairly quick way to identify influential factor variables. Here, I decided to the enlist the aid of the Random forest algorithm to help identify influential categorical factors. I created a subset of the factor/character variables with Sale Price. Afterwards, I ran the random forest algorithm.
Factor_subset <- ames_complete[sapply(ames_complete, is.character)|sapply(ames_complete, is.factor)|names(ames_complete)=='SalePrice']
set.seed(174)
Influential.Factors <- ranger(SalePrice ~ ., data = Factor_subset, importance = "permutation")
Extract.influential <- vip(Influential.Factors)
plot(Extract.influential)
I decided to produce a range of potentially informative visualisations; I was particularly interested in seeing whether I could spot any discernible relationships between the dependent variable and the respective exploratory variables. The adage of ‘a picture speaks a thousand words’ is very often the case in the field of data analysis. Up until this point, throughout the EDA (Exploratory Data Analysis) section, I had relied heavily upon the guidance of the Pearson correlation coefficient. However, as I stated earlier, the Pearson coefficient is a measure of linear correlation, it does not capture non-linear relationships. Informative patterns may well exist between the independent variable(s) and the dependent variable that the Pearson coefficient fails to capture.
Here, I plotted all the numeric variables against Sale Price
amesnumeric.4%>%
pivot_longer(-SalePrice, names_to = "Feature", values_to = "Value") %>%
ggplot() +
geom_point(mapping=aes(x = Value, y = SalePrice, color = Feature)) +
scale_y_continuous(labels = scales::comma)+
facet_wrap(~ Feature, scales = "free", ncol = 4) +
scale_x_continuous(n.breaks = 2)+
theme(legend.position = "",
plot.title.position = "plot")+
labs(x = "Numeric Feature Value",
title = "Ames Housing Numeric Variable versus Sale Price")+
ISAAC.THEME
Overall Area is clearly moving in tandem with Sale Price, whereas Years since remodeled appears to exhibit a clearly discernible negative correlation with Sale Price.
ggplot(data = ames_complete,
mapping = aes(x = Yrs.since.Remod, y = SalePrice)) +
geom_line() +
facet_wrap(vars(Neighborhood)) +
labs(title = "Price vs Time since remodeled/Neighbourhood")+
theme_economist()
Please note that the variable time since remodeled ( Yrs.since.Remod) translates as the time since construction if remodelling/additions have not been carried out.
Factor_subset%>%gather(-SalePrice, key = "var", value = "value") %>%ggplot(aes(x = value, y = SalePrice)) +
geom_violin() +
facet_wrap(~ var, scales = "free") +
theme_bw()+
ISAAC.THEME2
I decided to expore the distribution densities of the exploratory variables.
melt(ames_complete) %>%
ggplot(aes(x= value)) +
geom_density(fill='#fdae6b') + facet_wrap(~variable, scales = 'free')+
ISAAC.THEME
## Using MS.SubClass, MS.Zoning, Lot.Shape, Lot.Config, Neighborhood, Condition.1, Bldg.Type, House.Style, Roof.Style, Exterior.1st, Exterior.2nd, Mas.Vnr.Type, Exter.Qual, Exter.Cond, Foundation, Bsmt.Qual, Heating.QC, Electrical, Kitchen.Qual, Paved.Drive, Sale.Type, Sale.Condition, Bsmt.Exposure, BsmtFin.Type.1, Fireplace.Qu, Garage.Type, Garage.Finish, Garage.Qual, Fence, Month.Sold as id variables
ggplot(ames.Updated.3, aes(x=SalePrice, y=..density..)) +
geom_histogram(bins=50, fill="#8F0C90", col="#DAF7A6") +
geom_density(size = 1, colour = "#FFFAE5") +
labs(title = "Houses prices", x = 'Price', y = "Frequency") +
scale_x_continuous(labels=scales::dollar_format())+
theme(plot.title = element_text(hjust = 0.5))+
ISAAC.THEME
The distribution of house prices is clearly right-skewed. I felt that it may be worth considering logarithmic transformations of sale price (the dependent variable), especially for linear modelling.
My hunch was that prices would vary considerably across neighbourhoods, indeed the visualisation featured below clearly indicates that this was the case.
library(ggridges)
ggplot(ames.comp2, aes(x = `SalePrice`, y = Neighborhood, fill = stat(x))) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_viridis_c(name = "Prices", option = "C") +
labs(title = 'Prices across Neighbourhoods')
## Picking joint bandwidth of 15500
I was keen to obtain an overview of some basic descriptive summary statistics, my primary motivation was to identify factor levels containing a low number of entries. Factor levels with a low level of entries be problematic for running many algorithms.
ames.Updated.3%>%group_by(Bsmt.Qual)%>%summarize(n=n())%>%mutate(percent = round(n / sum(n) * 100, 2))%>%kable()%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Bsmt.Qual | n | percent |
|---|---|---|
| Ex | 258 | 8.81 |
| Fa | 88 | 3.00 |
| Gd | 1219 | 41.60 |
| Po | 2 | 0.07 |
| TA | 1283 | 43.79 |
| NA | 80 | 2.73 |
ames.Updated.3%>%group_by(Exter.Qual)%>%summarize(n=n())%>%mutate(prop=n/sum(n))%>%kable()%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Exter.Qual | n | prop |
|---|---|---|
| Ex | 107 | 0.0365188 |
| Fa | 35 | 0.0119454 |
| Gd | 989 | 0.3375427 |
| TA | 1799 | 0.6139932 |
ames.Updated.3%>%group_by(Kitchen.Qual)%>%summarize(n=n())%>%mutate(percent = round(n / sum(n) * 100, 2))%>%knitr::kable(
format = "html",
booktabs = TRUE, # For LaTeX
col.names = c("Quality", "Count", "Percentage"),
caption = "Kitchen Quality Proportions",
digits = 1
) %>%kable_styling(bootstrap_options = "striped")
| Quality | Count | Percentage |
|---|---|---|
| Ex | 205 | 7.0 |
| Fa | 70 | 2.4 |
| Gd | 1160 | 39.6 |
| Po | 1 | 0.0 |
| TA | 1494 | 51.0 |
ames.Updated.3%>%group_by(Fireplace.Qu)%>%summarize(n=n())%>%mutate(percent = round(n / sum(n) * 100, 2))%>%knitr::kable(
format = "html",
booktabs = TRUE, # For LaTeX
col.names = c("Quality", "Count", "Percentage"),
caption = "Distribution of Fireplace Quality",
digits = 1
) %>%kable_styling(bootstrap_options = "striped")
| Quality | Count | Percentage |
|---|---|---|
| Ex | 43 | 1.5 |
| Fa | 75 | 2.6 |
| Gd | 744 | 25.4 |
| None | 1422 | 48.5 |
| Po | 46 | 1.6 |
| TA | 600 | 20.5 |
library(kableExtra)
ames.Updated.3%>%
summarise(average = mean(SalePrice), median=median(SalePrice), minimum = min(SalePrice), maximum=max(SalePrice), standard.deviation=sd(SalePrice)) %>%
kable()%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| average | median | minimum | maximum | standard.deviation |
|---|---|---|---|---|
| 180796.1 | 160000 | 12789 | 755000 | 79886.69 |
A number of the variables in this dataset were of the ordinal type; I decided to recode these as numeric variables, primarily to ensure that they would be suitable for the modelling process.
EXTERNAL QUALITY
ames.Updated.3$Exter.Qual <- as.numeric(factor(ames.Updated.3$Exter.Qual,
levels = c("Fa","TA", "Gd", "Ex"),
labels = c(1,2,3,4) ,ordered = TRUE))
BASEMENT QUALITY
ames.Updated.3$Bsmt.Qual <- as.numeric(factor(ames.Updated.3$Bsmt.Qual,
levels = c("Po","Fa", "TA", "Gd", "Ex"), labels = c(1,2,3,4,5) ,ordered = TRUE))
KITCHEN QUALITY
When I was perusing through the distribution of levels, I realized that there is only one entry for a classification of Poor(Po), therefore I shall assimilate this into the nearest rank of Fair(Fa). :
ames.Updated.3[ames.Updated.3$Kitchen.Qual=="Po",]$Kitchen.Qual <- "Fa"
ames.Updated.3$Kitchen.Qual <- as.numeric(factor(ames.Updated.3$Kitchen.Qual,
levels = c("Fa","TA", "Gd", "Ex"),
labels = c(1,2,3,4) ,ordered = TRUE))
FIREPLACE QUALITY
ames.Updated.3$Fireplace.Qu <- as.numeric(factor(ames.Updated.3$Fireplace.Qu,
levels = c("None","Po","Fa", "TA", "Gd", "Ex"),
labels = c(1,2,3,4,5,6) ,ordered = TRUE))
Now that I had introduced some new numeric variables into the dataset, I decided to examine their correlation with sale price.
ames.ordinal<- ames.Updated.3[c('Exter.Qual','Fireplace.Qu','Bsmt.Qual', 'Kitchen.Qual','SalePrice')]
ames.ordinal%>%drop_na()%>%corr_var(SalePrice, max_pvalue = 0.05)
At this stage, I decided to reduce the evolving dataset considerably by selecting the twenty exploratory variables that had thus far indicated clear predictive promise. I made the decision to include the top ten variables with the highest correlation with Sale Price, in addition to the top ten most important features extracted from the random forest algorithm that I ran earlier using the factor variables to predict house prices. The model was entitled Influential.Factors. I extracted the relevant subset from ames.Updated.3.
ames.parsimonious6<-ames.Updated.3%>%select(Overall.Qual,Overall.AREA,Garage.Cars, Garage.Area, Bathroom.Tally, Mas.Vnr.Area, Yrs.since.Remod,TotRms.AbvGrd,Fireplaces,BsmtFin.SF.1,Bsmt.Qual,Foundation,Exter.Qual,Kitchen.Qual,Neighborhood,Fireplace.Qu, Garage.Type, Garage.Finish, MS.SubClass, Exterior.1st,SalePrice)
I soon realized that I still needed to do a little more preprocessing before the data was ready for the modelling process. A number of the independent variables had levels with few entries. I decided to recruit the Forcats library in order to relevel levels with fewer than 50 entries into a new category simply labelled ‘other’. I initially attempted to run a number of the models without including this step, however a number of the models failed to function as intended.
library(forcats)
ames.parsimonious6 <- ames.parsimonious6 %>%mutate(Neighborhood = fct_lump_min(Neighborhood, min=50, other_level = "other"))
ames.parsimonious6 <- ames.parsimonious6 %>%mutate(MS.SubClass = fct_lump_min(MS.SubClass, min=50, other_level = "other"))
ames.parsimonious6 <- ames.parsimonious6 %>%mutate(Exterior.1st = fct_lump_min(Exterior.1st, min=50, other_level = "other"))
ames.parsimonious6 <- ames.parsimonious6 %>%mutate(Garage.Type = fct_lump_min(Garage.Type, min=50, other_level = "other"))
In the basement quality variable, only two of the entries were marked as Poor(Po); I decided to change these entries to Fair(Fa). This would increase the likelihood of algorithms running smoothly.
#convert few entries marked '1' as '2'
ames.parsimonious6$Bsmt.Qual[ames.parsimonious6$Bsmt.Qual==1]<-2
It was time to begin the modelling process. Firstly, I needed to divide the primary dataset into a training and testing dataset.
library(doParallel)
## Loading required package: parallel
# doParallel
cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cores = cl)
#split data into training/testing sets
set.seed(174)
data_split <- initial_split(ames.parsimonious6, strata = "SalePrice", prop = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
I decided to recruit the recipes-package for some final preprocessing of the evolving dataset. A recipe allows one to craft a sequence of preprocessing operations on specified datasets. Thereafter, the recipe can be independently applied to other sets, including validation sets. One of the primary benefits of using the tidymodels/recipe framework is that it allows a data analyst to easily avoid the leakage of data between sets. A classic mistake can occur when certain preprocessing steps like centering and scaling are applied to the entire dataset before it is split into train/test or cross-validation folds. It is important to apply certain data preparation steps after splitting the data into respective train/test/folds in order to avoid biasing predictive performance. An example of this form of data leakage would occur if the analyst attempted to normalize columns of a dataset before splitting it into train/test subsets. If the analyst decided to apply min-max normalization, the process would entail calculating the global maximum(s)/minimum(s) in order to apply to every calculation. If this was done before splitting, the resultant training would have some insight into the distribution of the test data set. This type of issue can also arise with various forms of imputation.
In the following recipe I imputed missing data with K-nearest neighbours(KNN)-this algorithm, using Euclidean distance, recruits the k nearest samples in the sample space and imputes the mean of the specified neighbours. In addition, I converted all categorical variables to dummy variables in order to make them machine learning friendly.
model_rec <- recipe(
SalePrice ~ .,
data= ames_train)%>%step_dummy(all_nominal())%>%step_knnimpute(all_predictors(), neighbors = 10)%>%prep(training=ames_train, retain=TRUE, verbose=TRUE)
## oper 1 step dummy [training]
## oper 2 step knnimpute [training]
## The retained training set is ~ 1.01 Mb in memory.
Here, I applied the recipe model_rec to the ames_train data
trainSet.prep<-bake(model_rec,new_data=ames_train, composition='matrix')
trainSet = as.data.frame((trainSet.prep))
Similarly, I used the function bake to apply the same recipe to the ames_test set.
testSet.prep<-bake(model_rec,new_data=ames_test, composition='matrix')
testSet = as.data.frame((testSet.prep))
I decided to run a simple linear model to begin with. Linear regression is a good place to start testing models. Having visually examined the scatter plots of the independent variables against the dependent variable in the visualisation section, my hunch was that linear modelling would prove to be an inappropriate method. The scatterplots suggested that some of the continuous variables appeared to be violating one of the cardinal assumptions of linear regression- Linearity! This assumption posits that the dependent variable has a roughly linear relationship with each of the covariates.
Here, I constructed a simple linear model using all of the variables in the model.
```r
set.seed(741)
Linear.model<-lm(SalePrice~., data=trainSet)
I wanted to test the linear model on the test set.
predictions <- predict(Linear.model, newdata = testSet)
I calculated the residuals using the formula residuals = actual - prediction.
residuals <- testSet$SalePrice - predictions
I was keen to find out how well the model performed. Here, I simply wanted to see the root-mean-square error(RMSE).
sqrt(mean(Linear.model$residuals^2))
## [1] 30651.1
The adage “A picture is worth a thousand words” is an apt expression when it comes assessing the residuals of linear models.
par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(Linear.model)
The linear model appeared to be poorly predicting prices in the upper price ranges: the residuals versus fitted plot indicates increased variance as the price ascends. The quantile-quantile(Q-Q) plot was particularly revealing: the residuals are overdispersed. The distribution is heavier tailed than a normal distribution: this generally signifies that there are more extreme values than would be expected if the underlying distribution was normal. The p-values and confidence intervals will be potentially biased and too narrow, which may result in an increase in type I errors (claiming that x affects y, when it does not). I decided to seek some corroboration via some trusted allies: the Shapiro, Breusch-Pagan and Anderson-Darling tests were all brought to the fore.
SHAPIRO TEST
I commenced my investigation with the Shapiro test. I wanted to find out whether the residuals/errors were normally distributed. Whilst it is not a prerequisite that the residuals are normally distributed for ordinary least squares to generate unbiased estimates of the underlying population parameters, meeting this condition facilitates the production of trustworthy prediction and confidence intervals.
p_val <- shapiro.test(Linear.model$residuals)$p.value;
if(p_val>=0.05) {
print("The data is likely to be normally distributed")
} else {
print("Given that p<0.05, the data does not appear to be normally distributed. The calculated
coeffient estimates cannot be trusted")
}
## [1] "Given that p<0.05, the data does not appear to be normally distributed. The calculated\ncoeffient estimates cannot be trusted"
The null hypothesis of the shapiro test is that the observed data is normally distributed. As p<0.05, in keeping with custom, I concluded that the residuals were not normally distributed.
BREUSCH-PAGAN TEST
I decided to perform a homoscedastic test. The Breusch-Pagan test evaluates the homoscedasticity of the residuals. Linear regression is particularly suitable when the error/residuals are homoscedastic across all the predicted values of the dependent variable. This type of assessment gauges whether a regression model’s ability to predict the dependent variable is consistent across all ranges of the respective variable.
library(lmtest)
bptest(Linear.model)
##
## studentized Breusch-Pagan test
##
## data: Linear.model
## BP = 879.11, df = 60, p-value < 0.00000000000000022
if(p_val>=0.05) {
print("The data can be considered homoscesdastic")
} else {
print("Given that p<0.05, heteroskedasticity appears to be present.")
}
## [1] "Given that p<0.05, heteroskedasticity appears to be present."
ANDERSON DARLING TEST
# Anderson-Darling normality test
library(nortest)
ad.test(Linear.model$residuals)
##
## Anderson-Darling normality test
##
## data: Linear.model$residuals
## A = 64.433, p-value < 0.00000000000000022
if(p_val>=0.05) {
print("The data can be considered to be normally distributed")
} else {
print("The result suggests that the residuals are NOT normally distibuted ")
}
## [1] "The result suggests that the residuals are NOT normally distibuted "
Non-normality and heteroscedasticity were clearly present in the distribution of the residuals. If heteroscedasticity is present and is the only principal assumption of linear regression that is not met, this does not bias the regression coefficients. However, it does bias the standard errors and test-statistics. Unfortunately, this can produce inaccurate standard errors that result in confidence intervals that cannot be relied on.
OUTLIERS
I decided to look a little further into the presence of the outliers that were showing up in the diagnostic plots. Outliers can be found by inspecting the standardized residuals. Standardized residuals indicate the number of standard errors removed from the regression line. It is customary to consider standardized residuals exceeding 3 as possible outliers.
Whilst not all outliers are deemed to be influential, they are generally worth exploring further. Cook’s distance is a metric that seeks to capture the influence of a value: it is governed by a mixture of leverage and residual size.
plot(Linear.model, 4, id.n = 6)
plot(Linear.model, 5, id.n = 6)
extra.metrics <- augment(Linear.model)
# Add observations indices and
# drop some columns (.se.fit, .sigma) for simplification
extra.metrics <- extra.metrics %>%
mutate(index = 1:nrow(extra.metrics))
extra.metrics %>%
top_n(6, wt = .cooksd)
## # A tibble: 6 x 69
## SalePrice Overall.Qual Overall.AREA Garage.Cars Garage.Area Bathroom.Tally
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 160000 10 11752 2 1418 4.5
## 2 745000 10 6872 3 813 4.5
## 3 755000 10 6760 3 832 4
## 4 183850 10 10190 3 1154 4
## 5 184750 10 7814 3 884 4.5
## 6 475000 10 4715 3 840 2.5
## # ... with 63 more variables: Mas.Vnr.Area <dbl>, Yrs.since.Remod <dbl>,
## # TotRms.AbvGrd <dbl>, Fireplaces <dbl>, BsmtFin.SF.1 <dbl>, Bsmt.Qual <dbl>,
## # Exter.Qual <dbl>, Kitchen.Qual <dbl>, Fireplace.Qu <dbl>,
## # Foundation_CBlock <dbl>, Foundation_PConc <dbl>, Foundation_Slab <dbl>,
## # Foundation_Stone <dbl>, Foundation_Wood <dbl>, Neighborhood_CollgCr <dbl>,
## # Neighborhood_Crawfor <dbl>, Neighborhood_Edwards <dbl>,
## # Neighborhood_Gilbert <dbl>, Neighborhood_IDOTRR <dbl>,
## # Neighborhood_Mitchel <dbl>, Neighborhood_NAmes <dbl>,
## # Neighborhood_NoRidge <dbl>, Neighborhood_NridgHt <dbl>,
## # Neighborhood_NWAmes <dbl>, Neighborhood_OldTown <dbl>,
## # Neighborhood_Sawyer <dbl>, Neighborhood_SawyerW <dbl>,
## # Neighborhood_Somerst <dbl>, Neighborhood_StoneBr <dbl>,
## # Neighborhood_Timber <dbl>, Neighborhood_other <dbl>,
## # Garage.Type_BuiltIn <dbl>, Garage.Type_Detchd <dbl>,
## # Garage.Type_None <dbl>, Garage.Type_other <dbl>, Garage.Finish_None <dbl>,
## # Garage.Finish_RFn <dbl>, Garage.Finish_Unf <dbl>, MS.SubClass_B <dbl>,
## # MS.SubClass_E <dbl>, MS.SubClass_F <dbl>, MS.SubClass_G <dbl>,
## # MS.SubClass_I <dbl>, MS.SubClass_K <dbl>, MS.SubClass_L <dbl>,
## # MS.SubClass_N <dbl>, MS.SubClass_P <dbl>, MS.SubClass_other <dbl>,
## # Exterior.1st_CemntBd <dbl>, Exterior.1st_HdBoard <dbl>,
## # Exterior.1st_MetalSd <dbl>, Exterior.1st_Plywood <dbl>,
## # Exterior.1st_VinylSd <dbl>, Exterior.1st_Wd.Sdng <dbl>,
## # Exterior.1st_WdShing <dbl>, Exterior.1st_other <dbl>, .fitted <dbl>,
## # .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>,
## # index <int>
It was noteworthy that the top five outliers with high Cook’s distance scores shared a common Overall.Quality score of 10. I decided that I would consider revisiting these observations at a later stage.
PREDICTED/ACTUAL PRICES
I wanted to visualise the consistency of the predictions across the sample space.
library(auditor)
lm_audit <- audit(Linear.model, data = trainSet, y =trainSet$SalePrice)
## Preparation of a new explainer is initiated
## -> model label : lm ( [33m default [39m )
## -> data : 2199 rows 62 cols
## -> target variable : 2199 values
## -> predict function : yhat.lm will be used ( [33m default [39m )
## -> predicted values : No value for predict function target column. ( [33m default [39m )
## -> model_info : package stats , ver. 4.0.4 , task regression ( [33m default [39m )
## -> predicted values : numerical, min = 7432.059 , mean = 181401.5 , max = 572141.6
## -> residual function : difference between y and yhat ( [33m default [39m )
## -> residuals : numerical, min = -412141.6 , mean = -0.000000001694649 , max = 275067.8
## [32m A new explainer has been created! [39m
# validate a model with auditor
mr_lm <- model_residual(lm_audit)
# plot results
plot_prediction(mr_lm, abline = TRUE)
Here, it was evident that as price increases, the model’s predictive accuracy diminishes.
I was keen to explore the performance of a variety of algorithms that are known to work well with non-linear data. However, before I abandoned my exploration of linear regression, I wanted to find out what variables the linear model suggested were the most important in determining sale price.
library(vip)
vip(Linear.model)
I decided to experiment with a variety of algorithms. I simply wanted to get a sense of what algorithm(s) may be well-suited to modelling Sale Price in relation to the variables I had at my disposal.
library(parallel)
library(doParallel)
Mycluster =makeCluster(detectCores()-1)
registerDoParallel(Mycluster)
myControl = trainControl(method = 'cv', number = 10,
verboseIter = FALSE, savePredictions = TRUE,allowParallel = T)
set.seed(174)
Linear.Model = train(SalePrice ~., data = trainSet, metric = 'RMSE', method = 'lm',preProcess = c('center', 'scale'),trControl = myControl)
set.seed(174)
Glmnet.Model = train(SalePrice ~ ., data = trainSet , metric = 'RMSE', method = 'glmnet',preProcess = c('center', 'scale'), trControl = myControl)
set.seed(174)
Rapid.Ranger = train(SalePrice ~ ., data = trainSet, metric = 'RMSE', method = 'ranger',preProcess = c('center', 'scale'),trControl = myControl)
set.seed(174)
Basic.Knn <- train(SalePrice ~ .,
method = "knn",
tuneGrid = expand.grid(k =1:3), trControl = myControl, metric= "RMSE", data = trainSet)
set.seed(174)
Xgb.Super <- train(SalePrice~.,method = "xgbTree", tuneLength = 4,trControl = myControl,metric= "RMSE", data = trainSet)
## [17:45:46] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.
suite.of.models = list("LINEAR.MODEL"=Linear.Model,"GLMNET.MODEL"=Glmnet.Model, "RANGER.QUEST"= Rapid.Ranger, "KNN.SIMPLE" = Basic.Knn, "XGB.SUPER"= Xgb.Super)
resamps = resamples(suite.of.models)
dotplot(resamps, metric = 'RMSE')
Evaluate.Prediction <- function(model, model.label, testData, ytest, grid = NULL) {
#capture prediction time
ptm <- proc.time()
# use test data to make predictions
pred <- predict(model, testData)
tm <- proc.time() - ptm
Pred.metric<- postResample(pred = pred, obs = ytest)
RMSE.test <- c(Pred.metric[[1]])
RSquared.test <- c(Pred.metric[[2]])
MAE.test <- c(Pred.metric[[3]])
Summarised.results = NULL
if (is.null(grid)) {
Summarised.results = data.frame(predictor = c(model.label) , RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]))
} else {
.grid = data.frame(predictor = c(model.label) , RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]))
Summarised.results = rbind(grid, .grid)}
Summarised.results }
METRIC.GRID <- Evaluate.Prediction (Rapid.Ranger, "RAPID.QUEST", testSet, testSet$SalePrice, grid=NULL)
METRIC.GRID <- Evaluate.Prediction (Glmnet.Model, "GLMNET.MODEL", testSet, testSet$SalePrice, grid=METRIC.GRID)
METRIC.GRID <- Evaluate.Prediction (Basic.Knn, "KNN.SIMPLE", testSet, testSet$SalePrice, grid=METRIC.GRID)
METRIC.GRID <- Evaluate.Prediction (Linear.Model, "LINEAR.MODEL", testSet, testSet$SalePrice, grid=METRIC.GRID)
METRIC.GRID <- Evaluate.Prediction (Xgb.Super, "XGB.SUPER", testSet, testSet$SalePrice, grid=METRIC.GRID)
kable(METRIC.GRID[order(METRIC.GRID$RMSE, decreasing=F),])%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| predictor | RMSE | RSquared | MAE | time | |
|---|---|---|---|---|---|
| 1 | RAPID.QUEST | 22946.09 | 0.9042345 | 14757.51 | 0.17 |
| 5 | XGB.SUPER | 23845.53 | 0.8992469 | 15554.32 | 0.16 |
| 2 | GLMNET.MODEL | 24783.40 | 0.8883612 | 17368.92 | 0.06 |
| 4 | LINEAR.MODEL | 24826.47 | 0.8883348 | 17432.18 | 0.03 |
| 3 | KNN.SIMPLE | 37875.20 | 0.7548226 | 26322.29 | 0.19 |
The random forest (Rapid.Ranger) model and extreme gradient boosting (XGB.SUPER) models showed the most promise with the lowest average predictive errors (RMSE). I decided to explore these models further.
I decided to construct the random forest model using the R’s Tidymodels framework. This is a superb package that beautifully streamlines the predictive modelling process.
To divide the available data, I recruited the initial_split() function: it is designed to split a data set into a training and testing set. In the default stance, 0.75 of the data is allocated to the training set with the remainder being reserved for testing. The default setting can be modified by changing the prop argument. Instead of a data frame, this function produces an rsplit object.
set.seed(741)
ames_split <- initial_split(ames.parsimonious6, prop = 0.80, strata = SalePrice)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
I opted to perform cross validation on the train set. Cross-validation, also known as rotation estimation, is a validation strategy used for evaluating how well the results of a statistical analysis are likely to generalize to an unseen dataset. Many data analysts employ k-fold cross-validation by employing k=5 or k=10, these numbers have demonstrated an ideal balance between minimising bias and variance.
set.seed(174)
random.f_cv <- vfold_cv(ames_train,v=5)
The following object simple_rec is a recipe/sequence of instructions that can been applied to datasets. The specified operations will be applied to the train and test sets that are used throughout this particular exploration of the random forest algorithm
simple_rec <-
recipe(SalePrice ~ .,
data= ames_train)%>%
step_dummy(all_nominal())%>%step_knnimpute(all_predictors(), neighbors=10)
In the code chunk specified below, the argument engine indicates what machine learning algorithm will be used to fit the model. Tidymodels is a wrapper for various machine learning packages. This is simply a specification, no data is being fed to the algorithm at this stage .
```r
ranger_model <- rand_forest(
trees = tune(),
mtry = tune()) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("regression")
Here, I recruited a workflow: this is a container that integrates a preprocessing recipe and a model specification.
random.f_workflow <- workflow()%>% add_recipe(simple_rec)%>%
add_model(ranger_model)
In the following code chunk, I specified a simple grid to be used in the quest for optimal parameters. I used the following grid numerous times with varying values. However, as I progressively experimented with varying values, I did not re-run the previously tried values in order to minimise computational cost. Thus, the following grid is a curtailed version of the values I have experimented with.
random.f_grid <- expand_grid(mtry = 33:36, trees = seq(1400, 1700, by = 100))
It was time to initialise the exploration of the specified hyperparameter space.
tidy_grid_results <- random.f_workflow %>% tune_grid(
resamples = random.f_cv,
grid = random.f_grid)
Here, I visualised the movement of RMSE throughout the search
autoplot(tidy_grid_results, metric = "rmse")
I extracted the parameters that generated the lowest RMSE
param_final <- tidy_grid_results %>%select_best(metric = "rmse")
It was time to update the workflow: I added the optimised parameters to the workflow using the finalize_workflow() function.
random.f_workflow <- random.f_workflow %>%
finalize_workflow(param_final)
It was time to finally fit the model
rf_fit <- random.f_workflow %>%
fit(ames_train)
I decided to find out which features demonstrated the most importance according to the random forest model
library(vip)
rf_fit %>%
pull_workflow_fit() %>%
vip::vip()
It was time to test the fitted model on the validation/test set. The predict function` in the tidymodels framework applies the same preprocessing operations that were defined in the recipe and previously applied to the training sets (cross-validation splits). However, this time these operations are applied to the test set before the newly preprocessed data is passed to the predict function.
predict(rf_fit, new_data = ames_test)
## # A tibble: 584 x 1
## .pred
## <dbl>
## 1 123570.
## 2 176195.
## 3 175797.
## 4 179798.
## 5 152288.
## 6 122690.
## 7 356318.
## 8 236323.
## 9 313051.
## 10 164239.
## # ... with 574 more rows
Below, I plotted the predicted versus observed values produced by the random forest model
ranger_pred <- predict(rf_fit, new_data = ames_test %>% select(-SalePrice))
ranger_pred <- bind_cols(ranger_pred, ames_test %>% select(SalePrice))
ggplot(ranger_pred, aes(x = SalePrice, y = .pred)) +
# Create a diagonal line:
geom_abline(lty = 2) +
geom_point(alpha = 0.5) +
labs(y = "Predicted Sale Price", x = "SalePrice") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
I was now in a position to trial the constructed model on the test set.
ames_metrics <- metric_set(yardstick::rmse, yardstick:: mae)
ames_metrics(ranger_pred, truth = SalePrice, estimate = .pred)
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 25416.
## 2 mae standard 15981.
It was time to construct a model using the extreme gradient boosting algorithm.
Again, I started by dividing the dataset into train/test sets
# divide dataset, stratify by SalePrice
set.seed(174)
ames_split.2 <-initial_split(ames.parsimonious6, prop = 0.75, strata = SalePrice)
ames_train2 <- training(ames_split.2)
ames_test2 <- testing(ames_split.2)
Again, I called a recipe to perform some final preprocessing.
XGB_rec <-
recipe(SalePrice ~ .,
data= ames_train2)%>%
step_dummy(all_nominal())%>%step_knnimpute(all_predictors(), neighbors=10)%>%prep()
Here, I applied cross-validation to randomly divide the training data into more training and test subsets.
set.seed(174)
cv_folds <-recipes::bake(
XGB_rec,
new_data = ames_train2)%>%
rsample::vfold_cv(v = 5)
In the following code block, I transformed the train/tests using the recipe (XGB_rec). I did not include this step when working with the previous random forest model. This step would not have been necessary if I had not gone on to construct an explainer with modelStudio (featured at the end of this document).
train.ready<-juice(XGB_rec)
test.ready<-bake(XGB_rec, new_data = ames_test2)
Here, I defined the XGBoost model specification, alongside the hyperparameters to be tuned
Model.XGB <-
boost_tree(
mode = "regression",
trees = 1000,
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune()
) %>%set_engine("xgboost", objective = "reg:squarederror")
I employed the tidymodels dials package to specify the parameters.
# grid specification
XGB.aspects <-
dials::parameters(
min_n(),
tree_depth(),
learn_rate(),
loss_reduction())
At this stage, I set up the grid space of exploration by recruiting the dials::grid_max_entropy() grid function which covers the specified hyperparameter space.
xgboost_grid <-
dials::grid_max_entropy(
XGB.aspects, size = 200)
knitr::kable(head(xgboost_grid))
| min_n | tree_depth | learn_rate | loss_reduction |
|---|---|---|---|
| 40 | 7 | 0.0000001 | 0.0077783 |
| 37 | 4 | 0.0000537 | 0.0009884 |
| 10 | 13 | 0.0000000 | 0.0000000 |
| 17 | 3 | 0.0000046 | 17.8336015 |
| 37 | 3 | 0.0037961 | 0.0000094 |
| 16 | 7 | 0.0013587 | 0.0000000 |
At this stage, I recruited the tidymodels workflows package in order to apply a basic formula to the XGBoost model specified above.
xgboost_wf <-
workflows::workflow() %>%
add_model(Model.XGB) %>%
add_formula(SalePrice ~ .)
I used the following tune_grid() to specify the search space for optimal hyperparameters
# hyperparameter tuning
TUNE.XGB <- tune::tune_grid(
object = xgboost_wf,
resamples = cv_folds,
grid = xgboost_grid,
metrics = yardstick::metric_set(yardstick::rmse, yardstick::rsq, yardstick::rsq_trad, yardstick::mae),
control = tune::control_grid(verbose = FALSE))
Here, I extracted the parameters that yielded the minimal “rmse”
param_final <- TUNE.XGB %>%select_best(metric = "rmse")
I finalized the XGBoost model to use the tuning parameters that produced the lowest RMSE.
xgboost_wf2 <- xgboost_wf%>%
finalize_workflow(param_final)
It was time to fit the final extreme gradient boosting model
XGB.model <- xgboost_wf2 %>%
fit(train.ready)
Here, I extracted influential variables
library(vip)
XGB.model %>%
pull_workflow_fit() %>%
vip::vip()
XGB PREDICT ON TEST SET:
I was now ready to make predictions on the unseen test set. Thereafter, I wanted to inspect the consistency of the predictions produced by the XGB.model:
XGB_res <- predict(XGB.model, new_data = test.ready %>% select(-SalePrice))
XGB_res <- bind_cols(XGB_res, test.ready %>% select(SalePrice))
XGB_metrics <- metric_set(yardstick::rmse, yardstick:: mae)
XGB_metrics(XGB_res, truth = SalePrice, estimate = .pred)
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 23724.
## 2 mae standard 15045.
ggplot(XGB_res, aes(x = SalePrice, y = .pred)) +
# Create a diagonal line:
geom_abline(lty = 2) +
geom_point(alpha = 0.5) +
labs(y = "Predicted Sale Price", x = "SalePrice") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
Finally, I constructed an explainer using the fitted extreme gradient boosting model. The explainer provides detailed descriptions of what direction particular features are likely to be impacting the prices of particular houses. In the following interactive display of boxes, explanations for the first two houses in the test set are offered . For a thorough explanation of the types of plots used below, I would recommend the curious reader to consult this excellent resource.
explainer<- explain_tidymodels(XGB.model,
data = test.ready,
y = test.ready$SalePrice,
label = "tidymodels")
## Preparation of a new explainer is initiated
## -> model label : tidymodels
## -> data : 731 rows 62 cols
## -> data : tibble converted into a data.frame
## -> target variable : 731 values
## -> predict function : yhat.workflow will be used ( [33m default [39m )
## -> predicted values : No value for predict function target column. ( [33m default [39m )
## -> model_info : package tidymodels , ver. 0.1.2 , task regression ( [33m default [39m )
## -> predicted values : numerical, min = 56540.02 , mean = 177666.1 , max = 524786.9
## -> residual function : difference between y and yhat ( [33m default [39m )
## -> residuals : numerical, min = -193005 , mean = 1308.632 , max = 168033.1
## [32m A new explainer has been created! [39m
new_obs <- test.ready[1:2,]
rownames(new_obs) <- c("case 1", "case 2")
modelStudio(explainer, new_obs)