#suppress warnings globally
options(warn=-1)
options(scipen = 999)              #suppress the appearance of scientific notation throughout the analysis
rm(list=ls())

0.0.1 BACKGROUND


In order to put my newly acquired data science skills to the test, I decided to the explore the well-known Ames Housing dataset. This particular dataset documents 2006-2010 actual house prices in the City of Ames, Iowa. The dataset details the price of houses and is accompanied by a broad range of variables. The type of variables contained in this dataset primarily pertain to the quality, dimensions and quantity of physical characteristics associated with the respective properties. Generally, these covariates are the sort of features that prospective buyers are likely to be interested in, features such as the age of the property, the quality of the exterior and how many bathrooms the property holds. I refer the curious reader to the following article, where an in-depth rundown of the various features can be obtained.


0.0.2 STRUCTURE AND AIMS OF ANALYSIS


EDA:EXPLORATORY DATA ANALYSIS

MODELLING


0.0.3 KEY INSIGHTS


Employing an alpha level of 0.05, Overall Quality (0.86), Overall area (78.8), External Quality (69.4) and the number of garage cars (0.66) turned out to have the highest significant correlations with Sale Price. The age of the house (-0.56) and time since remodelled (-0.53) had the highest significant negative correlations with Sale Price. Pearson’s correlation coefficient (r) was employed- this metric quantifies linear association. Of course, Pearson’s correlation is unsuitable for adequately capturing curvilinear relationships between variables. The house prices ranged from $12,789 to $755,000, with the average price equating to $ 180,796 (standard deviation= $79,886.69).

I employed a range of predictive algorithms in order to discover an algorithm that would yield optimally low residual error. Of the algorithms tested, Extreme Gradient Boosting and Random Forest algorithms demonstrated the best predictive performances. I was able to obtain a mean absolute error (MAE) of $15,045 using extreme gradient boosting on unseen data (segregated validation set) and a MAE of $15,981 by employing the random forest algorithm. In terms of the features with the most important global/overall importance, both the random forest model and the extreme gradient boosting model agreed that Overall Area (of the interior house) and Overall Quality were the top two primary drivers responsible for estimating Sale Price. Additionally, I ran a classic multiple linear regression: the most important features for prediction turned out to be Overall Quality in first place, with Overall Area being the fourth most important feature. Thus, given the consensus between these differing algorithms, our confidence increases that these variables are indeed important predictors for determining sale price. As sale price increased, all of the models tested struggled to consistently and accurately predict the respective house prices. One potential hypothesis - admittedly, a tentative one- that could account for the deficit in predictive capacity in these upper price regions, is that such prices are less likely to be shaped by ordinary consumer behaviour. Buyers operating in the higher price region are not constrained by the usual economic push/pull factors; in other words, the liberated movement of human whim is introducing volatility.


AVENUES FOR FURTHER STUDY


0.0.4 COMMENCE OF THE ANALYSIS


I recruited the following libraries at various stages throughout the modelling process


library(ggplot2)
library(caret)
library(tidyr)
library(DMwR)
library(missForest)
library(randomForest)
library(tidymodels)
library(ranger) 
library(recipes) 
library(workflows) 
library(themis)
library(xgboost)
library(ggdark)
library(viridis)
library(DALEXtra)
library(ggthemes)
library(lares)
library(kableExtra)
library(tidyr)
library(tidyverse)
library(reshape)
library(dplyr)
library(modelStudio)
library(DALEXtra)
library(DALEX)
library(tidymodels)
library(vip)


I decided to define a couple of customised visualisation themes at the outset of the project; I wanted to be able to easily format the visual output of the analysis in a consistent manner. I feel that visual consistency enhances the fluidity of the reading process.


ISAAC.THEME <- theme(plot.title = element_text(hjust = 0.5), #  title centralised
        plot.background = element_rect(fill="black"), #background to be black
        panel.background = element_rect(fill="gray20"), # panel background to be dark grey
        panel.grid.minor = element_line(color="black"), # grid lines to be removed
        panel.grid.major = element_line(color="black"), # grid lines to be removed
        axis.text = element_text(color="white"), # axis text to appear white
        title = element_text(color="white", face="bold"), # title to be white and bold.
        legend.background = element_rect(fill="black"), # legend background appear as black
        legend.text = element_text(color="white", size=20), # legend text to appear white
        legend.key = element_rect(fill="black", color="black"), 
        legend.position = "top")
ISAAC.THEME2 <- theme(plot.title = element_text(hjust = 0.5), # title centralised
        plot.background = element_rect(fill="black"), #  background to be black
        panel.background = element_rect(fill="gray20"), #  panel background to be grey
        panel.grid.minor = element_line(color="black"), #  grid lines to be removed
        panel.grid.major = element_line(color="black"), # grid lines to be removed
        axis.text = element_text(color="white"), # axis text to be white
        title = element_text(color="white", face="bold"), # title to appear white and bold.
        legend.background = element_rect(fill="black"), # legend of background should appear black
        legend.text = element_text(color=NULL, size=12 ), # legend text should appear white
        legend.key = element_rect(fill=NULL), 
        legend.position = 'none')


I had previously saved the dataset in on my computer. I commenced the analysis by loading the respective dataset in its raw form.


load("ames.RData")


In accordance with tradition, I kicked-off the analysis by inspecting the broad structure of the data provided. I wanted to glance the contents of the dataset.


df_str(ames, return = "plot")
## Other available 'return' options: "skimr", "numbers", "names", "distr"


RAW DATA


I wished to obtain more insight into the contents of the data in its unprocessed form; I decided to inspect the first fifteen rows:


Order PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Functional Fireplaces Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition SalePrice
1 526301100 20 RL 141 31770 Pave NA IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6 5 1960 1960 Hip CompShg BrkFace Plywood Stone 112 TA TA CBlock TA Gd Gd BLQ 639 Unf 0 441 1080 GasA Fa Y SBrkr 1656 0 0 1656 1 0 1 0 3 1 TA 7 Typ 2 Gd Attchd 1960 Fin 2 528 TA TA P 210 62 0 0 0 0 NA NA NA 0 5 2010 WD Normal 215000
2 526350040 20 RH 80 11622 Pave NA Reg Lvl AllPub Inside Gtl NAmes Feedr Norm 1Fam 1Story 5 6 1961 1961 Gable CompShg VinylSd VinylSd None 0 TA TA CBlock TA TA No Rec 468 LwQ 144 270 882 GasA TA Y SBrkr 896 0 0 896 0 0 1 0 2 1 TA 5 Typ 0 NA Attchd 1961 Unf 1 730 TA TA Y 140 0 0 0 120 0 NA MnPrv NA 0 6 2010 WD Normal 105000
3 526351010 20 RL 81 14267 Pave NA IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6 6 1958 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace 108 TA TA CBlock TA TA No ALQ 923 Unf 0 406 1329 GasA TA Y SBrkr 1329 0 0 1329 0 0 1 1 3 1 Gd 6 Typ 0 NA Attchd 1958 Unf 1 312 TA TA Y 393 36 0 0 0 0 NA NA Gar2 12500 6 2010 WD Normal 172000
4 526353030 20 RL 93 11160 Pave NA Reg Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 7 5 1968 1968 Hip CompShg BrkFace BrkFace None 0 Gd TA CBlock TA TA No ALQ 1065 Unf 0 1045 2110 GasA Ex Y SBrkr 2110 0 0 2110 1 0 2 1 3 1 Ex 8 Typ 2 TA Attchd 1968 Fin 2 522 TA TA Y 0 0 0 0 0 0 NA NA NA 0 4 2010 WD Normal 244000
5 527105010 60 RL 74 13830 Pave NA IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 5 5 1997 1998 Gable CompShg VinylSd VinylSd None 0 TA TA PConc Gd TA No GLQ 791 Unf 0 137 928 GasA Gd Y SBrkr 928 701 0 1629 0 0 2 1 3 1 TA 6 Typ 1 TA Attchd 1997 Fin 2 482 TA TA Y 212 34 0 0 0 0 NA MnPrv NA 0 3 2010 WD Normal 189900
6 527105030 60 RL 78 9978 Pave NA IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 6 6 1998 1998 Gable CompShg VinylSd VinylSd BrkFace 20 TA TA PConc TA TA No GLQ 602 Unf 0 324 926 GasA Ex Y SBrkr 926 678 0 1604 0 0 2 1 3 1 Gd 7 Typ 1 Gd Attchd 1998 Fin 2 470 TA TA Y 360 36 0 0 0 0 NA NA NA 0 6 2010 WD Normal 195500
7 527127150 120 RL 41 4920 Pave NA Reg Lvl AllPub Inside Gtl StoneBr Norm Norm TwnhsE 1Story 8 5 2001 2001 Gable CompShg CemntBd CmentBd None 0 Gd TA PConc Gd TA Mn GLQ 616 Unf 0 722 1338 GasA Ex Y SBrkr 1338 0 0 1338 1 0 2 0 2 1 Gd 6 Typ 0 NA Attchd 2001 Fin 2 582 TA TA Y 0 0 170 0 0 0 NA NA NA 0 4 2010 WD Normal 213500
8 527145080 120 RL 43 5005 Pave NA IR1 HLS AllPub Inside Gtl StoneBr Norm Norm TwnhsE 1Story 8 5 1992 1992 Gable CompShg HdBoard HdBoard None 0 Gd TA PConc Gd TA No ALQ 263 Unf 0 1017 1280 GasA Ex Y SBrkr 1280 0 0 1280 0 0 2 0 2 1 Gd 5 Typ 0 NA Attchd 1992 RFn 2 506 TA TA Y 0 82 0 0 144 0 NA NA NA 0 1 2010 WD Normal 191500
9 527146030 120 RL 39 5389 Pave NA IR1 Lvl AllPub Inside Gtl StoneBr Norm Norm TwnhsE 1Story 8 5 1995 1996 Gable CompShg CemntBd CmentBd None 0 Gd TA PConc Gd TA No GLQ 1180 Unf 0 415 1595 GasA Ex Y SBrkr 1616 0 0 1616 1 0 2 0 2 1 Gd 5 Typ 1 TA Attchd 1995 RFn 2 608 TA TA Y 237 152 0 0 0 0 NA NA NA 0 3 2010 WD Normal 236500
10 527162130 60 RL 60 7500 Pave NA Reg Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 7 5 1999 1999 Gable CompShg VinylSd VinylSd None 0 TA TA PConc TA TA No Unf 0 Unf 0 994 994 GasA Gd Y SBrkr 1028 776 0 1804 0 0 2 1 3 1 Gd 7 Typ 1 TA Attchd 1999 Fin 2 442 TA TA Y 140 60 0 0 0 0 NA NA NA 0 6 2010 WD Normal 189000
11 527163010 60 RL 75 10000 Pave NA IR1 Lvl AllPub Corner Gtl Gilbert Norm Norm 1Fam 2Story 6 5 1993 1994 Gable CompShg HdBoard HdBoard None 0 TA TA PConc Gd TA No Unf 0 Unf 0 763 763 GasA Gd Y SBrkr 763 892 0 1655 0 0 2 1 3 1 TA 7 Typ 1 TA Attchd 1993 Fin 2 440 TA TA Y 157 84 0 0 0 0 NA NA NA 0 4 2010 WD Normal 175900
12 527165230 20 RL NA 7980 Pave NA IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 1Story 6 7 1992 2007 Gable CompShg HdBoard HdBoard None 0 TA Gd PConc Gd TA No ALQ 935 Unf 0 233 1168 GasA Ex Y SBrkr 1187 0 0 1187 1 0 2 0 3 1 TA 6 Typ 0 NA Attchd 1992 Fin 2 420 TA TA Y 483 21 0 0 0 0 NA GdPrv Shed 500 3 2010 WD Normal 185000
13 527166040 60 RL 63 8402 Pave NA IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 6 5 1998 1998 Gable CompShg VinylSd VinylSd None 0 TA TA PConc Gd TA No Unf 0 Unf 0 789 789 GasA Gd Y SBrkr 789 676 0 1465 0 0 2 1 3 1 TA 7 Typ 1 Gd Attchd 1998 Fin 2 393 TA TA Y 0 75 0 0 0 0 NA NA NA 0 5 2010 WD Normal 180400
14 527180040 20 RL 85 10176 Pave NA Reg Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 1Story 7 5 1990 1990 Gable CompShg HdBoard HdBoard None 0 TA TA PConc Gd TA Gd GLQ 637 Unf 0 663 1300 GasA Gd Y SBrkr 1341 0 0 1341 1 0 1 1 2 1 Gd 5 Typ 1 Po Attchd 1990 Unf 2 506 TA TA Y 192 0 0 0 0 0 NA NA NA 0 2 2010 WD Normal 171500
15 527182190 120 RL NA 6820 Pave NA IR1 Lvl AllPub Corner Gtl StoneBr Norm Norm TwnhsE 1Story 8 5 1985 1985 Gable CompShg HdBoard HdBoard None 0 Gd TA PConc Gd TA Av GLQ 368 BLQ 1120 0 1488 GasA TA Y SBrkr 1502 0 0 1502 1 0 1 1 1 1 Gd 4 Typ 0 NA Attchd 1985 RFn 2 528 TA TA Y 0 54 0 0 140 0 NA NA NA 0 6 2010 WD Normal 212000


CLEARER OVERVIEW USING PSYCH


I recruited the psych package to obtain summary statistics regarding the variability of the data, including the kurtosis and skew of the respective data.


psych::describe(ames) %>% 
    kable()  %>% kable_styling(bootstrap_options = "striped") %>%
    scroll_box(width = "100%", height = "300px")
vars n mean sd median trimmed mad min max range skew kurtosis se
Order 1 2930 1465.5000000 845.9624696 1465.5 1465.5000000 1086.0045 1 2930 2929 0.0000000 -1.2012287 15.6284996
PID 2 2930 714464496.9887372 188730844.6493901 535453620.0 712423200.4014505 12373193.9730 526301100 1007100110 480799010 0.0558287 -1.9944772 3486655.7767905
MS.SubClass 3 2930 57.3873720 42.6380246 50.0 49.7632253 44.4780 20 190 170 1.3561897 1.3793717 0.7877044
MS.Zoning* 4 2930 5.9672355 0.8656517 6.0 6.0720990 0.0000 1 7 6 -2.6139550 8.4051820 0.0159922
Lot.Frontage 5 2440 69.2245902 23.3653350 68.0 68.3524590 17.7912 21 313 292 1.4972247 11.1977390 0.4730174
Lot.Area 6 2930 10147.9218430 7880.0177594 9436.5 9481.0499147 3024.5040 1300 215245 213945 12.8077740 264.3869706 145.5772081
Street* 7 2930 1.9959044 0.0638763 2.0 2.0000000 0.0000 1 2 1 -15.5217245 239.0055030 0.0011801
Alley* 8 198 1.3939394 0.4898603 1.0 1.3687500 0.0000 1 2 1 0.4308369 -1.8235128 0.0348129
Lot.Shape* 9 2930 2.9402730 1.4121054 4.0 3.0503413 0.0000 1 4 3 -0.6062653 -1.6026962 0.0260876
Land.Contour* 10 2930 3.7778157 0.7031991 4.0 3.9982935 0.0000 1 4 3 -3.1226507 8.4362234 0.0129911
Utilities* 11 2930 1.0017065 0.0554058 1.0 1.0000000 0.0000 1 3 2 34.0201767 1187.9571398 0.0010236
Lot.Config* 12 2930 4.0552901 1.6039218 5.0 4.3191126 0.0000 1 5 4 -1.1937666 -0.4447109 0.0296312
Land.Slope* 13 2930 1.0535836 0.2483042 1.0 1.0000000 0.0000 1 3 2 4.9830595 26.6246706 0.0045872
Neighborhood* 14 2930 15.2952218 7.0220750 16.0 15.3993174 8.8956 1 28 27 -0.1960192 -1.1853548 0.1297274
Condition.1* 15 2930 3.0402730 0.8724077 3.0 3.0000000 0.0000 1 9 8 2.9876854 15.7385319 0.0161171
Condition.2* 16 2930 3.0020478 0.2090376 3.0 3.0000000 0.0000 1 8 7 12.0767140 308.9729348 0.0038618
Bldg.Type* 17 2930 1.5170648 1.2188973 1.0 1.1719283 0.0000 1 5 4 2.1518391 3.0038936 0.0225182
House.Style* 18 2930 4.0238908 1.9106293 3.0 4.0102389 0.0000 1 8 7 0.3211783 -0.9501428 0.0352974
Overall.Qual 19 2930 6.0948805 1.4110261 6.0 6.0755119 1.4826 1 10 9 0.1904388 0.0481942 0.0260676
Overall.Cond 20 2930 5.5631399 1.1115366 5.0 5.4680034 0.0000 1 9 8 0.5738415 1.4837965 0.0205348
Year.Built 21 2930 1971.3563140 30.2453606 1973.0 1974.2474403 37.0650 1872 2010 138 -0.6038435 -0.5046106 0.5587595
Year.Remod.Add 22 2930 1984.2665529 20.8602859 1993.0 1985.6284130 20.7564 1950 2010 60 -0.4514001 -1.3424069 0.3853776
Roof.Style* 23 2930 2.3948805 0.8197217 2.0 2.2431741 0.0000 1 6 5 1.5584134 0.8882255 0.0151437
Roof.Matl* 24 2930 2.0627986 0.5382105 2.0 2.0000000 0.0000 1 8 7 8.7203343 76.9796054 0.0099430
Exterior.1st* 25 2930 11.1604096 3.6506082 14.0 11.4718430 1.4826 1 16 15 -0.5885754 -0.7610933 0.0674422
Exterior.2nd* 26 2930 11.8709898 3.9980891 15.0 12.1889932 2.9652 1 17 16 -0.5636232 -0.8976905 0.0738616
Mas.Vnr.Type* 27 2930 4.4269625 1.0775448 5.0 4.4577645 0.0000 1 6 5 -0.6932065 -0.6198667 0.0199068
Mas.Vnr.Area 28 2907 101.8968008 179.1126106 0.0 61.1388053 0.0000 0 1600 1600 2.6042950 9.2601663 3.3220307
Exter.Qual* 29 2930 3.5290102 0.7016622 4.0 3.6424915 0.0000 1 4 3 -1.7925145 3.6662698 0.0129627
Exter.Cond* 30 2930 4.7098976 0.7723926 5.0 4.9261945 0.0000 1 5 4 -2.5031881 5.1134440 0.0142694
Foundation* 31 2930 2.3924915 0.7261949 2.0 2.4539249 1.4826 1 6 5 0.0107612 0.7550324 0.0134159
Bsmt.Qual* 32 2851 4.6878288 1.3131144 4.0 4.8487505 2.9652 1 6 5 -0.4618164 -0.8243922 0.0245926
Bsmt.Cond* 33 2851 5.7972641 0.6970662 6.0 6.0000000 0.0000 1 6 5 -3.3552418 10.0737610 0.0130550
Bsmt.Exposure* 34 2851 4.2714837 1.1375368 5.0 4.4660237 0.0000 1 5 4 -1.1653915 -0.3047203 0.0213043
BsmtFin.Type.1* 35 2851 4.7551736 1.8096342 4.0 4.8193775 2.9652 1 7 6 -0.0393708 -1.3628952 0.0338916
BsmtFin.SF.1 36 2929 442.6295664 455.5908391 370.0 384.0840085 548.5620 0 5644 5644 1.4147320 6.8388117 8.4181236
BsmtFin.Type.2* 37 2851 6.6720449 1.0167651 7.0 6.9706269 0.0000 1 7 6 -3.3854719 10.8381116 0.0190424
BsmtFin.SF.2 38 2929 49.7224309 169.1684756 0.0 2.0366738 0.0000 0 1526 1526 4.1357391 18.7325405 3.1257897
Bsmt.Unf.SF 39 2929 559.2625469 439.4941528 466.0 510.7735608 415.1280 0 2336 2336 0.9221075 0.4044530 8.1206990
Total.Bsmt.SF 40 2929 1051.6145442 440.6150670 990.0 1035.0537313 349.8936 0 6110 6110 1.1550204 9.1097038 8.1414106
Heating* 41 2930 2.0252560 0.2452214 2.0 2.0000000 0.0000 1 6 5 12.0957649 168.4515977 0.0045303
Heating.QC* 42 2930 2.5389078 1.7439418 1.0 2.4236348 0.0000 1 5 4 0.4808449 -1.5194533 0.0322180
Central.Air* 43 2930 1.9331058 0.2498813 2.0 2.0000000 0.0000 1 2 1 -3.4653089 10.0117831 0.0046164
Electrical* 44 2930 5.6846416 1.0492497 6.0 6.0000000 0.0000 1 6 5 -3.0814461 7.6505468 0.0193841
X1st.Flr.SF 45 2930 1159.5576792 391.8908853 1084.0 1127.1689420 349.8936 334 5095 4761 1.4679244 6.9480810 7.2398797
X2nd.Flr.SF 46 2930 335.4559727 428.3957150 0.0 272.8959044 0.0000 0 2065 2065 0.8655698 -0.4179641 7.9142781
Low.Qual.Fin.SF 47 2930 4.6767918 46.3105100 0.0 0.0000000 0.0000 0 1064 1064 12.1057567 175.1836856 0.8555507
Gr.Liv.Area 48 2930 1499.6904437 505.5088875 1442.0 1452.2453072 461.0886 334 5642 5308 1.2728055 4.1238681 9.3388841
Bsmt.Full.Bath 49 2928 0.4313525 0.5248202 0.0 0.3963311 0.0000 0 3 3 0.6160073 -0.7502980 0.0096990
Bsmt.Half.Bath 50 2928 0.0611339 0.2452536 0.0 0.0000000 0.0000 0 2 2 3.9367587 14.8820022 0.0045324
Full.Bath 51 2930 1.5665529 0.5529406 2.0 1.5575939 0.0000 0 4 4 0.1717761 -0.5442378 0.0102151
Half.Bath 52 2930 0.3795222 0.5026293 0.0 0.3387372 0.0000 0 2 2 0.6969988 -1.0319744 0.0092857
Bedroom.AbvGr 53 2930 2.8542662 0.8277311 3.0 2.8323379 0.0000 0 8 8 0.3053813 1.8828128 0.0152917
Kitchen.AbvGr 54 2930 1.0443686 0.2140762 1.0 1.0000000 0.0000 0 3 3 4.3094087 19.8182212 0.0039549
Kitchen.Qual* 55 2930 3.8563140 1.2692368 5.0 4.0251706 0.0000 1 5 4 -0.6206625 -0.6780245 0.0234482
TotRms.AbvGrd 56 2930 6.4430034 1.5729644 6.0 6.3336177 1.4826 2 15 13 0.7527712 1.1477391 0.0290593
Functional* 57 2930 7.6918089 1.1754144 8.0 8.0000000 0.0000 1 8 7 -3.8253633 13.7939856 0.0217149
Fireplaces 58 2930 0.5993174 0.6479209 1.0 0.5183447 1.4826 0 4 4 0.7384585 0.0971731 0.0119698
Fireplace.Qu* 59 1508 3.7194960 1.1264883 3.0 3.7831126 1.4826 1 5 4 -0.1217511 -1.0149437 0.0290086
Garage.Type* 60 2773 3.2830869 1.7903034 2.0 3.1144660 0.0000 1 6 5 0.7476409 -1.3113362 0.0339979
Garage.Yr.Blt 61 2771 1978.1324432 25.5284113 1979.0 1980.7108705 31.1346 1895 2207 312 -0.3842554 1.8176404 0.4849596
Garage.Finish* 62 2773 3.1799495 0.8229231 3.0 3.2257774 1.4826 1 4 3 -0.3510946 -1.4163756 0.0156273
Garage.Cars 63 2929 1.7668146 0.7605664 2.0 1.7680171 0.0000 0 5 5 -0.2196113 0.2402900 0.0140533
Garage.Area 64 2929 472.8197337 215.0465485 480.0 468.3458422 182.3598 0 1488 1488 0.2417464 0.9446576 3.9734961
Garage.Qual* 65 2772 5.8405483 0.6634018 6.0 6.0000000 0.0000 1 6 5 -4.0371873 14.7948678 0.0126003
Garage.Cond* 66 2772 5.8979076 0.5319486 6.0 6.0000000 0.0000 1 6 5 -5.2866785 27.1445647 0.0101035
Paved.Drive* 67 2930 2.8313993 0.5363888 3.0 3.0000000 0.0000 1 3 2 -2.9849322 7.1493968 0.0099094
Wood.Deck.SF 68 2930 93.7518771 126.3615619 0.0 71.2141638 0.0000 0 1424 1424 1.8407918 6.7337405 2.3344317
Open.Porch.SF 69 2930 47.5334471 67.4834001 27.0 33.8724403 40.0302 0 742 742 2.5327906 10.9241024 1.2467034
Enclosed.Porch 70 2930 23.0116041 64.1390592 0.0 4.8297782 0.0000 0 1012 1012 4.0103363 28.4151135 1.1849193
X3Ssn.Porch 71 2930 2.5924915 25.1413310 0.0 0.0000000 0.0000 0 508 508 11.3921213 149.6265877 0.4644666
Screen.Porch 72 2930 16.0020478 56.0873702 0.0 0.0000000 0.0000 0 576 576 3.9534162 17.8124277 1.0361706
Pool.Area 73 2930 2.2433447 35.5971806 0.0 0.0000000 0.0000 0 800 800 16.9218026 299.0552932 0.6576303
Pool.QC* 74 13 2.4615385 1.1982894 3.0 2.4545455 1.4826 1 4 3 -0.0507909 -1.6761110 0.3323457
Fence* 75 572 2.4125874 0.8355387 3.0 2.4890830 0.0000 1 4 3 -0.6760973 -0.8887827 0.0349356
Misc.Feature* 76 106 3.8490566 0.5484602 4.0 4.0000000 0.0000 1 5 4 -3.1640389 10.3679528 0.0532712
Misc.Val 77 2930 50.6351536 566.3442883 0.0 0.0000000 0.0000 0 17000 17000 21.9772674 564.8476877 10.4627709
Mo.Sold 78 2930 6.2160410 2.7144924 6.0 6.1569966 2.9652 1 12 11 0.1923989 -0.4576397 0.0501481
Yr.Sold 79 2930 2007.7904437 1.3166129 2008.0 2007.7380546 1.4826 2006 2010 4 0.1347244 -1.1589064 0.0243234
Sale.Type* 80 2930 9.3587031 1.8781600 10.0 9.8745734 0.0000 1 10 9 -3.3213129 10.7566598 0.0346975
Sale.Condition* 81 2930 4.7798635 1.0762992 5.0 5.0000000 0.0000 1 6 5 -2.7934757 7.2515027 0.0198838
SalePrice 82 2930 180796.0600683 79886.6923567 160000.0 170429.1510239 54856.2000 12789 755000 742211 1.7417153 5.1025881 1475.8445972


0.0.5 EXAMINING MISSING VALUES


I was keen to establish how complete the dataset was; I used the following stream of code to provide clear insight into this question

  ames%>%
  gather(key = "key", value = "val") %>% 
  mutate(is.missing = is.na(val)) %>% 
  group_by(key, is.missing) %>%
  summarise(Absent.data = n())%>%
  filter(is.missing == TRUE, Absent.data > 1) %>% 
  select(-is.missing) %>%
  arrange(desc(Absent.data)) %>% 
  ggplot(aes(x = reorder(key, Absent.data), y = Absent.data, fill = key)) +
  geom_col() +
  coord_flip() +
  xlab("Variable") +
  ylab("Absent values")+
  theme(legend.position='none')+
  geom_text( aes( label = Absent.data), colour ="white", hjust = 1.0, size = 4.0)+
  coord_flip()+
  ISAAC.THEME2
## `summarise()` has grouped output by 'key'. You can override using the `.groups` argument.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.


The primary documentation relating to the contents of the Ames Housing dataset is revealing: surprisingly, the designation ‘NA’ for some of the variables provided did not signal the standard not available meaning . For instance, in relation to Pool.QC, the designation NA translates as No Pool, NA means No Basement for Bsmt.Qual and No Alley in relation to Alley.

Of course, I could not be entirely sure whether some of these NA entries indicate the absence of certain variables like pools and fireplaces were the result of systematic labelling or accidental omissions. However, in the case of the pool and fireplace variables, I could reduce the likelihood that the NA entries were caused by accidental omissions. Before I permitted myself to recode with distasteful haste, I engaged in a little verification. I simply reasoned that if a house truly does not have a pool, clearly the variable Pool.Area (denoting area of pool) should not exceed 0.

POOL


ames%>%
  select(Pool.Area, Pool.QC)%>% 
  filter(Pool.Area > 0 & is.na(Pool.QC))
## [1] Pool.Area Pool.QC  
## <0 rows> (or 0-length row.names)

The 0 result started suggested that there were no issues there.


FIREPLACES


Similarly, I reasoned that if a house truly does not have a fireplace, then naturally it should not have an accompanying rating.


ames %>% 
  select(Fireplaces, Fireplace.Qu) %>% filter(Fireplaces > 0 & is.na(Fireplace.Qu))
## [1] Fireplaces   Fireplace.Qu
## <0 rows> (or 0-length row.names)

Again, I was satisfied that there were no issues here either.

I decided to relabel the NA variables appropriately. This deficit in the respective data entries can be classified as MNAR: missing not at random. I was now in a position to confidently perform some blanket coding.


# create temporary subset for recoding 

x<- ames[c('Alley','Bsmt.Exposure', 'BsmtFin.Type.1','BsmtFin.Type.2', 'Fireplace.Qu', 'Garage.Type', 'Garage.Finish', 'Garage.Qual', 'Garage.Cond','Pool.QC','Fence','Misc.Feature')] #list of variables to be recoded

# recode specified variables 

x=as.matrix(x)
x[is.na(x)] <-"None"
x=as.data.frame(x)


myvars <- names(ames) %in% c("Alley", "Bsmt.Exposure", "BsmtFin.Type.1", "BsmtFin.Type.2", "Fireplace.Qu", "Garage.Type", "Garage.Finish", "Garage.Qual", "Garage.Cond","Pool.QC","Fence","Misc.Feature")

ames.reduced <- ames[!myvars] #remove specified variables in their original format

ames.Updated<- bind_cols(ames.reduced, x )
# append updated variables to reduced dataset


In aforementioned code chunk, I extracted the variables where the designation NA indicates the absence of the respective feature(s) into a separate data frame, before recoding NA as ‘None’. Thereafter, I removed the original variables, before using dplyr’s ‘bind_cols’ function to append the revised variables back to the original dataset. These manoeuvres significantly reduced my NA burden - the following code block and resultant graphical display illustrates the revised level of missingness.


ames.Updated %>%
  gather(key = "key", value = "val") %>% 
  mutate(is.missing = is.na(val)) %>% 
  group_by(key, is.missing) %>%
  summarise(Absent.data = n())%>%
  filter(is.missing == TRUE, Absent.data > 1) %>% 
  select(-is.missing) %>%
  arrange(desc(Absent.data)) %>% 
  ggplot(aes(x = reorder(key, Absent.data), y = Absent.data, fill = key)) +
  geom_col() +
  coord_flip() +
  xlab("Variable") +
  ylab("Absent values")+
  theme(legend.position='none')+
  geom_text( aes( label = Absent.data), colour ="white", hjust = 1.0, size = 4.0)+
  coord_flip()+
  ISAAC.THEME2
## `summarise()` has grouped output by 'key'. You can override using the `.groups` argument.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.


The coding provided great improvement! My NA burden was reduced considerably.


DETECT EMPTY STRINGS


The is.na function does not detect the presence of missing strings, so I decided to do an independent check for their presence in the dataset and then tabulate the results

EmptyStrings <- colSums(ames.Updated == "", na.rm =TRUE)
rev(stack(EmptyStrings[EmptyStrings > 0]))%>%kable()%>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))
ind values
Mas.Vnr.Type 23
Bsmt.Qual 1
Bsmt.Cond 1
Electrical 1
Bsmt.Exposure 4
BsmtFin.Type.1 1
BsmtFin.Type.2 2
Garage.Finish 2
Garage.Qual 1
Garage.Cond 1


I decided to recode all the empty strings as NA- if you happen to have Saint-like patience, I’ll show you how I addressed these entries alongside the other NAs in the imputation phase!

ames.Updated <- ames.Updated %>% mutate_all(na_if,"")


I decided to remove the features Order and Parcel identification(PID) from the dataset as these are effectively arbitrary and unlikely to have been useful for any form of meaningful generalisation.

#remove Order, PID variables
ames.Updated<-ames.Updated%>%select(-c(Order,PID))


0.0.6 ADDRESS INCORRECTLY LABELLED VARIABLES


A close inspection of the dataset revealed that the variable MS.SubClass had been incorrectly labelled as an integer. I converted this variable to a factor variable. MS SubClass is a nominal variable that characterises the type of dwelling, from a 1-story building to a duplex style of house . I converted this variable to a factor before recoding the various levels simply using the letters of the alphabet at this stage.

class(ames.Updated$MS.SubClass)
## [1] "integer"
ames.Updated$MS.SubClass<-as.factor(ames.Updated$MS.SubClass)
levels(ames.Updated$MS.SubClass) <- c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P")


MONTH SOLD


Month sold (Mo.Sold) was entered as a an integer, however I reclassified it as a factor variable


ames.Updated[["Mo.Sold"]] <- factor(ames.Updated[["Mo.Sold"]])
ames.Updated$Month.Sold <- month.abb[ames.Updated$Mo.Sold]
ames.Updated$Mo.Sold<-NULL
ames.Updated$Month.Sold<-as.factor(ames.Updated$Month.Sold)


0.0.7 FEATURE ENGINEERING


I decided to modify the variables Year.Built, Garage.Yr.Blt and Year.Remod.Add slightly: instead of defining these variables by their respective year, I wanted to extract the number of years elapsed. I accomplished this by simply subtracting the year from 2010 (the year in which the sale prices were recorded). I felt that this would make the data friendlier for usage with the various algorithms that were to be employed in due course.


ames.Updated$Age.of.house <- 2010 - ames.Updated$Year.Built
ames.Updated$Year.Built<-NULL
ames.Updated$Age.of.garage <- 2010 - ames.Updated$Garage.Yr.Blt
ames.Updated$Garage.Yr.Blt <-NULL
ames.Updated$Yrs.since.Remod<-2010-ames.Updated$Year.Remod.Add
ames.Updated$Year.Remod.Add<-NULL


0.0.8 REDUCTION OF EXTRANEOUS CLUTTER


The original dataset contained four variables quantifying the types of bathrooms; I decided to aggregate these variables into a united variable simply entitled ‘Bathroom.Tally’.


ames.Updated$Bathroom.Tally <- 0.5 * ames.Updated$Half.Bath + 0.5 * ames.Updated$Bsmt.Half.Bath + ames.Updated$Bsmt.Full.Bath + ames.Updated$Full.Bath 


I decided to combine the interior area related variables, namely Total.Bsmt.SF (total square feet of basement area) and Gr.Liv.Area (above ground living area of the house in square feet) .

ames.Updated$Overall.AREA <- ames.Updated$Gr.Liv.Area + ames.Updated$Total.Bsmt.SF


I then removed the constituent variables of Bathroom.tally, Overall.area.size and Porch area:


ames.Updated<-ames.Updated%>%select(-c(Half.Bath, Bsmt.Half.Bath,Bsmt.Full.Bath,Full.Bath,Gr.Liv.Area,Total.Bsmt.SF))


ELIMINATE NEAR ZERO VARIANCE FACTORS


The nearZeroVar functions evaluates the ratio of the highest frequency value to the next most frequent value(freqCut) . In addition, the function evaluates the percentage of unique values in relation to the overall quantity of samples (uniqueCut). The default settings are freqCut = 19 and uniqueCut = 10. Prior to inspecting the data for near-zero variance predictors , I decided to use dplyr’s drop_na function so that I was examining the dataset using only ‘complete cases’.


ames.Updated.comp<-ames.Updated%>%drop_na()

nzv.data <- nearZeroVar(ames.Updated.comp, saveMetrics = TRUE)
drop.cols <- rownames(nzv.data)[nzv.data$nzv == TRUE]
ames.Updated.2<- ames.Updated[,!names(ames.Updated)%in% drop.cols]


Applying the near zero variance function flagged the following variables: Alley,Street, Land.Contour, Utilities, Land.Slope, Misc.Val, Pool.Area,Screen.Porch, Enclosed.Porch, Open.Porch.SF,Garage.Cond,Kitchen.Abv.Gr,Low.Qual.Fin.SF,Bsmt.Fin.SF.2, Condition.2, Bsmt.Cond, BsmtFin.Type.2,Roof.Matl for removal. All of the suggested variables were removed.


0.0.9 MULTICOLLINEARITY


When independent variables are highly correlated, changes in one variable are associated with changes in the other correlated variables. When conducting multiple linear regression, it becomes challenging to change one feature without simultaneously changing another independent variable, thus the algorithm struggles to discern the unique relationship that exists between particular individual independent variables upon the dependent variable. Unfortunately, the coefficient estimates become unreliable, they are sensitive to small changes in the model. Whilst the presence of multicollinearity influences the coefficient and p-values, it will not impair predictive performance.

I extracted a subset consisting of all complete cases.

ames_complete <- 
ames.Updated.2[complete.cases(ames.Updated.2), ]


NUMERIC AND CATEGORICAL COMPLETE CASES


I then extracted a new subset containing all numeric variables:

amesnumeric.comp <- dplyr::select_if(ames_complete, is.numeric)


I then decided to find out what numeric variables were most correlated with Sale Price

corr_var(amesnumeric.comp, # name of dataset
         SalePrice, # name of variable to focus on
         max_pvalue = 0.05
        ,top = 10)


0.0.10 CORRELATIONS


I removed the dependent variable SalePrice before examining the pairwise correlations:

amesnumeric2 <-amesnumeric.comp[c(-19)]


PAIRWISE CORRELATIONS ACROSS COVARIATES


The caret findCorrelation evaluates pair-wise correlations across all variables, flagging variables that are highly correlated. Of the identified pairs, the function recommends the removal of the variable with the highest average absolute correlation across the dataset. However, I decided to retain the variables that have the highest correlation with the dependent variable i.e. SalePrice.I proceeded to find out which of the variables had the highest correlation with the dependent variable. Firstly, I extracted a subset with the suggested variables for removal, I was then able to easily arrange the correlations with SalePrice in descending order.

library(caret)
identify<-caret::findCorrelation(cor(amesnumeric2),cutoff = 0.75,names = T, verbose = T)
## Compare row 23  and column  9 with corr  0.832 
##   Means:  0.419 vs 0.236 so flagging column 23 
## Compare row 14  and column  15 with corr  0.852 
##   Means:  0.351 vs 0.218 so flagging column 14 
## Compare row 19  and column  20 with corr  0.84 
##   Means:  0.286 vs 0.207 so flagging column 19 
## All correlations <= 0.75


ames.suggestions<- amesnumeric.comp[c('Garage.Cars','Garage.Area','Overall.AREA',  'X1st.Flr.SF','Age.of.garage', 'Age.of.house','SalePrice')]
#evaluate correlations against dependent variable
ames.suggestions %>% corr_var(SalePrice)


I removed the three variables that had the lowest correlation with SalePrice from the amesnumeric.comp subset.

#variable to be removed
minus.vars<-c("Age.of.garage","Age.of.house","X1st.Flr.SF")

#create numeric subset devoid of variables specified in minus.vars
amesnumeric.3 <- amesnumeric.comp[,!names(amesnumeric.comp) %in% minus.vars]


At this stage, I removed the same variables from the main dataset

ames.Updated.3<-ames.Updated.2[,!names(ames.Updated.2)%in% minus.vars]


0.0.11 HIGHEST SIGNIFICANT CORRELATIONS


I decided to evaluate which variables had the highest correlations (using Pearson’s r) with SalePrice. I shall took a subset featuring only complete cases from the previously updated primary dataset:

ames.comp2<-ames.Updated.3[complete.cases(ames.Updated.3),]


I took a further subset containing only the numeric variables from the subset containing only complete cases before evaluating the respective correlations.

amesnumeric.4<-dplyr::select_if(ames.comp2, is.numeric)
corr_var(amesnumeric.4, # name of dataset
         SalePrice, # name of variable to focus on
         max_pvalue = 0.05
        ,top = 10)


0.0.12 FACTOR SELECTION


I wanted a fairly quick way to identify influential factor variables. Here, I decided to the enlist the aid of the Random forest algorithm to help identify influential categorical factors. I created a subset of the factor/character variables with Sale Price. Afterwards, I ran the random forest algorithm.

Factor_subset <- ames_complete[sapply(ames_complete, is.character)|sapply(ames_complete, is.factor)|names(ames_complete)=='SalePrice']
set.seed(174)
Influential.Factors <- ranger(SalePrice ~ ., data = Factor_subset, importance = "permutation")
Extract.influential <- vip(Influential.Factors)
plot(Extract.influential)


0.1 ILLUMINATING VISUALSATIONS


I decided to produce a range of potentially informative visualisations; I was particularly interested in seeing whether I could spot any discernible relationships between the dependent variable and the respective exploratory variables. The adage of ‘a picture speaks a thousand words’ is very often the case in the field of data analysis. Up until this point, throughout the EDA (Exploratory Data Analysis) section, I had relied heavily upon the guidance of the Pearson correlation coefficient. However, as I stated earlier, the Pearson coefficient is a measure of linear correlation, it does not capture non-linear relationships. Informative patterns may well exist between the independent variable(s) and the dependent variable that the Pearson coefficient fails to capture.


0.1.1 NUMERIC VARIABLES


Here, I plotted all the numeric variables against Sale Price

amesnumeric.4%>%
    pivot_longer(-SalePrice, names_to = "Feature", values_to = "Value") %>%
    ggplot() +
    geom_point(mapping=aes(x = Value, y = SalePrice, color = Feature)) + 
    scale_y_continuous(labels = scales::comma)+
    facet_wrap(~ Feature, scales = "free", ncol = 4) +
    scale_x_continuous(n.breaks = 2)+
    theme(legend.position = "",
          plot.title.position = "plot")+
    labs(x = "Numeric Feature Value",
         title = "Ames Housing Numeric Variable versus Sale Price")+
  ISAAC.THEME


Overall Area is clearly moving in tandem with Sale Price, whereas Years since remodeled appears to exhibit a clearly discernible negative correlation with Sale Price.

0.1.2 PRICE VARIATION OVER TIME


ggplot(data = ames_complete, 
       mapping = aes(x = Yrs.since.Remod, y = SalePrice)) +
    geom_line() +
    facet_wrap(vars(Neighborhood)) +
    labs(title = "Price vs Time since remodeled/Neighbourhood")+
  theme_economist()


Please note that the variable time since remodeled ( Yrs.since.Remod) translates as the time since construction if remodelling/additions have not been carried out.


0.1.3 FACTOR VARIABLES (Boxplots)


Factor_subset%>%gather(-SalePrice, key = "var", value = "value") %>%ggplot(aes(x = value, y = SalePrice)) +
    geom_violin() +
    facet_wrap(~ var, scales = "free") +
    theme_bw()+
  ISAAC.THEME2


0.1.4 DENSITY PLOTS


I decided to expore the distribution densities of the exploratory variables.

melt(ames_complete) %>%
  ggplot(aes(x= value)) +
    geom_density(fill='#fdae6b') + facet_wrap(~variable, scales = 'free')+
ISAAC.THEME
## Using MS.SubClass, MS.Zoning, Lot.Shape, Lot.Config, Neighborhood, Condition.1, Bldg.Type, House.Style, Roof.Style, Exterior.1st, Exterior.2nd, Mas.Vnr.Type, Exter.Qual, Exter.Cond, Foundation, Bsmt.Qual, Heating.QC, Electrical, Kitchen.Qual, Paved.Drive, Sale.Type, Sale.Condition, Bsmt.Exposure, BsmtFin.Type.1, Fireplace.Qu, Garage.Type, Garage.Finish, Garage.Qual, Fence, Month.Sold as id variables


0.1.5 DISTRIBUTION OF SALE PRICE


ggplot(ames.Updated.3, aes(x=SalePrice, y=..density..)) + 
  geom_histogram(bins=50, fill="#8F0C90", col="#DAF7A6") +
  geom_density(size = 1, colour = "#FFFAE5") +
  labs(title = "Houses prices", x = 'Price', y = "Frequency") +
  scale_x_continuous(labels=scales::dollar_format())+
  theme(plot.title = element_text(hjust = 0.5))+
  ISAAC.THEME


The distribution of house prices is clearly right-skewed. I felt that it may be worth considering logarithmic transformations of sale price (the dependent variable), especially for linear modelling.


0.1.6 NEIGHBOURHOOD VS SALE PRICE


My hunch was that prices would vary considerably across neighbourhoods, indeed the visualisation featured below clearly indicates that this was the case.

library(ggridges)
ggplot(ames.comp2, aes(x = `SalePrice`, y = Neighborhood, fill = stat(x))) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
  scale_fill_viridis_c(name = "Prices", option = "C") +
  labs(title = 'Prices across Neighbourhoods')
## Picking joint bandwidth of 15500




0.2 SUMMARY TABLES/DESCRIPTIVE STATS


I was keen to obtain an overview of some basic descriptive summary statistics, my primary motivation was to identify factor levels containing a low number of entries. Factor levels with a low level of entries be problematic for running many algorithms.


0.2.1 BASEMENT QUALITY


ames.Updated.3%>%group_by(Bsmt.Qual)%>%summarize(n=n())%>%mutate(percent = round(n / sum(n) * 100, 2))%>%kable()%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Bsmt.Qual n percent
Ex 258 8.81
Fa 88 3.00
Gd 1219 41.60
Po 2 0.07
TA 1283 43.79
NA 80 2.73

0.2.2 EXTERNAL QUALITY


ames.Updated.3%>%group_by(Exter.Qual)%>%summarize(n=n())%>%mutate(prop=n/sum(n))%>%kable()%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Exter.Qual n prop
Ex 107 0.0365188
Fa 35 0.0119454
Gd 989 0.3375427
TA 1799 0.6139932

0.2.3 KITCHEN QUALITY


ames.Updated.3%>%group_by(Kitchen.Qual)%>%summarize(n=n())%>%mutate(percent = round(n / sum(n) * 100, 2))%>%knitr::kable(
    format = "html",
    booktabs = TRUE, # For LaTeX
    col.names = c("Quality", "Count", "Percentage"),
    caption = "Kitchen Quality Proportions",
    digits = 1
  ) %>%kable_styling(bootstrap_options = "striped")
Kitchen Quality Proportions
Quality Count Percentage
Ex 205 7.0
Fa 70 2.4
Gd 1160 39.6
Po 1 0.0
TA 1494 51.0


0.2.4 FIREPLACE QUALITY


ames.Updated.3%>%group_by(Fireplace.Qu)%>%summarize(n=n())%>%mutate(percent = round(n / sum(n) * 100, 2))%>%knitr::kable(
    format = "html",
    booktabs = TRUE, # For LaTeX
    col.names = c("Quality", "Count", "Percentage"),
    caption = "Distribution of Fireplace Quality",
    digits = 1
  ) %>%kable_styling(bootstrap_options = "striped")
Distribution of Fireplace Quality
Quality Count Percentage
Ex 43 1.5
Fa 75 2.6
Gd 744 25.4
None 1422 48.5
Po 46 1.6
TA 600 20.5


0.2.5 SALE PRICE SUMMARIES


library(kableExtra)
ames.Updated.3%>%
   summarise(average = mean(SalePrice), median=median(SalePrice), minimum = min(SalePrice), maximum=max(SalePrice), standard.deviation=sd(SalePrice)) %>% 
    kable()%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
average median minimum maximum standard.deviation
180796.1 160000 12789 755000 79886.69


0.2.6 ESSENTIAL RECODING


A number of the variables in this dataset were of the ordinal type; I decided to recode these as numeric variables, primarily to ensure that they would be suitable for the modelling process.

EXTERNAL QUALITY


ames.Updated.3$Exter.Qual <- as.numeric(factor(ames.Updated.3$Exter.Qual, 
                                  levels = c("Fa","TA", "Gd", "Ex"),
                                  labels = c(1,2,3,4) ,ordered = TRUE))


BASEMENT QUALITY


ames.Updated.3$Bsmt.Qual <- as.numeric(factor(ames.Updated.3$Bsmt.Qual, 
levels = c("Po","Fa", "TA", "Gd", "Ex"), labels = c(1,2,3,4,5) ,ordered = TRUE))


KITCHEN QUALITY


When I was perusing through the distribution of levels, I realized that there is only one entry for a classification of Poor(Po), therefore I shall assimilate this into the nearest rank of Fair(Fa). :

ames.Updated.3[ames.Updated.3$Kitchen.Qual=="Po",]$Kitchen.Qual <- "Fa"
ames.Updated.3$Kitchen.Qual <- as.numeric(factor(ames.Updated.3$Kitchen.Qual, 
                                  levels = c("Fa","TA", "Gd", "Ex"),
                                  labels = c(1,2,3,4) ,ordered = TRUE))


FIREPLACE QUALITY


ames.Updated.3$Fireplace.Qu <- as.numeric(factor(ames.Updated.3$Fireplace.Qu, 
                                  levels = c("None","Po","Fa", "TA", "Gd", "Ex"),
                                  labels = c(1,2,3,4,5,6) ,ordered = TRUE))


Now that I had introduced some new numeric variables into the dataset, I decided to examine their correlation with sale price.

ames.ordinal<- ames.Updated.3[c('Exter.Qual','Fireplace.Qu','Bsmt.Qual',  'Kitchen.Qual','SalePrice')]
ames.ordinal%>%drop_na()%>%corr_var(SalePrice, max_pvalue = 0.05)


0.2.7 REDUCED DATASET


At this stage, I decided to reduce the evolving dataset considerably by selecting the twenty exploratory variables that had thus far indicated clear predictive promise. I made the decision to include the top ten variables with the highest correlation with Sale Price, in addition to the top ten most important features extracted from the random forest algorithm that I ran earlier using the factor variables to predict house prices. The model was entitled Influential.Factors. I extracted the relevant subset from ames.Updated.3.

ames.parsimonious6<-ames.Updated.3%>%select(Overall.Qual,Overall.AREA,Garage.Cars, Garage.Area, Bathroom.Tally, Mas.Vnr.Area, Yrs.since.Remod,TotRms.AbvGrd,Fireplaces,BsmtFin.SF.1,Bsmt.Qual,Foundation,Exter.Qual,Kitchen.Qual,Neighborhood,Fireplace.Qu, Garage.Type, Garage.Finish, MS.SubClass, Exterior.1st,SalePrice)


0.2.8 ENSURE FACTOR VARIABLES ARE SUITABLE FOR MODELLING:


I soon realized that I still needed to do a little more preprocessing before the data was ready for the modelling process. A number of the independent variables had levels with few entries. I decided to recruit the Forcats library in order to relevel levels with fewer than 50 entries into a new category simply labelled ‘other’. I initially attempted to run a number of the models without including this step, however a number of the models failed to function as intended.

library(forcats)

ames.parsimonious6 <- ames.parsimonious6 %>%mutate(Neighborhood = fct_lump_min(Neighborhood, min=50, other_level = "other"))

ames.parsimonious6 <- ames.parsimonious6 %>%mutate(MS.SubClass = fct_lump_min(MS.SubClass, min=50, other_level = "other"))

ames.parsimonious6 <- ames.parsimonious6 %>%mutate(Exterior.1st = fct_lump_min(Exterior.1st, min=50, other_level = "other"))

ames.parsimonious6 <- ames.parsimonious6 %>%mutate(Garage.Type = fct_lump_min(Garage.Type, min=50, other_level = "other"))


In the basement quality variable, only two of the entries were marked as Poor(Po); I decided to change these entries to Fair(Fa). This would increase the likelihood of algorithms running smoothly.

#convert few entries marked '1' as '2'

ames.parsimonious6$Bsmt.Qual[ames.parsimonious6$Bsmt.Qual==1]<-2


0.2.9 RECRUITING TIDYMODELS


It was time to begin the modelling process. Firstly, I needed to divide the primary dataset into a training and testing dataset.


library(doParallel)
## Loading required package: parallel
# doParallel
cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cores = cl)
#split data into training/testing sets
set.seed(174)
data_split <- initial_split(ames.parsimonious6, strata = "SalePrice", prop = 0.75)

ames_train <- training(data_split)
ames_test  <- testing(data_split)


I decided to recruit the recipes-package for some final preprocessing of the evolving dataset. A recipe allows one to craft a sequence of preprocessing operations on specified datasets. Thereafter, the recipe can be independently applied to other sets, including validation sets. One of the primary benefits of using the tidymodels/recipe framework is that it allows a data analyst to easily avoid the leakage of data between sets. A classic mistake can occur when certain preprocessing steps like centering and scaling are applied to the entire dataset before it is split into train/test or cross-validation folds. It is important to apply certain data preparation steps after splitting the data into respective train/test/folds in order to avoid biasing predictive performance. An example of this form of data leakage would occur if the analyst attempted to normalize columns of a dataset before splitting it into train/test subsets. If the analyst decided to apply min-max normalization, the process would entail calculating the global maximum(s)/minimum(s) in order to apply to every calculation. If this was done before splitting, the resultant training would have some insight into the distribution of the test data set. This type of issue can also arise with various forms of imputation.

In the following recipe I imputed missing data with K-nearest neighbours(KNN)-this algorithm, using Euclidean distance, recruits the k nearest samples in the sample space and imputes the mean of the specified neighbours. In addition, I converted all categorical variables to dummy variables in order to make them machine learning friendly.

model_rec <- recipe(
  SalePrice ~ .,
  data= ames_train)%>%step_dummy(all_nominal())%>%step_knnimpute(all_predictors(), neighbors = 10)%>%prep(training=ames_train, retain=TRUE, verbose=TRUE)
## oper 1 step dummy [training] 
## oper 2 step knnimpute [training] 
## The retained training set is ~ 1.01 Mb  in memory.

Here, I applied the recipe model_rec to the ames_train data

trainSet.prep<-bake(model_rec,new_data=ames_train, composition='matrix')
trainSet = as.data.frame((trainSet.prep))

Similarly, I used the function bake to apply the same recipe to the ames_test set.

testSet.prep<-bake(model_rec,new_data=ames_test, composition='matrix')
testSet = as.data.frame((testSet.prep))

0.2.10 LINEAR MODELLING


I decided to run a simple linear model to begin with. Linear regression is a good place to start testing models. Having visually examined the scatter plots of the independent variables against the dependent variable in the visualisation section, my hunch was that linear modelling would prove to be an inappropriate method. The scatterplots suggested that some of the continuous variables appeared to be violating one of the cardinal assumptions of linear regression- Linearity! This assumption posits that the dependent variable has a roughly linear relationship with each of the covariates.

Here, I constructed a simple linear model using all of the variables in the model.


```r
set.seed(741)
Linear.model<-lm(SalePrice~., data=trainSet)

I wanted to test the linear model on the test set.

predictions <- predict(Linear.model, newdata = testSet)


I calculated the residuals using the formula residuals = actual - prediction.

residuals <- testSet$SalePrice - predictions


I was keen to find out how well the model performed. Here, I simply wanted to see the root-mean-square error(RMSE).


sqrt(mean(Linear.model$residuals^2))
## [1] 30651.1

0.2.11 ASSESSMENT OF RESIDUALS


The adage “A picture is worth a thousand words” is an apt expression when it comes assessing the residuals of linear models.

par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(Linear.model)

The linear model appeared to be poorly predicting prices in the upper price ranges: the residuals versus fitted plot indicates increased variance as the price ascends. The quantile-quantile(Q-Q) plot was particularly revealing: the residuals are overdispersed. The distribution is heavier tailed than a normal distribution: this generally signifies that there are more extreme values than would be expected if the underlying distribution was normal. The p-values and confidence intervals will be potentially biased and too narrow, which may result in an increase in type I errors (claiming that x affects y, when it does not). I decided to seek some corroboration via some trusted allies: the Shapiro, Breusch-Pagan and Anderson-Darling tests were all brought to the fore.


SHAPIRO TEST


I commenced my investigation with the Shapiro test. I wanted to find out whether the residuals/errors were normally distributed. Whilst it is not a prerequisite that the residuals are normally distributed for ordinary least squares to generate unbiased estimates of the underlying population parameters, meeting this condition facilitates the production of trustworthy prediction and confidence intervals.

p_val <- shapiro.test(Linear.model$residuals)$p.value;
if(p_val>=0.05) {
  print("The data is likely to be normally distributed")
} else {
  print("Given that p<0.05, the data does not appear to be normally distributed. The calculated
coeffient estimates cannot be trusted")
}
## [1] "Given that p<0.05, the data does not appear to be normally distributed. The calculated\ncoeffient estimates cannot be trusted"

The null hypothesis of the shapiro test is that the observed data is normally distributed. As p<0.05, in keeping with custom, I concluded that the residuals were not normally distributed.


BREUSCH-PAGAN TEST


I decided to perform a homoscedastic test. The Breusch-Pagan test evaluates the homoscedasticity of the residuals. Linear regression is particularly suitable when the error/residuals are homoscedastic across all the predicted values of the dependent variable. This type of assessment gauges whether a regression model’s ability to predict the dependent variable is consistent across all ranges of the respective variable.


library(lmtest)
bptest(Linear.model)
## 
##  studentized Breusch-Pagan test
## 
## data:  Linear.model
## BP = 879.11, df = 60, p-value < 0.00000000000000022
if(p_val>=0.05) {
  print("The data can be considered homoscesdastic")
} else {
  print("Given that p<0.05,  heteroskedasticity appears to be present.")
}
## [1] "Given that p<0.05,  heteroskedasticity appears to be present."


ANDERSON DARLING TEST


# Anderson-Darling normality test
library(nortest)
ad.test(Linear.model$residuals)
## 
##  Anderson-Darling normality test
## 
## data:  Linear.model$residuals
## A = 64.433, p-value < 0.00000000000000022
if(p_val>=0.05) {
  print("The data can be considered to be normally distributed")
} else {
  print("The result suggests that the residuals are NOT normally distibuted ")
}
## [1] "The result suggests that the residuals are NOT normally distibuted "

Non-normality and heteroscedasticity were clearly present in the distribution of the residuals. If heteroscedasticity is present and is the only principal assumption of linear regression that is not met, this does not bias the regression coefficients. However, it does bias the standard errors and test-statistics. Unfortunately, this can produce inaccurate standard errors that result in confidence intervals that cannot be relied on.


OUTLIERS


I decided to look a little further into the presence of the outliers that were showing up in the diagnostic plots. Outliers can be found by inspecting the standardized residuals. Standardized residuals indicate the number of standard errors removed from the regression line. It is customary to consider standardized residuals exceeding 3 as possible outliers.

Whilst not all outliers are deemed to be influential, they are generally worth exploring further. Cook’s distance is a metric that seeks to capture the influence of a value: it is governed by a mixture of leverage and residual size.

plot(Linear.model, 4, id.n = 6)

plot(Linear.model, 5, id.n = 6)

extra.metrics <- augment(Linear.model)
# Add observations indices and
# drop some columns (.se.fit, .sigma) for simplification
extra.metrics <- extra.metrics %>%
  mutate(index = 1:nrow(extra.metrics))
extra.metrics %>%
  top_n(6, wt = .cooksd)
## # A tibble: 6 x 69
##   SalePrice Overall.Qual Overall.AREA Garage.Cars Garage.Area Bathroom.Tally
##       <dbl>        <dbl>        <dbl>       <dbl>       <dbl>          <dbl>
## 1    160000           10        11752           2        1418            4.5
## 2    745000           10         6872           3         813            4.5
## 3    755000           10         6760           3         832            4  
## 4    183850           10        10190           3        1154            4  
## 5    184750           10         7814           3         884            4.5
## 6    475000           10         4715           3         840            2.5
## # ... with 63 more variables: Mas.Vnr.Area <dbl>, Yrs.since.Remod <dbl>,
## #   TotRms.AbvGrd <dbl>, Fireplaces <dbl>, BsmtFin.SF.1 <dbl>, Bsmt.Qual <dbl>,
## #   Exter.Qual <dbl>, Kitchen.Qual <dbl>, Fireplace.Qu <dbl>,
## #   Foundation_CBlock <dbl>, Foundation_PConc <dbl>, Foundation_Slab <dbl>,
## #   Foundation_Stone <dbl>, Foundation_Wood <dbl>, Neighborhood_CollgCr <dbl>,
## #   Neighborhood_Crawfor <dbl>, Neighborhood_Edwards <dbl>,
## #   Neighborhood_Gilbert <dbl>, Neighborhood_IDOTRR <dbl>,
## #   Neighborhood_Mitchel <dbl>, Neighborhood_NAmes <dbl>,
## #   Neighborhood_NoRidge <dbl>, Neighborhood_NridgHt <dbl>,
## #   Neighborhood_NWAmes <dbl>, Neighborhood_OldTown <dbl>,
## #   Neighborhood_Sawyer <dbl>, Neighborhood_SawyerW <dbl>,
## #   Neighborhood_Somerst <dbl>, Neighborhood_StoneBr <dbl>,
## #   Neighborhood_Timber <dbl>, Neighborhood_other <dbl>,
## #   Garage.Type_BuiltIn <dbl>, Garage.Type_Detchd <dbl>,
## #   Garage.Type_None <dbl>, Garage.Type_other <dbl>, Garage.Finish_None <dbl>,
## #   Garage.Finish_RFn <dbl>, Garage.Finish_Unf <dbl>, MS.SubClass_B <dbl>,
## #   MS.SubClass_E <dbl>, MS.SubClass_F <dbl>, MS.SubClass_G <dbl>,
## #   MS.SubClass_I <dbl>, MS.SubClass_K <dbl>, MS.SubClass_L <dbl>,
## #   MS.SubClass_N <dbl>, MS.SubClass_P <dbl>, MS.SubClass_other <dbl>,
## #   Exterior.1st_CemntBd <dbl>, Exterior.1st_HdBoard <dbl>,
## #   Exterior.1st_MetalSd <dbl>, Exterior.1st_Plywood <dbl>,
## #   Exterior.1st_VinylSd <dbl>, Exterior.1st_Wd.Sdng <dbl>,
## #   Exterior.1st_WdShing <dbl>, Exterior.1st_other <dbl>, .fitted <dbl>,
## #   .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>,
## #   index <int>

It was noteworthy that the top five outliers with high Cook’s distance scores shared a common Overall.Quality score of 10. I decided that I would consider revisiting these observations at a later stage.


PREDICTED/ACTUAL PRICES


I wanted to visualise the consistency of the predictions across the sample space.

library(auditor)
lm_audit <- audit(Linear.model, data = trainSet, y =trainSet$SalePrice)
## Preparation of a new explainer is initiated
##   -> model label       :  lm  (  default  )
##   -> data              :  2199  rows  62  cols 
##   -> target variable   :  2199  values 
##   -> predict function  :  yhat.lm  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package stats , ver. 4.0.4 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  7432.059 , mean =  181401.5 , max =  572141.6  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -412141.6 , mean =  -0.000000001694649 , max =  275067.8  
##   A new explainer has been created! 
# validate a model with auditor
mr_lm <- model_residual(lm_audit)

# plot results
plot_prediction(mr_lm, abline = TRUE)


Here, it was evident that as price increases, the model’s predictive accuracy diminishes.

I was keen to explore the performance of a variety of algorithms that are known to work well with non-linear data. However, before I abandoned my exploration of linear regression, I wanted to find out what variables the linear model suggested were the most important in determining sale price.

library(vip)
vip(Linear.model)


0.2.12 EXPERIMENTING WITH VARIOUS MODELS


I decided to experiment with a variety of algorithms. I simply wanted to get a sense of what algorithm(s) may be well-suited to modelling Sale Price in relation to the variables I had at my disposal.


library(parallel)      
library(doParallel)

Mycluster =makeCluster(detectCores()-1)
registerDoParallel(Mycluster)

myControl = trainControl(method = 'cv', number = 10, 
verboseIter = FALSE, savePredictions = TRUE,allowParallel = T)



set.seed(174)
Linear.Model = train(SalePrice ~., data = trainSet, metric = 'RMSE', method = 'lm',preProcess = c('center', 'scale'),trControl = myControl)


set.seed(174)
Glmnet.Model = train(SalePrice ~ ., data = trainSet , metric = 'RMSE', method = 'glmnet',preProcess = c('center', 'scale'), trControl = myControl)


set.seed(174)
Rapid.Ranger = train(SalePrice ~ ., data = trainSet, metric = 'RMSE', method = 'ranger',preProcess = c('center', 'scale'),trControl = myControl)


set.seed(174)
Basic.Knn <- train(SalePrice ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k =1:3), trControl  = myControl, metric= "RMSE", data = trainSet)


set.seed(174)
Xgb.Super <- train(SalePrice~.,method = "xgbTree", tuneLength = 4,trControl = myControl,metric= "RMSE", data = trainSet)
## [17:45:46] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.


suite.of.models = list("LINEAR.MODEL"=Linear.Model,"GLMNET.MODEL"=Glmnet.Model, "RANGER.QUEST"= Rapid.Ranger, "KNN.SIMPLE" = Basic.Knn, "XGB.SUPER"= Xgb.Super)
resamps = resamples(suite.of.models) 
dotplot(resamps, metric = 'RMSE')


0.2.12.1 TESTING MODELS ON TEST SET


Evaluate.Prediction <- function(model, model.label, testData, ytest, grid = NULL) {
 
  #capture prediction time
  ptm <- proc.time()
  # use test data to make predictions
  pred <- predict(model, testData)
  tm <- proc.time() - ptm
  
  Pred.metric<- postResample(pred = pred, obs = ytest)
  RMSE.test <- c(Pred.metric[[1]])
  RSquared.test <- c(Pred.metric[[2]])
  MAE.test <- c(Pred.metric[[3]])
  
  
  Summarised.results = NULL
  if (is.null(grid)) { 
    Summarised.results = data.frame(predictor = c(model.label) ,  RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]))
  } else {
    .grid = data.frame(predictor = c(model.label) , RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]))
    Summarised.results = rbind(grid, .grid)}
  
  
  Summarised.results }


METRIC.GRID <- Evaluate.Prediction (Rapid.Ranger, "RAPID.QUEST", testSet, testSet$SalePrice, grid=NULL)

METRIC.GRID <- Evaluate.Prediction (Glmnet.Model, "GLMNET.MODEL", testSet, testSet$SalePrice, grid=METRIC.GRID)

METRIC.GRID <- Evaluate.Prediction (Basic.Knn, "KNN.SIMPLE", testSet, testSet$SalePrice, grid=METRIC.GRID)

METRIC.GRID <- Evaluate.Prediction (Linear.Model, "LINEAR.MODEL", testSet, testSet$SalePrice, grid=METRIC.GRID)

METRIC.GRID <- Evaluate.Prediction (Xgb.Super, "XGB.SUPER", testSet, testSet$SalePrice, grid=METRIC.GRID)


kable(METRIC.GRID[order(METRIC.GRID$RMSE, decreasing=F),])%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
predictor RMSE RSquared MAE time
1 RAPID.QUEST 22946.09 0.9042345 14757.51 0.17
5 XGB.SUPER 23845.53 0.8992469 15554.32 0.16
2 GLMNET.MODEL 24783.40 0.8883612 17368.92 0.06
4 LINEAR.MODEL 24826.47 0.8883348 17432.18 0.03
3 KNN.SIMPLE 37875.20 0.7548226 26322.29 0.19


The random forest (Rapid.Ranger) model and extreme gradient boosting (XGB.SUPER) models showed the most promise with the lowest average predictive errors (RMSE). I decided to explore these models further.


0.2.13 TUNING A RANDOM FOREST MODEL USING TIDYMODELS FRAME WORK


I decided to construct the random forest model using the R’s Tidymodels framework. This is a superb package that beautifully streamlines the predictive modelling process.

To divide the available data, I recruited the initial_split() function: it is designed to split a data set into a training and testing set. In the default stance, 0.75 of the data is allocated to the training set with the remainder being reserved for testing. The default setting can be modified by changing the prop argument. Instead of a data frame, this function produces an rsplit object.

set.seed(741)
ames_split <- initial_split(ames.parsimonious6, prop = 0.80, strata = SalePrice)
ames_train <- training(ames_split)
ames_test  <- testing(ames_split)


I opted to perform cross validation on the train set. Cross-validation, also known as rotation estimation, is a validation strategy used for evaluating how well the results of a statistical analysis are likely to generalize to an unseen dataset. Many data analysts employ k-fold cross-validation by employing k=5 or k=10, these numbers have demonstrated an ideal balance between minimising bias and variance.

set.seed(174)
random.f_cv <- vfold_cv(ames_train,v=5)


The following object simple_rec is a recipe/sequence of instructions that can been applied to datasets. The specified operations will be applied to the train and test sets that are used throughout this particular exploration of the random forest algorithm

simple_rec <-
recipe(SalePrice ~ .,
  data= ames_train)%>%
  step_dummy(all_nominal())%>%step_knnimpute(all_predictors(), neighbors=10)


In the code chunk specified below, the argument engine indicates what machine learning algorithm will be used to fit the model. Tidymodels is a wrapper for various machine learning packages. This is simply a specification, no data is being fed to the algorithm at this stage .


```r
ranger_model <- rand_forest(
    trees = tune(),
    mtry = tune()) %>%
  set_engine("ranger", importance = "permutation") %>% 
  set_mode("regression")


Here, I recruited a workflow: this is a container that integrates a preprocessing recipe and a model specification.

random.f_workflow <- workflow()%>% add_recipe(simple_rec)%>% 
add_model(ranger_model)


In the following code chunk, I specified a simple grid to be used in the quest for optimal parameters. I used the following grid numerous times with varying values. However, as I progressively experimented with varying values, I did not re-run the previously tried values in order to minimise computational cost. Thus, the following grid is a curtailed version of the values I have experimented with.

random.f_grid <- expand_grid(mtry = 33:36, trees = seq(1400, 1700, by = 100))


It was time to initialise the exploration of the specified hyperparameter space.

tidy_grid_results <- random.f_workflow %>% tune_grid(
    resamples = random.f_cv,
    grid = random.f_grid)

0.2.14 COLLECT METRICS OF RANGER MODEL


Here, I visualised the movement of RMSE throughout the search

autoplot(tidy_grid_results, metric = "rmse")

I extracted the parameters that generated the lowest RMSE

param_final <- tidy_grid_results %>%select_best(metric = "rmse")

It was time to update the workflow: I added the optimised parameters to the workflow using the finalize_workflow() function.

random.f_workflow <- random.f_workflow %>%
finalize_workflow(param_final)

It was time to finally fit the model

rf_fit <- random.f_workflow %>%
fit(ames_train)

I decided to find out which features demonstrated the most importance according to the random forest model

library(vip)
rf_fit %>% 
  pull_workflow_fit() %>% 
  vip::vip()


0.2.15 PERFORMANCE ON TEST SET:


It was time to test the fitted model on the validation/test set. The predict function` in the tidymodels framework applies the same preprocessing operations that were defined in the recipe and previously applied to the training sets (cross-validation splits). However, this time these operations are applied to the test set before the newly preprocessed data is passed to the predict function.

predict(rf_fit, new_data = ames_test)
## # A tibble: 584 x 1
##      .pred
##      <dbl>
##  1 123570.
##  2 176195.
##  3 175797.
##  4 179798.
##  5 152288.
##  6 122690.
##  7 356318.
##  8 236323.
##  9 313051.
## 10 164239.
## # ... with 574 more rows


Below, I plotted the predicted versus observed values produced by the random forest model

ranger_pred <- predict(rf_fit, new_data = ames_test %>% select(-SalePrice))
ranger_pred <- bind_cols(ranger_pred, ames_test %>% select(SalePrice))


ggplot(ranger_pred, aes(x = SalePrice, y = .pred)) + 
  # Create a diagonal line:
  geom_abline(lty = 2) + 
  geom_point(alpha = 0.5) + 
  labs(y = "Predicted Sale Price", x = "SalePrice") +
  # Scale and size the x- and y-axis uniformly:
  coord_obs_pred()


0.2.16 TEST RESULTS:


I was now in a position to trial the constructed model on the test set.

ames_metrics <- metric_set(yardstick::rmse, yardstick:: mae)

ames_metrics(ranger_pred, truth = SalePrice, estimate = .pred)
## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      25416.
## 2 mae     standard      15981.


0.3 XGBOOST MODELLING:


It was time to construct a model using the extreme gradient boosting algorithm.

Again, I started by dividing the dataset into train/test sets

# divide dataset, stratify by SalePrice
set.seed(174)
ames_split.2 <-initial_split(ames.parsimonious6, prop = 0.75, strata = SalePrice)
ames_train2 <- training(ames_split.2)
ames_test2  <- testing(ames_split.2)


Again, I called a recipe to perform some final preprocessing.

XGB_rec <-
recipe(SalePrice ~ .,
  data= ames_train2)%>%
  step_dummy(all_nominal())%>%step_knnimpute(all_predictors(), neighbors=10)%>%prep()


Here, I applied cross-validation to randomly divide the training data into more training and test subsets.

set.seed(174)
cv_folds <-recipes::bake(
    XGB_rec, 
    new_data = ames_train2)%>%  
  rsample::vfold_cv(v = 5)


In the following code block, I transformed the train/tests using the recipe (XGB_rec). I did not include this step when working with the previous random forest model. This step would not have been necessary if I had not gone on to construct an explainer with modelStudio (featured at the end of this document).

train.ready<-juice(XGB_rec)
test.ready<-bake(XGB_rec, new_data = ames_test2)


Here, I defined the XGBoost model specification, alongside the hyperparameters to be tuned

Model.XGB <- 
  boost_tree(
    mode = "regression",
    trees = 1000,
    min_n = tune(),
    tree_depth = tune(),
    learn_rate = tune(),
    loss_reduction = tune()
  ) %>%set_engine("xgboost", objective = "reg:squarederror")


I employed the tidymodels dials package to specify the parameters.

# grid specification 
XGB.aspects <- 
  dials::parameters(
    min_n(),
    tree_depth(),
    learn_rate(),
    loss_reduction())


At this stage, I set up the grid space of exploration by recruiting the dials::grid_max_entropy() grid function which covers the specified hyperparameter space.

xgboost_grid <- 
dials::grid_max_entropy(
XGB.aspects, size = 200)
knitr::kable(head(xgboost_grid))
min_n tree_depth learn_rate loss_reduction
40 7 0.0000001 0.0077783
37 4 0.0000537 0.0009884
10 13 0.0000000 0.0000000
17 3 0.0000046 17.8336015
37 3 0.0037961 0.0000094
16 7 0.0013587 0.0000000


At this stage, I recruited the tidymodels workflows package in order to apply a basic formula to the XGBoost model specified above.

xgboost_wf <- 
workflows::workflow() %>%
add_model(Model.XGB) %>% 
add_formula(SalePrice ~ .)


I used the following tune_grid() to specify the search space for optimal hyperparameters

# hyperparameter tuning
TUNE.XGB <- tune::tune_grid(
  object = xgboost_wf,
  resamples = cv_folds,
  grid = xgboost_grid,
  metrics = yardstick::metric_set(yardstick::rmse, yardstick::rsq, yardstick::rsq_trad, yardstick::mae),
  control = tune::control_grid(verbose = FALSE)) 


Here, I extracted the parameters that yielded the minimal “rmse”

param_final <- TUNE.XGB %>%select_best(metric = "rmse")


I finalized the XGBoost model to use the tuning parameters that produced the lowest RMSE.

xgboost_wf2 <- xgboost_wf%>%
finalize_workflow(param_final)


It was time to fit the final extreme gradient boosting model

XGB.model <- xgboost_wf2 %>%
fit(train.ready)


Here, I extracted influential variables

library(vip)
XGB.model %>% 
  pull_workflow_fit() %>% 
  vip::vip()


XGB PREDICT ON TEST SET:

I was now ready to make predictions on the unseen test set. Thereafter, I wanted to inspect the consistency of the predictions produced by the XGB.model:


XGB_res <- predict(XGB.model, new_data = test.ready %>% select(-SalePrice))

XGB_res <- bind_cols(XGB_res, test.ready %>% select(SalePrice))

XGB_metrics <- metric_set(yardstick::rmse, yardstick:: mae)

XGB_metrics(XGB_res, truth = SalePrice, estimate = .pred)
## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      23724.
## 2 mae     standard      15045.
ggplot(XGB_res, aes(x = SalePrice, y = .pred)) + 
    # Create a diagonal line:
    geom_abline(lty = 2) + 
    geom_point(alpha = 0.5) + 
    labs(y = "Predicted Sale Price", x = "SalePrice") +
    # Scale and size the x- and y-axis uniformly:
    coord_obs_pred()


0.3.1 EXPLAINER:


Finally, I constructed an explainer using the fitted extreme gradient boosting model. The explainer provides detailed descriptions of what direction particular features are likely to be impacting the prices of particular houses. In the following interactive display of boxes, explanations for the first two houses in the test set are offered . For a thorough explanation of the types of plots used below, I would recommend the curious reader to consult this excellent resource.


explainer<- explain_tidymodels(XGB.model,
                                data = test.ready,
                                y = test.ready$SalePrice,
                                label = "tidymodels")
## Preparation of a new explainer is initiated
##   -> model label       :  tidymodels 
##   -> data              :  731  rows  62  cols 
##   -> data              :  tibble converted into a data.frame 
##   -> target variable   :  731  values 
##   -> predict function  :  yhat.workflow  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package tidymodels , ver. 0.1.2 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  56540.02 , mean =  177666.1 , max =  524786.9  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -193005 , mean =  1308.632 , max =  168033.1  
##   A new explainer has been created! 
new_obs <- test.ready[1:2,]
rownames(new_obs) <- c("case 1", "case 2")
modelStudio(explainer, new_obs)