AMES SIMPLIFIED

0.0.1 BACKGROUND

In order to put my newly acquired data science skills to the test, I decided to the explore the well-known Ames Housing dataset. This particular dataset documents 2006-2010 actual house prices in the City of Ames, Iowa. The dataset details the price of houses and is accompanied by a broad range of variables. The type of variables contained in this dataset primarily pertain to the quality, dimensions and quantity of physical characteristics associated with the respective properties. Generally, these covariates are the sort of features that prospective buyers are likely to be interested in, features such as the age of the property, the quality of the exterior and how many bathrooms the property holds. I refer the curious reader to the following article, where an in-depth rundown of the various features can be obtained.

0.0.2 STRUCTURE AND AIMS OF ANALYSIS

EDA:EXPLORATORY DATA ANALYSIS

Investigate missing data
Clean respective data
Feature selection
Ensure data is suitable for modelling
- numeric conversions
- decluttering categories

MODELLING

Survey range of algorithms
Tune parameters/hyperparameters of promising algorithms
Construct explainer in order to derive key insights
- global explanations
- local explanations
Convert optimised model to a functional application for the use of front-end users/Shiny application (this stage of the process is not included in this document).

0.0.3 KEY INSIGHTS

Employing an alpha level of 0.05, Overall Quality (0.86), Overall area (78.8), External Quality (69.4) and the number of garage cars (0.66) turned out to have the highest significant correlations with Sale Price. The age of the house (-0.56) and time since remodelled (-0.53) had the highest significant negative correlations with Sale Price. Pearson’s correlation coefficient (r) was employed- this metric quantifies linear association. Of course, Pearson’s correlation is unsuitable for adequately capturing curvilinear relationships between variables. The house prices ranged from $12,789 to $755,000, with the average price equating to $ 180,796 (standard deviation= $79,886.69).

I employed a range of predictive algorithms in order to discover an algorithm that would yield optimally low residual error. Of the algorithms tested, Extreme Gradient Boosting and Random Forest algorithms demonstrated the best predictive performances. I was able to obtain a mean absolute error (MAE) of $15,045 using extreme gradient boosting on unseen data (segregated validation set) and a MAE of $15,981 by employing the random forest algorithm. In terms of the features with the most important global/overall importance, both the random forest model and the extreme gradient boosting model agreed that Overall Area (of the interior house) and Overall Quality were the top two primary drivers responsible for estimating Sale Price. Additionally, I ran a classic multiple linear regression: the most important features for prediction turned out to be Overall Quality in first place, with Overall Area being the fourth most important feature. Thus, given the consensus between these differing algorithms, our confidence increases that these variables are indeed important predictors for determining sale price. As sale price increased, all of the models tested struggled to consistently and accurately predict the respective house prices. One potential hypothesis - admittedly, a tentative one- that could account for the deficit in predictive capacity in these upper price regions, is that such prices are less likely to be shaped by ordinary consumer behaviour. Buyers operating in the higher price region are not constrained by the usual economic push/pull factors; in other words, the liberated movement of human whim is introducing volatility.

AVENUES FOR FURTHER STUDY

Consider experimentation with polynomial terms/log transformations/ splines to improve modelling.
Consider reintroducing some of the discarded variables.

0.0.4 COMMENCE OF THE ANALYSIS

I recruited the following libraries at various stages throughout the modelling process

library(ggplot2)
library(caret)
library(tidyr)
library(DMwR)
library(missForest)
library(randomForest)
library(tidymodels)
library(ranger) 
library(recipes) 
library(workflows) 
library(themis)
library(xgboost)
library(ggdark)
library(viridis)
library(DALEXtra)
library(ggthemes)
library(lares)
library(kableExtra)
library(tidyr)
library(tidyverse)
library(reshape)
library(dplyr)
library(modelStudio)
library(DALEXtra)
library(DALEX)
library(tidymodels)
library(vip)

I decided to define a couple of customised visualisation themes at the outset of the project; I wanted to be able to easily format the visual output of the analysis in a consistent manner. I feel that visual consistency enhances the fluidity of the reading process.

ISAAC.THEME <- theme(plot.title = element_text(hjust = 0.5), #  title centralised
        plot.background = element_rect(fill="black"), #background to be black
        panel.background = element_rect(fill="gray20"), # panel background to be dark grey
        panel.grid.minor = element_line(color="black"), # grid lines to be removed
        panel.grid.major = element_line(color="black"), # grid lines to be removed
        axis.text = element_text(color="white"), # axis text to appear white
        title = element_text(color="white", face="bold"), # title to be white and bold.
        legend.background = element_rect(fill="black"), # legend background appear as black
        legend.text = element_text(color="white", size=20), # legend text to appear white
        legend.key = element_rect(fill="black", color="black"), 
        legend.position = "top")

ISAAC.THEME2 <- theme(plot.title = element_text(hjust = 0.5), # title centralised
        plot.background = element_rect(fill="black"), #  background to be black
        panel.background = element_rect(fill="gray20"), #  panel background to be grey
        panel.grid.minor = element_line(color="black"), #  grid lines to be removed
        panel.grid.major = element_line(color="black"), # grid lines to be removed
        axis.text = element_text(color="white"), # axis text to be white
        title = element_text(color="white", face="bold"), # title to appear white and bold.
        legend.background = element_rect(fill="black"), # legend of background should appear black
        legend.text = element_text(color=NULL, size=12 ), # legend text should appear white
        legend.key = element_rect(fill=NULL), 
        legend.position = 'none')

I had previously saved the dataset in on my computer. I commenced the analysis by loading the respective dataset in its raw form.

load("ames.RData")

In accordance with tradition, I kicked-off the analysis by inspecting the broad structure of the data provided. I wanted to glance the contents of the dataset.

df_str(ames, return = "plot")

## Other available 'return' options: "skimr", "numbers", "names", "distr"

RAW DATA

I wished to obtain more insight into the contents of the data in its unprocessed form; I decided to inspect the first fifteen rows:

Order	PID	MS.SubClass	MS.Zoning	Lot.Frontage	Lot.Area	Street	Alley	Lot.Shape	Land.Contour	Utilities	Lot.Config	Land.Slope	Neighborhood	Condition.1	Condition.2	Bldg.Type	House.Style	Overall.Qual	Overall.Cond	Year.Built	Year.Remod.Add	Roof.Style	Roof.Matl	Exterior.1st	Exterior.2nd	Mas.Vnr.Type	Mas.Vnr.Area	Exter.Qual	Exter.Cond	Foundation	Bsmt.Qual	Bsmt.Cond	Bsmt.Exposure	BsmtFin.Type.1	BsmtFin.SF.1	BsmtFin.Type.2	BsmtFin.SF.2	Bsmt.Unf.SF	Total.Bsmt.SF	Heating	Heating.QC	Central.Air	Electrical	X1st.Flr.SF	X2nd.Flr.SF	Gr.Liv.Area	Bsmt.Full.Bath	Full.Bath	Half.Bath	Bedroom.AbvGr	Kitchen.AbvGr	Kitchen.Qual	TotRms.AbvGrd	Functional	Fireplaces	Fireplace.Qu	Garage.Type	Garage.Yr.Blt	Garage.Finish	Garage.Cars	Garage.Area	Garage.Qual	Garage.Cond	Paved.Drive	Wood.Deck.SF	Open.Porch.SF	Enclosed.Porch	Screen.Porch	Pool.QC	Fence	Misc.Feature	Misc.Val	Mo.Sold	Yr.Sold	Sale.Type	Sale.Condition	SalePrice
1	526301100	20	RL	141	31770	Pave	NA	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	5	1960	1960	Hip	CompShg	BrkFace	Plywood	Stone	112	TA	TA	CBlock	TA	Gd	Gd	BLQ	639	Unf	0	441	1080	GasA	Fa	Y	SBrkr	1656	0	1656	1	1	0	3	1	TA	7	Typ	2	Gd	Attchd	1960	Fin	2	528	TA	TA	P	210	62	0	0	NA	NA	NA	0	5	2010	WD	Normal	215000
2	526350040	20	RH	80	11622	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	NAmes	Feedr	Norm	1Fam	1Story	5	6	1961	1961	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	CBlock	TA	TA	No	Rec	468	LwQ	144	270	882	GasA	TA	Y	SBrkr	896	0	896	0	1	0	2	1	TA	5	Typ	0	NA	Attchd	1961	Unf	1	730	TA	TA	Y	140	0	0	120	NA	MnPrv	NA	0	6	2010	WD	Normal	105000
3	526351010	20	RL	81	14267	Pave	NA	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	6	1958	1958	Hip	CompShg	Wd Sdng	Wd Sdng	BrkFace	108	TA	TA	CBlock	TA	TA	No	ALQ	923	Unf	0	406	1329	GasA	TA	Y	SBrkr	1329	0	1329	0	1	1	3	1	Gd	6	Typ	0	NA	Attchd	1958	Unf	1	312	TA	TA	Y	393	36	0	0	NA	NA	Gar2	12500	6	2010	WD	Normal	172000
4	526353030	20	RL	93	11160	Pave	NA	Reg	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	7	5	1968	1968	Hip	CompShg	BrkFace	BrkFace	None	0	Gd	TA	CBlock	TA	TA	No	ALQ	1065	Unf	0	1045	2110	GasA	Ex	Y	SBrkr	2110	0	2110	1	2	1	3	1	Ex	8	Typ	2	TA	Attchd	1968	Fin	2	522	TA	TA	Y	0	0	0	0	NA	NA	NA	0	4	2010	WD	Normal	244000
5	527105010	60	RL	74	13830	Pave	NA	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	5	5	1997	1998	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	PConc	Gd	TA	No	GLQ	791	Unf	0	137	928	GasA	Gd	Y	SBrkr	928	701	1629	0	2	1	3	1	TA	6	Typ	1	TA	Attchd	1997	Fin	2	482	TA	TA	Y	212	34	0	0	NA	MnPrv	NA	0	3	2010	WD	Normal	189900
6	527105030	60	RL	78	9978	Pave	NA	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	6	6	1998	1998	Gable	CompShg	VinylSd	VinylSd	BrkFace	20	TA	TA	PConc	TA	TA	No	GLQ	602	Unf	0	324	926	GasA	Ex	Y	SBrkr	926	678	1604	0	2	1	3	1	Gd	7	Typ	1	Gd	Attchd	1998	Fin	2	470	TA	TA	Y	360	36	0	0	NA	NA	NA	0	6	2010	WD	Normal	195500
7	527127150	120	RL	41	4920	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	StoneBr	Norm	Norm	TwnhsE	1Story	8	5	2001	2001	Gable	CompShg	CemntBd	CmentBd	None	0	Gd	TA	PConc	Gd	TA	Mn	GLQ	616	Unf	0	722	1338	GasA	Ex	Y	SBrkr	1338	0	1338	1	2	0	2	1	Gd	6	Typ	0	NA	Attchd	2001	Fin	2	582	TA	TA	Y	0	0	170	0	NA	NA	NA	0	4	2010	WD	Normal	213500
8	527145080	120	RL	43	5005	Pave	NA	IR1	HLS	AllPub	Inside	Gtl	StoneBr	Norm	Norm	TwnhsE	1Story	8	5	1992	1992	Gable	CompShg	HdBoard	HdBoard	None	0	Gd	TA	PConc	Gd	TA	No	ALQ	263	Unf	0	1017	1280	GasA	Ex	Y	SBrkr	1280	0	1280	0	2	0	2	1	Gd	5	Typ	0	NA	Attchd	1992	RFn	2	506	TA	TA	Y	0	82	0	144	NA	NA	NA	0	1	2010	WD	Normal	191500
9	527146030	120	RL	39	5389	Pave	NA	IR1	Lvl	AllPub	Inside	Gtl	StoneBr	Norm	Norm	TwnhsE	1Story	8	5	1995	1996	Gable	CompShg	CemntBd	CmentBd	None	0	Gd	TA	PConc	Gd	TA	No	GLQ	1180	Unf	0	415	1595	GasA	Ex	Y	SBrkr	1616	0	1616	1	2	0	2	1	Gd	5	Typ	1	TA	Attchd	1995	RFn	2	608	TA	TA	Y	237	152	0	0	NA	NA	NA	0	3	2010	WD	Normal	236500
10	527162130	60	RL	60	7500	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	7	5	1999	1999	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	PConc	TA	TA	No	Unf	0	Unf	0	994	994	GasA	Gd	Y	SBrkr	1028	776	1804	0	2	1	3	1	Gd	7	Typ	1	TA	Attchd	1999	Fin	2	442	TA	TA	Y	140	60	0	0	NA	NA	NA	0	6	2010	WD	Normal	189000
11	527163010	60	RL	75	10000	Pave	NA	IR1	Lvl	AllPub	Corner	Gtl	Gilbert	Norm	Norm	1Fam	2Story	6	5	1993	1994	Gable	CompShg	HdBoard	HdBoard	None	0	TA	TA	PConc	Gd	TA	No	Unf	0	Unf	0	763	763	GasA	Gd	Y	SBrkr	763	892	1655	0	2	1	3	1	TA	7	Typ	1	TA	Attchd	1993	Fin	2	440	TA	TA	Y	157	84	0	0	NA	NA	NA	0	4	2010	WD	Normal	175900
12	527165230	20	RL	NA	7980	Pave	NA	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	1Story	6	7	1992	2007	Gable	CompShg	HdBoard	HdBoard	None	0	TA	Gd	PConc	Gd	TA	No	ALQ	935	Unf	0	233	1168	GasA	Ex	Y	SBrkr	1187	0	1187	1	2	0	3	1	TA	6	Typ	0	NA	Attchd	1992	Fin	2	420	TA	TA	Y	483	21	0	0	NA	GdPrv	Shed	500	3	2010	WD	Normal	185000
13	527166040	60	RL	63	8402	Pave	NA	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	6	5	1998	1998	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	PConc	Gd	TA	No	Unf	0	Unf	0	789	789	GasA	Gd	Y	SBrkr	789	676	1465	0	2	1	3	1	TA	7	Typ	1	Gd	Attchd	1998	Fin	2	393	TA	TA	Y	0	75	0	0	NA	NA	NA	0	5	2010	WD	Normal	180400
14	527180040	20	RL	85	10176	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	1Story	7	5	1990	1990	Gable	CompShg	HdBoard	HdBoard	None	0	TA	TA	PConc	Gd	TA	Gd	GLQ	637	Unf	0	663	1300	GasA	Gd	Y	SBrkr	1341	0	1341	1	1	1	2	1	Gd	5	Typ	1	Po	Attchd	1990	Unf	2	506	TA	TA	Y	192	0	0	0	NA	NA	NA	0	2	2010	WD	Normal	171500
15	527182190	120	RL	NA	6820	Pave	NA	IR1	Lvl	AllPub	Corner	Gtl	StoneBr	Norm	Norm	TwnhsE	1Story	8	5	1985	1985	Gable	CompShg	HdBoard	HdBoard	None	0	Gd	TA	PConc	Gd	TA	Av	GLQ	368	BLQ	1120	0	1488	GasA	TA	Y	SBrkr	1502	0	1502	1	1	1	1	1	Gd	4	Typ	0	NA	Attchd	1985	RFn	2	528	TA	TA	Y	0	54	0	140	NA	NA	NA	0	6	2010	WD	Normal	212000

CLEARER OVERVIEW USING PSYCH

I recruited the psych package to obtain summary statistics regarding the variability of the data, including the kurtosis and skew of the respective data.

psych::describe(ames) %>% 
    kable()  %>% kable_styling(bootstrap_options = "striped") %>%
    scroll_box(width = "100%", height = "300px")

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Order	1	2930	1465.5000000	845.9624696	1465.5	1465.5000000	1086.0045	1	2930	2929	0.0000000	-1.2012287	15.6284996
PID	2	2930	714464496.9887372	188730844.6493901	535453620.0	712423200.4014505	12373193.9730	526301100	1007100110	480799010	0.0558287	-1.9944772	3486655.7767905
MS.SubClass	3	2930	57.3873720	42.6380246	50.0	49.7632253	44.4780	20	190	170	1.3561897	1.3793717	0.7877044
MS.Zoning*	4	2930	5.9672355	0.8656517	6.0	6.0720990	0.0000	1	7	6	-2.6139550	8.4051820	0.0159922
Lot.Frontage	5	2440	69.2245902	23.3653350	68.0	68.3524590	17.7912	21	313	292	1.4972247	11.1977390	0.4730174
Lot.Area	6	2930	10147.9218430	7880.0177594	9436.5	9481.0499147	3024.5040	1300	215245	213945	12.8077740	264.3869706	145.5772081
Street*	7	2930	1.9959044	0.0638763	2.0	2.0000000	0.0000	1	2	1	-15.5217245	239.0055030	0.0011801
Alley*	8	198	1.3939394	0.4898603	1.0	1.3687500	0.0000	1	2	1	0.4308369	-1.8235128	0.0348129
Lot.Shape*	9	2930	2.9402730	1.4121054	4.0	3.0503413	0.0000	1	4	3	-0.6062653	-1.6026962	0.0260876
Land.Contour*	10	2930	3.7778157	0.7031991	4.0	3.9982935	0.0000	1	4	3	-3.1226507	8.4362234	0.0129911
Utilities*	11	2930	1.0017065	0.0554058	1.0	1.0000000	0.0000	1	3	2	34.0201767	1187.9571398	0.0010236
Lot.Config*	12	2930	4.0552901	1.6039218	5.0	4.3191126	0.0000	1	5	4	-1.1937666	-0.4447109	0.0296312
Land.Slope*	13	2930	1.0535836	0.2483042	1.0	1.0000000	0.0000	1	3	2	4.9830595	26.6246706	0.0045872
Neighborhood*	14	2930	15.2952218	7.0220750	16.0	15.3993174	8.8956	1	28	27	-0.1960192	-1.1853548	0.1297274
Condition.1*	15	2930	3.0402730	0.8724077	3.0	3.0000000	0.0000	1	9	8	2.9876854	15.7385319	0.0161171
Condition.2*	16	2930	3.0020478	0.2090376	3.0	3.0000000	0.0000	1	8	7	12.0767140	308.9729348	0.0038618
Bldg.Type*	17	2930	1.5170648	1.2188973	1.0	1.1719283	0.0000	1	5	4	2.1518391	3.0038936	0.0225182
House.Style*	18	2930	4.0238908	1.9106293	3.0	4.0102389	0.0000	1	8	7	0.3211783	-0.9501428	0.0352974
Overall.Qual	19	2930	6.0948805	1.4110261	6.0	6.0755119	1.4826	1	10	9	0.1904388	0.0481942	0.0260676
Overall.Cond	20	2930	5.5631399	1.1115366	5.0	5.4680034	0.0000	1	9	8	0.5738415	1.4837965	0.0205348
Year.Built	21	2930	1971.3563140	30.2453606	1973.0	1974.2474403	37.0650	1872	2010	138	-0.6038435	-0.5046106	0.5587595
Year.Remod.Add	22	2930	1984.2665529	20.8602859	1993.0	1985.6284130	20.7564	1950	2010	60	-0.4514001	-1.3424069	0.3853776
Roof.Style*	23	2930	2.3948805	0.8197217	2.0	2.2431741	0.0000	1	6	5	1.5584134	0.8882255	0.0151437
Roof.Matl*	24	2930	2.0627986	0.5382105	2.0	2.0000000	0.0000	1	8	7	8.7203343	76.9796054	0.0099430
Exterior.1st*	25	2930	11.1604096	3.6506082	14.0	11.4718430	1.4826	1	16	15	-0.5885754	-0.7610933	0.0674422
Exterior.2nd*	26	2930	11.8709898	3.9980891	15.0	12.1889932	2.9652	1	17	16	-0.5636232	-0.8976905	0.0738616
Mas.Vnr.Type*	27	2930	4.4269625	1.0775448	5.0	4.4577645	0.0000	1	6	5	-0.6932065	-0.6198667	0.0199068
Mas.Vnr.Area	28	2907	101.8968008	179.1126106	0.0	61.1388053	0.0000	0	1600	1600	2.6042950	9.2601663	3.3220307
Exter.Qual*	29	2930	3.5290102	0.7016622	4.0	3.6424915	0.0000	1	4	3	-1.7925145	3.6662698	0.0129627
Exter.Cond*	30	2930	4.7098976	0.7723926	5.0	4.9261945	0.0000	1	5	4	-2.5031881	5.1134440	0.0142694
Foundation*	31	2930	2.3924915	0.7261949	2.0	2.4539249	1.4826	1	6	5	0.0107612	0.7550324	0.0134159
Bsmt.Qual*	32	2851	4.6878288	1.3131144	4.0	4.8487505	2.9652	1	6	5	-0.4618164	-0.8243922	0.0245926
Bsmt.Cond*	33	2851	5.7972641	0.6970662	6.0	6.0000000	0.0000	1	6	5	-3.3552418	10.0737610	0.0130550
Bsmt.Exposure*	34	2851	4.2714837	1.1375368	5.0	4.4660237	0.0000	1	5	4	-1.1653915	-0.3047203	0.0213043
BsmtFin.Type.1*	35	2851	4.7551736	1.8096342	4.0	4.8193775	2.9652	1	7	6	-0.0393708	-1.3628952	0.0338916
BsmtFin.SF.1	36	2929	442.6295664	455.5908391	370.0	384.0840085	548.5620	0	5644	5644	1.4147320	6.8388117	8.4181236
BsmtFin.Type.2*	37	2851	6.6720449	1.0167651	7.0	6.9706269	0.0000	1	7	6	-3.3854719	10.8381116	0.0190424
BsmtFin.SF.2	38	2929	49.7224309	169.1684756	0.0	2.0366738	0.0000	0	1526	1526	4.1357391	18.7325405	3.1257897
Bsmt.Unf.SF	39	2929	559.2625469	439.4941528	466.0	510.7735608	415.1280	0	2336	2336	0.9221075	0.4044530	8.1206990
Total.Bsmt.SF	40	2929	1051.6145442	440.6150670	990.0	1035.0537313	349.8936	0	6110	6110	1.1550204	9.1097038	8.1414106
Heating*	41	2930	2.0252560	0.2452214	2.0	2.0000000	0.0000	1	6	5	12.0957649	168.4515977	0.0045303
Heating.QC*	42	2930	2.5389078	1.7439418	1.0	2.4236348	0.0000	1	5	4	0.4808449	-1.5194533	0.0322180
Central.Air*	43	2930	1.9331058	0.2498813	2.0	2.0000000	0.0000	1	2	1	-3.4653089	10.0117831	0.0046164
Electrical*	44	2930	5.6846416	1.0492497	6.0	6.0000000	0.0000	1	6	5	-3.0814461	7.6505468	0.0193841
X1st.Flr.SF	45	2930	1159.5576792	391.8908853	1084.0	1127.1689420	349.8936	334	5095	4761	1.4679244	6.9480810	7.2398797
X2nd.Flr.SF	46	2930	335.4559727	428.3957150	0.0	272.8959044	0.0000	0	2065	2065	0.8655698	-0.4179641	7.9142781
Low.Qual.Fin.SF	47	2930	4.6767918	46.3105100	0.0	0.0000000	0.0000	0	1064	1064	12.1057567	175.1836856	0.8555507
Gr.Liv.Area	48	2930	1499.6904437	505.5088875	1442.0	1452.2453072	461.0886	334	5642	5308	1.2728055	4.1238681	9.3388841
Bsmt.Full.Bath	49	2928	0.4313525	0.5248202	0.0	0.3963311	0.0000	0	3	3	0.6160073	-0.7502980	0.0096990
Bsmt.Half.Bath	50	2928	0.0611339	0.2452536	0.0	0.0000000	0.0000	0	2	2	3.9367587	14.8820022	0.0045324
Full.Bath	51	2930	1.5665529	0.5529406	2.0	1.5575939	0.0000	0	4	4	0.1717761	-0.5442378	0.0102151
Half.Bath	52	2930	0.3795222	0.5026293	0.0	0.3387372	0.0000	0	2	2	0.6969988	-1.0319744	0.0092857
Bedroom.AbvGr	53	2930	2.8542662	0.8277311	3.0	2.8323379	0.0000	0	8	8	0.3053813	1.8828128	0.0152917
Kitchen.AbvGr	54	2930	1.0443686	0.2140762	1.0	1.0000000	0.0000	0	3	3	4.3094087	19.8182212	0.0039549
Kitchen.Qual*	55	2930	3.8563140	1.2692368	5.0	4.0251706	0.0000	1	5	4	-0.6206625	-0.6780245	0.0234482
TotRms.AbvGrd	56	2930	6.4430034	1.5729644	6.0	6.3336177	1.4826	2	15	13	0.7527712	1.1477391	0.0290593
Functional*	57	2930	7.6918089	1.1754144	8.0	8.0000000	0.0000	1	8	7	-3.8253633	13.7939856	0.0217149
Fireplaces	58	2930	0.5993174	0.6479209	1.0	0.5183447	1.4826	0	4	4	0.7384585	0.0971731	0.0119698
Fireplace.Qu*	59	1508	3.7194960	1.1264883	3.0	3.7831126	1.4826	1	5	4	-0.1217511	-1.0149437	0.0290086
Garage.Type*	60	2773	3.2830869	1.7903034	2.0	3.1144660	0.0000	1	6	5	0.7476409	-1.3113362	0.0339979
Garage.Yr.Blt	61	2771	1978.1324432	25.5284113	1979.0	1980.7108705	31.1346	1895	2207	312	-0.3842554	1.8176404	0.4849596
Garage.Finish*	62	2773	3.1799495	0.8229231	3.0	3.2257774	1.4826	1	4	3	-0.3510946	-1.4163756	0.0156273
Garage.Cars	63	2929	1.7668146	0.7605664	2.0	1.7680171	0.0000	0	5	5	-0.2196113	0.2402900	0.0140533
Garage.Area	64	2929	472.8197337	215.0465485	480.0	468.3458422	182.3598	0	1488	1488	0.2417464	0.9446576	3.9734961
Garage.Qual*	65	2772	5.8405483	0.6634018	6.0	6.0000000	0.0000	1	6	5	-4.0371873	14.7948678	0.0126003
Garage.Cond*	66	2772	5.8979076	0.5319486	6.0	6.0000000	0.0000	1	6	5	-5.2866785	27.1445647	0.0101035
Paved.Drive*	67	2930	2.8313993	0.5363888	3.0	3.0000000	0.0000	1	3	2	-2.9849322	7.1493968	0.0099094
Wood.Deck.SF	68	2930	93.7518771	126.3615619	0.0	71.2141638	0.0000	0	1424	1424	1.8407918	6.7337405	2.3344317
Open.Porch.SF	69	2930	47.5334471	67.4834001	27.0	33.8724403	40.0302	0	742	742	2.5327906	10.9241024	1.2467034
Enclosed.Porch	70	2930	23.0116041	64.1390592	0.0	4.8297782	0.0000	0	1012	1012	4.0103363	28.4151135	1.1849193
X3Ssn.Porch	71	2930	2.5924915	25.1413310	0.0	0.0000000	0.0000	0	508	508	11.3921213	149.6265877	0.4644666
Screen.Porch	72	2930	16.0020478	56.0873702	0.0	0.0000000	0.0000	0	576	576	3.9534162	17.8124277	1.0361706
Pool.Area	73	2930	2.2433447	35.5971806	0.0	0.0000000	0.0000	0	800	800	16.9218026	299.0552932	0.6576303
Pool.QC*	74	13	2.4615385	1.1982894	3.0	2.4545455	1.4826	1	4	3	-0.0507909	-1.6761110	0.3323457
Fence*	75	572	2.4125874	0.8355387	3.0	2.4890830	0.0000	1	4	3	-0.6760973	-0.8887827	0.0349356
Misc.Feature*	76	106	3.8490566	0.5484602	4.0	4.0000000	0.0000	1	5	4	-3.1640389	10.3679528	0.0532712
Misc.Val	77	2930	50.6351536	566.3442883	0.0	0.0000000	0.0000	0	17000	17000	21.9772674	564.8476877	10.4627709
Mo.Sold	78	2930	6.2160410	2.7144924	6.0	6.1569966	2.9652	1	12	11	0.1923989	-0.4576397	0.0501481
Yr.Sold	79	2930	2007.7904437	1.3166129	2008.0	2007.7380546	1.4826	2006	2010	4	0.1347244	-1.1589064	0.0243234
Sale.Type*	80	2930	9.3587031	1.8781600	10.0	9.8745734	0.0000	1	10	9	-3.3213129	10.7566598	0.0346975
Sale.Condition*	81	2930	4.7798635	1.0762992	5.0	5.0000000	0.0000	1	6	5	-2.7934757	7.2515027	0.0198838
SalePrice	82	2930	180796.0600683	79886.6923567	160000.0	170429.1510239	54856.2000	12789	755000	742211	1.7417153	5.1025881	1475.8445972

0.0.5 EXAMINING MISSING VALUES

I was keen to establish how complete the dataset was; I used the following stream of code to provide clear insight into this question

  ames%>%
  gather(key = "key", value = "val") %>% 
  mutate(is.missing = is.na(val)) %>% 
  group_by(key, is.missing) %>%
  summarise(Absent.data = n())%>%
  filter(is.missing == TRUE, Absent.data > 1) %>% 
  select(-is.missing) %>%
  arrange(desc(Absent.data)) %>% 
  ggplot(aes(x = reorder(key, Absent.data), y = Absent.data, fill = key)) +
  geom_col() +
  coord_flip() +
  xlab("Variable") +
  ylab("Absent values")+
  theme(legend.position='none')+
  geom_text( aes( label = Absent.data), colour ="white", hjust = 1.0, size = 4.0)+
  coord_flip()+
  ISAAC.THEME2

## `summarise()` has grouped output by 'key'. You can override using the `.groups` argument.

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

The primary documentation relating to the contents of the Ames Housing dataset is revealing: surprisingly, the designation ‘NA’ for some of the variables provided did not signal the standard not available meaning . For instance, in relation to Pool.QC, the designation NA translates as No Pool, NA means No Basement for Bsmt.Qual and No Alley in relation to Alley.

Of course, I could not be entirely sure whether some of these NA entries indicate the absence of certain variables like pools and fireplaces were the result of systematic labelling or accidental omissions. However, in the case of the pool and fireplace variables, I could reduce the likelihood that the NA entries were caused by accidental omissions. Before I permitted myself to recode with distasteful haste, I engaged in a little verification. I simply reasoned that if a house truly does not have a pool, clearly the variable Pool.Area (denoting area of pool) should not exceed 0.

POOL

ames%>%
  select(Pool.Area, Pool.QC)%>% 
  filter(Pool.Area > 0 & is.na(Pool.QC))

## [1] Pool.Area Pool.QC  
## <0 rows> (or 0-length row.names)

The 0 result started suggested that there were no issues there.

FIREPLACES

Similarly, I reasoned that if a house truly does not have a fireplace, then naturally it should not have an accompanying rating.

ames %>% 
  select(Fireplaces, Fireplace.Qu) %>% filter(Fireplaces > 0 & is.na(Fireplace.Qu))

## [1] Fireplaces   Fireplace.Qu
## <0 rows> (or 0-length row.names)

Again, I was satisfied that there were no issues here either.

I decided to relabel the NA variables appropriately. This deficit in the respective data entries can be classified as MNAR: missing not at random. I was now in a position to confidently perform some blanket coding.

# create temporary subset for recoding 

x<- ames[c('Alley','Bsmt.Exposure', 'BsmtFin.Type.1','BsmtFin.Type.2', 'Fireplace.Qu', 'Garage.Type', 'Garage.Finish', 'Garage.Qual', 'Garage.Cond','Pool.QC','Fence','Misc.Feature')] #list of variables to be recoded

# recode specified variables 

x=as.matrix(x)
x[is.na(x)] <-"None"
x=as.data.frame(x)


myvars <- names(ames) %in% c("Alley", "Bsmt.Exposure", "BsmtFin.Type.1", "BsmtFin.Type.2", "Fireplace.Qu", "Garage.Type", "Garage.Finish", "Garage.Qual", "Garage.Cond","Pool.QC","Fence","Misc.Feature")

ames.reduced <- ames[!myvars] #remove specified variables in their original format

ames.Updated<- bind_cols(ames.reduced, x )
# append updated variables to reduced dataset

In aforementioned code chunk, I extracted the variables where the designation NA indicates the absence of the respective feature(s) into a separate data frame, before recoding NA as ‘None’. Thereafter, I removed the original variables, before using dplyr’s ‘bind_cols’ function to append the revised variables back to the original dataset. These manoeuvres significantly reduced my NA burden - the following code block and resultant graphical display illustrates the revised level of missingness.

ames.Updated %>%
  gather(key = "key", value = "val") %>% 
  mutate(is.missing = is.na(val)) %>% 
  group_by(key, is.missing) %>%
  summarise(Absent.data = n())%>%
  filter(is.missing == TRUE, Absent.data > 1) %>% 
  select(-is.missing) %>%
  arrange(desc(Absent.data)) %>% 
  ggplot(aes(x = reorder(key, Absent.data), y = Absent.data, fill = key)) +
  geom_col() +
  coord_flip() +
  xlab("Variable") +
  ylab("Absent values")+
  theme(legend.position='none')+
  geom_text( aes( label = Absent.data), colour ="white", hjust = 1.0, size = 4.0)+
  coord_flip()+
  ISAAC.THEME2

## `summarise()` has grouped output by 'key'. You can override using the `.groups` argument.

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

The coding provided great improvement! My NA burden was reduced considerably.

DETECT EMPTY STRINGS

The is.na function does not detect the presence of missing strings, so I decided to do an independent check for their presence in the dataset and then tabulate the results

EmptyStrings <- colSums(ames.Updated == "", na.rm =TRUE)
rev(stack(EmptyStrings[EmptyStrings > 0]))%>%kable()%>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", full_width = F))

ind	values
Mas.Vnr.Type	23
Bsmt.Qual	1
Bsmt.Cond	1
Electrical	1
Bsmt.Exposure	4
BsmtFin.Type.1	1
BsmtFin.Type.2	2
Garage.Finish	2
Garage.Qual	1
Garage.Cond	1

I decided to recode all the empty strings as NA- if you happen to have Saint-like patience, I’ll show you how I addressed these entries alongside the other NAs in the imputation phase!

ames.Updated <- ames.Updated %>% mutate_all(na_if,"")

I decided to remove the features Order and Parcel identification(PID) from the dataset as these are effectively arbitrary and unlikely to have been useful for any form of meaningful generalisation.

#remove Order, PID variables
ames.Updated<-ames.Updated%>%select(-c(Order,PID))

0.0.6 ADDRESS INCORRECTLY LABELLED VARIABLES

A close inspection of the dataset revealed that the variable MS.SubClass had been incorrectly labelled as an integer. I converted this variable to a factor variable. MS SubClass is a nominal variable that characterises the type of dwelling, from a 1-story building to a duplex style of house . I converted this variable to a factor before recoding the various levels simply using the letters of the alphabet at this stage.

class(ames.Updated$MS.SubClass)

## [1] "integer"

ames.Updated$MS.SubClass<-as.factor(ames.Updated$MS.SubClass)
levels(ames.Updated$MS.SubClass) <- c("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P")

MONTH SOLD

Month sold (Mo.Sold) was entered as a an integer, however I reclassified it as a factor variable

ames.Updated[["Mo.Sold"]] <- factor(ames.Updated[["Mo.Sold"]])
ames.Updated$Month.Sold <- month.abb[ames.Updated$Mo.Sold]
ames.Updated$Mo.Sold<-NULL
ames.Updated$Month.Sold<-as.factor(ames.Updated$Month.Sold)

0.0.7 FEATURE ENGINEERING

I decided to modify the variables Year.Built, Garage.Yr.Blt and Year.Remod.Add slightly: instead of defining these variables by their respective year, I wanted to extract the number of years elapsed. I accomplished this by simply subtracting the year from 2010 (the year in which the sale prices were recorded). I felt that this would make the data friendlier for usage with the various algorithms that were to be employed in due course.

ames.Updated$Age.of.house <- 2010 - ames.Updated$Year.Built
ames.Updated$Year.Built<-NULL
ames.Updated$Age.of.garage <- 2010 - ames.Updated$Garage.Yr.Blt
ames.Updated$Garage.Yr.Blt <-NULL
ames.Updated$Yrs.since.Remod<-2010-ames.Updated$Year.Remod.Add
ames.Updated$Year.Remod.Add<-NULL

0.0.8 REDUCTION OF EXTRANEOUS CLUTTER

The original dataset contained four variables quantifying the types of bathrooms; I decided to aggregate these variables into a united variable simply entitled ‘Bathroom.Tally’.

ames.Updated$Bathroom.Tally <- 0.5 * ames.Updated$Half.Bath + 0.5 * ames.Updated$Bsmt.Half.Bath + ames.Updated$Bsmt.Full.Bath + ames.Updated$Full.Bath

I decided to combine the interior area related variables, namely Total.Bsmt.SF (total square feet of basement area) and Gr.Liv.Area (above ground living area of the house in square feet) .

ames.Updated$Overall.AREA <- ames.Updated$Gr.Liv.Area + ames.Updated$Total.Bsmt.SF

I then removed the constituent variables of Bathroom.tally, Overall.area.size and Porch area:

ames.Updated<-ames.Updated%>%select(-c(Half.Bath, Bsmt.Half.Bath,Bsmt.Full.Bath,Full.Bath,Gr.Liv.Area,Total.Bsmt.SF))

ELIMINATE NEAR ZERO VARIANCE FACTORS

The nearZeroVar functions evaluates the ratio of the highest frequency value to the next most frequent value(freqCut) . In addition, the function evaluates the percentage of unique values in relation to the overall quantity of samples (uniqueCut). The default settings are freqCut = 19 and uniqueCut = 10. Prior to inspecting the data for near-zero variance predictors , I decided to use dplyr’s drop_na function so that I was examining the dataset using only ‘complete cases’.

ames.Updated.comp<-ames.Updated%>%drop_na()

nzv.data <- nearZeroVar(ames.Updated.comp, saveMetrics = TRUE)
drop.cols <- rownames(nzv.data)[nzv.data$nzv == TRUE]
ames.Updated.2<- ames.Updated[,!names(ames.Updated)%in% drop.cols]

Applying the near zero variance function flagged the following variables: Alley,Street, Land.Contour, Utilities, Land.Slope, Misc.Val, Pool.Area,Screen.Porch, Enclosed.Porch, Open.Porch.SF,Garage.Cond,Kitchen.Abv.Gr,Low.Qual.Fin.SF,Bsmt.Fin.SF.2, Condition.2, Bsmt.Cond, BsmtFin.Type.2,Roof.Matl for removal. All of the suggested variables were removed.

0.0.9 MULTICOLLINEARITY

When independent variables are highly correlated, changes in one variable are associated with changes in the other correlated variables. When conducting multiple linear regression, it becomes challenging to change one feature without simultaneously changing another independent variable, thus the algorithm struggles to discern the unique relationship that exists between particular individual independent variables upon the dependent variable. Unfortunately, the coefficient estimates become unreliable, they are sensitive to small changes in the model. Whilst the presence of multicollinearity influences the coefficient and p-values, it will not impair predictive performance.

I extracted a subset consisting of all complete cases.

ames_complete <- 
ames.Updated.2[complete.cases(ames.Updated.2), ]

NUMERIC AND CATEGORICAL COMPLETE CASES

I then extracted a new subset containing all numeric variables:

amesnumeric.comp <- dplyr::select_if(ames_complete, is.numeric)

I then decided to find out what numeric variables were most correlated with Sale Price

corr_var(amesnumeric.comp, # name of dataset
         SalePrice, # name of variable to focus on
         max_pvalue = 0.05
        ,top = 10)

0.0.10 CORRELATIONS

I removed the dependent variable SalePrice before examining the pairwise correlations:

amesnumeric2 <-amesnumeric.comp[c(-19)]

PAIRWISE CORRELATIONS ACROSS COVARIATES

The caret findCorrelation evaluates pair-wise correlations across all variables, flagging variables that are highly correlated. Of the identified pairs, the function recommends the removal of the variable with the highest average absolute correlation across the dataset. However, I decided to retain the variables that have the highest correlation with the dependent variable i.e. SalePrice.I proceeded to find out which of the variables had the highest correlation with the dependent variable. Firstly, I extracted a subset with the suggested variables for removal, I was then able to easily arrange the correlations with SalePrice in descending order.

library(caret)
identify<-caret::findCorrelation(cor(amesnumeric2),cutoff = 0.75,names = T, verbose = T)

## Compare row 23  and column  9 with corr  0.832 
##   Means:  0.419 vs 0.236 so flagging column 23 
## Compare row 14  and column  15 with corr  0.852 
##   Means:  0.351 vs 0.218 so flagging column 14 
## Compare row 19  and column  20 with corr  0.84 
##   Means:  0.286 vs 0.207 so flagging column 19 
## All correlations <= 0.75

ames.suggestions<- amesnumeric.comp[c('Garage.Cars','Garage.Area','Overall.AREA',  'X1st.Flr.SF','Age.of.garage', 'Age.of.house','SalePrice')]

#evaluate correlations against dependent variable
ames.suggestions %>% corr_var(SalePrice)

I removed the three variables that had the lowest correlation with SalePrice from the amesnumeric.comp subset.

#variable to be removed
minus.vars<-c("Age.of.garage","Age.of.house","X1st.Flr.SF")

#create numeric subset devoid of variables specified in minus.vars
amesnumeric.3 <- amesnumeric.comp[,!names(amesnumeric.comp) %in% minus.vars]

At this stage, I removed the same variables from the main dataset

ames.Updated.3<-ames.Updated.2[,!names(ames.Updated.2)%in% minus.vars]

0.0.11 HIGHEST SIGNIFICANT CORRELATIONS

I decided to evaluate which variables had the highest correlations (using Pearson’s r) with SalePrice. I shall took a subset featuring only complete cases from the previously updated primary dataset:

ames.comp2<-ames.Updated.3[complete.cases(ames.Updated.3),]

I took a further subset containing only the numeric variables from the subset containing only complete cases before evaluating the respective correlations.

amesnumeric.4<-dplyr::select_if(ames.comp2, is.numeric)

corr_var(amesnumeric.4, # name of dataset
         SalePrice, # name of variable to focus on
         max_pvalue = 0.05
        ,top = 10)

0.0.12 FACTOR SELECTION

I wanted a fairly quick way to identify influential factor variables. Here, I decided to the enlist the aid of the Random forest algorithm to help identify influential categorical factors. I created a subset of the factor/character variables with Sale Price. Afterwards, I ran the random forest algorithm.

Factor_subset <- ames_complete[sapply(ames_complete, is.character)|sapply(ames_complete, is.factor)|names(ames_complete)=='SalePrice']

set.seed(174)
Influential.Factors <- ranger(SalePrice ~ ., data = Factor_subset, importance = "permutation")

Extract.influential <- vip(Influential.Factors)
plot(Extract.influential)

0.1 ILLUMINATING VISUALSATIONS

I decided to produce a range of potentially informative visualisations; I was particularly interested in seeing whether I could spot any discernible relationships between the dependent variable and the respective exploratory variables. The adage of ‘a picture speaks a thousand words’ is very often the case in the field of data analysis. Up until this point, throughout the EDA (Exploratory Data Analysis) section, I had relied heavily upon the guidance of the Pearson correlation coefficient. However, as I stated earlier, the Pearson coefficient is a measure of linear correlation, it does not capture non-linear relationships. Informative patterns may well exist between the independent variable(s) and the dependent variable that the Pearson coefficient fails to capture.

0.1.1 NUMERIC VARIABLES

Here, I plotted all the numeric variables against Sale Price

amesnumeric.4%>%
    pivot_longer(-SalePrice, names_to = "Feature", values_to = "Value") %>%
    ggplot() +
    geom_point(mapping=aes(x = Value, y = SalePrice, color = Feature)) + 
    scale_y_continuous(labels = scales::comma)+
    facet_wrap(~ Feature, scales = "free", ncol = 4) +
    scale_x_continuous(n.breaks = 2)+
    theme(legend.position = "",
          plot.title.position = "plot")+
    labs(x = "Numeric Feature Value",
         title = "Ames Housing Numeric Variable versus Sale Price")+
  ISAAC.THEME

Overall Area is clearly moving in tandem with Sale Price, whereas Years since remodeled appears to exhibit a clearly discernible negative correlation with Sale Price.

0.1.2 PRICE VARIATION OVER TIME

ggplot(data = ames_complete, 
       mapping = aes(x = Yrs.since.Remod, y = SalePrice)) +
    geom_line() +
    facet_wrap(vars(Neighborhood)) +
    labs(title = "Price vs Time since remodeled/Neighbourhood")+
  theme_economist()

Please note that the variable time since remodeled ( Yrs.since.Remod) translates as the time since construction if remodelling/additions have not been carried out.

0.1.3 FACTOR VARIABLES (Boxplots)

Factor_subset%>%gather(-SalePrice, key = "var", value = "value") %>%ggplot(aes(x = value, y = SalePrice)) +
    geom_violin() +
    facet_wrap(~ var, scales = "free") +
    theme_bw()+
  ISAAC.THEME2

0.1.4 DENSITY PLOTS

I decided to expore the distribution densities of the exploratory variables.

melt(ames_complete) %>%
  ggplot(aes(x= value)) +
    geom_density(fill='#fdae6b') + facet_wrap(~variable, scales = 'free')+
ISAAC.THEME

## Using MS.SubClass, MS.Zoning, Lot.Shape, Lot.Config, Neighborhood, Condition.1, Bldg.Type, House.Style, Roof.Style, Exterior.1st, Exterior.2nd, Mas.Vnr.Type, Exter.Qual, Exter.Cond, Foundation, Bsmt.Qual, Heating.QC, Electrical, Kitchen.Qual, Paved.Drive, Sale.Type, Sale.Condition, Bsmt.Exposure, BsmtFin.Type.1, Fireplace.Qu, Garage.Type, Garage.Finish, Garage.Qual, Fence, Month.Sold as id variables

0.1.5 DISTRIBUTION OF SALE PRICE

ggplot(ames.Updated.3, aes(x=SalePrice, y=..density..)) + 
  geom_histogram(bins=50, fill="#8F0C90", col="#DAF7A6") +
  geom_density(size = 1, colour = "#FFFAE5") +
  labs(title = "Houses prices", x = 'Price', y = "Frequency") +
  scale_x_continuous(labels=scales::dollar_format())+
  theme(plot.title = element_text(hjust = 0.5))+
  ISAAC.THEME

The distribution of house prices is clearly right-skewed. I felt that it may be worth considering logarithmic transformations of sale price (the dependent variable), especially for linear modelling.

0.1.6 NEIGHBOURHOOD VS SALE PRICE

My hunch was that prices would vary considerably across neighbourhoods, indeed the visualisation featured below clearly indicates that this was the case.

library(ggridges)
ggplot(ames.comp2, aes(x = `SalePrice`, y = Neighborhood, fill = stat(x))) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
  scale_fill_viridis_c(name = "Prices", option = "C") +
  labs(title = 'Prices across Neighbourhoods')

## Picking joint bandwidth of 15500

0.2 SUMMARY TABLES/DESCRIPTIVE STATS

I was keen to obtain an overview of some basic descriptive summary statistics, my primary motivation was to identify factor levels containing a low number of entries. Factor levels with a low level of entries be problematic for running many algorithms.

0.2.1 BASEMENT QUALITY

ames.Updated.3%>%group_by(Bsmt.Qual)%>%summarize(n=n())%>%mutate(percent = round(n / sum(n) * 100, 2))%>%kable()%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Bsmt.Qual	n	percent
Ex	258	8.81
Fa	88	3.00
Gd	1219	41.60
Po	2	0.07
TA	1283	43.79
NA	80	2.73

0.2.2 EXTERNAL QUALITY

ames.Updated.3%>%group_by(Exter.Qual)%>%summarize(n=n())%>%mutate(prop=n/sum(n))%>%kable()%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Exter.Qual	n	prop
Ex	107	0.0365188
Fa	35	0.0119454
Gd	989	0.3375427
TA	1799	0.6139932

0.2.3 KITCHEN QUALITY

ames.Updated.3%>%group_by(Kitchen.Qual)%>%summarize(n=n())%>%mutate(percent = round(n / sum(n) * 100, 2))%>%knitr::kable(
    format = "html",
    booktabs = TRUE, # For LaTeX
    col.names = c("Quality", "Count", "Percentage"),
    caption = "Kitchen Quality Proportions",
    digits = 1
  ) %>%kable_styling(bootstrap_options = "striped")

Kitchen Quality Proportions
Quality	Count	Percentage
Ex	205	7.0
Fa	70	2.4
Gd	1160	39.6
Po	1	0.0
TA	1494	51.0

0.2.4 FIREPLACE QUALITY

ames.Updated.3%>%group_by(Fireplace.Qu)%>%summarize(n=n())%>%mutate(percent = round(n / sum(n) * 100, 2))%>%knitr::kable(
    format = "html",
    booktabs = TRUE, # For LaTeX
    col.names = c("Quality", "Count", "Percentage"),
    caption = "Distribution of Fireplace Quality",
    digits = 1
  ) %>%kable_styling(bootstrap_options = "striped")

Distribution of Fireplace Quality
Quality	Count	Percentage
Ex	43	1.5
Fa	75	2.6
Gd	744	25.4
None	1422	48.5
Po	46	1.6
TA	600	20.5

0.2.5 SALE PRICE SUMMARIES

library(kableExtra)
ames.Updated.3%>%
   summarise(average = mean(SalePrice), median=median(SalePrice), minimum = min(SalePrice), maximum=max(SalePrice), standard.deviation=sd(SalePrice)) %>% 
    kable()%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

average	median	minimum	maximum	standard.deviation
180796.1	160000	12789	755000	79886.69

0.2.6 ESSENTIAL RECODING

A number of the variables in this dataset were of the ordinal type; I decided to recode these as numeric variables, primarily to ensure that they would be suitable for the modelling process.

EXTERNAL QUALITY

ames.Updated.3$Exter.Qual <- as.numeric(factor(ames.Updated.3$Exter.Qual, 
                                  levels = c("Fa","TA", "Gd", "Ex"),
                                  labels = c(1,2,3,4) ,ordered = TRUE))

BASEMENT QUALITY

ames.Updated.3$Bsmt.Qual <- as.numeric(factor(ames.Updated.3$Bsmt.Qual, 
levels = c("Po","Fa", "TA", "Gd", "Ex"), labels = c(1,2,3,4,5) ,ordered = TRUE))

KITCHEN QUALITY

When I was perusing through the distribution of levels, I realized that there is only one entry for a classification of Poor(Po), therefore I shall assimilate this into the nearest rank of Fair(Fa). :

ames.Updated.3[ames.Updated.3$Kitchen.Qual=="Po",]$Kitchen.Qual <- "Fa"

ames.Updated.3$Kitchen.Qual <- as.numeric(factor(ames.Updated.3$Kitchen.Qual, 
                                  levels = c("Fa","TA", "Gd", "Ex"),
                                  labels = c(1,2,3,4) ,ordered = TRUE))

FIREPLACE QUALITY

ames.Updated.3$Fireplace.Qu <- as.numeric(factor(ames.Updated.3$Fireplace.Qu, 
                                  levels = c("None","Po","Fa", "TA", "Gd", "Ex"),
                                  labels = c(1,2,3,4,5,6) ,ordered = TRUE))

Now that I had introduced some new numeric variables into the dataset, I decided to examine their correlation with sale price.

ames.ordinal<- ames.Updated.3[c('Exter.Qual','Fireplace.Qu','Bsmt.Qual',  'Kitchen.Qual','SalePrice')]
ames.ordinal%>%drop_na()%>%corr_var(SalePrice, max_pvalue = 0.05)

0.2.7 REDUCED DATASET

At this stage, I decided to reduce the evolving dataset considerably by selecting the twenty exploratory variables that had thus far indicated clear predictive promise. I made the decision to include the top ten variables with the highest correlation with Sale Price, in addition to the top ten most important features extracted from the random forest algorithm that I ran earlier using the factor variables to predict house prices. The model was entitled Influential.Factors. I extracted the relevant subset from ames.Updated.3.

ames.parsimonious6<-ames.Updated.3%>%select(Overall.Qual,Overall.AREA,Garage.Cars, Garage.Area, Bathroom.Tally, Mas.Vnr.Area, Yrs.since.Remod,TotRms.AbvGrd,Fireplaces,BsmtFin.SF.1,Bsmt.Qual,Foundation,Exter.Qual,Kitchen.Qual,Neighborhood,Fireplace.Qu, Garage.Type, Garage.Finish, MS.SubClass, Exterior.1st,SalePrice)

0.2.8 ENSURE FACTOR VARIABLES ARE SUITABLE FOR MODELLING:

I soon realized that I still needed to do a little more preprocessing before the data was ready for the modelling process. A number of the independent variables had levels with few entries. I decided to recruit the Forcats library in order to relevel levels with fewer than 50 entries into a new category simply labelled ‘other’. I initially attempted to run a number of the models without including this step, however a number of the models failed to function as intended.

library(forcats)

ames.parsimonious6 <- ames.parsimonious6 %>%mutate(Neighborhood = fct_lump_min(Neighborhood, min=50, other_level = "other"))

ames.parsimonious6 <- ames.parsimonious6 %>%mutate(MS.SubClass = fct_lump_min(MS.SubClass, min=50, other_level = "other"))

ames.parsimonious6 <- ames.parsimonious6 %>%mutate(Exterior.1st = fct_lump_min(Exterior.1st, min=50, other_level = "other"))

ames.parsimonious6 <- ames.parsimonious6 %>%mutate(Garage.Type = fct_lump_min(Garage.Type, min=50, other_level = "other"))

In the basement quality variable, only two of the entries were marked as Poor(Po); I decided to change these entries to Fair(Fa). This would increase the likelihood of algorithms running smoothly.

#convert few entries marked '1' as '2'

ames.parsimonious6$Bsmt.Qual[ames.parsimonious6$Bsmt.Qual==1]<-2

0.2.9 RECRUITING TIDYMODELS

It was time to begin the modelling process. Firstly, I needed to divide the primary dataset into a training and testing dataset.

library(doParallel)

## Loading required package: parallel

# doParallel
cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cores = cl)

#split data into training/testing sets
set.seed(174)
data_split <- initial_split(ames.parsimonious6, strata = "SalePrice", prop = 0.75)

ames_train <- training(data_split)
ames_test  <- testing(data_split)

I decided to recruit the recipes-package for some final preprocessing of the evolving dataset. A recipe allows one to craft a sequence of preprocessing operations on specified datasets. Thereafter, the recipe can be independently applied to other sets, including validation sets. One of the primary benefits of using the tidymodels/recipe framework is that it allows a data analyst to easily avoid the leakage of data between sets. A classic mistake can occur when certain preprocessing steps like centering and scaling are applied to the entire dataset before it is split into train/test or cross-validation folds. It is important to apply certain data preparation steps after splitting the data into respective train/test/folds in order to avoid biasing predictive performance. An example of this form of data leakage would occur if the analyst attempted to normalize columns of a dataset before splitting it into train/test subsets. If the analyst decided to apply min-max normalization, the process would entail calculating the global maximum(s)/minimum(s) in order to apply to every calculation. If this was done before splitting, the resultant training would have some insight into the distribution of the test data set. This type of issue can also arise with various forms of imputation.

In the following recipe I imputed missing data with K-nearest neighbours(KNN)-this algorithm, using Euclidean distance, recruits the k nearest samples in the sample space and imputes the mean of the specified neighbours. In addition, I converted all categorical variables to dummy variables in order to make them machine learning friendly.

model_rec <- recipe(
  SalePrice ~ .,
  data= ames_train)%>%step_dummy(all_nominal())%>%step_knnimpute(all_predictors(), neighbors = 10)%>%prep(training=ames_train, retain=TRUE, verbose=TRUE)

## oper 1 step dummy [training] 
## oper 2 step knnimpute [training] 
## The retained training set is ~ 1.01 Mb  in memory.

Here, I applied the recipe model_rec to the ames_train data

trainSet.prep<-bake(model_rec,new_data=ames_train, composition='matrix')
trainSet = as.data.frame((trainSet.prep))

Similarly, I used the function bake to apply the same recipe to the ames_test set.

testSet.prep<-bake(model_rec,new_data=ames_test, composition='matrix')
testSet = as.data.frame((testSet.prep))

0.2.10 LINEAR MODELLING

I decided to run a simple linear model to begin with. Linear regression is a good place to start testing models. Having visually examined the scatter plots of the independent variables against the dependent variable in the visualisation section, my hunch was that linear modelling would prove to be an inappropriate method. The scatterplots suggested that some of the continuous variables appeared to be violating one of the cardinal assumptions of linear regression- Linearity! This assumption posits that the dependent variable has a roughly linear relationship with each of the covariates.

Here, I constructed a simple linear model using all of the variables in the model.


```r
set.seed(741)
Linear.model<-lm(SalePrice~., data=trainSet)

I wanted to test the linear model on the test set.

predictions <- predict(Linear.model, newdata = testSet)

I calculated the residuals using the formula residuals = actual - prediction.

residuals <- testSet$SalePrice - predictions

I was keen to find out how well the model performed. Here, I simply wanted to see the root-mean-square error(RMSE).

sqrt(mean(Linear.model$residuals^2))

## [1] 30651.1

0.2.11 ASSESSMENT OF RESIDUALS

The adage “A picture is worth a thousand words” is an apt expression when it comes assessing the residuals of linear models.

par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(Linear.model)

The linear model appeared to be poorly predicting prices in the upper price ranges: the residuals versus fitted plot indicates increased variance as the price ascends. The quantile-quantile(Q-Q) plot was particularly revealing: the residuals are overdispersed. The distribution is heavier tailed than a normal distribution: this generally signifies that there are more extreme values than would be expected if the underlying distribution was normal. The p-values and confidence intervals will be potentially biased and too narrow, which may result in an increase in type I errors (claiming that x affects y, when it does not). I decided to seek some corroboration via some trusted allies: the Shapiro, Breusch-Pagan and Anderson-Darling tests were all brought to the fore.

SHAPIRO TEST

I commenced my investigation with the Shapiro test. I wanted to find out whether the residuals/errors were normally distributed. Whilst it is not a prerequisite that the residuals are normally distributed for ordinary least squares to generate unbiased estimates of the underlying population parameters, meeting this condition facilitates the production of trustworthy prediction and confidence intervals.

p_val <- shapiro.test(Linear.model$residuals)$p.value;
if(p_val>=0.05) {
  print("The data is likely to be normally distributed")
} else {
  print("Given that p<0.05, the data does not appear to be normally distributed. The calculated
coeffient estimates cannot be trusted")
}

## [1] "Given that p<0.05, the data does not appear to be normally distributed. The calculated\ncoeffient estimates cannot be trusted"

The null hypothesis of the shapiro test is that the observed data is normally distributed. As p<0.05, in keeping with custom, I concluded that the residuals were not normally distributed.

BREUSCH-PAGAN TEST

I decided to perform a homoscedastic test. The Breusch-Pagan test evaluates the homoscedasticity of the residuals. Linear regression is particularly suitable when the error/residuals are homoscedastic across all the predicted values of the dependent variable. This type of assessment gauges whether a regression model’s ability to predict the dependent variable is consistent across all ranges of the respective variable.

library(lmtest)
bptest(Linear.model)

## 
##  studentized Breusch-Pagan test
## 
## data:  Linear.model
## BP = 879.11, df = 60, p-value < 0.00000000000000022

if(p_val>=0.05) {
  print("The data can be considered homoscesdastic")
} else {
  print("Given that p<0.05,  heteroskedasticity appears to be present.")
}

## [1] "Given that p<0.05,  heteroskedasticity appears to be present."

ANDERSON DARLING TEST

# Anderson-Darling normality test
library(nortest)
ad.test(Linear.model$residuals)

## 
##  Anderson-Darling normality test
## 
## data:  Linear.model$residuals
## A = 64.433, p-value < 0.00000000000000022

if(p_val>=0.05) {
  print("The data can be considered to be normally distributed")
} else {
  print("The result suggests that the residuals are NOT normally distibuted ")
}

## [1] "The result suggests that the residuals are NOT normally distibuted "

Non-normality and heteroscedasticity were clearly present in the distribution of the residuals. If heteroscedasticity is present and is the only principal assumption of linear regression that is not met, this does not bias the regression coefficients. However, it does bias the standard errors and test-statistics. Unfortunately, this can produce inaccurate standard errors that result in confidence intervals that cannot be relied on.

OUTLIERS

I decided to look a little further into the presence of the outliers that were showing up in the diagnostic plots. Outliers can be found by inspecting the standardized residuals. Standardized residuals indicate the number of standard errors removed from the regression line. It is customary to consider standardized residuals exceeding 3 as possible outliers.

Whilst not all outliers are deemed to be influential, they are generally worth exploring further. Cook’s distance is a metric that seeks to capture the influence of a value: it is governed by a mixture of leverage and residual size.

plot(Linear.model, 4, id.n = 6)

plot(Linear.model, 5, id.n = 6)

extra.metrics <- augment(Linear.model)

# Add observations indices and
# drop some columns (.se.fit, .sigma) for simplification
extra.metrics <- extra.metrics %>%
  mutate(index = 1:nrow(extra.metrics))

extra.metrics %>%
  top_n(6, wt = .cooksd)

## # A tibble: 6 x 69
##   SalePrice Overall.Qual Overall.AREA Garage.Cars Garage.Area Bathroom.Tally
##       <dbl>        <dbl>        <dbl>       <dbl>       <dbl>          <dbl>
## 1    160000           10        11752           2        1418            4.5
## 2    745000           10         6872           3         813            4.5
## 3    755000           10         6760           3         832            4  
## 4    183850           10        10190           3        1154            4  
## 5    184750           10         7814           3         884            4.5
## 6    475000           10         4715           3         840            2.5
## # ... with 63 more variables: Mas.Vnr.Area <dbl>, Yrs.since.Remod <dbl>,
## #   TotRms.AbvGrd <dbl>, Fireplaces <dbl>, BsmtFin.SF.1 <dbl>, Bsmt.Qual <dbl>,
## #   Exter.Qual <dbl>, Kitchen.Qual <dbl>, Fireplace.Qu <dbl>,
## #   Foundation_CBlock <dbl>, Foundation_PConc <dbl>, Foundation_Slab <dbl>,
## #   Foundation_Stone <dbl>, Foundation_Wood <dbl>, Neighborhood_CollgCr <dbl>,
## #   Neighborhood_Crawfor <dbl>, Neighborhood_Edwards <dbl>,
## #   Neighborhood_Gilbert <dbl>, Neighborhood_IDOTRR <dbl>,
## #   Neighborhood_Mitchel <dbl>, Neighborhood_NAmes <dbl>,
## #   Neighborhood_NoRidge <dbl>, Neighborhood_NridgHt <dbl>,
## #   Neighborhood_NWAmes <dbl>, Neighborhood_OldTown <dbl>,
## #   Neighborhood_Sawyer <dbl>, Neighborhood_SawyerW <dbl>,
## #   Neighborhood_Somerst <dbl>, Neighborhood_StoneBr <dbl>,
## #   Neighborhood_Timber <dbl>, Neighborhood_other <dbl>,
## #   Garage.Type_BuiltIn <dbl>, Garage.Type_Detchd <dbl>,
## #   Garage.Type_None <dbl>, Garage.Type_other <dbl>, Garage.Finish_None <dbl>,
## #   Garage.Finish_RFn <dbl>, Garage.Finish_Unf <dbl>, MS.SubClass_B <dbl>,
## #   MS.SubClass_E <dbl>, MS.SubClass_F <dbl>, MS.SubClass_G <dbl>,
## #   MS.SubClass_I <dbl>, MS.SubClass_K <dbl>, MS.SubClass_L <dbl>,
## #   MS.SubClass_N <dbl>, MS.SubClass_P <dbl>, MS.SubClass_other <dbl>,
## #   Exterior.1st_CemntBd <dbl>, Exterior.1st_HdBoard <dbl>,
## #   Exterior.1st_MetalSd <dbl>, Exterior.1st_Plywood <dbl>,
## #   Exterior.1st_VinylSd <dbl>, Exterior.1st_Wd.Sdng <dbl>,
## #   Exterior.1st_WdShing <dbl>, Exterior.1st_other <dbl>, .fitted <dbl>,
## #   .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>,
## #   index <int>

It was noteworthy that the top five outliers with high Cook’s distance scores shared a common Overall.Quality score of 10. I decided that I would consider revisiting these observations at a later stage.

PREDICTED/ACTUAL PRICES

I wanted to visualise the consistency of the predictions across the sample space.

library(auditor)
lm_audit <- audit(Linear.model, data = trainSet, y =trainSet$SalePrice)

## Preparation of a new explainer is initiated
##   -> model label       :  lm  ( [33m default [39m )
##   -> data              :  2199  rows  62  cols 
##   -> target variable   :  2199  values 
##   -> predict function  :  yhat.lm  will be used ( [33m default [39m )
##   -> predicted values  :  No value for predict function target column. ( [33m default [39m )
##   -> model_info        :  package stats , ver. 4.0.4 , task regression ( [33m default [39m ) 
##   -> predicted values  :  numerical, min =  7432.059 , mean =  181401.5 , max =  572141.6  
##   -> residual function :  difference between y and yhat ( [33m default [39m )
##   -> residuals         :  numerical, min =  -412141.6 , mean =  -0.000000001694649 , max =  275067.8  
##  [32m A new explainer has been created! [39m

# validate a model with auditor
mr_lm <- model_residual(lm_audit)

# plot results
plot_prediction(mr_lm, abline = TRUE)

Here, it was evident that as price increases, the model’s predictive accuracy diminishes.

I was keen to explore the performance of a variety of algorithms that are known to work well with non-linear data. However, before I abandoned my exploration of linear regression, I wanted to find out what variables the linear model suggested were the most important in determining sale price.

library(vip)
vip(Linear.model)

0.2.12 EXPERIMENTING WITH VARIOUS MODELS

I decided to experiment with a variety of algorithms. I simply wanted to get a sense of what algorithm(s) may be well-suited to modelling Sale Price in relation to the variables I had at my disposal.

library(parallel)      
library(doParallel)

Mycluster =makeCluster(detectCores()-1)
registerDoParallel(Mycluster)

myControl = trainControl(method = 'cv', number = 10, 
verboseIter = FALSE, savePredictions = TRUE,allowParallel = T)



set.seed(174)
Linear.Model = train(SalePrice ~., data = trainSet, metric = 'RMSE', method = 'lm',preProcess = c('center', 'scale'),trControl = myControl)


set.seed(174)
Glmnet.Model = train(SalePrice ~ ., data = trainSet , metric = 'RMSE', method = 'glmnet',preProcess = c('center', 'scale'), trControl = myControl)


set.seed(174)
Rapid.Ranger = train(SalePrice ~ ., data = trainSet, metric = 'RMSE', method = 'ranger',preProcess = c('center', 'scale'),trControl = myControl)


set.seed(174)
Basic.Knn <- train(SalePrice ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k =1:3), trControl  = myControl, metric= "RMSE", data = trainSet)


set.seed(174)
Xgb.Super <- train(SalePrice~.,method = "xgbTree", tuneLength = 4,trControl = myControl,metric= "RMSE", data = trainSet)

## [17:45:46] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.

suite.of.models = list("LINEAR.MODEL"=Linear.Model,"GLMNET.MODEL"=Glmnet.Model, "RANGER.QUEST"= Rapid.Ranger, "KNN.SIMPLE" = Basic.Knn, "XGB.SUPER"= Xgb.Super)
resamps = resamples(suite.of.models) 
dotplot(resamps, metric = 'RMSE')

0.2.12.1 TESTING MODELS ON TEST SET

Evaluate.Prediction <- function(model, model.label, testData, ytest, grid = NULL) {
 
  #capture prediction time
  ptm <- proc.time()
  # use test data to make predictions
  pred <- predict(model, testData)
  tm <- proc.time() - ptm
  
  Pred.metric<- postResample(pred = pred, obs = ytest)
  RMSE.test <- c(Pred.metric[[1]])
  RSquared.test <- c(Pred.metric[[2]])
  MAE.test <- c(Pred.metric[[3]])
  
  
  Summarised.results = NULL
  if (is.null(grid)) { 
    Summarised.results = data.frame(predictor = c(model.label) ,  RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]))
  } else {
    .grid = data.frame(predictor = c(model.label) , RMSE = RMSE.test , RSquared = RSquared.test, MAE = MAE.test, time = c(tm[[3]]))
    Summarised.results = rbind(grid, .grid)}
  
  
  Summarised.results }

METRIC.GRID <- Evaluate.Prediction (Rapid.Ranger, "RAPID.QUEST", testSet, testSet$SalePrice, grid=NULL)

METRIC.GRID <- Evaluate.Prediction (Glmnet.Model, "GLMNET.MODEL", testSet, testSet$SalePrice, grid=METRIC.GRID)

METRIC.GRID <- Evaluate.Prediction (Basic.Knn, "KNN.SIMPLE", testSet, testSet$SalePrice, grid=METRIC.GRID)

METRIC.GRID <- Evaluate.Prediction (Linear.Model, "LINEAR.MODEL", testSet, testSet$SalePrice, grid=METRIC.GRID)

METRIC.GRID <- Evaluate.Prediction (Xgb.Super, "XGB.SUPER", testSet, testSet$SalePrice, grid=METRIC.GRID)

kable(METRIC.GRID[order(METRIC.GRID$RMSE, decreasing=F),])%>%kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	predictor	RMSE	RSquared	MAE	time
1	RAPID.QUEST	22946.09	0.9042345	14757.51	0.17
5	XGB.SUPER	23845.53	0.8992469	15554.32	0.16
2	GLMNET.MODEL	24783.40	0.8883612	17368.92	0.06
4	LINEAR.MODEL	24826.47	0.8883348	17432.18	0.03
3	KNN.SIMPLE	37875.20	0.7548226	26322.29	0.19

The random forest (Rapid.Ranger) model and extreme gradient boosting (XGB.SUPER) models showed the most promise with the lowest average predictive errors (RMSE). I decided to explore these models further.

0.2.13 TUNING A RANDOM FOREST MODEL USING TIDYMODELS FRAME WORK

I decided to construct the random forest model using the R’s Tidymodels framework. This is a superb package that beautifully streamlines the predictive modelling process.

To divide the available data, I recruited the initial_split() function: it is designed to split a data set into a training and testing set. In the default stance, 0.75 of the data is allocated to the training set with the remainder being reserved for testing. The default setting can be modified by changing the prop argument. Instead of a data frame, this function produces an rsplit object.

set.seed(741)
ames_split <- initial_split(ames.parsimonious6, prop = 0.80, strata = SalePrice)
ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

I opted to perform cross validation on the train set. Cross-validation, also known as rotation estimation, is a validation strategy used for evaluating how well the results of a statistical analysis are likely to generalize to an unseen dataset. Many data analysts employ k-fold cross-validation by employing k=5 or k=10, these numbers have demonstrated an ideal balance between minimising bias and variance.

set.seed(174)
random.f_cv <- vfold_cv(ames_train,v=5)

The following object simple_rec is a recipe/sequence of instructions that can been applied to datasets. The specified operations will be applied to the train and test sets that are used throughout this particular exploration of the random forest algorithm

simple_rec <-
recipe(SalePrice ~ .,
  data= ames_train)%>%
  step_dummy(all_nominal())%>%step_knnimpute(all_predictors(), neighbors=10)

In the code chunk specified below, the argument engine indicates what machine learning algorithm will be used to fit the model. Tidymodels is a wrapper for various machine learning packages. This is simply a specification, no data is being fed to the algorithm at this stage .


```r
ranger_model <- rand_forest(
    trees = tune(),
    mtry = tune()) %>%
  set_engine("ranger", importance = "permutation") %>% 
  set_mode("regression")

Here, I recruited a workflow: this is a container that integrates a preprocessing recipe and a model specification.

random.f_workflow <- workflow()%>% add_recipe(simple_rec)%>% 
add_model(ranger_model)

In the following code chunk, I specified a simple grid to be used in the quest for optimal parameters. I used the following grid numerous times with varying values. However, as I progressively experimented with varying values, I did not re-run the previously tried values in order to minimise computational cost. Thus, the following grid is a curtailed version of the values I have experimented with.

random.f_grid <- expand_grid(mtry = 33:36, trees = seq(1400, 1700, by = 100))

It was time to initialise the exploration of the specified hyperparameter space.

tidy_grid_results <- random.f_workflow %>% tune_grid(
    resamples = random.f_cv,
    grid = random.f_grid)

0.2.14 COLLECT METRICS OF RANGER MODEL

Here, I visualised the movement of RMSE throughout the search

autoplot(tidy_grid_results, metric = "rmse")

I extracted the parameters that generated the lowest RMSE

param_final <- tidy_grid_results %>%select_best(metric = "rmse")

It was time to update the workflow: I added the optimised parameters to the workflow using the finalize_workflow() function.

random.f_workflow <- random.f_workflow %>%
finalize_workflow(param_final)

It was time to finally fit the model

rf_fit <- random.f_workflow %>%
fit(ames_train)

I decided to find out which features demonstrated the most importance according to the random forest model

library(vip)
rf_fit %>% 
  pull_workflow_fit() %>% 
  vip::vip()

0.2.15 PERFORMANCE ON TEST SET:

It was time to test the fitted model on the validation/test set. The predict function` in the tidymodels framework applies the same preprocessing operations that were defined in the recipe and previously applied to the training sets (cross-validation splits). However, this time these operations are applied to the test set before the newly preprocessed data is passed to the predict function.

predict(rf_fit, new_data = ames_test)

## # A tibble: 584 x 1
##      .pred
##      <dbl>
##  1 123570.
##  2 176195.
##  3 175797.
##  4 179798.
##  5 152288.
##  6 122690.
##  7 356318.
##  8 236323.
##  9 313051.
## 10 164239.
## # ... with 574 more rows

Below, I plotted the predicted versus observed values produced by the random forest model

ranger_pred <- predict(rf_fit, new_data = ames_test %>% select(-SalePrice))

ranger_pred <- bind_cols(ranger_pred, ames_test %>% select(SalePrice))

ggplot(ranger_pred, aes(x = SalePrice, y = .pred)) + 
  # Create a diagonal line:
  geom_abline(lty = 2) + 
  geom_point(alpha = 0.5) + 
  labs(y = "Predicted Sale Price", x = "SalePrice") +
  # Scale and size the x- and y-axis uniformly:
  coord_obs_pred()

0.2.16 TEST RESULTS:

I was now in a position to trial the constructed model on the test set.

ames_metrics <- metric_set(yardstick::rmse, yardstick:: mae)

ames_metrics(ranger_pred, truth = SalePrice, estimate = .pred)

## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      25416.
## 2 mae     standard      15981.

0.3 XGBOOST MODELLING:

It was time to construct a model using the extreme gradient boosting algorithm.

Again, I started by dividing the dataset into train/test sets

# divide dataset, stratify by SalePrice
set.seed(174)
ames_split.2 <-initial_split(ames.parsimonious6, prop = 0.75, strata = SalePrice)
ames_train2 <- training(ames_split.2)
ames_test2  <- testing(ames_split.2)

Again, I called a recipe to perform some final preprocessing.

XGB_rec <-
recipe(SalePrice ~ .,
  data= ames_train2)%>%
  step_dummy(all_nominal())%>%step_knnimpute(all_predictors(), neighbors=10)%>%prep()

Here, I applied cross-validation to randomly divide the training data into more training and test subsets.

set.seed(174)
cv_folds <-recipes::bake(
    XGB_rec, 
    new_data = ames_train2)%>%  
  rsample::vfold_cv(v = 5)

In the following code block, I transformed the train/tests using the recipe (XGB_rec). I did not include this step when working with the previous random forest model. This step would not have been necessary if I had not gone on to construct an explainer with modelStudio (featured at the end of this document).

train.ready<-juice(XGB_rec)
test.ready<-bake(XGB_rec, new_data = ames_test2)

Here, I defined the XGBoost model specification, alongside the hyperparameters to be tuned

Model.XGB <- 
  boost_tree(
    mode = "regression",
    trees = 1000,
    min_n = tune(),
    tree_depth = tune(),
    learn_rate = tune(),
    loss_reduction = tune()
  ) %>%set_engine("xgboost", objective = "reg:squarederror")

I employed the tidymodels dials package to specify the parameters.

# grid specification 
XGB.aspects <- 
  dials::parameters(
    min_n(),
    tree_depth(),
    learn_rate(),
    loss_reduction())

At this stage, I set up the grid space of exploration by recruiting the dials::grid_max_entropy() grid function which covers the specified hyperparameter space.

xgboost_grid <- 
dials::grid_max_entropy(
XGB.aspects, size = 200)
knitr::kable(head(xgboost_grid))

min_n	tree_depth	learn_rate	loss_reduction
40	7	0.0000001	0.0077783
37	4	0.0000537	0.0009884
10	13	0.0000000	0.0000000
17	3	0.0000046	17.8336015
37	3	0.0037961	0.0000094
16	7	0.0013587	0.0000000

At this stage, I recruited the tidymodels workflows package in order to apply a basic formula to the XGBoost model specified above.

xgboost_wf <- 
workflows::workflow() %>%
add_model(Model.XGB) %>% 
add_formula(SalePrice ~ .)

I used the following tune_grid() to specify the search space for optimal hyperparameters

# hyperparameter tuning
TUNE.XGB <- tune::tune_grid(
  object = xgboost_wf,
  resamples = cv_folds,
  grid = xgboost_grid,
  metrics = yardstick::metric_set(yardstick::rmse, yardstick::rsq, yardstick::rsq_trad, yardstick::mae),
  control = tune::control_grid(verbose = FALSE))

Here, I extracted the parameters that yielded the minimal “rmse”

param_final <- TUNE.XGB %>%select_best(metric = "rmse")

I finalized the XGBoost model to use the tuning parameters that produced the lowest RMSE.

xgboost_wf2 <- xgboost_wf%>%
finalize_workflow(param_final)

It was time to fit the final extreme gradient boosting model

XGB.model <- xgboost_wf2 %>%
fit(train.ready)

Here, I extracted influential variables

library(vip)
XGB.model %>% 
  pull_workflow_fit() %>% 
  vip::vip()

XGB PREDICT ON TEST SET:

I was now ready to make predictions on the unseen test set. Thereafter, I wanted to inspect the consistency of the predictions produced by the XGB.model:

XGB_res <- predict(XGB.model, new_data = test.ready %>% select(-SalePrice))

XGB_res <- bind_cols(XGB_res, test.ready %>% select(SalePrice))

XGB_metrics <- metric_set(yardstick::rmse, yardstick:: mae)

XGB_metrics(XGB_res, truth = SalePrice, estimate = .pred)

## # A tibble: 2 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      23724.
## 2 mae     standard      15045.

ggplot(XGB_res, aes(x = SalePrice, y = .pred)) + 
    # Create a diagonal line:
    geom_abline(lty = 2) + 
    geom_point(alpha = 0.5) + 
    labs(y = "Predicted Sale Price", x = "SalePrice") +
    # Scale and size the x- and y-axis uniformly:
    coord_obs_pred()

0.3.1 EXPLAINER:

Finally, I constructed an explainer using the fitted extreme gradient boosting model. The explainer provides detailed descriptions of what direction particular features are likely to be impacting the prices of particular houses. In the following interactive display of boxes, explanations for the first two houses in the test set are offered . For a thorough explanation of the types of plots used below, I would recommend the curious reader to consult this excellent resource.

explainer<- explain_tidymodels(XGB.model,
                                data = test.ready,
                                y = test.ready$SalePrice,
                                label = "tidymodels")

## Preparation of a new explainer is initiated
##   -> model label       :  tidymodels 
##   -> data              :  731  rows  62  cols 
##   -> data              :  tibble converted into a data.frame 
##   -> target variable   :  731  values 
##   -> predict function  :  yhat.workflow  will be used ( [33m default [39m )
##   -> predicted values  :  No value for predict function target column. ( [33m default [39m )
##   -> model_info        :  package tidymodels , ver. 0.1.2 , task regression ( [33m default [39m ) 
##   -> predicted values  :  numerical, min =  56540.02 , mean =  177666.1 , max =  524786.9  
##   -> residual function :  difference between y and yhat ( [33m default [39m )
##   -> residuals         :  numerical, min =  -193005 , mean =  1308.632 , max =  168033.1  
##  [32m A new explainer has been created! [39m

new_obs <- test.ready[1:2,]
rownames(new_obs) <- c("case 1", "case 2")
modelStudio(explainer, new_obs)