DATA 605 Final Exam: House Prices

Background

The Kaggle House Prices: Advanced Regression Techniques competition provides a dataset containing final sale prices and 79 explanatory variables describing aspects of residential homes in Ames, Iowa. This dataset is utilized to apply the learnings and techniques taught in the Fundamentals of Computational Math class.

For deeper investigation, the final sale price (in dollars) is defined as the dependent variable \(Y\) and the above grade living area (in square feet) is defined as the independent variable \(X\). Both of these variables are skewed to the right, as illustrated in the Statistics section below.

Probability

After setting the values \(x\) and \(y\) are set to the 3rd quartile of \(X\) and 2nd quartile of \(Y\), respectively, the below probabilities are investigated.

\[\begin{align} P(X>x | Y>y) &= \frac{P(X > x \ \cap \ Y > y)}{P(Y > y)} \\ &= \frac{\frac{315}{1460}} {\frac{728}{1460}} = 0.4327 \end{align}\]

Given that a house’s sale price is above the median price, there is a 43.27% chance that its living area is above the 75th percentile.

\[\begin{align} P(X>x, Y>y) &= P(X > x \ \cap \ Y > y) \\ &= \frac{315}{1460} = 0.2158 \end{align}\]

Among all houses, there is a 21.58% chance that the living area is above the 75th percentile and the sale price is above the median.

\[\begin{align} P(X<x | Y>y) &= \frac{P(X < x \ \cap \ Y > y)}{P(Y > y)} \\ &= \frac{\frac{413}{1460}} {\frac{728}{1460}} = 0.5673 \end{align}\]

Given that a house’s sale price is above the median price, there is a 56.73% chance that its living area is below the 75th percentile.

	Y <= y	Y > y	Total
X <= x	682	413	1095
X > x	50	315	365
Total	732	728	1460

Independence

Splitting the variables by quantile such that \(A\) is the subset of \(X\) where \(X > x\) and \(B\) is the subet of \(Y\) where \(Y > y\), these variables can be tested for independence.

The value of \(P(A|B)\) is equivalent to \(P(X>x | Y>y)\), which was shown above to be 0.4327. To test for independence, this is compared to the product of the probabilities of \(A\) and \(B\): \[\begin{align} P(A) \times P(B) &= P(X>x) \times P(Y>y) \\ &= \frac{365}{1460} \times \frac{728}{1460} = 0.1247 \end{align}\]

Since \(P(A|B) \neq P(A)P(B)\), the variables \(A\) and \(B\) are not independent.

This can be verified further using the chisq.test function to run a \(\chi^2\) test for the variables \(X\) and \(Y\).

The p-value of this \(\chi^2\) test with 1459 degrees of freedom is essentially zero, so the null hypothesis that the variables are independent is rejected.

Descriptive and Inferential Statistics

Summary statistics for \(X\) and \(Y\) are provided in the table below:

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	Std. Dev.
X	334	1130	1464	1515	1777	5642	525
Y	34900	130000	163000	180900	214000	755000	79443

The plots below show the univariate and joint distributions of the two variables

Using the t.test function, a 95% confidence interval for the difference in the means of \(X\) and \(Y\) is given by [-183,465, -175,346.4]. The p-value associated with this hypothesis test is essentially zero, so the null hypothesis that there is no difference between the means is rejected at any significance level.

A corrleation matrix for the two variables is presented below:

	X	Y
X	1	0.7086
Y	0.7086	1

To test the hypothesis that there is no correlation between \(X\) and \(Y\), the cor.test function is used. Using this function, a 99% confidence interval for the correlation between the variables is given by [0.6733974, 0.7406408]. The p-value associated with this hypothesis test is near-zero, so the null hypothesis that there is no correlation between the variables is rejected.

This means that it can be said with greater than 99.99% confidence that living area and sale price of homes are correlated, and it can be said with 99% confidence that the value of this correlation lies between 0.6734 and 0.7406.

Linear Algebra and Correlation

As shown above, the correlation matrix is given by \[C = \left[ \begin{array}{cc} 1 & 0.7086245 \\0.7086245 & 1 \end{array} \right]\]

To get the precision matrix, this matrix is inverted using the solve function:

This gives a precision matrix of \[P = C^{-1} = \left[ \begin{array}{cc} 2.0086317 & -1.4233656 \\-1.4233656 & 2.0086317 \end{array} \right]\]

Multiplying the correlation and matrices together, in either order, gives the identity matrix: \[C P = \left[ \begin{array}{cc}1 & 0 \\0 & 1\end{array} \right] = P C\]

Principal Components Analysis

In order to conduct principal components analysis, numeric explanatory values are considered. The LotFrontage variable contains a number of missing variables, but no values of zero – these are imputed with 0, as these properties may not have any street frontage (such as condominiums). A simlar approach is taken for the MasVnrArea variable.

The GarageYrBlt variable also has a number of missing values – these, similarly, likely represent homes with no garages. These values are filled based on the YearBuilt variable and the median difference between YearBuilt and GarageYrBlt (which happens to be zero).

As shown in the chart above, the first principal component explains roughly 20% of the total variance. The first seven components explain 50% of the variance, with 95% of the variance explained by the first 25 components.

Calculus-Based Probability & Statistics

Using the fitdistr function from the MASS package, an exponential distribution is fit to the living area variable \(X\). No transformation of the variable is needed, as all values of \(X\) are positive. This fitting results in an optimal parameter of \(\lambda = 6.599\times 10^{-4}\)

1000 samples from an exponential distribution with \(\lambda = 6.599\times 10^{-4}\) are simulated using the rexp function. This simulation produces the following distribution, which is contrasted to the distribution of the observed \(X\) data:

The difference in these distributions is apparent – the simulated data have the highest frequency close to zero, while the observed data seem to show a roughly normal (albeit strongly right-skewed) distribution centered around 1500. These differences are further highlighted by the differences in the percentiles of the distribution and the data – the 5th and 95th percentiles are presented in the table below.

	5%	95%
Simulated	77.73	4540
Observed	848	2466

Since the observed \(X\) data appears roughly normal, the mean \(\bar{X} = 1515.5\), standard deviation \(s_X = 525.5\) and size \(n = 1460\) of the sample are used to generate a 95% confidence interval for the mean of all homes, given by \([1488.51, 1542.42]\).

Modeling

In order to predict the sale price of houses using information about the characteristics of a house, two regression models are constructed and their results entered into the Kaggle competition.

Best Subsets and BIC

The first model considers only the numeric variables examined in previous section. The leaps package is utilized to determine the minimum Bayesian Information Criterion (BIC) to select the best model – this occurs with 15 predictors, as shown below.

# use numeric variables from PCA; include target variable
model_data <- cbind(pca_data, SalePrice = train$SalePrice)
# use regsubsets to get best subsets per number of variables
library(leaps)
regfit <- regsubsets(SalePrice~., data = model_data, nvmax = 20)

Reordering variables and trying again:

The parameters of this model are presented below:

	Estimate	Std. Error	t value	Pr(>\|t\|)
LotArea	0.6162	0.1077	5.721	1.286e-08
OverallQual	24431	1123	21.75	5.986e-91
OverallCond	5557	1007	5.517	4.085e-08
YearBuilt	273.4	71.76	3.81	0.0001446
MasVnrArea	39.46	6.257	6.307	3.779e-10
BsmtFinSF1	19.94	2.592	7.692	2.657e-14
X1stFlrSF	26.85	3.475	7.726	2.06e-14
BsmtHalfBath	-4250	4211	-1.009	0.313
FullBath	6379	2583	2.47	0.01363
TotRmsAbvGrd	8901	828.8	10.74	6.092e-26
Fireplaces	8832	1820	4.853	1.346e-06
GarageYrBlt	203.1	72.73	2.792	0.005306
WoodDeckSF	32.95	8.486	3.883	0.000108
EnclosedPorch	25.93	17.95	1.445	0.1488
MiscVal	-1.491	2.008	-0.7423	0.458
(Intercept)	-1066852	113116	-9.431	1.553e-20

(Dispersion parameter for gaussian family taken to be 1432794491 )

Null deviance:	9.208e+12 on 1459 degrees of freedom
Residual deviance:	2.069e+12 on 1444 degrees of freedom

Stepwise Regression and AIC

The second model considers all variables without missing values. Beginning with a full regression of these variables, a backwards stepwise regression is run, with the optimal model selected based on the Akaike Information Criterion (AIC) to select the best model. Since this model includes categorical values and associated dummy values, the number of predictors is far higher – the final model includes 132 variables.

The parameters of this model are presented below:

# select variables with no NA values
nona <- train[, which(sapply(train, function(x) sum(is.na(x))) == 0)]
# prepare stepwise model using all non-NA variables
stepback <- step(lm(SalePrice ~ ., nona), direction = 'backward', trace = FALSE)

	Estimate	Std. Error	t value	Pr(>\|t\|)
MSZoningFV	34396	11791	2.917	0.003591
MSZoningRH	26476	11791	2.245	0.02491
MSZoningRL	30218	10034	3.011	0.002649
MSZoningRM	28287	9402	3.009	0.002672
LotArea	0.6974	0.09838	7.089	2.188e-12
StreetPave	38039	11831	3.215	0.001335
LandContourHLS	15534	5063	3.068	0.0022
LandContourLow	-3677	6157	-0.5972	0.5505
LandContourLvl	7725	3591	2.151	0.03165
UtilitiesNoSeWa	-43196	25395	-1.701	0.08918
LotConfigCulDSac	7390	3149	2.347	0.01907
LotConfigFR2	-4582	4007	-1.143	0.2531
LotConfigFR3	-11788	12885	-0.9149	0.3604
LotConfigInside	-1115	1743	-0.6395	0.5226
LandSlopeMod	10063	3876	2.596	0.00953
LandSlopeSev	-30530	10504	-2.907	0.003715
NeighborhoodBlueste	-5758	19052	-0.3022	0.7625
NeighborhoodBrDale	-2318	10827	-0.2141	0.8305
NeighborhoodBrkSide	-5364	9149	-0.5863	0.5578
NeighborhoodClearCr	-13976	9174	-1.524	0.1279
NeighborhoodCollgCr	-12091	7202	-1.679	0.09343
NeighborhoodCrawfor	3727	8424	0.4424	0.6582
NeighborhoodEdwards	-18723	7913	-2.366	0.01812
NeighborhoodGilbert	-16003	7568	-2.115	0.03465
NeighborhoodIDOTRR	-9879	10464	-0.9441	0.3453
NeighborhoodMeadowV	-6090	11002	-0.5535	0.58
NeighborhoodMitchel	-21725	8069	-2.692	0.007185
NeighborhoodNAmes	-18990	7748	-2.451	0.01438
NeighborhoodNoRidge	24498	8318	2.945	0.003284
NeighborhoodNPkVill	7451	10933	0.6815	0.4957
NeighborhoodNridgHt	19007	7262	2.617	0.008962
NeighborhoodNWAmes	-26813	7982	-3.359	0.0008033
NeighborhoodOldTown	-17464	9408	-1.856	0.06363
NeighborhoodSawyer	-13964	8071	-1.73	0.08383
NeighborhoodSawyerW	-8871	7722	-1.149	0.2508
NeighborhoodSomerst	-4323	8828	-0.4897	0.6245
NeighborhoodStoneBr	34632	8151	4.249	2.302e-05
NeighborhoodSWISU	-10380	9457	-1.098	0.2726
NeighborhoodTimber	-8941	8159	-1.096	0.2733
NeighborhoodVeenker	-1366	10461	-0.1306	0.8961
Condition1Feedr	3808	4868	0.7824	0.4341
Condition1Norm	11938	4022	2.968	0.003049
Condition1PosA	7539	9942	0.7583	0.4484
Condition1PosN	8063	7333	1.099	0.2718
Condition1RRAe	-15681	8735	-1.795	0.07284
Condition1RRAn	7633	6763	1.129	0.2592
Condition1RRNe	-7162	17920	-0.3996	0.6895
Condition1RRNn	7281	12603	0.5777	0.5636
Condition2Feedr	-9495	22010	-0.4314	0.6662
Condition2Norm	-7157	18716	-0.3824	0.7022
Condition2PosA	24013	31619	0.7594	0.4477
Condition2PosN	-237998	26816	-8.875	2.209e-18
Condition2RRAe	-118270	43569	-2.715	0.006723
Condition2RRAn	-9097	30839	-0.295	0.7681
Condition2RRNn	-8431	25978	-0.3246	0.7456
BldgType2fmCon	-5392	5612	-0.9608	0.3368
BldgTypeDuplex	-324.7	5822	-0.05576	0.9555
BldgTypeTwnhs	-29394	5369	-5.474	5.25e-08
BldgTypeTwnhsE	-24961	3692	-6.76	2.059e-11
OverallQual	7386	969.3	7.619	4.829e-14
OverallCond	5600	788.2	7.105	1.956e-12
YearBuilt	342.2	64.42	5.313	1.266e-07
YearRemodAdd	121.2	52.47	2.31	0.02107
RoofStyleGable	-1174	18288	-0.06422	0.9488
RoofStyleGambrel	5620	19799	0.2839	0.7766
RoofStyleHip	142.7	18358	0.007775	0.9938
RoofStyleMansard	16065	21044	0.7634	0.4454
RoofStyleShed	90844	34791	2.611	0.009126
RoofMatlCompShg	670691	30697	21.85	1.199e-90
RoofMatlMembran	759771	45562	16.68	7.92e-57
RoofMatlMetal	726075	45064	16.11	1.766e-53
RoofMatlRoll	665150	39567	16.81	1.214e-57
RoofMatlTar&Grv	672100	35731	18.81	3.71e-70
RoofMatlWdShake	653044	34237	19.07	7.124e-72
RoofMatlWdShngl	750671	31760	23.64	2.305e-103
Exterior1stAsphShn	-3109	25590	-0.1215	0.9033
Exterior1stBrkComm	-4618	19395	-0.2381	0.8118
Exterior1stBrkFace	13392	6935	1.931	0.05368
Exterior1stCBlock	-25488	26833	-0.9499	0.3423
Exterior1stCemntBd	61.36	7314	0.00839	0.9933
Exterior1stHdBoard	-5745	6273	-0.9159	0.3599
Exterior1stImStucc	-28407	25201	-1.127	0.2598
Exterior1stMetalSd	416.5	6113	0.06813	0.9457
Exterior1stPlywood	-9249	6630	-1.395	0.1632
Exterior1stStone	-21446	19990	-1.073	0.2835
Exterior1stStucco	59.44	7680	0.00774	0.9938
Exterior1stVinylSd	-1409	6168	-0.2284	0.8194
Exterior1stWd Sdng	-1592	6080	-0.2618	0.7935
Exterior1stWdShing	-795.6	7679	-0.1036	0.9175
MasVnrArea	18.26	4.705	3.882	0.0001088
ExterQualFa	-5766	10153	-0.5679	0.5702
ExterQualGd	-29201	4749	-6.149	1.031e-09
ExterQualTA	-30176	5289	-5.705	1.431e-08
FoundationCBlock	795.8	3037	0.2621	0.7933
FoundationPConc	4463	3366	1.326	0.185
FoundationSlab	10100	7159	1.411	0.1585
FoundationStone	3039	10561	0.2878	0.7736
FoundationWood	-27078	14600	-1.855	0.06387
BsmtFinSF1	36.24	3.858	9.392	2.471e-20
BsmtFinSF2	25.03	5.399	4.636	3.904e-06
BsmtUnfSF	14.02	3.806	3.683	0.0002396
X1stFlrSF	57.36	4.952	11.58	1.275e-29
X2ndFlrSF	54.7	3.388	16.15	1.129e-53
FullBath	2907	1983	1.466	0.1428
BedroomAbvGr	-5963	1319	-4.522	6.68e-06
KitchenAbvGr	-18045	5308	-3.4	0.000695
KitchenQualFa	-21622	5963	-3.626	0.0002988
KitchenQualGd	-28688	3412	-8.409	1.058e-16
KitchenQualTA	-26886	3878	-6.932	6.451e-12
TotRmsAbvGrd	1326	927.5	1.43	0.1529
FunctionalMaj2	-258.6	13104	-0.01973	0.9843
FunctionalMin1	3639	8304	0.4383	0.6613
FunctionalMin2	8806	8153	1.08	0.2803
FunctionalMod	-3456	9769	-0.3538	0.7235
FunctionalSev	-73267	26888	-2.725	0.006516
FunctionalTyp	18628	7105	2.622	0.008842
Fireplaces	2383	1333	1.788	0.07393
GarageCars	3278	2138	1.533	0.1255
GarageArea	15.65	7.3	2.144	0.03221
WoodDeckSF	11.53	5.787	1.992	0.04661
OpenPorchSF	16.16	11.3	1.429	0.1532
ScreenPorch	31.2	12.15	2.569	0.01031
PoolArea	88.3	17.35	5.089	4.117e-07
MoSold	-581.4	245.6	-2.367	0.01807
SaleTypeCon	38440	18103	2.123	0.0339
SaleTypeConLD	17942	9487	1.891	0.05882
SaleTypeConLI	8824	11722	0.7528	0.4517
SaleTypeConLw	2918	11906	0.2451	0.8064
SaleTypeCWD	22334	12942	1.726	0.08463
SaleTypeNew	21167	4876	4.341	1.525e-05
SaleTypeOth	14982	14714	1.018	0.3087
SaleTypeWD	2878	4026	0.7148	0.4749
(Intercept)	-1625001	158993	-10.22	1.175e-23

Fitting linear model: SalePrice ~ MSZoning + LotArea + Street + LandContour + Utilities + LotConfig + LandSlope + Neighborhood + Condition1 + Condition2 + BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd + RoofStyle + RoofMatl + Exterior1st + MasVnrArea + ExterQual + Foundation + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + FullBath + BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd + Functional + Fireplaces + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea + MoSold + SaleType
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
1460	23852	0.918	0.9099

Prediction

Both models are used to predict the sale price of homes in the test data set provided in the competition. This data set is read in, and the same imputation applied to the training data set is applied to it prior to prediction.

Predictions of NA are imputed with the median of the sale price in the training data set. The predictions generated each contain some negative numbers; as a result, all values are shifted so that the minimum value predicted is zero.

The distribution of predictions from the two models, as well as a scatterplot showing the relationship between predictions by both models, are presented below:

Results

The results of the two models’ predictions are output to csv files and submitted to the Kaggle competition under the username dsmilo. The scores of these results are posted below: