Background

The Kaggle House Prices: Advanced Regression Techniques competition provides a dataset containing final sale prices and 79 explanatory variables describing aspects of residential homes in Ames, Iowa. This dataset is utilized to apply the learnings and techniques taught in the Fundamentals of Computational Math class.

For deeper investigation, the final sale price (in dollars) is defined as the dependent variable \(Y\) and the above grade living area (in square feet) is defined as the independent variable \(X\). Both of these variables are skewed to the right, as illustrated in the Statistics section below.

Probability

After setting the values \(x\) and \(y\) are set to the 3rd quartile of \(X\) and 2nd quartile of \(Y\), respectively, the below probabilities are investigated.

\[\begin{align} P(X>x | Y>y) &= \frac{P(X > x \ \cap \ Y > y)}{P(Y > y)} \\ &= \frac{\frac{315}{1460}} {\frac{728}{1460}} = 0.4327 \end{align}\]

Given that a house’s sale price is above the median price, there is a 43.27% chance that its living area is above the 75th percentile.

\[\begin{align} P(X>x, Y>y) &= P(X > x \ \cap \ Y > y) \\ &= \frac{315}{1460} = 0.2158 \end{align}\]

Among all houses, there is a 21.58% chance that the living area is above the 75th percentile and the sale price is above the median.

\[\begin{align} P(X<x | Y>y) &= \frac{P(X < x \ \cap \ Y > y)}{P(Y > y)} \\ &= \frac{\frac{413}{1460}} {\frac{728}{1460}} = 0.5673 \end{align}\]

Given that a house’s sale price is above the median price, there is a 56.73% chance that its living area is below the 75th percentile.

  Y <= y Y > y Total
X <= x 682 413 1095
X > x 50 315 365
Total 732 728 1460

Independence

Splitting the variables by quantile such that \(A\) is the subset of \(X\) where \(X > x\) and \(B\) is the subet of \(Y\) where \(Y > y\), these variables can be tested for independence.

The value of \(P(A|B)\) is equivalent to \(P(X>x | Y>y)\), which was shown above to be 0.4327. To test for independence, this is compared to the product of the probabilities of \(A\) and \(B\): \[\begin{align} P(A) \times P(B) &= P(X>x) \times P(Y>y) \\ &= \frac{365}{1460} \times \frac{728}{1460} = 0.1247 \end{align}\]

Since \(P(A|B) \neq P(A)P(B)\), the variables \(A\) and \(B\) are not independent.

This can be verified further using the chisq.test function to run a \(\chi^2\) test for the variables \(X\) and \(Y\).

The p-value of this \(\chi^2\) test with 1459 degrees of freedom is essentially zero, so the null hypothesis that the variables are independent is rejected.

Descriptive and Inferential Statistics

Summary statistics for \(X\) and \(Y\) are provided in the table below:

  Min. 1st Qu. Median Mean 3rd Qu. Max. Std. Dev.
X 334 1130 1464 1515 1777 5642 525
Y 34900 130000 163000 180900 214000 755000 79443

The plots below show the univariate and joint distributions of the two variables

Using the t.test function, a 95% confidence interval for the difference in the means of \(X\) and \(Y\) is given by [-183,465, -175,346.4]. The p-value associated with this hypothesis test is essentially zero, so the null hypothesis that there is no difference between the means is rejected at any significance level.

A corrleation matrix for the two variables is presented below:

  X Y
X 1 0.7086
Y 0.7086 1

To test the hypothesis that there is no correlation between \(X\) and \(Y\), the cor.test function is used. Using this function, a 99% confidence interval for the correlation between the variables is given by [0.6733974, 0.7406408]. The p-value associated with this hypothesis test is near-zero, so the null hypothesis that there is no correlation between the variables is rejected.

This means that it can be said with greater than 99.99% confidence that living area and sale price of homes are correlated, and it can be said with 99% confidence that the value of this correlation lies between 0.6734 and 0.7406.

Linear Algebra and Correlation

As shown above, the correlation matrix is given by \[C = \left[ \begin{array}{cc} 1 & 0.7086245 \\0.7086245 & 1 \end{array} \right]\]

To get the precision matrix, this matrix is inverted using the solve function:

This gives a precision matrix of \[P = C^{-1} = \left[ \begin{array}{cc} 2.0086317 & -1.4233656 \\-1.4233656 & 2.0086317 \end{array} \right]\]

Multiplying the correlation and matrices together, in either order, gives the identity matrix: \[C P = \left[ \begin{array}{cc}1 & 0 \\0 & 1\end{array} \right] = P C\]

Principal Components Analysis

In order to conduct principal components analysis, numeric explanatory values are considered. The LotFrontage variable contains a number of missing variables, but no values of zero – these are imputed with 0, as these properties may not have any street frontage (such as condominiums). A simlar approach is taken for the MasVnrArea variable.

The GarageYrBlt variable also has a number of missing values – these, similarly, likely represent homes with no garages. These values are filled based on the YearBuilt variable and the median difference between YearBuilt and GarageYrBlt (which happens to be zero).

As shown in the chart above, the first principal component explains roughly 20% of the total variance. The first seven components explain 50% of the variance, with 95% of the variance explained by the first 25 components.

Calculus-Based Probability & Statistics

Using the fitdistr function from the MASS package, an exponential distribution is fit to the living area variable \(X\). No transformation of the variable is needed, as all values of \(X\) are positive. This fitting results in an optimal parameter of \(\lambda = 6.599\times 10^{-4}\)

1000 samples from an exponential distribution with \(\lambda = 6.599\times 10^{-4}\) are simulated using the rexp function. This simulation produces the following distribution, which is contrasted to the distribution of the observed \(X\) data:

The difference in these distributions is apparent – the simulated data have the highest frequency close to zero, while the observed data seem to show a roughly normal (albeit strongly right-skewed) distribution centered around 1500. These differences are further highlighted by the differences in the percentiles of the distribution and the data – the 5th and 95th percentiles are presented in the table below.

  5% 95%
Simulated 77.73 4540
Observed 848 2466

Since the observed \(X\) data appears roughly normal, the mean \(\bar{X} = 1515.5\), standard deviation \(s_X = 525.5\) and size \(n = 1460\) of the sample are used to generate a 95% confidence interval for the mean of all homes, given by \([1488.51, 1542.42]\).

Modeling

In order to predict the sale price of houses using information about the characteristics of a house, two regression models are constructed and their results entered into the Kaggle competition.

Best Subsets and BIC

The first model considers only the numeric variables examined in previous section. The leaps package is utilized to determine the minimum Bayesian Information Criterion (BIC) to select the best model – this occurs with 15 predictors, as shown below.

# use numeric variables from PCA; include target variable
model_data <- cbind(pca_data, SalePrice = train$SalePrice)
# use regsubsets to get best subsets per number of variables
library(leaps)
regfit <- regsubsets(SalePrice~., data = model_data, nvmax = 20)
Reordering variables and trying again:

The parameters of this model are presented below:

  Estimate Std. Error t value Pr(>|t|)
LotArea 0.6162 0.1077 5.721 1.286e-08
OverallQual 24431 1123 21.75 5.986e-91
OverallCond 5557 1007 5.517 4.085e-08
YearBuilt 273.4 71.76 3.81 0.0001446
MasVnrArea 39.46 6.257 6.307 3.779e-10
BsmtFinSF1 19.94 2.592 7.692 2.657e-14
X1stFlrSF 26.85 3.475 7.726 2.06e-14
BsmtHalfBath -4250 4211 -1.009 0.313
FullBath 6379 2583 2.47 0.01363
TotRmsAbvGrd 8901 828.8 10.74 6.092e-26
Fireplaces 8832 1820 4.853 1.346e-06
GarageYrBlt 203.1 72.73 2.792 0.005306
WoodDeckSF 32.95 8.486 3.883 0.000108
EnclosedPorch 25.93 17.95 1.445 0.1488
MiscVal -1.491 2.008 -0.7423 0.458
(Intercept) -1066852 113116 -9.431 1.553e-20

(Dispersion parameter for gaussian family taken to be 1432794491 )

Null deviance: 9.208e+12 on 1459 degrees of freedom
Residual deviance: 2.069e+12 on 1444 degrees of freedom

Stepwise Regression and AIC

The second model considers all variables without missing values. Beginning with a full regression of these variables, a backwards stepwise regression is run, with the optimal model selected based on the Akaike Information Criterion (AIC) to select the best model. Since this model includes categorical values and associated dummy values, the number of predictors is far higher – the final model includes 132 variables.

The parameters of this model are presented below:

# select variables with no NA values
nona <- train[, which(sapply(train, function(x) sum(is.na(x))) == 0)]
# prepare stepwise model using all non-NA variables
stepback <- step(lm(SalePrice ~ ., nona), direction = 'backward', trace = FALSE)
  Estimate Std. Error t value Pr(>|t|)
MSZoningFV 34396 11791 2.917 0.003591
MSZoningRH 26476 11791 2.245 0.02491
MSZoningRL 30218 10034 3.011 0.002649
MSZoningRM 28287 9402 3.009 0.002672
LotArea 0.6974 0.09838 7.089 2.188e-12
StreetPave 38039 11831 3.215 0.001335
LandContourHLS 15534 5063 3.068 0.0022
LandContourLow -3677 6157 -0.5972 0.5505
LandContourLvl 7725 3591 2.151 0.03165
UtilitiesNoSeWa -43196 25395 -1.701 0.08918
LotConfigCulDSac 7390 3149 2.347 0.01907
LotConfigFR2 -4582 4007 -1.143 0.2531
LotConfigFR3 -11788 12885 -0.9149 0.3604
LotConfigInside -1115 1743 -0.6395 0.5226
LandSlopeMod 10063 3876 2.596 0.00953
LandSlopeSev -30530 10504 -2.907 0.003715
NeighborhoodBlueste -5758 19052 -0.3022 0.7625
NeighborhoodBrDale -2318 10827 -0.2141 0.8305
NeighborhoodBrkSide -5364 9149 -0.5863 0.5578
NeighborhoodClearCr -13976 9174 -1.524 0.1279
NeighborhoodCollgCr -12091 7202 -1.679 0.09343
NeighborhoodCrawfor 3727 8424 0.4424 0.6582
NeighborhoodEdwards -18723 7913 -2.366 0.01812
NeighborhoodGilbert -16003 7568 -2.115 0.03465
NeighborhoodIDOTRR -9879 10464 -0.9441 0.3453
NeighborhoodMeadowV -6090 11002 -0.5535 0.58
NeighborhoodMitchel -21725 8069 -2.692 0.007185
NeighborhoodNAmes -18990 7748 -2.451 0.01438
NeighborhoodNoRidge 24498 8318 2.945 0.003284
NeighborhoodNPkVill 7451 10933 0.6815 0.4957
NeighborhoodNridgHt 19007 7262 2.617 0.008962
NeighborhoodNWAmes -26813 7982 -3.359 0.0008033
NeighborhoodOldTown -17464 9408 -1.856 0.06363
NeighborhoodSawyer -13964 8071 -1.73 0.08383
NeighborhoodSawyerW -8871 7722 -1.149 0.2508
NeighborhoodSomerst -4323 8828 -0.4897 0.6245
NeighborhoodStoneBr 34632 8151 4.249 2.302e-05
NeighborhoodSWISU -10380 9457 -1.098 0.2726
NeighborhoodTimber -8941 8159 -1.096 0.2733
NeighborhoodVeenker -1366 10461 -0.1306 0.8961
Condition1Feedr 3808 4868 0.7824 0.4341
Condition1Norm 11938 4022 2.968 0.003049
Condition1PosA 7539 9942 0.7583 0.4484
Condition1PosN 8063 7333 1.099 0.2718
Condition1RRAe -15681 8735 -1.795 0.07284
Condition1RRAn 7633 6763 1.129 0.2592
Condition1RRNe -7162 17920 -0.3996 0.6895
Condition1RRNn 7281 12603 0.5777 0.5636
Condition2Feedr -9495 22010 -0.4314 0.6662
Condition2Norm -7157 18716 -0.3824 0.7022
Condition2PosA 24013 31619 0.7594 0.4477
Condition2PosN -237998 26816 -8.875 2.209e-18
Condition2RRAe -118270 43569 -2.715 0.006723
Condition2RRAn -9097 30839 -0.295 0.7681
Condition2RRNn -8431 25978 -0.3246 0.7456
BldgType2fmCon -5392 5612 -0.9608 0.3368
BldgTypeDuplex -324.7 5822 -0.05576 0.9555
BldgTypeTwnhs -29394 5369 -5.474 5.25e-08
BldgTypeTwnhsE -24961 3692 -6.76 2.059e-11
OverallQual 7386 969.3 7.619 4.829e-14
OverallCond 5600 788.2 7.105 1.956e-12
YearBuilt 342.2 64.42 5.313 1.266e-07
YearRemodAdd 121.2 52.47 2.31 0.02107
RoofStyleGable -1174 18288 -0.06422 0.9488
RoofStyleGambrel 5620 19799 0.2839 0.7766
RoofStyleHip 142.7 18358 0.007775 0.9938
RoofStyleMansard 16065 21044 0.7634 0.4454
RoofStyleShed 90844 34791 2.611 0.009126
RoofMatlCompShg 670691 30697 21.85 1.199e-90
RoofMatlMembran 759771 45562 16.68 7.92e-57
RoofMatlMetal 726075 45064 16.11 1.766e-53
RoofMatlRoll 665150 39567 16.81 1.214e-57
RoofMatlTar&Grv 672100 35731 18.81 3.71e-70
RoofMatlWdShake 653044 34237 19.07 7.124e-72
RoofMatlWdShngl 750671 31760 23.64 2.305e-103
Exterior1stAsphShn -3109 25590 -0.1215 0.9033
Exterior1stBrkComm -4618 19395 -0.2381 0.8118
Exterior1stBrkFace 13392 6935 1.931 0.05368
Exterior1stCBlock -25488 26833 -0.9499 0.3423
Exterior1stCemntBd 61.36 7314 0.00839 0.9933
Exterior1stHdBoard -5745 6273 -0.9159 0.3599
Exterior1stImStucc -28407 25201 -1.127 0.2598
Exterior1stMetalSd 416.5 6113 0.06813 0.9457
Exterior1stPlywood -9249 6630 -1.395 0.1632
Exterior1stStone -21446 19990 -1.073 0.2835
Exterior1stStucco 59.44 7680 0.00774 0.9938
Exterior1stVinylSd -1409 6168 -0.2284 0.8194
Exterior1stWd Sdng -1592 6080 -0.2618 0.7935
Exterior1stWdShing -795.6 7679 -0.1036 0.9175
MasVnrArea 18.26 4.705 3.882 0.0001088
ExterQualFa -5766 10153 -0.5679 0.5702
ExterQualGd -29201 4749 -6.149 1.031e-09
ExterQualTA -30176 5289 -5.705 1.431e-08
FoundationCBlock 795.8 3037 0.2621 0.7933
FoundationPConc 4463 3366 1.326 0.185
FoundationSlab 10100 7159 1.411 0.1585
FoundationStone 3039 10561 0.2878 0.7736
FoundationWood -27078 14600 -1.855 0.06387
BsmtFinSF1 36.24 3.858 9.392 2.471e-20
BsmtFinSF2 25.03 5.399 4.636 3.904e-06
BsmtUnfSF 14.02 3.806 3.683 0.0002396
X1stFlrSF 57.36 4.952 11.58 1.275e-29
X2ndFlrSF 54.7 3.388 16.15 1.129e-53
FullBath 2907 1983 1.466 0.1428
BedroomAbvGr -5963 1319 -4.522 6.68e-06
KitchenAbvGr -18045 5308 -3.4 0.000695
KitchenQualFa -21622 5963 -3.626 0.0002988
KitchenQualGd -28688 3412 -8.409 1.058e-16
KitchenQualTA -26886 3878 -6.932 6.451e-12
TotRmsAbvGrd 1326 927.5 1.43 0.1529
FunctionalMaj2 -258.6 13104 -0.01973 0.9843
FunctionalMin1 3639 8304 0.4383 0.6613
FunctionalMin2 8806 8153 1.08 0.2803
FunctionalMod -3456 9769 -0.3538 0.7235
FunctionalSev -73267 26888 -2.725 0.006516
FunctionalTyp 18628 7105 2.622 0.008842
Fireplaces 2383 1333 1.788 0.07393
GarageCars 3278 2138 1.533 0.1255
GarageArea 15.65 7.3 2.144 0.03221
WoodDeckSF 11.53 5.787 1.992 0.04661
OpenPorchSF 16.16 11.3 1.429 0.1532
ScreenPorch 31.2 12.15 2.569 0.01031
PoolArea 88.3 17.35 5.089 4.117e-07
MoSold -581.4 245.6 -2.367 0.01807
SaleTypeCon 38440 18103 2.123 0.0339
SaleTypeConLD 17942 9487 1.891 0.05882
SaleTypeConLI 8824 11722 0.7528 0.4517
SaleTypeConLw 2918 11906 0.2451 0.8064
SaleTypeCWD 22334 12942 1.726 0.08463
SaleTypeNew 21167 4876 4.341 1.525e-05
SaleTypeOth 14982 14714 1.018 0.3087
SaleTypeWD 2878 4026 0.7148 0.4749
(Intercept) -1625001 158993 -10.22 1.175e-23
Fitting linear model: SalePrice ~ MSZoning + LotArea + Street + LandContour + Utilities + LotConfig + LandSlope + Neighborhood + Condition1 + Condition2 + BldgType + OverallQual + OverallCond + YearBuilt + YearRemodAdd + RoofStyle + RoofMatl + Exterior1st + MasVnrArea + ExterQual + Foundation + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + FullBath + BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd + Functional + Fireplaces + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea + MoSold + SaleType
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
1460 23852 0.918 0.9099

Prediction

Both models are used to predict the sale price of homes in the test data set provided in the competition. This data set is read in, and the same imputation applied to the training data set is applied to it prior to prediction.

Predictions of NA are imputed with the median of the sale price in the training data set. The predictions generated each contain some negative numbers; as a result, all values are shifted so that the minimum value predicted is zero.

The distribution of predictions from the two models, as well as a scatterplot showing the relationship between predictions by both models, are presented below:

Results

The results of the two models’ predictions are output to csv files and submitted to the Kaggle competition under the username dsmilo. The scores of these results are posted below: