The Kaggle House Prices: Advanced Regression Techniques competition provides a dataset containing final sale prices and 79 explanatory variables describing aspects of residential homes in Ames, Iowa. This dataset is utilized to apply the learnings and techniques taught in the Fundamentals of Computational Math class.
For deeper investigation, the final sale price (in dollars) is defined as the dependent variable \(Y\) and the above grade living area (in square feet) is defined as the independent variable \(X\). Both of these variables are skewed to the right, as illustrated in the Statistics section below.
After setting the values \(x\) and \(y\) are set to the 3rd quartile of \(X\) and 2nd quartile of \(Y\), respectively, the below probabilities are investigated.
\[\begin{align} P(X>x | Y>y) &= \frac{P(X > x \ \cap \ Y > y)}{P(Y > y)} \\ &= \frac{\frac{315}{1460}} {\frac{728}{1460}} = 0.4327 \end{align}\]
Given that a house’s sale price is above the median price, there is a 43.27% chance that its living area is above the 75th percentile.
\[\begin{align} P(X>x, Y>y) &= P(X > x \ \cap \ Y > y) \\ &= \frac{315}{1460} = 0.2158 \end{align}\]
Among all houses, there is a 21.58% chance that the living area is above the 75th percentile and the sale price is above the median.
\[\begin{align} P(X<x | Y>y) &= \frac{P(X < x \ \cap \ Y > y)}{P(Y > y)} \\ &= \frac{\frac{413}{1460}} {\frac{728}{1460}} = 0.5673 \end{align}\]
Given that a house’s sale price is above the median price, there is a 56.73% chance that its living area is below the 75th percentile.
| Â | Y <= y | Y > y | Total |
|---|---|---|---|
| X <= x | 682 | 413 | 1095 |
| X > x | 50 | 315 | 365 |
| Total | 732 | 728 | 1460 |
Splitting the variables by quantile such that \(A\) is the subset of \(X\) where \(X > x\) and \(B\) is the subet of \(Y\) where \(Y > y\), these variables can be tested for independence.
The value of \(P(A|B)\) is equivalent to \(P(X>x | Y>y)\), which was shown above to be 0.4327. To test for independence, this is compared to the product of the probabilities of \(A\) and \(B\): \[\begin{align} P(A) \times P(B) &= P(X>x) \times P(Y>y) \\ &= \frac{365}{1460} \times \frac{728}{1460} = 0.1247 \end{align}\]
Since \(P(A|B) \neq P(A)P(B)\), the variables \(A\) and \(B\) are not independent.
This can be verified further using the chisq.test function to run a \(\chi^2\) test for the variables \(X\) and \(Y\).
The p-value of this \(\chi^2\) test with 1459 degrees of freedom is essentially zero, so the null hypothesis that the variables are independent is rejected.
Summary statistics for \(X\) and \(Y\) are provided in the table below:
| Â | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | Std. Dev. |
|---|---|---|---|---|---|---|---|
| X | 334 | 1130 | 1464 | 1515 | 1777 | 5642 | 525 |
| Y | 34900 | 130000 | 163000 | 180900 | 214000 | 755000 | 79443 |
The plots below show the univariate and joint distributions of the two variables
Using the t.test function, a 95% confidence interval for the difference in the means of \(X\) and \(Y\) is given by [-183,465, -175,346.4]. The p-value associated with this hypothesis test is essentially zero, so the null hypothesis that there is no difference between the means is rejected at any significance level.
A corrleation matrix for the two variables is presented below:
| Â | X | Y |
|---|---|---|
| X | 1 | 0.7086 |
| Y | 0.7086 | 1 |
To test the hypothesis that there is no correlation between \(X\) and \(Y\), the cor.test function is used. Using this function, a 99% confidence interval for the correlation between the variables is given by [0.6733974, 0.7406408]. The p-value associated with this hypothesis test is near-zero, so the null hypothesis that there is no correlation between the variables is rejected.
This means that it can be said with greater than 99.99% confidence that living area and sale price of homes are correlated, and it can be said with 99% confidence that the value of this correlation lies between 0.6734 and 0.7406.
As shown above, the correlation matrix is given by \[C = \left[ \begin{array}{cc} 1 & 0.7086245 \\0.7086245 & 1 \end{array} \right]\]
To get the precision matrix, this matrix is inverted using the solve function:
This gives a precision matrix of \[P = C^{-1} = \left[ \begin{array}{cc} 2.0086317 & -1.4233656 \\-1.4233656 & 2.0086317 \end{array} \right]\]
Multiplying the correlation and matrices together, in either order, gives the identity matrix: \[C P = \left[ \begin{array}{cc}1 & 0 \\0 & 1\end{array} \right] = P C\]
In order to conduct principal components analysis, numeric explanatory values are considered. The LotFrontage variable contains a number of missing variables, but no values of zero – these are imputed with 0, as these properties may not have any street frontage (such as condominiums). A simlar approach is taken for the MasVnrArea variable.
The GarageYrBlt variable also has a number of missing values – these, similarly, likely represent homes with no garages. These values are filled based on the YearBuilt variable and the median difference between YearBuilt and GarageYrBlt (which happens to be zero).
As shown in the chart above, the first principal component explains roughly 20% of the total variance. The first seven components explain 50% of the variance, with 95% of the variance explained by the first 25 components.
Using the fitdistr function from the MASS package, an exponential distribution is fit to the living area variable \(X\). No transformation of the variable is needed, as all values of \(X\) are positive. This fitting results in an optimal parameter of \(\lambda = 6.599\times 10^{-4}\)
1000 samples from an exponential distribution with \(\lambda = 6.599\times 10^{-4}\) are simulated using the rexp function. This simulation produces the following distribution, which is contrasted to the distribution of the observed \(X\) data:
The difference in these distributions is apparent – the simulated data have the highest frequency close to zero, while the observed data seem to show a roughly normal (albeit strongly right-skewed) distribution centered around 1500. These differences are further highlighted by the differences in the percentiles of the distribution and the data – the 5th and 95th percentiles are presented in the table below.
| Â | 5% | 95% |
|---|---|---|
| Simulated | 77.73 | 4540 |
| Observed | 848 | 2466 |
Since the observed \(X\) data appears roughly normal, the mean \(\bar{X} = 1515.5\), standard deviation \(s_X = 525.5\) and size \(n = 1460\) of the sample are used to generate a 95% confidence interval for the mean of all homes, given by \([1488.51, 1542.42]\).
In order to predict the sale price of houses using information about the characteristics of a house, two regression models are constructed and their results entered into the Kaggle competition.
The first model considers only the numeric variables examined in previous section. The leaps package is utilized to determine the minimum Bayesian Information Criterion (BIC) to select the best model – this occurs with 15 predictors, as shown below.
# use numeric variables from PCA; include target variable
model_data <- cbind(pca_data, SalePrice = train$SalePrice)
# use regsubsets to get best subsets per number of variables
library(leaps)
regfit <- regsubsets(SalePrice~., data = model_data, nvmax = 20)Reordering variables and trying again:
The parameters of this model are presented below:
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| LotArea | 0.6162 | 0.1077 | 5.721 | 1.286e-08 |
| OverallQual | 24431 | 1123 | 21.75 | 5.986e-91 |
| OverallCond | 5557 | 1007 | 5.517 | 4.085e-08 |
| YearBuilt | 273.4 | 71.76 | 3.81 | 0.0001446 |
| MasVnrArea | 39.46 | 6.257 | 6.307 | 3.779e-10 |
| BsmtFinSF1 | 19.94 | 2.592 | 7.692 | 2.657e-14 |
| X1stFlrSF | 26.85 | 3.475 | 7.726 | 2.06e-14 |
| BsmtHalfBath | -4250 | 4211 | -1.009 | 0.313 |
| FullBath | 6379 | 2583 | 2.47 | 0.01363 |
| TotRmsAbvGrd | 8901 | 828.8 | 10.74 | 6.092e-26 |
| Fireplaces | 8832 | 1820 | 4.853 | 1.346e-06 |
| GarageYrBlt | 203.1 | 72.73 | 2.792 | 0.005306 |
| WoodDeckSF | 32.95 | 8.486 | 3.883 | 0.000108 |
| EnclosedPorch | 25.93 | 17.95 | 1.445 | 0.1488 |
| MiscVal | -1.491 | 2.008 | -0.7423 | 0.458 |
| (Intercept) | -1066852 | 113116 | -9.431 | 1.553e-20 |
(Dispersion parameter for gaussian family taken to be 1432794491 )
| Null deviance: | 9.208e+12 on 1459 degrees of freedom |
| Residual deviance: | 2.069e+12 on 1444 degrees of freedom |
The second model considers all variables without missing values. Beginning with a full regression of these variables, a backwards stepwise regression is run, with the optimal model selected based on the Akaike Information Criterion (AIC) to select the best model. Since this model includes categorical values and associated dummy values, the number of predictors is far higher – the final model includes 132 variables.
The parameters of this model are presented below:
# select variables with no NA values
nona <- train[, which(sapply(train, function(x) sum(is.na(x))) == 0)]
# prepare stepwise model using all non-NA variables
stepback <- step(lm(SalePrice ~ ., nona), direction = 'backward', trace = FALSE)| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| MSZoningFV | 34396 | 11791 | 2.917 | 0.003591 |
| MSZoningRH | 26476 | 11791 | 2.245 | 0.02491 |
| MSZoningRL | 30218 | 10034 | 3.011 | 0.002649 |
| MSZoningRM | 28287 | 9402 | 3.009 | 0.002672 |
| LotArea | 0.6974 | 0.09838 | 7.089 | 2.188e-12 |
| StreetPave | 38039 | 11831 | 3.215 | 0.001335 |
| LandContourHLS | 15534 | 5063 | 3.068 | 0.0022 |
| LandContourLow | -3677 | 6157 | -0.5972 | 0.5505 |
| LandContourLvl | 7725 | 3591 | 2.151 | 0.03165 |
| UtilitiesNoSeWa | -43196 | 25395 | -1.701 | 0.08918 |
| LotConfigCulDSac | 7390 | 3149 | 2.347 | 0.01907 |
| LotConfigFR2 | -4582 | 4007 | -1.143 | 0.2531 |
| LotConfigFR3 | -11788 | 12885 | -0.9149 | 0.3604 |
| LotConfigInside | -1115 | 1743 | -0.6395 | 0.5226 |
| LandSlopeMod | 10063 | 3876 | 2.596 | 0.00953 |
| LandSlopeSev | -30530 | 10504 | -2.907 | 0.003715 |
| NeighborhoodBlueste | -5758 | 19052 | -0.3022 | 0.7625 |
| NeighborhoodBrDale | -2318 | 10827 | -0.2141 | 0.8305 |
| NeighborhoodBrkSide | -5364 | 9149 | -0.5863 | 0.5578 |
| NeighborhoodClearCr | -13976 | 9174 | -1.524 | 0.1279 |
| NeighborhoodCollgCr | -12091 | 7202 | -1.679 | 0.09343 |
| NeighborhoodCrawfor | 3727 | 8424 | 0.4424 | 0.6582 |
| NeighborhoodEdwards | -18723 | 7913 | -2.366 | 0.01812 |
| NeighborhoodGilbert | -16003 | 7568 | -2.115 | 0.03465 |
| NeighborhoodIDOTRR | -9879 | 10464 | -0.9441 | 0.3453 |
| NeighborhoodMeadowV | -6090 | 11002 | -0.5535 | 0.58 |
| NeighborhoodMitchel | -21725 | 8069 | -2.692 | 0.007185 |
| NeighborhoodNAmes | -18990 | 7748 | -2.451 | 0.01438 |
| NeighborhoodNoRidge | 24498 | 8318 | 2.945 | 0.003284 |
| NeighborhoodNPkVill | 7451 | 10933 | 0.6815 | 0.4957 |
| NeighborhoodNridgHt | 19007 | 7262 | 2.617 | 0.008962 |
| NeighborhoodNWAmes | -26813 | 7982 | -3.359 | 0.0008033 |
| NeighborhoodOldTown | -17464 | 9408 | -1.856 | 0.06363 |
| NeighborhoodSawyer | -13964 | 8071 | -1.73 | 0.08383 |
| NeighborhoodSawyerW | -8871 | 7722 | -1.149 | 0.2508 |
| NeighborhoodSomerst | -4323 | 8828 | -0.4897 | 0.6245 |
| NeighborhoodStoneBr | 34632 | 8151 | 4.249 | 2.302e-05 |
| NeighborhoodSWISU | -10380 | 9457 | -1.098 | 0.2726 |
| NeighborhoodTimber | -8941 | 8159 | -1.096 | 0.2733 |
| NeighborhoodVeenker | -1366 | 10461 | -0.1306 | 0.8961 |
| Condition1Feedr | 3808 | 4868 | 0.7824 | 0.4341 |
| Condition1Norm | 11938 | 4022 | 2.968 | 0.003049 |
| Condition1PosA | 7539 | 9942 | 0.7583 | 0.4484 |
| Condition1PosN | 8063 | 7333 | 1.099 | 0.2718 |
| Condition1RRAe | -15681 | 8735 | -1.795 | 0.07284 |
| Condition1RRAn | 7633 | 6763 | 1.129 | 0.2592 |
| Condition1RRNe | -7162 | 17920 | -0.3996 | 0.6895 |
| Condition1RRNn | 7281 | 12603 | 0.5777 | 0.5636 |
| Condition2Feedr | -9495 | 22010 | -0.4314 | 0.6662 |
| Condition2Norm | -7157 | 18716 | -0.3824 | 0.7022 |
| Condition2PosA | 24013 | 31619 | 0.7594 | 0.4477 |
| Condition2PosN | -237998 | 26816 | -8.875 | 2.209e-18 |
| Condition2RRAe | -118270 | 43569 | -2.715 | 0.006723 |
| Condition2RRAn | -9097 | 30839 | -0.295 | 0.7681 |
| Condition2RRNn | -8431 | 25978 | -0.3246 | 0.7456 |
| BldgType2fmCon | -5392 | 5612 | -0.9608 | 0.3368 |
| BldgTypeDuplex | -324.7 | 5822 | -0.05576 | 0.9555 |
| BldgTypeTwnhs | -29394 | 5369 | -5.474 | 5.25e-08 |
| BldgTypeTwnhsE | -24961 | 3692 | -6.76 | 2.059e-11 |
| OverallQual | 7386 | 969.3 | 7.619 | 4.829e-14 |
| OverallCond | 5600 | 788.2 | 7.105 | 1.956e-12 |
| YearBuilt | 342.2 | 64.42 | 5.313 | 1.266e-07 |
| YearRemodAdd | 121.2 | 52.47 | 2.31 | 0.02107 |
| RoofStyleGable | -1174 | 18288 | -0.06422 | 0.9488 |
| RoofStyleGambrel | 5620 | 19799 | 0.2839 | 0.7766 |
| RoofStyleHip | 142.7 | 18358 | 0.007775 | 0.9938 |
| RoofStyleMansard | 16065 | 21044 | 0.7634 | 0.4454 |
| RoofStyleShed | 90844 | 34791 | 2.611 | 0.009126 |
| RoofMatlCompShg | 670691 | 30697 | 21.85 | 1.199e-90 |
| RoofMatlMembran | 759771 | 45562 | 16.68 | 7.92e-57 |
| RoofMatlMetal | 726075 | 45064 | 16.11 | 1.766e-53 |
| RoofMatlRoll | 665150 | 39567 | 16.81 | 1.214e-57 |
| RoofMatlTar&Grv | 672100 | 35731 | 18.81 | 3.71e-70 |
| RoofMatlWdShake | 653044 | 34237 | 19.07 | 7.124e-72 |
| RoofMatlWdShngl | 750671 | 31760 | 23.64 | 2.305e-103 |
| Exterior1stAsphShn | -3109 | 25590 | -0.1215 | 0.9033 |
| Exterior1stBrkComm | -4618 | 19395 | -0.2381 | 0.8118 |
| Exterior1stBrkFace | 13392 | 6935 | 1.931 | 0.05368 |
| Exterior1stCBlock | -25488 | 26833 | -0.9499 | 0.3423 |
| Exterior1stCemntBd | 61.36 | 7314 | 0.00839 | 0.9933 |
| Exterior1stHdBoard | -5745 | 6273 | -0.9159 | 0.3599 |
| Exterior1stImStucc | -28407 | 25201 | -1.127 | 0.2598 |
| Exterior1stMetalSd | 416.5 | 6113 | 0.06813 | 0.9457 |
| Exterior1stPlywood | -9249 | 6630 | -1.395 | 0.1632 |
| Exterior1stStone | -21446 | 19990 | -1.073 | 0.2835 |
| Exterior1stStucco | 59.44 | 7680 | 0.00774 | 0.9938 |
| Exterior1stVinylSd | -1409 | 6168 | -0.2284 | 0.8194 |
| Exterior1stWd Sdng | -1592 | 6080 | -0.2618 | 0.7935 |
| Exterior1stWdShing | -795.6 | 7679 | -0.1036 | 0.9175 |
| MasVnrArea | 18.26 | 4.705 | 3.882 | 0.0001088 |
| ExterQualFa | -5766 | 10153 | -0.5679 | 0.5702 |
| ExterQualGd | -29201 | 4749 | -6.149 | 1.031e-09 |
| ExterQualTA | -30176 | 5289 | -5.705 | 1.431e-08 |
| FoundationCBlock | 795.8 | 3037 | 0.2621 | 0.7933 |
| FoundationPConc | 4463 | 3366 | 1.326 | 0.185 |
| FoundationSlab | 10100 | 7159 | 1.411 | 0.1585 |
| FoundationStone | 3039 | 10561 | 0.2878 | 0.7736 |
| FoundationWood | -27078 | 14600 | -1.855 | 0.06387 |
| BsmtFinSF1 | 36.24 | 3.858 | 9.392 | 2.471e-20 |
| BsmtFinSF2 | 25.03 | 5.399 | 4.636 | 3.904e-06 |
| BsmtUnfSF | 14.02 | 3.806 | 3.683 | 0.0002396 |
| X1stFlrSF | 57.36 | 4.952 | 11.58 | 1.275e-29 |
| X2ndFlrSF | 54.7 | 3.388 | 16.15 | 1.129e-53 |
| FullBath | 2907 | 1983 | 1.466 | 0.1428 |
| BedroomAbvGr | -5963 | 1319 | -4.522 | 6.68e-06 |
| KitchenAbvGr | -18045 | 5308 | -3.4 | 0.000695 |
| KitchenQualFa | -21622 | 5963 | -3.626 | 0.0002988 |
| KitchenQualGd | -28688 | 3412 | -8.409 | 1.058e-16 |
| KitchenQualTA | -26886 | 3878 | -6.932 | 6.451e-12 |
| TotRmsAbvGrd | 1326 | 927.5 | 1.43 | 0.1529 |
| FunctionalMaj2 | -258.6 | 13104 | -0.01973 | 0.9843 |
| FunctionalMin1 | 3639 | 8304 | 0.4383 | 0.6613 |
| FunctionalMin2 | 8806 | 8153 | 1.08 | 0.2803 |
| FunctionalMod | -3456 | 9769 | -0.3538 | 0.7235 |
| FunctionalSev | -73267 | 26888 | -2.725 | 0.006516 |
| FunctionalTyp | 18628 | 7105 | 2.622 | 0.008842 |
| Fireplaces | 2383 | 1333 | 1.788 | 0.07393 |
| GarageCars | 3278 | 2138 | 1.533 | 0.1255 |
| GarageArea | 15.65 | 7.3 | 2.144 | 0.03221 |
| WoodDeckSF | 11.53 | 5.787 | 1.992 | 0.04661 |
| OpenPorchSF | 16.16 | 11.3 | 1.429 | 0.1532 |
| ScreenPorch | 31.2 | 12.15 | 2.569 | 0.01031 |
| PoolArea | 88.3 | 17.35 | 5.089 | 4.117e-07 |
| MoSold | -581.4 | 245.6 | -2.367 | 0.01807 |
| SaleTypeCon | 38440 | 18103 | 2.123 | 0.0339 |
| SaleTypeConLD | 17942 | 9487 | 1.891 | 0.05882 |
| SaleTypeConLI | 8824 | 11722 | 0.7528 | 0.4517 |
| SaleTypeConLw | 2918 | 11906 | 0.2451 | 0.8064 |
| SaleTypeCWD | 22334 | 12942 | 1.726 | 0.08463 |
| SaleTypeNew | 21167 | 4876 | 4.341 | 1.525e-05 |
| SaleTypeOth | 14982 | 14714 | 1.018 | 0.3087 |
| SaleTypeWD | 2878 | 4026 | 0.7148 | 0.4749 |
| (Intercept) | -1625001 | 158993 | -10.22 | 1.175e-23 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 1460 | 23852 | 0.918 | 0.9099 |
Both models are used to predict the sale price of homes in the test data set provided in the competition. This data set is read in, and the same imputation applied to the training data set is applied to it prior to prediction.
Predictions of NA are imputed with the median of the sale price in the training data set. The predictions generated each contain some negative numbers; as a result, all values are shifted so that the minimum value predicted is zero.
The distribution of predictions from the two models, as well as a scatterplot showing the relationship between predictions by both models, are presented below:
The results of the two models’ predictions are output to csv files and submitted to the Kaggle competition under the username dsmilo. The scores of these results are posted below: