House Prices: Advanced Regression Techniques competition.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
library(dplyr)
library(knitr)
library(tidyr)
library(moments)
library(psych)
library(Matrix)
# Assume file train.csv has been locally downloaded.
df = read.csv("train.csv")
dim(df)
## [1] 1460 81
- Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right!
- Pick the dependent variable and define it as Y.
# To pick an independent variable with right-skewness, we can examine histograms
# of a few variables.
par(mfrow=c(1,2))
# Overall Quality
hist(df$OverallQual, main = "OverallQual")
# Lot Area
hist(df$LotArea, main = "LotArea")
# LotArea is clearly right-skewed.
X = df$LotArea
# The target variable we are trying to predict is SalePrice, the
# property's sale price in dollars.
Y = df$SalePrice
# Show histogram of SalePrice (target).
# SalePrice
hist(df$SalePrice, main = "SalePrice")
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
- P(X > x | Y > y)
- P(X > x, Y > y)
- P(X < x | Y > y)
# Print summaries of X (independent var.) and Y (target).
d2 = df %>% dplyr::select(LotArea, SalePrice)
summary(d2)
## LotArea SalePrice
## Min. : 1300 Min. : 34900
## 1st Qu.: 7554 1st Qu.:129975
## Median : 9478 Median :163000
## Mean : 10517 Mean :180921
## 3rd Qu.: 11602 3rd Qu.:214000
## Max. :215245 Max. :755000
x_1q = summary(X)["1st Qu."]
y_1q = summary(Y)["1st Qu."]
cat("Lot_Area.1st_Quartile = ", x_1q, "; Sale_Price.1st_Quartile = ", y_1q, "\n")
## Lot_Area.1st_Quartile = 7554 ; Sale_Price.1st_Quartile = 130000
# Count the number of observations above the 1st quartile for X and Y.
cat("Number of observations above the 1st quartile for X =", sum(X > x_1q), "\n")
## Number of observations above the 1st quartile for X = 1095
cat("Number of observations above the 1st quartile for Y =", sum(Y > y_1q), "\n")
## Number of observations above the 1st quartile for Y = 1084
# Now calculate the required probabilities.
X1 = X > x_1q
Y1 = Y > y_1q
# a. P(X>x | Y>y)
d = sum(Y1) # denominator, instances where Y>y
n = sum(X[Y1] > x_1q) # numerator
p1 = n/d
cat("P(X>x | Y>y) =", n, "/", d, "=", p1, "\n")
## P(X>x | Y>y) = 893 / 1084 = 0.8238007
# b. P(X>x, Y>y)
d = length(Y) # denominator
n = sum((X > x_1q) & (Y > y_1q)) # numerator
p2 = n/d
cat("P(X>x, Y>y) =", n, "/", d, "=", p2, "\n")
## P(X>x, Y>y) = 893 / 1460 = 0.6116438
# c. P(X<x | Y>y)
d = sum(Y1) # denominator
n = sum(X[Y1] < x_1q) # numerator
p3 = n/d
cat("P(X<x | Y>y) =", n, "/", d, "=", p3, "\n")
## P(X<x | Y>y) = 191 / 1084 = 0.1761993
| x/y | <= 1st quartile | > 1st quartile | total |
|---|---|---|---|
| <= 1st quartile | 365/376 | 1095/376 | 1460/752 |
| > 1st quartile | 365/1084 | 1095/1084 | 1460/2168 |
| total | 730/1460 | 2190/1460 | 2920/2920 |
Does splitting the training data in this fashion make them independent?
No, it does not. As shown in the next section, we find that \(P\left(AB\right) \neq P\left(A\right)P\left(B\right)\)
Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
A = X > x_1q
B = Y > y_1q
# Calculate P(AB)
P_AB = sum(A[B])
cat("P(AB) = ", P_AB/length(Y), "\n")
## P(AB) = 0.6116438
# Calculate P(A) * P(B)
P_A = sum(A)/length(Y)
P_B = sum(B)/length(Y)
cat("P(A)P(B) = ", P_A*P_B, "\n")
## P(A)P(B) = 0.5568493
The above shows that \(P\left(AB\right) \neq P\left(A\right)P\left(B\right)\), i.e. that A and B are not independent.
Running a Chi-Square test on A, B for association:
d2 = df %>% dplyr::select(LotArea, SalePrice)
ptest = chisq.test(table(d2))
## Warning in chisq.test(table(d2)): Chi-squared approximation may be
## incorrect
print(ptest)
##
## Pearson's Chi-squared test
##
## data: table(d2)
## X-squared = 735090, df = 709660, p-value < 2.2e-16
The p-value indicates the two variables are statistically dependent.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y.
summary(d2)
## LotArea SalePrice
## Min. : 1300 Min. : 34900
## 1st Qu.: 7554 1st Qu.:129975
## Median : 9478 Median :163000
## Mean : 10517 Mean :180921
## 3rd Qu.: 11602 3rd Qu.:214000
## Max. :215245 Max. :755000
plot(d2, type="p", main="Scatter Plot: Sale Price vs. Lot Area",
xlab="Lot Area (sq ft)", ylab="Sale Price")
skewness(X)
## [1] 12.19514
skewness(Y)
## [1] 1.880941
kurtosis(X)
## [1] 205.5438
kurtosis(Y)
## [1] 9.509812
The skewness of X (Lot Area variable) is 12.20 indicating that it is extremely skewed to the right.
The skewness of Y (Sale Price, target variable) is 1.88 indicating that it is highly skewed to the right.
The kurtosis of X and Y are 205.5 and 9.5 respectively, implying that both X and Y are leptokurtic.
Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval. Discuss the meaning of your analysis.
The following 3 variables are selected for correlation testing.
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
1stFlrSF: First Floor square feet
df3 = df %>% dplyr::select(WoodDeckSF, OpenPorchSF, X1stFlrSF)
corr.mat = cor(df3)
# Print Correlation matrix and Correlation hypothesis
# test for 92% conf. interval.
corr.test(df3, alpha=0.08)
## Call:corr.test(x = df3, alpha = 0.08)
## Correlation matrix
## WoodDeckSF OpenPorchSF X1stFlrSF
## WoodDeckSF 1.00 0.06 0.24
## OpenPorchSF 0.06 1.00 0.21
## X1stFlrSF 0.24 0.21 1.00
## Sample Size
## [1] 1460
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## WoodDeckSF OpenPorchSF X1stFlrSF
## WoodDeckSF 0.00 0.02 0
## OpenPorchSF 0.02 0.00 0
## X1stFlrSF 0.00 0.00 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
Would you be worried about familywise error? Why or why not?
The 3 variables exhibit weak correlation as evidenced by their correlation matrix. Therefore there is less chance of familywise error.
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
corr.mat = cor(df3)
precision.mat = solve(corr.mat)
print(precision.mat)
## WoodDeckSF OpenPorchSF X1stFlrSF
## WoodDeckSF 1.058786157 -0.009777408 -0.2472307
## OpenPorchSF -0.009777408 1.046996622 -0.2193169
## X1stFlrSF -0.247230734 -0.219316883 1.1046357
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
product.1 = corr.mat %*% precision.mat
# print rounded to 12 significant digits
print(round(product.1, 12))
## WoodDeckSF OpenPorchSF X1stFlrSF
## WoodDeckSF 1 0 0
## OpenPorchSF 0 1 0
## X1stFlrSF 0 0 1
product.2 = precision.mat %*% corr.mat
# print rounded to 12 significant digits
print(round(product.2, 12))
## WoodDeckSF OpenPorchSF X1stFlrSF
## WoodDeckSF 1 0 0
## OpenPorchSF 0 1 0
## X1stFlrSF 0 0 1
The two matrices are numerically equal within a reasonable level of precision that would be used. They are both equal to the identity matrix \(I_3\).
LU Decomposition:
l = lu(precision.mat)
el = expand(l)
print(el)
## $L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.000000000 . .
## [2,] -0.009234544 1.000000000 .
## [3,] -0.233503935 -0.211671225 1.000000000
##
## $U
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.058786157 -0.009777408 -0.247230734
## [2,] . 1.046906332 -0.221599946
## [3,] . . 1.000000000
##
## $P
## 3 x 3 sparse Matrix of class "pMatrix"
##
## [1,] | . .
## [2,] . | .
## [3,] . . |
Many times, it makes sense to fit a closed form distribution to data. For the first variable that you selected which is skewed to the right, shift it so that the minimum value is above zero as necessary.
skewness(X)
## [1] 12.19514
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10520 11600 215200
The minimum value of X is 1300 and so we already have \(X \geq 0\).
Then load the MASS package and run fitdistr() to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html).
library(MASS)
exp_params = fitdistr(X, "exponential")
print(exp_params)
## rate
## 9.508570e-05
## (2.488507e-06)
lambda = as.double(exp_params$estimate)
print(lambda)
## [1] 9.50857e-05
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).
exponential.dist = rexp(1000, lambda)
Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the \(5^{th}\) and \(95^{th}\) percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality.
# Compare the two histograms side-by-side.
par(mfrow=c(1, 2))
hist(X, main="X (Lot Area)")
hist(exponential.dist, main="Exponential Distr for X")
Using the exponential pdf, find the \(5^{th}\) and \(95^{th}\) percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality.
quantile(exponential.dist, probs=c(0.05, 0.95))
## 5% 95%
## 608.5401 32128.1678
# Generate 95% C.I. for the empirical data:
xsd = sd(X)
xmean = mean(X)
n = length(X)
err = qnorm(0.975)*xsd/sqrt(n)
left = xmean - err
right = xmean + err
cat("A 95% confidence interval for Lot Area is [", left, ",", right, "]")
## A 95% confidence interval for Lot Area is [ 10004.84 , 11028.81 ]
Finally, provide the empirical \(5^{th}\) percentile and \(95^{th}\) percentile of the data. Discuss.
quantile(X, probs=c(0.05, 0.95))
## 5% 95%
## 3311.70 17401.15
The difference in the empirical data and the exponential fit for LotArea indicates that the assumption that LotArea follows an exponential distribution does not fit the observed data very well.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
check_model <- function(m) {
print(summary(m))
res = residuals(m)
print(summary(res))
hist(res)
plot(fitted(m), resid(m))
}
par(mfrow = c(1, 1))
# Full training data set
df.train = df
# Reduce to Dataframe with selected feature sets
df.train = df.train %>% dplyr::select(SalePrice,
BldgType,
BsmtCond,
BsmtExposure,
BsmtQual,
CentralAir,
GarageArea,
GarageCars,
# Exterior1st,
ExterQual,
Fence,
Fireplaces,
FireplaceQu,
Foundation,
HouseStyle,
KitchenQual,
LandContour,
LandSlope,
LotArea,
MasVnrArea,
MiscVal,
Neighborhood,
OverallCond,
OverallQual,
PoolArea,
# # PoolQC,
RoofStyle,
# # Street,
YearBuilt,
YearRemodAdd)
regr = lm(df.train)
check_model(regr)
##
## Call:
## lm(formula = df.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55818 -13267 0 13165 57224
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.622e+05 1.226e+06 0.458 0.648297
## BldgType2fmCon 6.908e+03 3.349e+04 0.206 0.837318
## BldgTypeTwnhs 6.399e+04 5.658e+04 1.131 0.262684
## BldgTypeTwnhsE 3.972e+04 5.705e+04 0.696 0.489005
## BsmtCondGd 3.283e+04 4.652e+04 0.706 0.483160
## BsmtCondTA 1.054e+04 4.105e+04 0.257 0.798187
## BsmtExposureGd -3.779e+04 2.016e+04 -1.874 0.065817 .
## BsmtExposureMn -5.524e+04 2.136e+04 -2.586 0.012203 *
## BsmtExposureNo -5.128e+04 2.080e+04 -2.465 0.016613 *
## BsmtQualFa 5.315e+04 8.118e+04 0.655 0.515187
## BsmtQualGd -5.749e+04 3.272e+04 -1.757 0.084086 .
## BsmtQualTA -3.081e+04 3.722e+04 -0.828 0.411143
## CentralAirY 3.072e+04 2.755e+04 1.115 0.269348
## GarageArea 3.770e+01 5.191e+01 0.726 0.470564
## GarageCars 3.571e+03 1.278e+04 0.279 0.780919
## ExterQualFa -3.487e+04 9.487e+04 -0.368 0.714484
## ExterQualGd -4.552e+04 5.381e+04 -0.846 0.401030
## ExterQualTA -8.199e+04 5.456e+04 -1.503 0.138235
## FenceGdWo 1.511e+04 1.204e+04 1.256 0.214237
## FenceMnPrv 2.358e+04 9.733e+03 2.423 0.018482 *
## FenceMnWw -2.630e+03 2.532e+04 -0.104 0.917600
## Fireplaces 3.213e+04 9.888e+03 3.249 0.001914 **
## FireplaceQuFa 5.772e+04 3.326e+04 1.735 0.087881 .
## FireplaceQuGd 6.748e+04 3.176e+04 2.125 0.037807 *
## FireplaceQuPo 5.046e+04 3.400e+04 1.484 0.143074
## FireplaceQuTA 8.673e+04 3.327e+04 2.606 0.011564 *
## FoundationCBlock 3.114e+03 2.572e+04 0.121 0.904064
## FoundationPConc 1.926e+04 2.439e+04 0.790 0.432794
## FoundationStone NA NA NA NA
## HouseStyle1.5Unf 8.463e+03 6.203e+04 0.136 0.891946
## HouseStyle1Story -2.295e+04 1.790e+04 -1.282 0.204779
## HouseStyle2.5Fin 3.214e+04 5.320e+04 0.604 0.548070
## HouseStyle2.5Unf -4.242e+04 4.092e+04 -1.037 0.304125
## HouseStyle2Story -1.808e+04 1.737e+04 -1.041 0.302231
## HouseStyleSFoyer -5.482e+04 3.006e+04 -1.823 0.073308 .
## HouseStyleSLvl -6.718e+04 2.079e+04 -3.232 0.002013 **
## KitchenQualFa -1.807e+05 4.373e+04 -4.132 0.000115 ***
## KitchenQualGd -7.231e+04 2.048e+04 -3.530 0.000812 ***
## KitchenQualTA -7.639e+04 2.219e+04 -3.443 0.001065 **
## LandContourHLS 4.463e+03 5.153e+04 0.087 0.931280
## LandContourLvl 1.515e+04 2.354e+04 0.644 0.522289
## LandSlopeMod 3.936e+04 2.082e+04 1.891 0.063596 .
## LotArea 4.778e+00 1.447e+00 3.303 0.001629 **
## MasVnrArea -8.486e-01 2.644e+01 -0.032 0.974507
## MiscVal -1.549e+01 1.086e+01 -1.426 0.159114
## NeighborhoodBrkSide -3.721e+04 4.938e+04 -0.754 0.454136
## NeighborhoodClearCr -2.834e+04 4.614e+04 -0.614 0.541444
## NeighborhoodCollgCr 2.136e+04 5.191e+04 0.411 0.682255
## NeighborhoodCrawfor 1.902e+04 4.186e+04 0.454 0.651259
## NeighborhoodEdwards -1.733e+04 4.170e+04 -0.416 0.679156
## NeighborhoodGilbert 3.936e+04 4.938e+04 0.797 0.428605
## NeighborhoodIDOTRR -3.081e+04 6.452e+04 -0.478 0.634712
## NeighborhoodMeadowV 3.315e+04 3.937e+04 0.842 0.403100
## NeighborhoodMitchel -2.393e+03 4.274e+04 -0.056 0.955540
## NeighborhoodNAmes 6.620e+03 3.869e+04 0.171 0.864718
## NeighborhoodNoRidge 1.378e+05 5.017e+04 2.747 0.007954 **
## NeighborhoodNWAmes 2.257e+04 4.130e+04 0.547 0.586763
## NeighborhoodOldTown -7.865e+03 4.644e+04 -0.169 0.866094
## NeighborhoodSawyer 8.498e+03 3.977e+04 0.214 0.831560
## NeighborhoodSawyerW 2.946e+04 4.212e+04 0.700 0.486955
## NeighborhoodSomerst 6.423e+04 5.570e+04 1.153 0.253544
## NeighborhoodSWISU -1.850e+04 5.167e+04 -0.358 0.721548
## NeighborhoodVeenker NA NA NA NA
## OverallCond 1.149e+04 4.414e+03 2.602 0.011698 *
## OverallQual 1.024e+04 5.776e+03 1.773 0.081416 .
## PoolArea 1.023e+02 3.234e+01 3.164 0.002457 **
## RoofStyleGable 8.690e+02 3.125e+04 0.028 0.977908
## RoofStyleGambrel 2.449e+04 3.947e+04 0.620 0.537404
## RoofStyleHip 7.238e+03 3.202e+04 0.226 0.821924
## YearBuilt 9.312e+01 5.017e+02 0.186 0.853395
## YearRemodAdd -3.623e+02 3.375e+02 -1.073 0.287439
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30010 on 59 degrees of freedom
## (1332 observations deleted due to missingness)
## Multiple R-squared: 0.9299, Adjusted R-squared: 0.8491
## F-statistic: 11.51 on 68 and 59 DF, p-value: < 2.2e-16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -55820 -13270 0 0 13160 57220
qqnorm(residuals(regr))
summary(df.train)
## SalePrice BldgType BsmtCond BsmtExposure BsmtQual
## Min. : 34900 1Fam :1220 Fa : 45 Av :221 Ex :121
## 1st Qu.:129975 2fmCon: 31 Gd : 65 Gd :134 Fa : 35
## Median :163000 Duplex: 52 Po : 2 Mn :114 Gd :618
## Mean :180921 Twnhs : 43 TA :1311 No :953 TA :649
## 3rd Qu.:214000 TwnhsE: 114 NA's: 37 NA's: 38 NA's: 37
## Max. :755000
##
## CentralAir GarageArea GarageCars ExterQual Fence
## N: 95 Min. : 0.0 Min. :0.000 Ex: 52 GdPrv: 59
## Y:1365 1st Qu.: 334.5 1st Qu.:1.000 Fa: 14 GdWo : 54
## Median : 480.0 Median :2.000 Gd:488 MnPrv: 157
## Mean : 473.0 Mean :1.767 TA:906 MnWw : 11
## 3rd Qu.: 576.0 3rd Qu.:2.000 NA's :1179
## Max. :1418.0 Max. :4.000
##
## Fireplaces FireplaceQu Foundation HouseStyle KitchenQual
## Min. :0.000 Ex : 24 BrkTil:146 1Story :726 Ex:100
## 1st Qu.:0.000 Fa : 33 CBlock:634 2Story :445 Fa: 39
## Median :1.000 Gd :380 PConc :647 1.5Fin :154 Gd:586
## Mean :0.613 Po : 20 Slab : 24 SLvl : 65 TA:735
## 3rd Qu.:1.000 TA :313 Stone : 6 SFoyer : 37
## Max. :3.000 NA's:690 Wood : 3 1.5Unf : 14
## (Other): 19
## LandContour LandSlope LotArea MasVnrArea
## Bnk: 63 Gtl:1382 Min. : 1300 Min. : 0.0
## HLS: 50 Mod: 65 1st Qu.: 7554 1st Qu.: 0.0
## Low: 36 Sev: 13 Median : 9478 Median : 0.0
## Lvl:1311 Mean : 10517 Mean : 103.7
## 3rd Qu.: 11602 3rd Qu.: 166.0
## Max. :215245 Max. :1600.0
## NA's :8
## MiscVal Neighborhood OverallCond OverallQual
## Min. : 0.00 NAmes :225 Min. :1.000 Min. : 1.000
## 1st Qu.: 0.00 CollgCr:150 1st Qu.:5.000 1st Qu.: 5.000
## Median : 0.00 OldTown:113 Median :5.000 Median : 6.000
## Mean : 43.49 Edwards:100 Mean :5.575 Mean : 6.099
## 3rd Qu.: 0.00 Somerst: 86 3rd Qu.:6.000 3rd Qu.: 7.000
## Max. :15500.00 Gilbert: 79 Max. :9.000 Max. :10.000
## (Other):707
## PoolArea RoofStyle YearBuilt YearRemodAdd
## Min. : 0.000 Flat : 13 Min. :1872 Min. :1950
## 1st Qu.: 0.000 Gable :1141 1st Qu.:1954 1st Qu.:1967
## Median : 0.000 Gambrel: 11 Median :1973 Median :1994
## Mean : 2.759 Hip : 286 Mean :1971 Mean :1985
## 3rd Qu.: 0.000 Mansard: 7 3rd Qu.:2000 3rd Qu.:2004
## Max. :738.000 Shed : 2 Max. :2010 Max. :2010
##
From the above we see that the Adjusted R-squared value is high, but the degrees of freedom is low (49). This is due to the fact that a number of observations have been dropped. From the summary of the dataframe, we see that the fields “Fence” and “FireplaceQu” have a high number of NAs. Therefore we exclude them from the model.
df.train = df.train %>% dplyr::select(-Fence, -FireplaceQu)
regr = lm(df.train)
check_model(regr)
##
## Call:
## lm(formula = df.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -311457 -15589 -831 13665 273845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.193e+05 1.996e+05 -2.601 0.009392 **
## BldgType2fmCon -2.050e+03 6.670e+03 -0.307 0.758600
## BldgTypeDuplex 5.642e+03 5.997e+03 0.941 0.346961
## BldgTypeTwnhs -4.096e+04 6.790e+03 -6.032 2.10e-09 ***
## BldgTypeTwnhsE -3.821e+04 4.290e+03 -8.907 < 2e-16 ***
## BsmtCondGd 3.527e+03 7.049e+03 0.500 0.616874
## BsmtCondPo 4.139e+04 2.604e+04 1.589 0.112207
## BsmtCondTA 5.689e+03 5.541e+03 1.027 0.304756
## BsmtExposureGd 2.218e+04 4.033e+03 5.500 4.54e-08 ***
## BsmtExposureMn -9.737e+02 4.124e+03 -0.236 0.813384
## BsmtExposureNo -8.296e+03 2.946e+03 -2.816 0.004932 **
## BsmtQualFa -4.094e+04 8.292e+03 -4.937 8.92e-07 ***
## BsmtQualGd -3.097e+04 4.414e+03 -7.015 3.64e-12 ***
## BsmtQualTA -3.507e+04 5.407e+03 -6.485 1.24e-10 ***
## CentralAirY 2.649e+03 4.746e+03 0.558 0.576848
## GarageArea 1.402e+01 9.539e+00 1.470 0.141865
## GarageCars 1.028e+04 2.842e+03 3.618 0.000308 ***
## ExterQualFa -2.298e+04 1.291e+04 -1.779 0.075430 .
## ExterQualGd -1.854e+04 6.356e+03 -2.917 0.003591 **
## ExterQualTA -2.377e+04 7.011e+03 -3.390 0.000719 ***
## Fireplaces 1.210e+04 1.717e+03 7.046 2.94e-12 ***
## FoundationCBlock 5.228e+03 4.166e+03 1.255 0.209713
## FoundationPConc 8.146e+03 4.621e+03 1.763 0.078152 .
## FoundationStone 1.273e+04 1.420e+04 0.896 0.370232
## FoundationWood 1.071e+04 1.994e+04 0.537 0.591206
## HouseStyle1.5Unf -2.081e+04 9.418e+03 -2.210 0.027302 *
## HouseStyle1Story -1.128e+04 3.594e+03 -3.139 0.001730 **
## HouseStyle2.5Fin 4.422e+04 1.266e+04 3.493 0.000493 ***
## HouseStyle2.5Unf 4.544e+03 1.091e+04 0.417 0.677108
## HouseStyle2Story -1.960e+03 3.742e+03 -0.524 0.600613
## HouseStyleSFoyer -3.070e+04 7.374e+03 -4.163 3.34e-05 ***
## HouseStyleSLvl -2.131e+04 5.665e+03 -3.762 0.000176 ***
## KitchenQualFa -3.274e+04 8.142e+03 -4.021 6.11e-05 ***
## KitchenQualGd -3.329e+04 4.591e+03 -7.250 7.05e-13 ***
## KitchenQualTA -3.789e+04 5.204e+03 -7.280 5.69e-13 ***
## LandContourHLS 1.137e+04 6.821e+03 1.667 0.095702 .
## LandContourLow 2.026e+03 8.488e+03 0.239 0.811375
## LandContourLvl 1.735e+04 4.886e+03 3.551 0.000396 ***
## LandSlopeMod 1.151e+04 5.243e+03 2.196 0.028268 *
## LandSlopeSev -3.051e+04 1.327e+04 -2.299 0.021652 *
## LotArea 7.985e-01 1.228e-01 6.505 1.10e-10 ***
## MasVnrArea 2.046e+01 6.218e+00 3.290 0.001028 **
## MiscVal -1.009e+00 1.897e+00 -0.532 0.595098
## NeighborhoodBlueste -7.660e+03 2.517e+04 -0.304 0.760882
## NeighborhoodBrDale -1.076e+04 1.353e+04 -0.795 0.426585
## NeighborhoodBrkSide -1.782e+04 1.112e+04 -1.603 0.109243
## NeighborhoodClearCr 9.918e+02 1.192e+04 0.083 0.933716
## NeighborhoodCollgCr -1.167e+04 9.329e+03 -1.251 0.211224
## NeighborhoodCrawfor 1.446e+04 1.087e+04 1.331 0.183572
## NeighborhoodEdwards -2.510e+04 1.017e+04 -2.468 0.013695 *
## NeighborhoodGilbert -2.076e+04 9.948e+03 -2.087 0.037053 *
## NeighborhoodIDOTRR -3.351e+04 1.172e+04 -2.860 0.004303 **
## NeighborhoodMeadowV 4.281e+03 1.257e+04 0.341 0.733495
## NeighborhoodMitchel -2.174e+04 1.043e+04 -2.084 0.037361 *
## NeighborhoodNAmes -1.759e+04 9.900e+03 -1.777 0.075817 .
## NeighborhoodNoRidge 6.186e+04 1.061e+04 5.830 6.96e-09 ***
## NeighborhoodNPkVill 6.038e+03 1.434e+04 0.421 0.673799
## NeighborhoodNridgHt 2.007e+04 9.679e+03 2.073 0.038347 *
## NeighborhoodNWAmes -1.315e+04 1.019e+04 -1.290 0.197446
## NeighborhoodOldTown -2.699e+04 1.076e+04 -2.508 0.012254 *
## NeighborhoodSawyer -1.769e+04 1.039e+04 -1.702 0.088988 .
## NeighborhoodSawyerW -3.910e+03 9.972e+03 -0.392 0.695086
## NeighborhoodSomerst 1.948e+03 9.461e+03 0.206 0.836855
## NeighborhoodStoneBr 5.363e+04 1.074e+04 4.993 6.72e-07 ***
## NeighborhoodSWISU -1.045e+04 1.248e+04 -0.838 0.402405
## NeighborhoodTimber -1.369e+04 1.061e+04 -1.290 0.197413
## NeighborhoodVeenker 1.202e+04 1.327e+04 0.906 0.365254
## OverallCond 3.752e+03 1.083e+03 3.465 0.000548 ***
## OverallQual 1.381e+04 1.263e+03 10.934 < 2e-16 ***
## PoolArea 6.747e+01 2.246e+01 3.004 0.002710 **
## RoofStyleGable -1.891e+03 1.105e+04 -0.171 0.864087
## RoofStyleGambrel 3.983e+03 1.507e+04 0.264 0.791594
## RoofStyleHip 4.768e+03 1.124e+04 0.424 0.671411
## RoofStyleMansard 1.733e+04 1.698e+04 1.020 0.307775
## RoofStyleShed 3.800e+04 2.684e+04 1.416 0.157091
## YearBuilt 1.175e+02 8.670e+01 1.355 0.175574
## YearRemodAdd 2.018e+02 7.077e+01 2.851 0.004424 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32710 on 1337 degrees of freedom
## (46 observations deleted due to missingness)
## Multiple R-squared: 0.8389, Adjusted R-squared: 0.8297
## F-statistic: 91.6 on 76 and 1337 DF, p-value: < 2.2e-16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -311500 -15590 -831 0 13660 273800
qqnorm(residuals(regr))
As a result the Adjusted R-squared value is now lower but the degrees of freedom is higher. This is preferable because it allows us to use more of the training observations.
Next, we can remove a number of variables that remain in the model and that have high p-values.
df.train = df.train %>% dplyr::select(-RoofStyle, -Foundation, -BsmtCond, -MiscVal)
regr = lm(df.train)
check_model(regr)
##
## Call:
## lm(formula = df.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -310149 -16203 -846 13252 277035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.766e+05 1.912e+05 -3.016 0.002613 **
## BldgType2fmCon -8.140e+02 6.634e+03 -0.123 0.902363
## BldgTypeDuplex 5.789e+03 5.874e+03 0.986 0.324520
## BldgTypeTwnhs -4.118e+04 6.797e+03 -6.059 1.77e-09 ***
## BldgTypeTwnhsE -3.833e+04 4.285e+03 -8.945 < 2e-16 ***
## BsmtExposureGd 2.233e+04 3.988e+03 5.601 2.58e-08 ***
## BsmtExposureMn -9.110e+02 4.114e+03 -0.221 0.824780
## BsmtExposureNo -8.097e+03 2.945e+03 -2.750 0.006044 **
## BsmtQualFa -4.098e+04 8.080e+03 -5.072 4.49e-07 ***
## BsmtQualGd -3.210e+04 4.386e+03 -7.317 4.33e-13 ***
## BsmtQualTA -3.594e+04 5.319e+03 -6.757 2.09e-11 ***
## CentralAirY 3.736e+03 4.670e+03 0.800 0.423936
## GarageArea 1.387e+01 9.527e+00 1.456 0.145734
## GarageCars 1.037e+04 2.837e+03 3.655 0.000268 ***
## ExterQualFa -2.227e+04 1.274e+04 -1.749 0.080578 .
## ExterQualGd -2.007e+04 6.319e+03 -3.176 0.001525 **
## ExterQualTA -2.618e+04 6.954e+03 -3.764 0.000174 ***
## Fireplaces 1.199e+04 1.705e+03 7.030 3.27e-12 ***
## HouseStyle1.5Unf -2.219e+04 9.398e+03 -2.361 0.018373 *
## HouseStyle1Story -9.547e+03 3.507e+03 -2.723 0.006562 **
## HouseStyle2.5Fin 4.400e+04 1.265e+04 3.479 0.000519 ***
## HouseStyle2.5Unf 4.402e+03 1.089e+04 0.404 0.686233
## HouseStyle2Story -6.899e+02 3.685e+03 -0.187 0.851535
## HouseStyleSFoyer -3.019e+04 7.308e+03 -4.131 3.83e-05 ***
## HouseStyleSLvl -1.986e+04 5.605e+03 -3.543 0.000409 ***
## KitchenQualFa -3.282e+04 8.117e+03 -4.044 5.56e-05 ***
## KitchenQualGd -3.330e+04 4.590e+03 -7.254 6.79e-13 ***
## KitchenQualTA -3.806e+04 5.187e+03 -7.338 3.73e-13 ***
## LandContourHLS 1.172e+04 6.777e+03 1.730 0.083882 .
## LandContourLow 3.150e+03 8.402e+03 0.375 0.707779
## LandContourLvl 1.722e+04 4.872e+03 3.534 0.000423 ***
## LandSlopeMod 1.175e+04 5.195e+03 2.262 0.023859 *
## LandSlopeSev -2.414e+04 1.228e+04 -1.966 0.049476 *
## LotArea 7.619e-01 1.186e-01 6.425 1.82e-10 ***
## MasVnrArea 2.286e+01 6.127e+00 3.730 0.000199 ***
## NeighborhoodBlueste -9.145e+03 2.510e+04 -0.364 0.715705
## NeighborhoodBrDale -1.329e+04 1.343e+04 -0.990 0.322272
## NeighborhoodBrkSide -1.951e+04 1.107e+04 -1.762 0.078312 .
## NeighborhoodClearCr 9.989e+02 1.175e+04 0.085 0.932274
## NeighborhoodCollgCr -1.229e+04 9.334e+03 -1.317 0.188066
## NeighborhoodCrawfor 1.440e+04 1.088e+04 1.324 0.185760
## NeighborhoodEdwards -2.487e+04 1.018e+04 -2.444 0.014663 *
## NeighborhoodGilbert -2.073e+04 9.943e+03 -2.085 0.037239 *
## NeighborhoodIDOTRR -3.214e+04 1.171e+04 -2.745 0.006123 **
## NeighborhoodMeadowV 4.433e+03 1.255e+04 0.353 0.723894
## NeighborhoodMitchel -2.148e+04 1.043e+04 -2.059 0.039663 *
## NeighborhoodNAmes -1.703e+04 9.850e+03 -1.729 0.084077 .
## NeighborhoodNoRidge 6.218e+04 1.062e+04 5.855 5.97e-09 ***
## NeighborhoodNPkVill 4.340e+03 1.418e+04 0.306 0.759657
## NeighborhoodNridgHt 1.936e+04 9.681e+03 1.999 0.045761 *
## NeighborhoodNWAmes -1.266e+04 1.012e+04 -1.251 0.211068
## NeighborhoodOldTown -2.717e+04 1.075e+04 -2.529 0.011562 *
## NeighborhoodSawyer -1.602e+04 1.034e+04 -1.548 0.121744
## NeighborhoodSawyerW -4.886e+03 9.973e+03 -0.490 0.624304
## NeighborhoodSomerst 7.319e+02 9.460e+03 0.077 0.938339
## NeighborhoodStoneBr 5.310e+04 1.075e+04 4.938 8.86e-07 ***
## NeighborhoodSWISU -1.179e+04 1.245e+04 -0.947 0.343724
## NeighborhoodTimber -1.350e+04 1.060e+04 -1.274 0.203008
## NeighborhoodVeenker 1.158e+04 1.327e+04 0.873 0.382800
## OverallCond 3.596e+03 1.034e+03 3.478 0.000521 ***
## OverallQual 1.418e+04 1.244e+03 11.393 < 2e-16 ***
## PoolArea 6.898e+01 2.215e+01 3.115 0.001880 **
## YearBuilt 1.563e+02 8.022e+01 1.949 0.051551 .
## YearRemodAdd 1.973e+02 6.934e+01 2.846 0.004496 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32770 on 1350 degrees of freedom
## (46 observations deleted due to missingness)
## Multiple R-squared: 0.8367, Adjusted R-squared: 0.8291
## F-statistic: 109.8 on 63 and 1350 DF, p-value: < 2.2e-16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -310100.0 -16200.0 -845.9 0.0 13250.0 277000.0
qqnorm(residuals(regr))
Perform predictions using the test.csv file and create Kaggle submission csv file.
df.test = read.csv("test.csv")
df.test = df.test %>% dplyr::select(Id,
BldgType,
# BsmtCond,
BsmtExposure,
BsmtQual,
CentralAir,
GarageArea,
GarageCars,
Exterior1st,
ExterQual,
# Fence,
Fireplaces,
# FireplaceQu,
# Foundation,
HouseStyle,
KitchenQual,
LandContour,
LandSlope,
LotArea,
MasVnrArea,
# MiscVal,
Neighborhood,
OverallCond,
OverallQual,
PoolArea,
# # PoolQC,
# RoofStyle,
# # Street,
YearBuilt,
YearRemodAdd)
#regr2 = update(regr, na.action=na.exclude)
predictions = predict(regr, df.test %>% dplyr::select(-Id))
predictions[is.na(predictions)] = mean(predictions, na.rm=T)
pred.df = data.frame(df.test$Id, as.numeric(predictions))
colnames(pred.df) = c("Id", "SalePrice")
write.csv(pred.df, file="submission.csv", row.names=F)