Data605 Final Exam

House Prices: Advanced Regression Techniques competition.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques.

library(dplyr)
library(knitr)
library(tidyr)
library(moments)
library(psych)
library(Matrix)


# Assume file train.csv has been locally downloaded.

df = read.csv("train.csv")
dim(df)

## [1] 1460   81

Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right!

Pick the dependent variable and define it as Y.

# To pick an independent variable with right-skewness, we can examine histograms
# of a few variables.
par(mfrow=c(1,2))
# Overall Quality
hist(df$OverallQual, main = "OverallQual")

# Lot Area
hist(df$LotArea, main = "LotArea")

# LotArea is clearly right-skewed.
X = df$LotArea

# The target variable we are trying to predict is SalePrice, the
# property's sale price in dollars.

Y = df$SalePrice

# Show histogram of SalePrice (target).
# SalePrice
hist(df$SalePrice, main = "SalePrice")

Probability.

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.

P(X > x | Y > y)

P(X > x, Y > y)

P(X < x | Y > y)

# Print summaries of X (independent var.) and Y (target).

d2 = df %>% dplyr::select(LotArea, SalePrice)
summary(d2)

##     LotArea         SalePrice     
##  Min.   :  1300   Min.   : 34900  
##  1st Qu.:  7554   1st Qu.:129975  
##  Median :  9478   Median :163000  
##  Mean   : 10517   Mean   :180921  
##  3rd Qu.: 11602   3rd Qu.:214000  
##  Max.   :215245   Max.   :755000

x_1q = summary(X)["1st Qu."]
y_1q = summary(Y)["1st Qu."]

cat("Lot_Area.1st_Quartile = ", x_1q, "; Sale_Price.1st_Quartile = ", y_1q, "\n")

## Lot_Area.1st_Quartile =  7554 ; Sale_Price.1st_Quartile =  130000

# Count the number of observations above the 1st quartile for X and Y.

cat("Number of observations above the 1st quartile for X =", sum(X > x_1q), "\n")

## Number of observations above the 1st quartile for X = 1095

cat("Number of observations above the 1st quartile for Y =", sum(Y > y_1q), "\n")

## Number of observations above the 1st quartile for Y = 1084

# Now calculate the required probabilities.

X1 = X > x_1q
Y1 = Y > y_1q

# a. P(X>x | Y>y)
d = sum(Y1)             # denominator, instances where Y>y
n = sum(X[Y1] > x_1q)   # numerator
p1 = n/d

cat("P(X>x | Y>y) =", n, "/", d, "=", p1, "\n")

## P(X>x | Y>y) = 893 / 1084 = 0.8238007

# b. P(X>x, Y>y)
d = length(Y)                       # denominator
n = sum((X > x_1q) & (Y > y_1q))    # numerator
p2 = n/d

cat("P(X>x, Y>y) =", n, "/", d, "=", p2, "\n")

## P(X>x, Y>y) = 893 / 1460 = 0.6116438

# c. P(X<x | Y>y)
d = sum(Y1)           # denominator
n = sum(X[Y1] < x_1q)  # numerator
p3 = n/d

cat("P(X<x | Y>y) =", n, "/", d, "=", p3, "\n")

## P(X<x | Y>y) = 191 / 1084 = 0.1761993

x/y	<= 1st quartile	> 1st quartile	total
<= 1st quartile	365/376	1095/376	1460/752
> 1st quartile	365/1084	1095/1084	1460/2168
total	730/1460	2190/1460	2920/2920

Does splitting the training data in this fashion make them independent?

No, it does not. As shown in the next section, we find that \(P\left(AB\right) \neq P\left(A\right)P\left(B\right)\)

Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

A = X > x_1q
B = Y > y_1q

# Calculate P(AB)
P_AB = sum(A[B])
cat("P(AB) = ", P_AB/length(Y), "\n")

## P(AB) =  0.6116438

# Calculate P(A) * P(B)
P_A = sum(A)/length(Y)
P_B = sum(B)/length(Y)
cat("P(A)P(B) = ", P_A*P_B, "\n")

## P(A)P(B) =  0.5568493

The above shows that \(P\left(AB\right) \neq P\left(A\right)P\left(B\right)\), i.e. that A and B are not independent.

Running a Chi-Square test on A, B for association:

d2 = df %>% dplyr::select(LotArea, SalePrice)
ptest = chisq.test(table(d2))

## Warning in chisq.test(table(d2)): Chi-squared approximation may be
## incorrect

print(ptest)

## 
##  Pearson's Chi-squared test
## 
## data:  table(d2)
## X-squared = 735090, df = 709660, p-value < 2.2e-16

The p-value indicates the two variables are statistically dependent.

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y.

summary(d2)

##     LotArea         SalePrice     
##  Min.   :  1300   Min.   : 34900  
##  1st Qu.:  7554   1st Qu.:129975  
##  Median :  9478   Median :163000  
##  Mean   : 10517   Mean   :180921  
##  3rd Qu.: 11602   3rd Qu.:214000  
##  Max.   :215245   Max.   :755000

plot(d2, type="p", main="Scatter Plot: Sale Price vs. Lot Area",
     xlab="Lot Area (sq ft)", ylab="Sale Price")

skewness(X)

## [1] 12.19514

skewness(Y)

## [1] 1.880941

kurtosis(X)

## [1] 205.5438

kurtosis(Y)

## [1] 9.509812

The skewness of X (Lot Area variable) is 12.20 indicating that it is extremely skewed to the right.

The skewness of Y (Sale Price, target variable) is 1.88 indicating that it is highly skewed to the right.

The kurtosis of X and Y are 205.5 and 9.5 respectively, implying that both X and Y are leptokurtic.

Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval. Discuss the meaning of your analysis.

The following 3 variables are selected for correlation testing.

WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
1stFlrSF: First Floor square feet

df3 = df %>% dplyr::select(WoodDeckSF, OpenPorchSF, X1stFlrSF)

corr.mat = cor(df3)

# Print Correlation matrix and Correlation hypothesis
# test for 92% conf. interval.
corr.test(df3, alpha=0.08)

## Call:corr.test(x = df3, alpha = 0.08)
## Correlation matrix 
##             WoodDeckSF OpenPorchSF X1stFlrSF
## WoodDeckSF        1.00        0.06      0.24
## OpenPorchSF       0.06        1.00      0.21
## X1stFlrSF         0.24        0.21      1.00
## Sample Size 
## [1] 1460
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##             WoodDeckSF OpenPorchSF X1stFlrSF
## WoodDeckSF        0.00        0.02         0
## OpenPorchSF       0.02        0.00         0
## X1stFlrSF         0.00        0.00         0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Would you be worried about familywise error? Why or why not?

The 3 variables exhibit weak correlation as evidenced by their correlation matrix. Therefore there is less chance of familywise error.

Linear Algebra and Correlation.

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

corr.mat = cor(df3)
precision.mat = solve(corr.mat)
print(precision.mat)

##               WoodDeckSF  OpenPorchSF  X1stFlrSF
## WoodDeckSF   1.058786157 -0.009777408 -0.2472307
## OpenPorchSF -0.009777408  1.046996622 -0.2193169
## X1stFlrSF   -0.247230734 -0.219316883  1.1046357

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

product.1 = corr.mat %*% precision.mat
# print rounded to 12 significant digits
print(round(product.1, 12))

##             WoodDeckSF OpenPorchSF X1stFlrSF
## WoodDeckSF           1           0         0
## OpenPorchSF          0           1         0
## X1stFlrSF            0           0         1

product.2 = precision.mat %*% corr.mat
# print rounded to 12 significant digits
print(round(product.2, 12))

##             WoodDeckSF OpenPorchSF X1stFlrSF
## WoodDeckSF           1           0         0
## OpenPorchSF          0           1         0
## X1stFlrSF            0           0         1

The two matrices are numerically equal within a reasonable level of precision that would be used. They are both equal to the identity matrix \(I_3\).

LU Decomposition:

l = lu(precision.mat)
el = expand(l)
print(el)

## $L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]         [,2]         [,3]        
## [1,]  1.000000000            .            .
## [2,] -0.009234544  1.000000000            .
## [3,] -0.233503935 -0.211671225  1.000000000
## 
## $U
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]         [,2]         [,3]        
## [1,]  1.058786157 -0.009777408 -0.247230734
## [2,]            .  1.046906332 -0.221599946
## [3,]            .            .  1.000000000
## 
## $P
## 3 x 3 sparse Matrix of class "pMatrix"
##           
## [1,] | . .
## [2,] . | .
## [3,] . . |

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. For the first variable that you selected which is skewed to the right, shift it so that the minimum value is above zero as necessary.

skewness(X)

## [1] 12.19514

summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10520   11600  215200

The minimum value of X is 1300 and so we already have \(X \geq 0\).

Then load the MASS package and run fitdistr() to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html).

library(MASS)
exp_params = fitdistr(X, "exponential")
print(exp_params)

##        rate    
##   9.508570e-05 
##  (2.488507e-06)

lambda = as.double(exp_params$estimate)
print(lambda)

## [1] 9.50857e-05

Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).

exponential.dist = rexp(1000, lambda)

Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the \(5^{th}\) and \(95^{th}\) percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality.

# Compare the two histograms side-by-side.
par(mfrow=c(1, 2))
hist(X, main="X (Lot Area)")
hist(exponential.dist, main="Exponential Distr for X")

Using the exponential pdf, find the \(5^{th}\) and \(95^{th}\) percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality.

quantile(exponential.dist, probs=c(0.05, 0.95))

##         5%        95% 
##   608.5401 32128.1678

# Generate 95% C.I. for the empirical data:

xsd = sd(X)
xmean = mean(X)
n = length(X)

err = qnorm(0.975)*xsd/sqrt(n)
left = xmean - err
right = xmean + err

cat("A 95% confidence interval for Lot Area is [", left, ",", right, "]")

## A 95% confidence interval for Lot Area is [ 10004.84 , 11028.81 ]

Finally, provide the empirical \(5^{th}\) percentile and \(95^{th}\) percentile of the data. Discuss.

quantile(X, probs=c(0.05, 0.95))

##       5%      95% 
##  3311.70 17401.15

The difference in the empirical data and the exponential fit for LotArea indicates that the assumption that LotArea follows an exponential distribution does not fit the observed data very well.

Modeling.

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

check_model <- function(m) {
    print(summary(m))
    res = residuals(m)
    print(summary(res))
    hist(res)
    plot(fitted(m), resid(m))
}
par(mfrow = c(1, 1))

# Full training data set
df.train = df

# Reduce to Dataframe with selected feature sets
df.train = df.train %>% dplyr::select(SalePrice,
                                      BldgType,
                                      BsmtCond,
                                      BsmtExposure,
                                      BsmtQual,
                                      CentralAir,
                                      GarageArea,
                                      GarageCars,
                                      # Exterior1st,
                                      ExterQual,
                                      Fence,
                                      Fireplaces,
                                      FireplaceQu,
                                      Foundation,
                                      HouseStyle,
                                      KitchenQual,
                                      LandContour,
                                      LandSlope,
                                      LotArea,
                                      MasVnrArea,
                                      MiscVal,
                                      Neighborhood,
                                      OverallCond,
                                      OverallQual,
                                      PoolArea,
                                      # # PoolQC,
                                      RoofStyle,
                                      # # Street,
                                      YearBuilt,
                                      YearRemodAdd)
regr = lm(df.train)
check_model(regr)

## 
## Call:
## lm(formula = df.train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -55818 -13267      0  13165  57224 
## 
## Coefficients: (2 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.622e+05  1.226e+06   0.458 0.648297    
## BldgType2fmCon       6.908e+03  3.349e+04   0.206 0.837318    
## BldgTypeTwnhs        6.399e+04  5.658e+04   1.131 0.262684    
## BldgTypeTwnhsE       3.972e+04  5.705e+04   0.696 0.489005    
## BsmtCondGd           3.283e+04  4.652e+04   0.706 0.483160    
## BsmtCondTA           1.054e+04  4.105e+04   0.257 0.798187    
## BsmtExposureGd      -3.779e+04  2.016e+04  -1.874 0.065817 .  
## BsmtExposureMn      -5.524e+04  2.136e+04  -2.586 0.012203 *  
## BsmtExposureNo      -5.128e+04  2.080e+04  -2.465 0.016613 *  
## BsmtQualFa           5.315e+04  8.118e+04   0.655 0.515187    
## BsmtQualGd          -5.749e+04  3.272e+04  -1.757 0.084086 .  
## BsmtQualTA          -3.081e+04  3.722e+04  -0.828 0.411143    
## CentralAirY          3.072e+04  2.755e+04   1.115 0.269348    
## GarageArea           3.770e+01  5.191e+01   0.726 0.470564    
## GarageCars           3.571e+03  1.278e+04   0.279 0.780919    
## ExterQualFa         -3.487e+04  9.487e+04  -0.368 0.714484    
## ExterQualGd         -4.552e+04  5.381e+04  -0.846 0.401030    
## ExterQualTA         -8.199e+04  5.456e+04  -1.503 0.138235    
## FenceGdWo            1.511e+04  1.204e+04   1.256 0.214237    
## FenceMnPrv           2.358e+04  9.733e+03   2.423 0.018482 *  
## FenceMnWw           -2.630e+03  2.532e+04  -0.104 0.917600    
## Fireplaces           3.213e+04  9.888e+03   3.249 0.001914 ** 
## FireplaceQuFa        5.772e+04  3.326e+04   1.735 0.087881 .  
## FireplaceQuGd        6.748e+04  3.176e+04   2.125 0.037807 *  
## FireplaceQuPo        5.046e+04  3.400e+04   1.484 0.143074    
## FireplaceQuTA        8.673e+04  3.327e+04   2.606 0.011564 *  
## FoundationCBlock     3.114e+03  2.572e+04   0.121 0.904064    
## FoundationPConc      1.926e+04  2.439e+04   0.790 0.432794    
## FoundationStone             NA         NA      NA       NA    
## HouseStyle1.5Unf     8.463e+03  6.203e+04   0.136 0.891946    
## HouseStyle1Story    -2.295e+04  1.790e+04  -1.282 0.204779    
## HouseStyle2.5Fin     3.214e+04  5.320e+04   0.604 0.548070    
## HouseStyle2.5Unf    -4.242e+04  4.092e+04  -1.037 0.304125    
## HouseStyle2Story    -1.808e+04  1.737e+04  -1.041 0.302231    
## HouseStyleSFoyer    -5.482e+04  3.006e+04  -1.823 0.073308 .  
## HouseStyleSLvl      -6.718e+04  2.079e+04  -3.232 0.002013 ** 
## KitchenQualFa       -1.807e+05  4.373e+04  -4.132 0.000115 ***
## KitchenQualGd       -7.231e+04  2.048e+04  -3.530 0.000812 ***
## KitchenQualTA       -7.639e+04  2.219e+04  -3.443 0.001065 ** 
## LandContourHLS       4.463e+03  5.153e+04   0.087 0.931280    
## LandContourLvl       1.515e+04  2.354e+04   0.644 0.522289    
## LandSlopeMod         3.936e+04  2.082e+04   1.891 0.063596 .  
## LotArea              4.778e+00  1.447e+00   3.303 0.001629 ** 
## MasVnrArea          -8.486e-01  2.644e+01  -0.032 0.974507    
## MiscVal             -1.549e+01  1.086e+01  -1.426 0.159114    
## NeighborhoodBrkSide -3.721e+04  4.938e+04  -0.754 0.454136    
## NeighborhoodClearCr -2.834e+04  4.614e+04  -0.614 0.541444    
## NeighborhoodCollgCr  2.136e+04  5.191e+04   0.411 0.682255    
## NeighborhoodCrawfor  1.902e+04  4.186e+04   0.454 0.651259    
## NeighborhoodEdwards -1.733e+04  4.170e+04  -0.416 0.679156    
## NeighborhoodGilbert  3.936e+04  4.938e+04   0.797 0.428605    
## NeighborhoodIDOTRR  -3.081e+04  6.452e+04  -0.478 0.634712    
## NeighborhoodMeadowV  3.315e+04  3.937e+04   0.842 0.403100    
## NeighborhoodMitchel -2.393e+03  4.274e+04  -0.056 0.955540    
## NeighborhoodNAmes    6.620e+03  3.869e+04   0.171 0.864718    
## NeighborhoodNoRidge  1.378e+05  5.017e+04   2.747 0.007954 ** 
## NeighborhoodNWAmes   2.257e+04  4.130e+04   0.547 0.586763    
## NeighborhoodOldTown -7.865e+03  4.644e+04  -0.169 0.866094    
## NeighborhoodSawyer   8.498e+03  3.977e+04   0.214 0.831560    
## NeighborhoodSawyerW  2.946e+04  4.212e+04   0.700 0.486955    
## NeighborhoodSomerst  6.423e+04  5.570e+04   1.153 0.253544    
## NeighborhoodSWISU   -1.850e+04  5.167e+04  -0.358 0.721548    
## NeighborhoodVeenker         NA         NA      NA       NA    
## OverallCond          1.149e+04  4.414e+03   2.602 0.011698 *  
## OverallQual          1.024e+04  5.776e+03   1.773 0.081416 .  
## PoolArea             1.023e+02  3.234e+01   3.164 0.002457 ** 
## RoofStyleGable       8.690e+02  3.125e+04   0.028 0.977908    
## RoofStyleGambrel     2.449e+04  3.947e+04   0.620 0.537404    
## RoofStyleHip         7.238e+03  3.202e+04   0.226 0.821924    
## YearBuilt            9.312e+01  5.017e+02   0.186 0.853395    
## YearRemodAdd        -3.623e+02  3.375e+02  -1.073 0.287439    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30010 on 59 degrees of freedom
##   (1332 observations deleted due to missingness)
## Multiple R-squared:  0.9299, Adjusted R-squared:  0.8491 
## F-statistic: 11.51 on 68 and 59 DF,  p-value: < 2.2e-16
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -55820  -13270       0       0   13160   57220

qqnorm(residuals(regr))

summary(df.train)

##    SalePrice        BldgType    BsmtCond    BsmtExposure BsmtQual  
##  Min.   : 34900   1Fam  :1220   Fa  :  45   Av  :221     Ex  :121  
##  1st Qu.:129975   2fmCon:  31   Gd  :  65   Gd  :134     Fa  : 35  
##  Median :163000   Duplex:  52   Po  :   2   Mn  :114     Gd  :618  
##  Mean   :180921   Twnhs :  43   TA  :1311   No  :953     TA  :649  
##  3rd Qu.:214000   TwnhsE: 114   NA's:  37   NA's: 38     NA's: 37  
##  Max.   :755000                                                    
##                                                                    
##  CentralAir   GarageArea       GarageCars    ExterQual   Fence     
##  N:  95     Min.   :   0.0   Min.   :0.000   Ex: 52    GdPrv:  59  
##  Y:1365     1st Qu.: 334.5   1st Qu.:1.000   Fa: 14    GdWo :  54  
##             Median : 480.0   Median :2.000   Gd:488    MnPrv: 157  
##             Mean   : 473.0   Mean   :1.767   TA:906    MnWw :  11  
##             3rd Qu.: 576.0   3rd Qu.:2.000             NA's :1179  
##             Max.   :1418.0   Max.   :4.000                         
##                                                                    
##    Fireplaces    FireplaceQu  Foundation    HouseStyle  KitchenQual
##  Min.   :0.000   Ex  : 24    BrkTil:146   1Story :726   Ex:100     
##  1st Qu.:0.000   Fa  : 33    CBlock:634   2Story :445   Fa: 39     
##  Median :1.000   Gd  :380    PConc :647   1.5Fin :154   Gd:586     
##  Mean   :0.613   Po  : 20    Slab  : 24   SLvl   : 65   TA:735     
##  3rd Qu.:1.000   TA  :313    Stone :  6   SFoyer : 37              
##  Max.   :3.000   NA's:690    Wood  :  3   1.5Unf : 14              
##                                           (Other): 19              
##  LandContour LandSlope     LotArea         MasVnrArea    
##  Bnk:  63    Gtl:1382   Min.   :  1300   Min.   :   0.0  
##  HLS:  50    Mod:  65   1st Qu.:  7554   1st Qu.:   0.0  
##  Low:  36    Sev:  13   Median :  9478   Median :   0.0  
##  Lvl:1311               Mean   : 10517   Mean   : 103.7  
##                         3rd Qu.: 11602   3rd Qu.: 166.0  
##                         Max.   :215245   Max.   :1600.0  
##                                          NA's   :8       
##     MiscVal          Neighborhood  OverallCond     OverallQual    
##  Min.   :    0.00   NAmes  :225   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:    0.00   CollgCr:150   1st Qu.:5.000   1st Qu.: 5.000  
##  Median :    0.00   OldTown:113   Median :5.000   Median : 6.000  
##  Mean   :   43.49   Edwards:100   Mean   :5.575   Mean   : 6.099  
##  3rd Qu.:    0.00   Somerst: 86   3rd Qu.:6.000   3rd Qu.: 7.000  
##  Max.   :15500.00   Gilbert: 79   Max.   :9.000   Max.   :10.000  
##                     (Other):707                                   
##     PoolArea         RoofStyle      YearBuilt     YearRemodAdd 
##  Min.   :  0.000   Flat   :  13   Min.   :1872   Min.   :1950  
##  1st Qu.:  0.000   Gable  :1141   1st Qu.:1954   1st Qu.:1967  
##  Median :  0.000   Gambrel:  11   Median :1973   Median :1994  
##  Mean   :  2.759   Hip    : 286   Mean   :1971   Mean   :1985  
##  3rd Qu.:  0.000   Mansard:   7   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :738.000   Shed   :   2   Max.   :2010   Max.   :2010  
##

From the above we see that the Adjusted R-squared value is high, but the degrees of freedom is low (49). This is due to the fact that a number of observations have been dropped. From the summary of the dataframe, we see that the fields “Fence” and “FireplaceQu” have a high number of NAs. Therefore we exclude them from the model.

df.train = df.train %>% dplyr::select(-Fence, -FireplaceQu)
regr = lm(df.train)
check_model(regr)

## 
## Call:
## lm(formula = df.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -311457  -15589    -831   13665  273845 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.193e+05  1.996e+05  -2.601 0.009392 ** 
## BldgType2fmCon      -2.050e+03  6.670e+03  -0.307 0.758600    
## BldgTypeDuplex       5.642e+03  5.997e+03   0.941 0.346961    
## BldgTypeTwnhs       -4.096e+04  6.790e+03  -6.032 2.10e-09 ***
## BldgTypeTwnhsE      -3.821e+04  4.290e+03  -8.907  < 2e-16 ***
## BsmtCondGd           3.527e+03  7.049e+03   0.500 0.616874    
## BsmtCondPo           4.139e+04  2.604e+04   1.589 0.112207    
## BsmtCondTA           5.689e+03  5.541e+03   1.027 0.304756    
## BsmtExposureGd       2.218e+04  4.033e+03   5.500 4.54e-08 ***
## BsmtExposureMn      -9.737e+02  4.124e+03  -0.236 0.813384    
## BsmtExposureNo      -8.296e+03  2.946e+03  -2.816 0.004932 ** 
## BsmtQualFa          -4.094e+04  8.292e+03  -4.937 8.92e-07 ***
## BsmtQualGd          -3.097e+04  4.414e+03  -7.015 3.64e-12 ***
## BsmtQualTA          -3.507e+04  5.407e+03  -6.485 1.24e-10 ***
## CentralAirY          2.649e+03  4.746e+03   0.558 0.576848    
## GarageArea           1.402e+01  9.539e+00   1.470 0.141865    
## GarageCars           1.028e+04  2.842e+03   3.618 0.000308 ***
## ExterQualFa         -2.298e+04  1.291e+04  -1.779 0.075430 .  
## ExterQualGd         -1.854e+04  6.356e+03  -2.917 0.003591 ** 
## ExterQualTA         -2.377e+04  7.011e+03  -3.390 0.000719 ***
## Fireplaces           1.210e+04  1.717e+03   7.046 2.94e-12 ***
## FoundationCBlock     5.228e+03  4.166e+03   1.255 0.209713    
## FoundationPConc      8.146e+03  4.621e+03   1.763 0.078152 .  
## FoundationStone      1.273e+04  1.420e+04   0.896 0.370232    
## FoundationWood       1.071e+04  1.994e+04   0.537 0.591206    
## HouseStyle1.5Unf    -2.081e+04  9.418e+03  -2.210 0.027302 *  
## HouseStyle1Story    -1.128e+04  3.594e+03  -3.139 0.001730 ** 
## HouseStyle2.5Fin     4.422e+04  1.266e+04   3.493 0.000493 ***
## HouseStyle2.5Unf     4.544e+03  1.091e+04   0.417 0.677108    
## HouseStyle2Story    -1.960e+03  3.742e+03  -0.524 0.600613    
## HouseStyleSFoyer    -3.070e+04  7.374e+03  -4.163 3.34e-05 ***
## HouseStyleSLvl      -2.131e+04  5.665e+03  -3.762 0.000176 ***
## KitchenQualFa       -3.274e+04  8.142e+03  -4.021 6.11e-05 ***
## KitchenQualGd       -3.329e+04  4.591e+03  -7.250 7.05e-13 ***
## KitchenQualTA       -3.789e+04  5.204e+03  -7.280 5.69e-13 ***
## LandContourHLS       1.137e+04  6.821e+03   1.667 0.095702 .  
## LandContourLow       2.026e+03  8.488e+03   0.239 0.811375    
## LandContourLvl       1.735e+04  4.886e+03   3.551 0.000396 ***
## LandSlopeMod         1.151e+04  5.243e+03   2.196 0.028268 *  
## LandSlopeSev        -3.051e+04  1.327e+04  -2.299 0.021652 *  
## LotArea              7.985e-01  1.228e-01   6.505 1.10e-10 ***
## MasVnrArea           2.046e+01  6.218e+00   3.290 0.001028 ** 
## MiscVal             -1.009e+00  1.897e+00  -0.532 0.595098    
## NeighborhoodBlueste -7.660e+03  2.517e+04  -0.304 0.760882    
## NeighborhoodBrDale  -1.076e+04  1.353e+04  -0.795 0.426585    
## NeighborhoodBrkSide -1.782e+04  1.112e+04  -1.603 0.109243    
## NeighborhoodClearCr  9.918e+02  1.192e+04   0.083 0.933716    
## NeighborhoodCollgCr -1.167e+04  9.329e+03  -1.251 0.211224    
## NeighborhoodCrawfor  1.446e+04  1.087e+04   1.331 0.183572    
## NeighborhoodEdwards -2.510e+04  1.017e+04  -2.468 0.013695 *  
## NeighborhoodGilbert -2.076e+04  9.948e+03  -2.087 0.037053 *  
## NeighborhoodIDOTRR  -3.351e+04  1.172e+04  -2.860 0.004303 ** 
## NeighborhoodMeadowV  4.281e+03  1.257e+04   0.341 0.733495    
## NeighborhoodMitchel -2.174e+04  1.043e+04  -2.084 0.037361 *  
## NeighborhoodNAmes   -1.759e+04  9.900e+03  -1.777 0.075817 .  
## NeighborhoodNoRidge  6.186e+04  1.061e+04   5.830 6.96e-09 ***
## NeighborhoodNPkVill  6.038e+03  1.434e+04   0.421 0.673799    
## NeighborhoodNridgHt  2.007e+04  9.679e+03   2.073 0.038347 *  
## NeighborhoodNWAmes  -1.315e+04  1.019e+04  -1.290 0.197446    
## NeighborhoodOldTown -2.699e+04  1.076e+04  -2.508 0.012254 *  
## NeighborhoodSawyer  -1.769e+04  1.039e+04  -1.702 0.088988 .  
## NeighborhoodSawyerW -3.910e+03  9.972e+03  -0.392 0.695086    
## NeighborhoodSomerst  1.948e+03  9.461e+03   0.206 0.836855    
## NeighborhoodStoneBr  5.363e+04  1.074e+04   4.993 6.72e-07 ***
## NeighborhoodSWISU   -1.045e+04  1.248e+04  -0.838 0.402405    
## NeighborhoodTimber  -1.369e+04  1.061e+04  -1.290 0.197413    
## NeighborhoodVeenker  1.202e+04  1.327e+04   0.906 0.365254    
## OverallCond          3.752e+03  1.083e+03   3.465 0.000548 ***
## OverallQual          1.381e+04  1.263e+03  10.934  < 2e-16 ***
## PoolArea             6.747e+01  2.246e+01   3.004 0.002710 ** 
## RoofStyleGable      -1.891e+03  1.105e+04  -0.171 0.864087    
## RoofStyleGambrel     3.983e+03  1.507e+04   0.264 0.791594    
## RoofStyleHip         4.768e+03  1.124e+04   0.424 0.671411    
## RoofStyleMansard     1.733e+04  1.698e+04   1.020 0.307775    
## RoofStyleShed        3.800e+04  2.684e+04   1.416 0.157091    
## YearBuilt            1.175e+02  8.670e+01   1.355 0.175574    
## YearRemodAdd         2.018e+02  7.077e+01   2.851 0.004424 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32710 on 1337 degrees of freedom
##   (46 observations deleted due to missingness)
## Multiple R-squared:  0.8389, Adjusted R-squared:  0.8297 
## F-statistic:  91.6 on 76 and 1337 DF,  p-value: < 2.2e-16
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -311500  -15590    -831       0   13660  273800

qqnorm(residuals(regr))

As a result the Adjusted R-squared value is now lower but the degrees of freedom is higher. This is preferable because it allows us to use more of the training observations.

Next, we can remove a number of variables that remain in the model and that have high p-values.

df.train = df.train %>% dplyr::select(-RoofStyle, -Foundation, -BsmtCond, -MiscVal)
regr = lm(df.train)
check_model(regr)

## 
## Call:
## lm(formula = df.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -310149  -16203    -846   13252  277035 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.766e+05  1.912e+05  -3.016 0.002613 ** 
## BldgType2fmCon      -8.140e+02  6.634e+03  -0.123 0.902363    
## BldgTypeDuplex       5.789e+03  5.874e+03   0.986 0.324520    
## BldgTypeTwnhs       -4.118e+04  6.797e+03  -6.059 1.77e-09 ***
## BldgTypeTwnhsE      -3.833e+04  4.285e+03  -8.945  < 2e-16 ***
## BsmtExposureGd       2.233e+04  3.988e+03   5.601 2.58e-08 ***
## BsmtExposureMn      -9.110e+02  4.114e+03  -0.221 0.824780    
## BsmtExposureNo      -8.097e+03  2.945e+03  -2.750 0.006044 ** 
## BsmtQualFa          -4.098e+04  8.080e+03  -5.072 4.49e-07 ***
## BsmtQualGd          -3.210e+04  4.386e+03  -7.317 4.33e-13 ***
## BsmtQualTA          -3.594e+04  5.319e+03  -6.757 2.09e-11 ***
## CentralAirY          3.736e+03  4.670e+03   0.800 0.423936    
## GarageArea           1.387e+01  9.527e+00   1.456 0.145734    
## GarageCars           1.037e+04  2.837e+03   3.655 0.000268 ***
## ExterQualFa         -2.227e+04  1.274e+04  -1.749 0.080578 .  
## ExterQualGd         -2.007e+04  6.319e+03  -3.176 0.001525 ** 
## ExterQualTA         -2.618e+04  6.954e+03  -3.764 0.000174 ***
## Fireplaces           1.199e+04  1.705e+03   7.030 3.27e-12 ***
## HouseStyle1.5Unf    -2.219e+04  9.398e+03  -2.361 0.018373 *  
## HouseStyle1Story    -9.547e+03  3.507e+03  -2.723 0.006562 ** 
## HouseStyle2.5Fin     4.400e+04  1.265e+04   3.479 0.000519 ***
## HouseStyle2.5Unf     4.402e+03  1.089e+04   0.404 0.686233    
## HouseStyle2Story    -6.899e+02  3.685e+03  -0.187 0.851535    
## HouseStyleSFoyer    -3.019e+04  7.308e+03  -4.131 3.83e-05 ***
## HouseStyleSLvl      -1.986e+04  5.605e+03  -3.543 0.000409 ***
## KitchenQualFa       -3.282e+04  8.117e+03  -4.044 5.56e-05 ***
## KitchenQualGd       -3.330e+04  4.590e+03  -7.254 6.79e-13 ***
## KitchenQualTA       -3.806e+04  5.187e+03  -7.338 3.73e-13 ***
## LandContourHLS       1.172e+04  6.777e+03   1.730 0.083882 .  
## LandContourLow       3.150e+03  8.402e+03   0.375 0.707779    
## LandContourLvl       1.722e+04  4.872e+03   3.534 0.000423 ***
## LandSlopeMod         1.175e+04  5.195e+03   2.262 0.023859 *  
## LandSlopeSev        -2.414e+04  1.228e+04  -1.966 0.049476 *  
## LotArea              7.619e-01  1.186e-01   6.425 1.82e-10 ***
## MasVnrArea           2.286e+01  6.127e+00   3.730 0.000199 ***
## NeighborhoodBlueste -9.145e+03  2.510e+04  -0.364 0.715705    
## NeighborhoodBrDale  -1.329e+04  1.343e+04  -0.990 0.322272    
## NeighborhoodBrkSide -1.951e+04  1.107e+04  -1.762 0.078312 .  
## NeighborhoodClearCr  9.989e+02  1.175e+04   0.085 0.932274    
## NeighborhoodCollgCr -1.229e+04  9.334e+03  -1.317 0.188066    
## NeighborhoodCrawfor  1.440e+04  1.088e+04   1.324 0.185760    
## NeighborhoodEdwards -2.487e+04  1.018e+04  -2.444 0.014663 *  
## NeighborhoodGilbert -2.073e+04  9.943e+03  -2.085 0.037239 *  
## NeighborhoodIDOTRR  -3.214e+04  1.171e+04  -2.745 0.006123 ** 
## NeighborhoodMeadowV  4.433e+03  1.255e+04   0.353 0.723894    
## NeighborhoodMitchel -2.148e+04  1.043e+04  -2.059 0.039663 *  
## NeighborhoodNAmes   -1.703e+04  9.850e+03  -1.729 0.084077 .  
## NeighborhoodNoRidge  6.218e+04  1.062e+04   5.855 5.97e-09 ***
## NeighborhoodNPkVill  4.340e+03  1.418e+04   0.306 0.759657    
## NeighborhoodNridgHt  1.936e+04  9.681e+03   1.999 0.045761 *  
## NeighborhoodNWAmes  -1.266e+04  1.012e+04  -1.251 0.211068    
## NeighborhoodOldTown -2.717e+04  1.075e+04  -2.529 0.011562 *  
## NeighborhoodSawyer  -1.602e+04  1.034e+04  -1.548 0.121744    
## NeighborhoodSawyerW -4.886e+03  9.973e+03  -0.490 0.624304    
## NeighborhoodSomerst  7.319e+02  9.460e+03   0.077 0.938339    
## NeighborhoodStoneBr  5.310e+04  1.075e+04   4.938 8.86e-07 ***
## NeighborhoodSWISU   -1.179e+04  1.245e+04  -0.947 0.343724    
## NeighborhoodTimber  -1.350e+04  1.060e+04  -1.274 0.203008    
## NeighborhoodVeenker  1.158e+04  1.327e+04   0.873 0.382800    
## OverallCond          3.596e+03  1.034e+03   3.478 0.000521 ***
## OverallQual          1.418e+04  1.244e+03  11.393  < 2e-16 ***
## PoolArea             6.898e+01  2.215e+01   3.115 0.001880 ** 
## YearBuilt            1.563e+02  8.022e+01   1.949 0.051551 .  
## YearRemodAdd         1.973e+02  6.934e+01   2.846 0.004496 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32770 on 1350 degrees of freedom
##   (46 observations deleted due to missingness)
## Multiple R-squared:  0.8367, Adjusted R-squared:  0.8291 
## F-statistic: 109.8 on 63 and 1350 DF,  p-value: < 2.2e-16
## 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -310100.0  -16200.0    -845.9       0.0   13250.0  277000.0

qqnorm(residuals(regr))

Perform predictions using the test.csv file and create Kaggle submission csv file.

df.test = read.csv("test.csv")

df.test = df.test %>% dplyr::select(Id,
                                    BldgType,
                                    # BsmtCond,
                                    BsmtExposure,
                                    BsmtQual,
                                    CentralAir,
                                    GarageArea,
                                    GarageCars,
                                    Exterior1st,
                                    ExterQual,
                                    # Fence,
                                    Fireplaces,
                                    # FireplaceQu,
                                    # Foundation,
                                    HouseStyle,
                                    KitchenQual,
                                    LandContour,
                                    LandSlope,
                                    LotArea,
                                    MasVnrArea,
                                    # MiscVal,
                                    Neighborhood,
                                    OverallCond,
                                    OverallQual,
                                    PoolArea,
                                    # # PoolQC,
                                    # RoofStyle,
                                    # # Street,
                                    YearBuilt,
                                    YearRemodAdd)

#regr2 = update(regr, na.action=na.exclude)
predictions = predict(regr, df.test %>% dplyr::select(-Id))

predictions[is.na(predictions)] = mean(predictions, na.rm=T)

pred.df = data.frame(df.test$Id, as.numeric(predictions))
colnames(pred.df) = c("Id", "SalePrice")

write.csv(pred.df, file="submission.csv", row.names=F)