Final Exam

Problem 1.

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{N+1}{2}\).

Probabilities

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

c. \(P(X < x\text{ and } X > y )\)

## The requested probability is 0.42675

Marginal, Joint Probabilities

Investigate whether \(P(X>x\text{ and } Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

Contingency tables, one with counts and one with probabilities, are below.

Y > y Y <= y
X > x 3760 1240
X <= x 3740 1260

Marginal probabilities:

\(P(Y>y)=\) 0.75.
\(P(Y\le y)=\) 0.25.
\(P(X>x)=\) 0.5.
\(P(X\le x)=\) 0.5.

Joint probabilities:

\(P(X>x\text{ and }Y>y)=\) 0.376.
\(P(X>x\text{ and }Y\le y)=\) 0.124.
\(P(X\le x\text{ and }Y>y)=\) 0.374.
\(P(X\le x\text{ and }Y\le y)=\) 0.126.

These probabilities are confirmed in the table below.

Y > y Y <= y
X > x 0.376 0.124
X <= x 0.374 0.126

Further, the joint probability \(P(X>x\text{ and }Y>y)=\) 0.376.

The product of marginal probabilities

\(P(X>x)=\) 0.5 and
\(P(Y>y)=\) 0.75 is

\(P(X>x)P(Y>y)=\) 0.375.

Fisher’s, Chi-squared

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Fisher’s results:

## 
##  Fisher's Exact Test for Count Data
## 
## data:  XY_matrix
## p-value = 0.6608
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.932132 1.119534
## sample estimates:
## odds ratio 
##   1.021566

Chi-squared results:

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tb_XY
## X-squared = 0.19253, df = 1, p-value = 0.6608

Fisher’s Test is for 2-by-2 contingency tables; chi-square is appropriate when comparing two categorical variables 2 or more categories each.

Since the p-value for both tests are equivalent and both greater than 0.05, we do not reject the null hypothesis of the test (the distribution of X is independent of the distribution of Y) at the 5% significance level. This is the expected result, as the variables were generated randomly and independently.

Problem 2.

Descriptives

Provide univariate descriptive statistics and appropriate plots for the training data set.

Using datatable from DT:

Scatterplot Matrix

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

Using the plot above we selected three variables OverallQual, TotRmsAbvGrd, GrLivArea, and the dependent variable, SalePrice.

Correlation Matrix

Derive a correlation matrix for any three quantitative variables in the dataset.

Let’s test for normality in the three variables. We’ll render histograms and qqplots for the three variables, and follow with Shapiro-Wilkes and Kolmogorov Smirov tests for normality.

GrLivArea clearly does not follow a normal distribution; while OverallQual seems to approximate a discrete normal distribution, TotRmsAbvGrd seems to deviate significantly. Test results below will confirm these observations.

## OverallQual:
## 
##  Shapiro-Wilk normality test
## 
## data:  hpnum_train$OverallQual
## W = 0.94801, p-value < 2.2e-16
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  hpnum_train$OverallQual
## D = 0.99523, p-value < 2.2e-16
## alternative hypothesis: two-sided
## GrLivArea:
## 
##  Shapiro-Wilk normality test
## 
## data:  hpnum_train$GrLivArea
## W = 0.92798, p-value < 2.2e-16
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  hpnum_train$GrLivArea
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
## TotRmsAbvGrd:
## 
##  Shapiro-Wilk normality test
## 
## data:  hpnum_train$TotRmsAbvGrd
## W = 0.94228, p-value < 2.2e-16
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  hpnum_train$TotRmsAbvGrd
## D = 0.99797, p-value < 2.2e-16
## alternative hypothesis: two-sided

According to our analyses, the variables are not normally distributed. This determines the method of the correlation tests; we’ll use method = "kendall", a nonparametric test.

Correlation Tests

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

Because we are using a nonparametric test method, cor.test will test for significance at the .95 level.

## Testing OverallQual and GrLivArea,    p-val: 6.870239e-130    We reject the null hypothesis.
## Testing OverallQual and TotRmsAbvGrd, p-val: 3.056094e-64     We reject the null hypothesis.
## Testing TotRmsAbvGrd and GrLivArea,   p-val: 8.066909e-285    We reject the null hypothesis.

To produce the desired confidence intervals for Kendall tau we use kendall.ci from the package NSM3 without the bootstrap option.

## OverallQual and GrLivArea:
## 
## 1 - alpha = 0.8 two-sided CI for tau:
## 0.447, 0.481
## OverallQual and TotRmsAbvGrd:
## 
## 1 - alpha = 0.8 two-sided CI for tau:
## 0.331, 0.371
## TotRmsAbvGrd and GrLivArea:
## 
## 1 - alpha = 0.8 two-sided CI for tau:
## 0.672, 0.697

Discussion

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

As expected, each of the confidence intervals do not span zero, confirming our rejection of the null hypothesis. Each of the pairs of variables are correlated. Given the meaning of each variable this result is logical. With these findings I would not be concerned with familywise error; even using the Bonferroni correction method, each of the p-values are much less than \(0.05/3\approx 0.015\), confirming our original result.

LU Decomposition

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Let’s check that the determinate does not equal zero.

## [1] 0.4161713

Since the determinate does not equal zero, the inverse exists. We use the library matlib to find the inverse. The product of the correlation matrix and the precision matrix:

##              [,1]          [,2]          [,3]
## [1,] 1.000000e+00  1.526603e-09 -6.066037e-11
## [2,] 5.717334e-09  1.000000e+00 -4.510915e-09
## [3,] 5.234534e-09 -3.307733e-09  1.000000e+00
## which is effectively
##      [,1] [,2] [,3]
## [1,] 1    0    0   
## [2,] 0    1    0   
## [3,] 0    0    1

The product of the precision matrix and the correlation matrix:

##               [,1]          [,2]          [,3]
## [1,]  1.000000e+00  5.717334e-09  5.234534e-09
## [2,]  1.526603e-09  1.000000e+00 -3.307733e-09
## [3,] -6.066037e-11 -4.510915e-09  1.000000e+00
## which is effectively
##      [,1] [,2] [,3]
## [1,] 1    0    0   
## [2,] 0    1    0   
## [3,] 0    0    1

We’ll obtain the lower and upper decompositions of the matrix and its inverse using the package pracma.

## Decomposition of the correlation matrix:
## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.3513492 1.0000000    0
## [3,] 0.4641893 0.5944486    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.3513492 0.4641893
## [2,]    0 0.8765538 0.5210662
## [3,]    0 0.0000000 0.4747812
## Decomposition of the inverse:
## $L
##             [,1]       [,2] [,3]
## [1,]  1.00000000  0.0000000    0
## [2,] -0.06348622  1.0000000    0
## [3,] -0.42075466 -0.6841587    1
## 
## $U
##          [,1]        [,2]       [,3]
## [1,] 1.278144 -0.08114454 -0.5377851
## [2,] 0.000000  1.87995768 -1.2861894
## [3,] 0.000000  0.00000000  1.0000000

Exponential PDF

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html).

According to the histogram of GrLivArea above, the variable is skewed to the right. The MASS package was loaded at an earlier point in this markdown. Let’s test that the mean is greater than the median in the training set and the minimum value of the variable is greater than zero.

## [1] TRUE
## [1] TRUE

We’ll use fitdistr to fit the pdf.

##        rate    
##   6.598640e-04 
##  (1.726943e-05)

\(\lambda\) Histogram

Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable.

The distributions look very different with the axes set for optimum comparison. To continue experimentation, I would try a truncated negative binomial distribution to approximate the original variable.

CDF Percentiles

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

For \(x < 0, x=0\); for \(x \ge 0\) we have:

\[\begin{aligned} F(x;\lambda)= P(X\le x)&=\int_{0}^{x} \lambda e^{-\lambda x}\\ &=\bigg[-e^{-\lambda x}\bigg]_0^{x}\\ &=1 - e^{-\lambda x}\\ \end{aligned}\] To find the 5th percentile:

\[\begin{aligned} F(x;0.000659864 )= P(X\le x)&=1 - e^{-0.000659864 x}=.05\\ \implies 0.95 &= e^{-0.000659864 x}\\ \log(0.95)&=-0.000659864 x\\ x&=\frac{\log(95)}{-0.000659864}\\ x&=77.73313\\ \end{aligned}\]

To find the 95th percentile:

\[\begin{aligned} F(x;0.000659864 )= P(X\le x)&=1 - e^{-0.000659864 x}=.95\\ \implies 0.05 &= e^{-0.000659864 x}\\ \log(0.05)&=-0.000659864 x\\ x&=\frac{\log(05)}{-0.000659864}\\ x&=4539.924\\ \end{aligned}\]

Programmatically:

## The 5th percentile: 77.73313
## The 95th percentile: 4539.924

Empirical CI

Also generate a 95% confidence interval from the empirical data, assuming normality.

The data in hp_train$GrLivArea are not normally distributed; however, we will assume normality.

## Assuming normality, the 95% confidence interval is ( 1488.509 , 1542.418 )

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

## The empirical 5th and 95th percentiles are 848 and 2466.1 respectively.

The empirical 5th and 95th percentiles (848 and 2466.1) are quite different than the confidence intervals calculated while assuming normality (1488.509 and 1542.418). Our variable, GrLivArea, is not normally distributed. The mean is greater than the median; further, error, mean and standard deviation should be recalculated to estimate proper distribution, as there are many more observations with value lesser than the mean than there are greater than the mean.

Let’s try the negative binomial distribution instead. A comparison of the histograms is below.

The fit is much better and the histograms closely resemble one another. Let’s use these parameters to calculate the confidence interval.

## Using the negative binomial parameters, the 95% confidence interval is ( 607.709 , 2423.218 )

Our estimated confidence interval (607.709, 2423.218) much more closely resembles the percentiles from the empirical data, 848 and 2466.1. Model estimation and fit improves drastically when the appropriate model is selected.

Regression Result

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Let’s use the descdist rendering to suggest a distribution for the dependent variable.

## summary statistics
## ------
## min:  34900   max:  755000 
## median:  163000 
## mean:  180921.2 
## estimated sd:  79442.5 
## estimated skewness:  1.882876 
## estimated kurtosis:  9.536282

From the figure above, it looks as if the lognormal distribution most closely approximates the empirical values in our data. Let’s compare with negative binomial.

We’ll select a lognormal distribution, as the negative binomial distribution fit underperforms.

The transformation results in a nearly normal distribution. Let’s fit a model with the numeric variables.

Method: “stepAIC”

We begin by starting over with a fresh dataset and imputing missing values in the numeric variables.

Before imputing missing data we assess the percentage of missing in each variable and omit those missing more than 7% in observations.

##    MSSubClass      MSZoning   LotFrontage       LotArea        Street 
##    0.00000000    0.00000000   17.73972603    0.00000000    0.00000000 
##         Alley      LotShape   LandContour     Utilities     LotConfig 
##   93.76712329    0.00000000    0.00000000    0.00000000    0.00000000 
##     LandSlope  Neighborhood    Condition1    Condition2      BldgType 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##    HouseStyle   OverallQual   OverallCond     YearBuilt  YearRemodAdd 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##     RoofStyle      RoofMatl   Exterior1st   Exterior2nd    MasVnrType 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.54794521 
##    MasVnrArea     ExterQual     ExterCond    Foundation      BsmtQual 
##    0.54794521    0.00000000    0.00000000    0.00000000    2.53424658 
##      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1  BsmtFinType2 
##    2.53424658    2.60273973    2.53424658    0.00000000    2.60273973 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating     HeatingQC 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF  LowQualFinSF 
##    0.00000000    0.06849315    0.00000000    0.00000000    0.00000000 
##     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd    Functional 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##    Fireplaces   FireplaceQu    GarageType   GarageYrBlt  GarageFinish 
##    0.00000000   47.26027397    5.54794521    5.54794521    5.54794521 
##    GarageCars    GarageArea    GarageQual    GarageCond    PavedDrive 
##    0.00000000    0.00000000    5.54794521    5.54794521    0.00000000 
##    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch   ScreenPorch 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##      PoolArea        PoolQC         Fence   MiscFeature       MiscVal 
##    0.00000000   99.52054795   80.75342466   96.30136986    0.00000000 
##        MoSold        YrSold      SaleType SaleCondition     SalePrice 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000

Let’s eliminate the variables missing more than 7%.

Let’s impute the missing data using the MICE package.

## 
##  iter imp variable
##   1   1  MasVnrArea*  GarageYrBlt*
##   1   2  MasVnrArea*  GarageYrBlt*
##   1   3  MasVnrArea  GarageYrBlt
##   1   4  MasVnrArea  GarageYrBlt*
##   1   5  MasVnrArea*  GarageYrBlt*
##   2   1  MasVnrArea  GarageYrBlt
##   2   2  MasVnrArea  GarageYrBlt*
##   2   3  MasVnrArea*  GarageYrBlt*
##   2   4  MasVnrArea  GarageYrBlt
##   2   5  MasVnrArea  GarageYrBlt*
##   3   1  MasVnrArea*  GarageYrBlt*
##   3   2  MasVnrArea*  GarageYrBlt*
##   3   3  MasVnrArea  GarageYrBlt*
##   3   4  MasVnrArea  GarageYrBlt
##   3   5  MasVnrArea  GarageYrBlt*
##   4   1  MasVnrArea*  GarageYrBlt*
##   4   2  MasVnrArea*  GarageYrBlt*
##   4   3  MasVnrArea  GarageYrBlt*
##   4   4  MasVnrArea*  GarageYrBlt*
##   4   5  MasVnrArea*  GarageYrBlt*
##   5   1  MasVnrArea*  GarageYrBlt
##   5   2  MasVnrArea*  GarageYrBlt*
##   5   3  MasVnrArea*  GarageYrBlt*
##   5   4  MasVnrArea  GarageYrBlt*
##   5   5  MasVnrArea*  GarageYrBlt*
##   6   1  MasVnrArea  GarageYrBlt*
##   6   2  MasVnrArea*  GarageYrBlt
##   6   3  MasVnrArea  GarageYrBlt*
##   6   4  MasVnrArea  GarageYrBlt*
##   6   5  MasVnrArea*  GarageYrBlt*
##   7   1  MasVnrArea  GarageYrBlt
##   7   2  MasVnrArea*  GarageYrBlt*
##   7   3  MasVnrArea*  GarageYrBlt
##   7   4  MasVnrArea  GarageYrBlt
##   7   5  MasVnrArea  GarageYrBlt
##   8   1  MasVnrArea  GarageYrBlt
##   8   2  MasVnrArea*  GarageYrBlt*
##   8   3  MasVnrArea*  GarageYrBlt*
##   8   4  MasVnrArea*  GarageYrBlt*
##   8   5  MasVnrArea*  GarageYrBlt
##   9   1  MasVnrArea*  GarageYrBlt*
##   9   2  MasVnrArea*  GarageYrBlt*
##   9   3  MasVnrArea*  GarageYrBlt
##   9   4  MasVnrArea  GarageYrBlt
##   9   5  MasVnrArea  GarageYrBlt*
##   10   1  MasVnrArea  GarageYrBlt
##   10   2  MasVnrArea  GarageYrBlt*
##   10   3  MasVnrArea*  GarageYrBlt*
##   10   4  MasVnrArea  GarageYrBlt
##   10   5  MasVnrArea  GarageYrBlt
##   11   1  MasVnrArea  GarageYrBlt*
##   11   2  MasVnrArea*  GarageYrBlt*
##   11   3  MasVnrArea*  GarageYrBlt*
##   11   4  MasVnrArea  GarageYrBlt*
##   11   5  MasVnrArea  GarageYrBlt
##   12   1  MasVnrArea*  GarageYrBlt*
##   12   2  MasVnrArea*  GarageYrBlt
##   12   3  MasVnrArea  GarageYrBlt*
##   12   4  MasVnrArea  GarageYrBlt*
##   12   5  MasVnrArea*  GarageYrBlt*
##   13   1  MasVnrArea  GarageYrBlt*
##   13   2  MasVnrArea*  GarageYrBlt*
##   13   3  MasVnrArea  GarageYrBlt
##   13   4  MasVnrArea  GarageYrBlt*
##   13   5  MasVnrArea*  GarageYrBlt*
##   14   1  MasVnrArea*  GarageYrBlt*
##   14   2  MasVnrArea*  GarageYrBlt*
##   14   3  MasVnrArea*  GarageYrBlt
##   14   4  MasVnrArea*  GarageYrBlt*
##   14   5  MasVnrArea  GarageYrBlt
##   15   1  MasVnrArea  GarageYrBlt
##   15   2  MasVnrArea  GarageYrBlt*
##   15   3  MasVnrArea*  GarageYrBlt*
##   15   4  MasVnrArea  GarageYrBlt
##   15   5  MasVnrArea  GarageYrBlt*
##   16   1  MasVnrArea  GarageYrBlt
##   16   2  MasVnrArea*  GarageYrBlt*
##   16   3  MasVnrArea*  GarageYrBlt
##   16   4  MasVnrArea*  GarageYrBlt*
##   16   5  MasVnrArea*  GarageYrBlt*
##   17   1  MasVnrArea*  GarageYrBlt*
##   17   2  MasVnrArea*  GarageYrBlt*
##   17   3  MasVnrArea  GarageYrBlt*
##   17   4  MasVnrArea  GarageYrBlt*
##   17   5  MasVnrArea*  GarageYrBlt*
##   18   1  MasVnrArea*  GarageYrBlt
##   18   2  MasVnrArea  GarageYrBlt*
##   18   3  MasVnrArea  GarageYrBlt*
##   18   4  MasVnrArea*  GarageYrBlt*
##   18   5  MasVnrArea*  GarageYrBlt*
##   19   1  MasVnrArea  GarageYrBlt*
##   19   2  MasVnrArea*  GarageYrBlt*
##   19   3  MasVnrArea*  GarageYrBlt*
##   19   4  MasVnrArea  GarageYrBlt
##   19   5  MasVnrArea  GarageYrBlt
##   20   1  MasVnrArea  GarageYrBlt*
##   20   2  MasVnrArea*  GarageYrBlt*
##   20   3  MasVnrArea  GarageYrBlt*
##   20   4  MasVnrArea*  GarageYrBlt*
##   20   5  MasVnrArea*  GarageYrBlt*
##   21   1  MasVnrArea  GarageYrBlt*
##   21   2  MasVnrArea*  GarageYrBlt*
##   21   3  MasVnrArea*  GarageYrBlt*
##   21   4  MasVnrArea  GarageYrBlt*
##   21   5  MasVnrArea*  GarageYrBlt*
##   22   1  MasVnrArea*  GarageYrBlt*
##   22   2  MasVnrArea  GarageYrBlt
##   22   3  MasVnrArea  GarageYrBlt
##   22   4  MasVnrArea*  GarageYrBlt
##   22   5  MasVnrArea*  GarageYrBlt*
##   23   1  MasVnrArea*  GarageYrBlt*
##   23   2  MasVnrArea  GarageYrBlt
##   23   3  MasVnrArea*  GarageYrBlt*
##   23   4  MasVnrArea*  GarageYrBlt*
##   23   5  MasVnrArea*  GarageYrBlt*
##   24   1  MasVnrArea  GarageYrBlt*
##   24   2  MasVnrArea*  GarageYrBlt*
##   24   3  MasVnrArea  GarageYrBlt*
##   24   4  MasVnrArea  GarageYrBlt*
##   24   5  MasVnrArea*  GarageYrBlt*
##   25   1  MasVnrArea*  GarageYrBlt*
##   25   2  MasVnrArea  GarageYrBlt*
##   25   3  MasVnrArea  GarageYrBlt
##   25   4  MasVnrArea*  GarageYrBlt
##   25   5  MasVnrArea*  GarageYrBlt
##   26   1  MasVnrArea*  GarageYrBlt*
##   26   2  MasVnrArea*  GarageYrBlt*
##   26   3  MasVnrArea*  GarageYrBlt
##   26   4  MasVnrArea*  GarageYrBlt
##   26   5  MasVnrArea  GarageYrBlt
##   27   1  MasVnrArea  GarageYrBlt
##   27   2  MasVnrArea*  GarageYrBlt*
##   27   3  MasVnrArea  GarageYrBlt*
##   27   4  MasVnrArea*  GarageYrBlt*
##   27   5  MasVnrArea  GarageYrBlt*
##   28   1  MasVnrArea  GarageYrBlt*
##   28   2  MasVnrArea*  GarageYrBlt*
##   28   3  MasVnrArea*  GarageYrBlt*
##   28   4  MasVnrArea  GarageYrBlt
##   28   5  MasVnrArea  GarageYrBlt
##   29   1  MasVnrArea  GarageYrBlt
##   29   2  MasVnrArea*  GarageYrBlt
##   29   3  MasVnrArea  GarageYrBlt
##   29   4  MasVnrArea*  GarageYrBlt*
##   29   5  MasVnrArea*  GarageYrBlt*
##   30   1  MasVnrArea  GarageYrBlt*
##   30   2  MasVnrArea  GarageYrBlt*
##   30   3  MasVnrArea*  GarageYrBlt*
##   30   4  MasVnrArea*  GarageYrBlt
##   30   5  MasVnrArea  GarageYrBlt
##   31   1  MasVnrArea*  GarageYrBlt*
##   31   2  MasVnrArea  GarageYrBlt*
##   31   3  MasVnrArea*  GarageYrBlt*
##   31   4  MasVnrArea  GarageYrBlt
##   31   5  MasVnrArea*  GarageYrBlt*
##   32   1  MasVnrArea  GarageYrBlt*
##   32   2  MasVnrArea*  GarageYrBlt*
##   32   3  MasVnrArea  GarageYrBlt
##   32   4  MasVnrArea  GarageYrBlt
##   32   5  MasVnrArea  GarageYrBlt
##   33   1  MasVnrArea*  GarageYrBlt*
##   33   2  MasVnrArea*  GarageYrBlt*
##   33   3  MasVnrArea*  GarageYrBlt*
##   33   4  MasVnrArea  GarageYrBlt
##   33   5  MasVnrArea*  GarageYrBlt*
##   34   1  MasVnrArea  GarageYrBlt
##   34   2  MasVnrArea*  GarageYrBlt*
##   34   3  MasVnrArea*  GarageYrBlt*
##   34   4  MasVnrArea*  GarageYrBlt*
##   34   5  MasVnrArea  GarageYrBlt
##   35   1  MasVnrArea*  GarageYrBlt*
##   35   2  MasVnrArea*  GarageYrBlt*
##   35   3  MasVnrArea*  GarageYrBlt*
##   35   4  MasVnrArea*  GarageYrBlt*
##   35   5  MasVnrArea*  GarageYrBlt
##   36   1  MasVnrArea  GarageYrBlt*
##   36   2  MasVnrArea  GarageYrBlt*
##   36   3  MasVnrArea*  GarageYrBlt*
##   36   4  MasVnrArea*  GarageYrBlt*
##   36   5  MasVnrArea*  GarageYrBlt*
##   37   1  MasVnrArea*  GarageYrBlt*
##   37   2  MasVnrArea*  GarageYrBlt*
##   37   3  MasVnrArea  GarageYrBlt*
##   37   4  MasVnrArea  GarageYrBlt*
##   37   5  MasVnrArea*  GarageYrBlt*
##   38   1  MasVnrArea*  GarageYrBlt*
##   38   2  MasVnrArea*  GarageYrBlt*
##   38   3  MasVnrArea*  GarageYrBlt
##   38   4  MasVnrArea  GarageYrBlt*
##   38   5  MasVnrArea  GarageYrBlt
##   39   1  MasVnrArea*  GarageYrBlt
##   39   2  MasVnrArea  GarageYrBlt*
##   39   3  MasVnrArea*  GarageYrBlt*
##   39   4  MasVnrArea*  GarageYrBlt
##   39   5  MasVnrArea*  GarageYrBlt*
##   40   1  MasVnrArea*  GarageYrBlt
##   40   2  MasVnrArea*  GarageYrBlt
##   40   3  MasVnrArea*  GarageYrBlt*
##   40   4  MasVnrArea*  GarageYrBlt*
##   40   5  MasVnrArea*  GarageYrBlt
##   41   1  MasVnrArea*  GarageYrBlt*
##   41   2  MasVnrArea*  GarageYrBlt*
##   41   3  MasVnrArea*  GarageYrBlt
##   41   4  MasVnrArea  GarageYrBlt*
##   41   5  MasVnrArea*  GarageYrBlt*
##   42   1  MasVnrArea*  GarageYrBlt*
##   42   2  MasVnrArea*  GarageYrBlt*
##   42   3  MasVnrArea  GarageYrBlt*
##   42   4  MasVnrArea*  GarageYrBlt
##   42   5  MasVnrArea*  GarageYrBlt*
##   43   1  MasVnrArea  GarageYrBlt
##   43   2  MasVnrArea  GarageYrBlt
##   43   3  MasVnrArea*  GarageYrBlt*
##   43   4  MasVnrArea*  GarageYrBlt*
##   43   5  MasVnrArea*  GarageYrBlt*
##   44   1  MasVnrArea  GarageYrBlt
##   44   2  MasVnrArea*  GarageYrBlt*
##   44   3  MasVnrArea*  GarageYrBlt*
##   44   4  MasVnrArea  GarageYrBlt*
##   44   5  MasVnrArea*  GarageYrBlt
##   45   1  MasVnrArea  GarageYrBlt*
##   45   2  MasVnrArea  GarageYrBlt
##   45   3  MasVnrArea  GarageYrBlt
##   45   4  MasVnrArea*  GarageYrBlt*
##   45   5  MasVnrArea*  GarageYrBlt
##   46   1  MasVnrArea  GarageYrBlt*
##   46   2  MasVnrArea  GarageYrBlt
##   46   3  MasVnrArea*  GarageYrBlt*
##   46   4  MasVnrArea*  GarageYrBlt*
##   46   5  MasVnrArea*  GarageYrBlt*
##   47   1  MasVnrArea*  GarageYrBlt*
##   47   2  MasVnrArea  GarageYrBlt
##   47   3  MasVnrArea  GarageYrBlt
##   47   4  MasVnrArea  GarageYrBlt*
##   47   5  MasVnrArea*  GarageYrBlt
##   48   1  MasVnrArea*  GarageYrBlt*
##   48   2  MasVnrArea  GarageYrBlt*
##   48   3  MasVnrArea  GarageYrBlt*
##   48   4  MasVnrArea  GarageYrBlt
##   48   5  MasVnrArea*  GarageYrBlt*
##   49   1  MasVnrArea  GarageYrBlt*
##   49   2  MasVnrArea*  GarageYrBlt*
##   49   3  MasVnrArea  GarageYrBlt*
##   49   4  MasVnrArea  GarageYrBlt*
##   49   5  MasVnrArea*  GarageYrBlt*
##   50   1  MasVnrArea*  GarageYrBlt
##   50   2  MasVnrArea  GarageYrBlt*
##   50   3  MasVnrArea*  GarageYrBlt*
##   50   4  MasVnrArea  GarageYrBlt*
##   50   5  MasVnrArea*  GarageYrBlt*
##  * Please inspect the loggedEvents
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##    MSSubClass       LotArea   OverallQual   OverallCond     YearBuilt 
##            ""            ""            ""            ""            "" 
##  YearRemodAdd    MasVnrArea    BsmtFinSF1    BsmtFinSF2     BsmtUnfSF 
##            ""         "pmm"            ""            ""            "" 
##   TotalBsmtSF     X1stFlrSF     X2ndFlrSF  LowQualFinSF     GrLivArea 
##            ""            ""            ""            ""            "" 
##  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath  BedroomAbvGr 
##            ""            ""            ""            ""            "" 
##  KitchenAbvGr  TotRmsAbvGrd    Fireplaces   GarageYrBlt    GarageCars 
##            ""            ""            ""         "pmm"            "" 
##    GarageArea    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##            ""            ""            ""            ""            "" 
##   ScreenPorch      PoolArea       MiscVal        MoSold        YrSold 
##            ""            ""            ""            ""            "" 
## log_SalePrice 
##            "" 
## PredictorMatrix:
##              MSSubClass LotArea OverallQual OverallCond YearBuilt YearRemodAdd
## MSSubClass            0       1           1           1         1            1
## LotArea               1       0           1           1         1            1
## OverallQual           1       1           0           1         1            1
## OverallCond           1       1           1           0         1            1
## YearBuilt             1       1           1           1         0            1
## YearRemodAdd          1       1           1           1         1            0
##              MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF X1stFlrSF
## MSSubClass            1          1          1         1           1         1
## LotArea               1          1          1         1           1         1
## OverallQual           1          1          1         1           1         1
## OverallCond           1          1          1         1           1         1
## YearBuilt             1          1          1         1           1         1
## YearRemodAdd          1          1          1         1           1         1
##              X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## MSSubClass           1            1         1            1            1
## LotArea              1            1         1            1            1
## OverallQual          1            1         1            1            1
## OverallCond          1            1         1            1            1
## YearBuilt            1            1         1            1            1
## YearRemodAdd         1            1         1            1            1
##              FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## MSSubClass          1        1            1            1            1
## LotArea             1        1            1            1            1
## OverallQual         1        1            1            1            1
## OverallCond         1        1            1            1            1
## YearBuilt           1        1            1            1            1
## YearRemodAdd        1        1            1            1            1
##              Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF
## MSSubClass            1           1          1          1          1
## LotArea               1           1          1          1          1
## OverallQual           1           1          1          1          1
## OverallCond           1           1          1          1          1
## YearBuilt             1           1          1          1          1
## YearRemodAdd          1           1          1          1          1
##              OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal
## MSSubClass             1             1          1           1        1       1
## LotArea                1             1          1           1        1       1
## OverallQual            1             1          1           1        1       1
## OverallCond            1             1          1           1        1       1
## YearBuilt              1             1          1           1        1       1
## YearRemodAdd           1             1          1           1        1       1
##              MoSold YrSold log_SalePrice
## MSSubClass        1      1             1
## LotArea           1      1             1
## OverallQual       1      1             1
## OverallCond       1      1             1
## YearBuilt         1      1             1
## YearRemodAdd      1      1             1
## Number of logged events:  816 
##   it im         dep meth
## 1  1  1  MasVnrArea  pmm
## 2  1  1  MasVnrArea  pmm
## 3  1  1 GarageYrBlt  pmm
## 4  1  1 GarageYrBlt  pmm
## 5  1  2  MasVnrArea  pmm
## 6  1  2  MasVnrArea  pmm
##                                                                                                                                                                                                                                                                                                                       out
## 1                                                                                                                                                                                                                                                                                                BsmtFinSF1, BsmtFullBath
## 2 * A ridge penalty had to be used to calculate the inverse crossproduct of the predictor matrix. Please remove duplicate variables or unique respondent names/numbers from the imputation model. It may be advisable to check the fraction of missing information (fmi) to evaluate the validity of the imputation model
## 3                                                                                                                                                                                                                                                                                                BsmtFinSF1, BsmtFullBath
## 4 * A ridge penalty had to be used to calculate the inverse crossproduct of the predictor matrix. Please remove duplicate variables or unique respondent names/numbers from the imputation model. It may be advisable to check the fraction of missing information (fmi) to evaluate the validity of the imputation model
## 5                                                                                                                                                                                                                                                                                                BsmtFinSF1, BsmtFullBath
## 6 * A ridge penalty had to be used to calculate the inverse crossproduct of the predictor matrix. Please remove duplicate variables or unique respondent names/numbers from the imputation model. It may be advisable to check the fraction of missing information (fmi) to evaluate the validity of the imputation model

Rename the dataframe.

Train the model using lm.

Method: “stepAIC”, Numeric Variables

## 
## Call:
## lm(formula = log_SalePrice ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtFinSF2 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + 
##     FullBath + HalfBath + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + 
##     GarageCars + WoodDeckSF + EnclosedPorch + X3SsnPorch + ScreenPorch + 
##     PoolArea + YrSold, data = hpnum_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92973 -0.06766  0.00243  0.07404  0.55321 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.722e+01  5.869e+00   2.933  0.00341 ** 
## MSSubClass    -6.174e-04  1.087e-04  -5.682 1.61e-08 ***
## LotArea        1.853e-06  4.229e-07   4.382 1.26e-05 ***
## OverallQual    8.434e-02  4.902e-03  17.205  < 2e-16 ***
## OverallCond    4.935e-02  4.258e-03  11.589  < 2e-16 ***
## YearBuilt      2.893e-03  2.534e-04  11.416  < 2e-16 ***
## YearRemodAdd   1.091e-03  2.717e-04   4.014 6.27e-05 ***
## BsmtFinSF1     8.116e-05  1.919e-05   4.228 2.51e-05 ***
## BsmtFinSF2     6.976e-05  2.947e-05   2.367  0.01804 *  
## BsmtUnfSF      5.298e-05  1.752e-05   3.024  0.00254 ** 
## X1stFlrSF      1.964e-04  2.390e-05   8.216 4.67e-16 ***
## X2ndFlrSF      1.694e-04  2.035e-05   8.322  < 2e-16 ***
## LowQualFinSF   1.648e-04  8.281e-05   1.991  0.04671 *  
## BsmtFullBath   5.979e-02  1.046e-02   5.717 1.31e-08 ***
## FullBath       3.772e-02  1.166e-02   3.236  0.00124 ** 
## HalfBath       2.089e-02  1.113e-02   1.876  0.06088 .  
## KitchenAbvGr  -5.147e-02  2.177e-02  -2.364  0.01822 *  
## TotRmsAbvGrd   1.466e-02  4.628e-03   3.167  0.00157 ** 
## Fireplaces     4.566e-02  7.346e-03   6.216 6.68e-10 ***
## GarageCars     7.362e-02  7.101e-03  10.367  < 2e-16 ***
## WoodDeckSF     1.270e-04  3.340e-05   3.801  0.00015 ***
## EnclosedPorch  1.709e-04  7.070e-05   2.417  0.01579 *  
## X3SsnPorch     2.272e-04  1.319e-04   1.723  0.08512 .  
## ScreenPorch    3.642e-04  7.223e-05   5.043 5.18e-07 ***
## PoolArea      -3.863e-04  9.888e-05  -3.907 9.77e-05 ***
## YrSold        -7.223e-03  2.919e-03  -2.475  0.01344 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1464 on 1434 degrees of freedom
## Multiple R-squared:  0.868,  Adjusted R-squared:  0.8657 
## F-statistic: 377.2 on 25 and 1434 DF,  p-value: < 2.2e-16

Method: “leapBackward”, Numeric Variables

## Reordering variables and trying again:
##   nvmax      RMSE  Rsquared        MAE     RMSESD RsquaredSD       MAESD
## 1    20 0.1622507 0.8330429 0.10989897 0.02771617 0.05307579 0.012968060
## 2    21 0.1562831 0.8446512 0.10436740 0.02858296 0.05400624 0.007308561
## 3    22 0.1551216 0.8468942 0.10302751 0.02886732 0.05437192 0.006277085
## 4    23 0.1544610 0.8476513 0.10334411 0.02853211 0.05412154 0.007394441
## 5    24 0.1514962 0.8539444 0.09958089 0.02917332 0.05350025 0.005759485
## 6    25 0.1512278 0.8543889 0.09928569 0.02945537 0.05394382 0.005579948
## 7    26 0.1515411 0.8539754 0.09924665 0.02946447 0.05380142 0.004701382

Let’s eliminate some string variables from the larger dataset and recode them as numeric. Recoding will be accomplished with fastDummies function dummy_cols.

Method: “bayesglm” with Recoded Variables

With all recoding done we’ll rerun the models to assess whether we can achieve a better fit. With the bayesglm method from caret:

##   parameter      RMSE  Rsquared        MAE     RMSESD RsquaredSD       MAESD
## 1      none 0.1471863 0.8506474 0.08454199 0.04800956 0.08227343 0.007384579

RMSE is smaller, and \(R^2\) is quite high; with this many variables we’ve probably overfit the model, but for RMSE we have an improvement over the model that included only the numeric variables.

Final Model

Method: “stepAIC” with Recoded Variables

## 
## Call:
## lm(formula = log_SalePrice ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtFinSF2 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + 
##     FullBath + HalfBath + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + 
##     GarageCars + GarageArea + WoodDeckSF + EnclosedPorch + X3SsnPorch + 
##     ScreenPorch + PoolArea + MSZoning_RM + MSZoning_CAll + Street_Grvl + 
##     LotShape_IR1 + LandContour_Bnk + LandContour_Low + Utilities_NoSeWa + 
##     LotConfig_Corner + LotConfig_CulDSac + LandSlope_Mod + LandSlope_Sev + 
##     Neighborhood_Veenker + Neighborhood_Crawfor + Neighborhood_NoRidge + 
##     Neighborhood_Mitchel + Neighborhood_Somerst + Neighborhood_BrkSide + 
##     Neighborhood_NridgHt + Neighborhood_MeadowV + Neighborhood_Edwards + 
##     Neighborhood_StoneBr + Neighborhood_ClearCr + Condition1_Feedr + 
##     Condition1_Artery + Condition1_RRAe + Condition1_RRAn + Condition2_PosN + 
##     Condition2_PosA + Condition2_RRAe + BldgType_2fmCon + BldgType_Twnhs + 
##     HouseStyle_1.5Fin + HouseStyle_SLvl + HouseStyle_2.5Unf + 
##     RoofStyle_Shed + RoofMatl_Metal + RoofMatl_Membran + RoofMatl_ClyTile + 
##     Exterior1st_MetalSd + Exterior1st_WdSdng + Exterior1st_BrkFace + 
##     Exterior1st_CemntBd + Exterior1st_BrkComm + Exterior2nd_HdBoard + 
##     Exterior2nd_WdSdng + Exterior2nd_CmentBd + Exterior2nd_BrkFace + 
##     Exterior2nd_AsbShng + ExterCond_Gd + ExterCond_Fa + Foundation_CBlock + 
##     Foundation_BrkTil + Foundation_Wood + Foundation_Slab + BsmtQual_Ex + 
##     BsmtCond_Po + BsmtExposure_Gd + BsmtExposure_Av + BsmtFinType1_GLQ + 
##     BsmtFinType2_BLQ + Heating_GasW + Heating_Grav + HeatingQC_Gd + 
##     HeatingQC_TA + HeatingQC_Fa + CentralAir_N + Electrical_Mix + 
##     KitchenQual_Ex + Functional_Min1 + Functional_Maj1 + Functional_Min2 + 
##     Functional_Mod + Functional_Maj2 + Functional_Sev + GarageQual_Fa + 
##     GarageQual_Ex + GarageCond_Fa + GarageCond_Po + GarageCond_Ex + 
##     SaleType_New + SaleType_ConLD + SaleType_CWD + SaleCondition_Abnorml + 
##     SaleCondition_Family + HouseStyle_2.5Fin, data = hp_train_final)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71268 -0.04811  0.00260  0.05437  0.71268 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.873e+00  5.775e-01  10.170  < 2e-16 ***
## MSSubClass            -3.750e-04  1.120e-04  -3.349 0.000835 ***
## LotArea                3.175e-06  4.246e-07   7.479 1.34e-13 ***
## OverallQual            4.444e-02  4.081e-03  10.890  < 2e-16 ***
## OverallCond            3.745e-02  3.543e-03  10.570  < 2e-16 ***
## YearBuilt              1.884e-03  2.530e-04   7.448 1.69e-13 ***
## YearRemodAdd           6.757e-04  2.158e-04   3.131 0.001777 ** 
## BsmtFinSF1             1.421e-04  1.734e-05   8.198 5.63e-16 ***
## BsmtFinSF2             1.220e-04  2.384e-05   5.117 3.56e-07 ***
## BsmtUnfSF              8.164e-05  1.575e-05   5.182 2.52e-07 ***
## X1stFlrSF              2.385e-04  1.976e-05  12.071  < 2e-16 ***
## X2ndFlrSF              2.343e-04  1.636e-05  14.325  < 2e-16 ***
## LowQualFinSF           1.964e-04  7.581e-05   2.590 0.009688 ** 
## BsmtFullBath           2.416e-02  7.808e-03   3.095 0.002011 ** 
## FullBath               1.664e-02  8.772e-03   1.897 0.058075 .  
## HalfBath               2.109e-02  8.420e-03   2.505 0.012355 *  
## KitchenAbvGr          -5.672e-02  1.705e-02  -3.328 0.000899 ***
## TotRmsAbvGrd           7.031e-03  3.510e-03   2.003 0.045402 *  
## Fireplaces             2.453e-02  5.494e-03   4.466 8.65e-06 ***
## GarageCars             2.637e-02  9.030e-03   2.920 0.003553 ** 
## GarageArea             1.126e-04  3.098e-05   3.633 0.000291 ***
## WoodDeckSF             9.546e-05  2.452e-05   3.893 0.000104 ***
## EnclosedPorch          1.307e-04  5.189e-05   2.518 0.011904 *  
## X3SsnPorch             1.546e-04  9.517e-05   1.624 0.104503    
## ScreenPorch            2.394e-04  5.306e-05   4.512 6.97e-06 ***
## PoolArea               1.461e-04  7.390e-05   1.977 0.048285 *  
## MSZoning_RM           -5.718e-02  1.035e-02  -5.524 3.96e-08 ***
## MSZoning_CAll         -4.080e-01  3.729e-02 -10.943  < 2e-16 ***
## Street_Grvl           -1.065e-01  4.668e-02  -2.281 0.022730 *  
## LotShape_IR1          -9.330e-03  6.457e-03  -1.445 0.148693    
## LandContour_Bnk       -2.204e-02  1.492e-02  -1.477 0.139785    
## LandContour_Low       -4.567e-02  2.306e-02  -1.980 0.047864 *  
## Utilities_NoSeWa      -1.448e-01  1.062e-01  -1.364 0.172938    
## LotConfig_Corner       1.875e-02  7.426e-03   2.525 0.011687 *  
## LotConfig_CulDSac      4.482e-02  1.223e-02   3.666 0.000256 ***
## LandSlope_Mod          3.511e-02  1.620e-02   2.168 0.030368 *  
## LandSlope_Sev         -2.161e-01  4.641e-02  -4.656 3.54e-06 ***
## Neighborhood_Veenker   6.434e-02  3.291e-02   1.955 0.050830 .  
## Neighborhood_Crawfor   1.342e-01  1.706e-02   7.864 7.55e-15 ***
## Neighborhood_NoRidge   4.863e-02  1.892e-02   2.570 0.010264 *  
## Neighborhood_Mitchel  -3.592e-02  1.606e-02  -2.236 0.025515 *  
## Neighborhood_Somerst   5.685e-02  1.395e-02   4.074 4.90e-05 ***
## Neighborhood_BrkSide   4.362e-02  1.587e-02   2.748 0.006072 ** 
## Neighborhood_NridgHt   8.866e-02  1.580e-02   5.613 2.41e-08 ***
## Neighborhood_MeadowV  -1.237e-01  3.344e-02  -3.701 0.000224 ***
## Neighborhood_Edwards  -5.303e-02  1.209e-02  -4.385 1.25e-05 ***
## Neighborhood_StoneBr   1.346e-01  2.332e-02   5.772 9.69e-09 ***
## Neighborhood_ClearCr   4.990e-02  2.380e-02   2.097 0.036162 *  
## Condition1_Feedr      -5.572e-02  1.259e-02  -4.424 1.05e-05 ***
## Condition1_Artery     -9.888e-02  1.689e-02  -5.853 6.04e-09 ***
## Condition1_RRAe       -1.237e-01  3.190e-02  -3.879 0.000110 ***
## Condition1_RRAn       -3.194e-02  2.113e-02  -1.512 0.130848    
## Condition2_PosN       -8.820e-01  7.785e-02 -11.329  < 2e-16 ***
## Condition2_PosA        2.606e-01  1.132e-01   2.302 0.021463 *  
## Condition2_RRAe       -5.176e-01  1.572e-01  -3.293 0.001016 ** 
## BldgType_2fmCon        4.499e-02  2.569e-02   1.751 0.080157 .  
## BldgType_Twnhs        -5.161e-02  1.936e-02  -2.666 0.007766 ** 
## HouseStyle_1.5Fin      2.314e-02  1.096e-02   2.112 0.034906 *  
## HouseStyle_SLvl        2.600e-02  1.519e-02   1.711 0.087315 .  
## HouseStyle_2.5Unf      5.335e-02  3.583e-02   1.489 0.136671    
## RoofStyle_Shed         3.578e-01  1.140e-01   3.139 0.001734 ** 
## RoofMatl_Metal         2.604e-01  1.158e-01   2.249 0.024696 *  
## RoofMatl_Membran       4.159e-01  1.181e-01   3.522 0.000443 ***
## RoofMatl_ClyTile      -2.622e+00  1.302e-01 -20.141  < 2e-16 ***
## Exterior1st_MetalSd    1.246e-02  9.107e-03   1.368 0.171489    
## Exterior1st_WdSdng    -4.830e-02  1.747e-02  -2.765 0.005771 ** 
## Exterior1st_BrkFace    7.566e-02  2.300e-02   3.290 0.001029 ** 
## Exterior1st_CemntBd   -1.061e-01  6.389e-02  -1.661 0.097019 .  
## Exterior1st_BrkComm   -1.707e-01  7.910e-02  -2.158 0.031115 *  
## Exterior2nd_HdBoard   -1.760e-02  9.020e-03  -1.951 0.051247 .  
## Exterior2nd_WdSdng     3.752e-02  1.744e-02   2.151 0.031675 *  
## Exterior2nd_CmentBd    1.211e-01  6.425e-02   1.885 0.059578 .  
## Exterior2nd_BrkFace   -4.244e-02  3.051e-02  -1.391 0.164355    
## Exterior2nd_AsbShng   -3.821e-02  2.544e-02  -1.502 0.133352    
## ExterCond_Gd          -2.200e-02  9.914e-03  -2.220 0.026618 *  
## ExterCond_Fa          -3.882e-02  2.325e-02  -1.670 0.095176 .  
## Foundation_CBlock     -2.692e-02  9.091e-03  -2.961 0.003117 ** 
## Foundation_BrkTil     -4.136e-02  1.424e-02  -2.903 0.003751 ** 
## Foundation_Wood       -1.646e-01  6.161e-02  -2.672 0.007640 ** 
## Foundation_Slab       -4.804e-02  2.892e-02  -1.661 0.096955 .  
## BsmtQual_Ex            3.266e-02  1.355e-02   2.410 0.016081 *  
## BsmtCond_Po            3.123e-01  1.226e-01   2.546 0.010996 *  
## BsmtExposure_Gd        4.851e-02  1.192e-02   4.069 5.00e-05 ***
## BsmtExposure_Av        1.564e-02  8.681e-03   1.801 0.071907 .  
## BsmtFinType1_GLQ       1.964e-02  8.887e-03   2.210 0.027279 *  
## BsmtFinType2_BLQ      -4.887e-02  1.939e-02  -2.521 0.011832 *  
## Heating_GasW           7.139e-02  2.796e-02   2.553 0.010793 *  
## Heating_Grav          -1.854e-01  4.648e-02  -3.989 7.00e-05 ***
## HeatingQC_Gd          -2.281e-02  8.670e-03  -2.631 0.008609 ** 
## HeatingQC_TA          -3.738e-02  8.347e-03  -4.478 8.16e-06 ***
## HeatingQC_Fa          -2.584e-02  1.849e-02  -1.397 0.162579    
## CentralAir_N          -5.592e-02  1.484e-02  -3.769 0.000171 ***
## Electrical_Mix        -3.383e-01  1.745e-01  -1.938 0.052818 .  
## KitchenQual_Ex         5.188e-02  1.401e-02   3.704 0.000221 ***
## Functional_Min1       -3.875e-02  1.977e-02  -1.960 0.050189 .  
## Functional_Maj1       -7.187e-02  3.094e-02  -2.323 0.020324 *  
## Functional_Min2       -3.607e-02  1.932e-02  -1.867 0.062174 .  
## Functional_Mod        -1.386e-01  3.014e-02  -4.597 4.68e-06 ***
## Functional_Maj2       -3.153e-01  5.348e-02  -5.895 4.72e-09 ***
## Functional_Sev        -3.562e-01  1.051e-01  -3.391 0.000716 ***
## GarageQual_Fa         -3.858e-02  1.919e-02  -2.011 0.044516 *  
## GarageQual_Ex          3.888e-01  1.166e-01   3.334 0.000878 ***
## GarageCond_Fa         -3.157e-02  2.125e-02  -1.486 0.137526    
## GarageCond_Po          1.203e-01  4.819e-02   2.496 0.012695 *  
## GarageCond_Ex         -3.474e-01  1.382e-01  -2.515 0.012032 *  
## SaleType_New           4.990e-02  1.187e-02   4.202 2.82e-05 ***
## SaleType_ConLD         1.326e-01  3.718e-02   3.567 0.000373 ***
## SaleType_CWD           8.259e-02  5.339e-02   1.547 0.122145    
## SaleCondition_Abnorml -6.188e-02  1.147e-02  -5.395 8.07e-08 ***
## SaleCondition_Family  -5.699e-02  2.389e-02  -2.386 0.017172 *  
## HouseStyle_2.5Fin     -7.336e-02  5.039e-02  -1.456 0.145691    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1022 on 1349 degrees of freedom
## Multiple R-squared:  0.9395, Adjusted R-squared:  0.9346 
## F-statistic: 190.5 on 110 and 1349 DF,  p-value: < 2.2e-16

This is the bestmodel by far, tho it includes several nonsignificant coefficients from recoded string variables. \(R^2\) at 94% indicates adequate fit and residuals are symmetric. P-value is significant, but the degrees of freedom are quite low.

The goal was to allow R to select the model programatically. Let’s test these two models with the test set.

# allowing strings to be imported as factors
hp_test <- read.csv('https://raw.githubusercontent.com/sigmasigmaiota/DATA605/master/test.csv', stringsAsFactors = FALSE)

# create data frame with numeric variables
nums2 <- unlist(lapply(hp_test, is.numeric))
# replace NAs with 0s
hp_test[is.na(hp_test)] <- 0
hpnum_test <- hp_test[,nums2]

# create list of character variables 
is_char2 <- colnames(hp_test[,sapply(hp_test, is.character)])

# remove all spaces from character variables in dataframe
hpchar_test <- as.data.frame(apply(hp_test[,is_char2],2,function(x)gsub('\\s+', '',x)),stringsAsFactors = FALSE)

# reassemble
hp_test <- data.frame(hpnum_test, hpchar_test)

# create dummy variables for each, eliminating the referant, defined as the most frequent of each nominal level 
hp_test2 <- dummy_columns(hp_test, select_columns = is_char2, ignore_na = FALSE)

names(hp_test2)[names(hp_test2)=="MSZoning_C(all)"] <- "MSZoning_CAll"
names(hp_test2)[names(hp_test2)=="RoofMatl_Tar&Grv"] <- "RoofMatl_TarGrv"

# some variable levels found in the training set are not present in the testing set. In order to proceed, we must include these variables and set value to zero.
hp_test2$Condition2_RRAe <- 0
hp_test2$Condition2_RRAn <- 0
hp_test2$HouseStyle_2.5Fin <- 0
hp_test2$RoofMatl_Metal <- 0
hp_test2$RoofMatl_Membran <- 0
hp_test2$RoofMatl_ClyTile <- 0
hp_test2$Electrical_Mix <- 0
hp_test2$GarageQual_Ex <- 0
#hp_test2$PoolQC_Fa <- 0
hp_test2$Utilities_NoSeWa <- 0
hp_test2$Condition2_RRNn <- 0
hp_test2$RoofMatl_TarGrv <- 0
hp_test2$RoofMatl_Roll <- 0
hp_test2$Exterior1st_Stone <- 0
hp_test2$Exterior2nd_Other <- 0
hp_test2$Heating_OthW <- 0
hp_test2$Heating_Floor <- 0
hp_test2$Electrical_0 <- 0
#hp_test2$MiscFeature_TenC <- 0
hp_test2$Exterior1st_ImStucc <- 0

# remove al variables not found in the model
#keep_test1 <- names(coef(step.model2$finalModel, step.model2$bestTune$nvmax))[-1]
keep_test1 <- names(coef(step.model2$finalModel))[-1]
keep_test2 <- coef(step.model3$finalModel)[-1]

# predict with the model
prd1 <- predict(step.model2, hp_test2)
prd2 <- predict(step.model3, hp_test2)

We’ll bind the predictions with the Id variable from the test set.

Kaggle username: stephensjones; Display name: Stephen S Jones

My first submission score was 0.14179, number 3009 on the leaderboard. A second attempt yielded a score of 0.14343; wrong direction! Third attempt: 0.18616; wrong direction. Fourth (and final) attempt: 0.13288, 2317 on the leaderboard.

The stepAIC model with lm yielded the best results, though caret was much faster with parallel processing options improving efficiency. Imputation amond numeric variables decreased residuals slightly.

The final regression model is included above.

Stephen (Scott) Jones

12/12/2019