Final Exam
Problem 1.
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{N+1}{2}\).
N = 46
set.seed(1973)
# random variable X
X <- runif(10000, min = 1, max = N)
# random variable Y
Y = rnorm(10000, mean = (N+1)/2, sd = (N+1)/2)
# x is median of X
x <- median(X)
# y is first quartile of Y
y <- quantile(Y)[[2]]Probabilities
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
a. \(P(X > x | X > y )\)
library(tidyverse)
library(dplyr)
Xdf <- data.frame(value = as.numeric(X), stringsAsFactors = FALSE)
cond_prob <- sum(Xdf > x & Xdf > y)/sum(Xdf > y)
cat("The requested probability is",cond_prob[[1]])## The requested probability is 0.5858231
b. \(P(X > x\text{ and } Y > y)\)
Ydf <- data.frame(value = as.numeric(Y), stringsAsFactors = FALSE)
res_prob <- mean(Xdf$value > x) * mean(Ydf$value > y)
cat("The requested probability is",res_prob)## The requested probability is 0.375
c. \(P(X < x\text{ and } X > y )\)
cond_prob2 <- mean(Xdf$value < x) * mean(Xdf$value > y)
cat("The requested probability is",cond_prob2[[1]])## The requested probability is 0.42675
Marginal, Joint Probabilities
Investigate whether \(P(X>x\text{ and } Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
Contingency tables, one with counts and one with probabilities, are below.
XYdf <- data.frame(cbind(Xdf$value,Ydf$value))
colnames(XYdf) <- c('xval','yval')
# compose dataframe with sums
tb_XY <- data.frame(c(sum(XYdf$xval > x & XYdf$yval > y),
sum(XYdf$xval <= x & XYdf$yval > y)),
c(sum(XYdf$xval > x & XYdf$yval <= y),
sum(XYdf$xval <= x & XYdf$yval <= y)),
row.names=c('X > x','X <= x'))
colnames(tb_XY) <- c('Y > y','Y <= y')
library(kableExtra)
tb_XY %>%
kable(align = "c", format = "html")%>%
kable_styling(bootstrap_options = c("striped", "hover"),full_width = F)%>%
column_spec(1, bold = T, border_right = T)%>%
column_spec(2, border_right = T)| Y > y | Y <= y | |
|---|---|---|
| X > x | 3760 | 1240 |
| X <= x | 3740 | 1260 |
Marginal probabilities:
\(P(Y>y)=\) 0.75.
\(P(Y\le y)=\) 0.25.
\(P(X>x)=\) 0.5.
\(P(X\le x)=\) 0.5.
Joint probabilities:
\(P(X>x\text{ and }Y>y)=\) 0.376.
\(P(X>x\text{ and }Y\le y)=\) 0.124.
\(P(X\le x\text{ and }Y>y)=\) 0.374.
\(P(X\le x\text{ and }Y\le y)=\) 0.126.
These probabilities are confirmed in the table below.
# calculate proportion
tb_prop_XY <- tb_XY/10000
tb_prop_XY %>%
kable(align = "c", format = "html",
table.attr = 'class="table table-striped table-hover"')%>%
kable_styling(bootstrap_options = c("striped", "hover"),full_width = F)%>%
column_spec(1, bold = T, border_right = T)%>%
column_spec(2, border_right = T)| Y > y | Y <= y | |
|---|---|---|
| X > x | 0.376 | 0.124 |
| X <= x | 0.374 | 0.126 |
Further, the joint probability \(P(X>x\text{ and }Y>y)=\) 0.376.
The product of marginal probabilities
\(P(X>x)=\) 0.5 and
\(P(Y>y)=\) 0.75 is
\(P(X>x)P(Y>y)=\) 0.375.
Fisher’s, Chi-squared
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Fisher’s results:
library(stats)
XY_matrix <- as.matrix(tb_XY)
fisher.test(XY_matrix, alternative = "two.sided", hybrid = TRUE)##
## Fisher's Exact Test for Count Data
##
## data: XY_matrix
## p-value = 0.6608
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.932132 1.119534
## sample estimates:
## odds ratio
## 1.021566
Chi-squared results:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tb_XY
## X-squared = 0.19253, df = 1, p-value = 0.6608
Fisher’s Test is for 2-by-2 contingency tables; chi-square is appropriate when comparing two categorical variables 2 or more categories each.
Since the p-value for both tests are equivalent and both greater than 0.05, we do not reject the null hypothesis of the test (the distribution of X is independent of the distribution of Y) at the 5% significance level. This is the expected result, as the variables were generated randomly and independently.
Problem 2.
Descriptives
Provide univariate descriptive statistics and appropriate plots for the training data set.
# allowing strings to be imported as factors
hp_train <- read.csv('https://raw.githubusercontent.com/sigmasigmaiota/DATA605/master/train.csv', stringsAsFactors = FALSE)
# omit the iterative ID column
hp_train[,1] <- NULLUsing datatable from DT:
Scatterplot Matrix
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
Using the plot above we selected three variables OverallQual, TotRmsAbvGrd, GrLivArea, and the dependent variable, SalePrice.
Correlation Matrix
Derive a correlation matrix for any three quantitative variables in the dataset.
Let’s test for normality in the three variables. We’ll render histograms and qqplots for the three variables, and follow with Shapiro-Wilkes and Kolmogorov Smirov tests for normality.
# create data frame with numeric variables
nums <- unlist(lapply(hp_train, is.numeric))
# replace NAs with 0s
hp_train[is.na(hp_train)] <- 0
hpnum_train <- hp_train[,nums]
library(ggpubr)
library(ggplot2)
library(cowplot)
p1<-ggplot(data=hpnum_train, aes(hpnum_train$OverallQual)) +
geom_histogram()+
labs(title="OverallQual")
p2<-ggplot(data=hpnum_train, aes(hpnum_train$GrLivArea)) +
geom_histogram()+
labs(title="GrLivArea")
p3<-ggplot(data=hpnum_train, aes(hpnum_train$TotRmsAbvGrd)) +
geom_histogram()+
labs(title="TotRmsAbvGrd")
p4<-ggqqplot(hpnum_train$OverallQual, ylab="OverallQual")
p5<-ggqqplot(hpnum_train$GrLivArea, ylab="GrLivArea")
p6<-ggqqplot(hpnum_train$TotRmsAbvGrd, ylab="TotRmsAbvGrd")
plot_grid(p1,p2,p3,p4,p5,p6,nrow=2,ncol=3)GrLivArea clearly does not follow a normal distribution; while OverallQual seems to approximate a discrete normal distribution, TotRmsAbvGrd seems to deviate significantly. Test results below will confirm these observations.
## OverallQual:
##
## Shapiro-Wilk normality test
##
## data: hpnum_train$OverallQual
## W = 0.94801, p-value < 2.2e-16
##
## One-sample Kolmogorov-Smirnov test
##
## data: hpnum_train$OverallQual
## D = 0.99523, p-value < 2.2e-16
## alternative hypothesis: two-sided
## GrLivArea:
##
## Shapiro-Wilk normality test
##
## data: hpnum_train$GrLivArea
## W = 0.92798, p-value < 2.2e-16
##
## One-sample Kolmogorov-Smirnov test
##
## data: hpnum_train$GrLivArea
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
## TotRmsAbvGrd:
##
## Shapiro-Wilk normality test
##
## data: hpnum_train$TotRmsAbvGrd
## W = 0.94228, p-value < 2.2e-16
##
## One-sample Kolmogorov-Smirnov test
##
## data: hpnum_train$TotRmsAbvGrd
## D = 0.99797, p-value < 2.2e-16
## alternative hypothesis: two-sided
According to our analyses, the variables are not normally distributed. This determines the method of the correlation tests; we’ll use method = "kendall", a nonparametric test.
library(corrplot)
# obtain correlations
TT <-cor(hpnum_train[c("OverallQual","TotRmsAbvGrd","GrLivArea")], method = "kendall")
# plot correlation plot
cex.before <- par("cex")
par(cex = 0.9)
corrplot(TT, insig = "blank", method = "color",
addCoef.col="grey",
order = "AOE", tl.cex = 1/par("cex"),
cl.cex = 1/par("cex"), addCoefasPercent = TRUE)Correlation Tests
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
Because we are using a nonparametric test method, cor.test will test for significance at the .95 level.
c_OG <- cor.test(hpnum_train$OverallQual, hpnum_train$GrLivArea, method = "kendall", use = "complete.obs")
c_OT <- cor.test(hpnum_train$OverallQual, hpnum_train$TotRmsAbvGrd, method = "kendall", use = "complete.obs")
c_TG <- cor.test(hpnum_train$TotRmsAbvGrd, hpnum_train$GrLivArea, method = "kendall", use = "complete.obs")
cat("Testing OverallQual and GrLivArea, p-val:",c_OG$p.value," We reject the null hypothesis.\n")## Testing OverallQual and GrLivArea, p-val: 6.870239e-130 We reject the null hypothesis.
cat("Testing OverallQual and TotRmsAbvGrd, p-val:",c_OT$p.value," We reject the null hypothesis.\n")## Testing OverallQual and TotRmsAbvGrd, p-val: 3.056094e-64 We reject the null hypothesis.
## Testing TotRmsAbvGrd and GrLivArea, p-val: 8.066909e-285 We reject the null hypothesis.
To produce the desired confidence intervals for Kendall tau we use kendall.ci from the package NSM3 without the bootstrap option.
## OverallQual and GrLivArea:
##
## 1 - alpha = 0.8 two-sided CI for tau:
## 0.447, 0.481
## OverallQual and TotRmsAbvGrd:
##
## 1 - alpha = 0.8 two-sided CI for tau:
## 0.331, 0.371
## TotRmsAbvGrd and GrLivArea:
##
## 1 - alpha = 0.8 two-sided CI for tau:
## 0.672, 0.697
Discussion
Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
As expected, each of the confidence intervals do not span zero, confirming our rejection of the null hypothesis. Each of the pairs of variables are correlated. Given the meaning of each variable this result is logical. With these findings I would not be concerned with familywise error; even using the Bonferroni correction method, each of the p-values are much less than \(0.05/3\approx 0.015\), confirming our original result.
LU Decomposition
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Let’s check that the determinate does not equal zero.
## [1] 0.4161713
Since the determinate does not equal zero, the inverse exists. We use the library matlib to find the inverse. The product of the correlation matrix and the precision matrix:
## [,1] [,2] [,3]
## [1,] 1.000000e+00 1.526603e-09 -6.066037e-11
## [2,] 5.717334e-09 1.000000e+00 -4.510915e-09
## [3,] 5.234534e-09 -3.307733e-09 1.000000e+00
## which is effectively
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
The product of the precision matrix and the correlation matrix:
## [,1] [,2] [,3]
## [1,] 1.000000e+00 5.717334e-09 5.234534e-09
## [2,] 1.526603e-09 1.000000e+00 -3.307733e-09
## [3,] -6.066037e-11 -4.510915e-09 1.000000e+00
## which is effectively
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
We’ll obtain the lower and upper decompositions of the matrix and its inverse using the package pracma.
## Decomposition of the correlation matrix:
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.3513492 1.0000000 0
## [3,] 0.4641893 0.5944486 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.3513492 0.4641893
## [2,] 0 0.8765538 0.5210662
## [3,] 0 0.0000000 0.4747812
## Decomposition of the inverse:
## $L
## [,1] [,2] [,3]
## [1,] 1.00000000 0.0000000 0
## [2,] -0.06348622 1.0000000 0
## [3,] -0.42075466 -0.6841587 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1.278144 -0.08114454 -0.5377851
## [2,] 0.000000 1.87995768 -1.2861894
## [3,] 0.000000 0.00000000 1.0000000
Exponential PDF
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html).
According to the histogram of GrLivArea above, the variable is skewed to the right. The MASS package was loaded at an earlier point in this markdown. Let’s test that the mean is greater than the median in the training set and the minimum value of the variable is greater than zero.
## [1] TRUE
## [1] TRUE
We’ll use fitdistr to fit the pdf.
## rate
## 6.598640e-04
## (1.726943e-05)
\(\lambda\) Histogram
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable.
fit <- fitdistr(hp_train$GrLivArea, "exponential")
set.seed(1974)
est_data <- rexp(1000, fit$estimate[1]) # unkonwn distribution parameters
par(mfrow=c(1,2))
hist(hp_train$GrLivArea, pch=20, breaks=50, prob=TRUE, col = "lightblue", ylim = c(0,.0008), xlim = c(0,9000))
hist(est_data, pch=20, breaks=50, prob=TRUE, col = "orange", ylim = c(0,.0008), xlim = c(0,9000))The distributions look very different with the axes set for optimum comparison. To continue experimentation, I would try a truncated negative binomial distribution to approximate the original variable.
CDF Percentiles
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
For \(x < 0, x=0\); for \(x \ge 0\) we have:
\[\begin{aligned} F(x;\lambda)= P(X\le x)&=\int_{0}^{x} \lambda e^{-\lambda x}\\ &=\bigg[-e^{-\lambda x}\bigg]_0^{x}\\ &=1 - e^{-\lambda x}\\ \end{aligned}\] To find the 5th percentile:
\[\begin{aligned} F(x;0.000659864 )= P(X\le x)&=1 - e^{-0.000659864 x}=.05\\ \implies 0.95 &= e^{-0.000659864 x}\\ \log(0.95)&=-0.000659864 x\\ x&=\frac{\log(95)}{-0.000659864}\\ x&=77.73313\\ \end{aligned}\]
To find the 95th percentile:
\[\begin{aligned} F(x;0.000659864 )= P(X\le x)&=1 - e^{-0.000659864 x}=.95\\ \implies 0.05 &= e^{-0.000659864 x}\\ \log(0.05)&=-0.000659864 x\\ x&=\frac{\log(05)}{-0.000659864}\\ x&=4539.924\\ \end{aligned}\]
Programmatically:
q <- seq(0.001,0.999,0.001)
qexpFIT <- data.frame(Q=q, Quantile=qexp(q, rate=fit$estimate[1]))
cat("The 5th percentile:",qexpFIT[50,2],"\n")## The 5th percentile: 77.73313
## The 95th percentile: 4539.924
Empirical CI
Also generate a 95% confidence interval from the empirical data, assuming normality.
The data in hp_train$GrLivArea are not normally distributed; however, we will assume normality.
m <- mean(hp_train$GrLivArea)
sdev <- sd(hp_train$GrLivArea)
n <- length(hp_train$GrLivArea)
error <- qnorm(0.975)*sdev/sqrt(n)
lower <- m-error
upper <- m+error
cat("Assuming normality, the 95% confidence interval is (",lower,",",upper,")\n")## Assuming normality, the 95% confidence interval is ( 1488.509 , 1542.418 )
Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
elow <- quantile(hp_train$GrLivArea, probs = 0.05)
eupp <- quantile(hp_train$GrLivArea, probs = 0.95)
cat("The empirical 5th and 95th percentiles are",elow,"and",eupp,"respectively.\n")## The empirical 5th and 95th percentiles are 848 and 2466.1 respectively.
The empirical 5th and 95th percentiles (848 and 2466.1) are quite different than the confidence intervals calculated while assuming normality (1488.509 and 1542.418). Our variable, GrLivArea, is not normally distributed. The mean is greater than the median; further, error, mean and standard deviation should be recalculated to estimate proper distribution, as there are many more observations with value lesser than the mean than there are greater than the mean.
Let’s try the negative binomial distribution instead. A comparison of the histograms is below.
fit2 <- fitdistr(hp_train$GrLivArea, "negative binomial")
set.seed(1975)
est2_data <- rnegbin(1000, fit2$estimate[2],fit2$estimate[1]) # unknown distribution parameters
par(mfrow=c(1,2))
hist(hp_train$GrLivArea, pch=20, breaks=50, prob=TRUE, col = "lightblue", ylim = c(0,.0008), xlim = c(0,9000))
hist(est2_data, pch=20, breaks=50, prob=TRUE, col = "red", ylim = c(0,.0008), xlim = c(0,9000))The fit is much better and the histograms closely resemble one another. Let’s use these parameters to calculate the confidence interval.
# mean is calculated from the estimated parameter of the proposed model
m2 <- fit2$estimate[2]
# standard deviation is calculated from the estimated paramter of the proposed model
sdev2 <- fit2$sd[2]
n2 <- length(hp_train$GrLivArea)
# we are estimating error following a negative binomial distribution, not a normal distribution
error2 <- qnbinom(0.975, mu = m2, size = fit2$estimate[1])*sdev2/sqrt(n2)
lower2 <- m2-error2
upper2 <- m2+error2
cat("Using the negative binomial parameters, the 95% confidence interval is (",lower2,",",upper2,")\n")## Using the negative binomial parameters, the 95% confidence interval is ( 607.709 , 2423.218 )
Our estimated confidence interval (607.709, 2423.218) much more closely resembles the percentiles from the empirical data, 848 and 2466.1. Model estimation and fit improves drastically when the appropriate model is selected.
Regression Result
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
p7<-ggplot(data=hp_train, aes(hp_train$SalePrice)) +
geom_histogram()+
labs(title="SalePrice")
p8<-ggqqplot(hp_train$SalePrice, ylab="SalePrice")
plot_grid(p7,p8)Let’s use the descdist rendering to suggest a distribution for the dependent variable.
## summary statistics
## ------
## min: 34900 max: 755000
## median: 163000
## mean: 180921.2
## estimated sd: 79442.5
## estimated skewness: 1.882876
## estimated kurtosis: 9.536282
From the figure above, it looks as if the lognormal distribution most closely approximates the empirical values in our data. Let’s compare with negative binomial.
fit_ln <- fitdist(hp_train$SalePrice, "lnorm")
fit_nb <- fitdist(hp_train$SalePrice, "nbinom")
par(mfrow=c(2,2))
plot.legend <- c("lognormal", "neg bin")
denscomp(list(fit_ln, fit_nb), legendtext = plot.legend)
cdfcomp (list(fit_ln, fit_nb), legendtext = plot.legend)
qqcomp (list(fit_ln, fit_nb), legendtext = plot.legend)
ppcomp (list(fit_ln, fit_nb), legendtext = plot.legend)We’ll select a lognormal distribution, as the negative binomial distribution fit underperforms.
# create transformed variable in the numeric-only dataset and the original dataset
hpnum_train$log_SalePrice <- log1p(hpnum_train$SalePrice)
hp_train$log_SalePrice <- log1p(hp_train$SalePrice)
hpnum_train$SalePrice <- NULL
hp_train$SalePrice <- NULL
p9<-ggplot(data=hpnum_train, aes(hpnum_train$log_SalePrice))+
geom_histogram()+
labs(title="Log SalePrice")
p10<-ggqqplot(hpnum_train$log_SalePrice, ylab="Log SalePrice")
plot_grid(p9,p10)The transformation results in a nearly normal distribution. Let’s fit a model with the numeric variables.
Method: “stepAIC”
We begin by starting over with a fresh dataset and imputing missing values in the numeric variables.
# do not allow strings to be imported as factors
hp_train <- read.csv('https://raw.githubusercontent.com/sigmasigmaiota/DATA605/master/train.csv', stringsAsFactors = FALSE)
# omit the iterative ID column
hp_train[,1] <- NULLBefore imputing missing data we assess the percentage of missing in each variable and omit those missing more than 7% in observations.
options(max.print=1000000)
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(hp_train,2,pMiss)## MSSubClass MSZoning LotFrontage LotArea Street
## 0.00000000 0.00000000 17.73972603 0.00000000 0.00000000
## Alley LotShape LandContour Utilities LotConfig
## 93.76712329 0.00000000 0.00000000 0.00000000 0.00000000
## LandSlope Neighborhood Condition1 Condition2 BldgType
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 0.00000000 0.00000000 0.00000000 0.00000000 0.54794521
## MasVnrArea ExterQual ExterCond Foundation BsmtQual
## 0.54794521 0.00000000 0.00000000 0.00000000 2.53424658
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 2.53424658 2.60273973 2.53424658 0.00000000 2.60273973
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 0.00000000 0.06849315 0.00000000 0.00000000 0.00000000
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## 0.00000000 47.26027397 5.54794521 5.54794521 5.54794521
## GarageCars GarageArea GarageQual GarageCond PavedDrive
## 0.00000000 0.00000000 5.54794521 5.54794521 0.00000000
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## PoolArea PoolQC Fence MiscFeature MiscVal
## 0.00000000 99.52054795 80.75342466 96.30136986 0.00000000
## MoSold YrSold SaleType SaleCondition SalePrice
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
Let’s eliminate the variables missing more than 7%.
drop <- c('FireplaceQu','GarageType','PoolQC','MiscFeature','Fence','LotFrontage','Alley')
hp_train = hp_train[,!(names(hp_train) %in% drop)]
# create data frame with numeric variables
nums <- unlist(lapply(hp_train, is.numeric))
hpnum_train <- hp_train[,nums]
# create transformed variable in the numeric-only dataset and the original dataset
hpnum_train$log_SalePrice <- log1p(hpnum_train$SalePrice)
hp_train$log_SalePrice <- log1p(hp_train$SalePrice)
hpnum_train$SalePrice <- NULL
hp_train$SalePrice <- NULLLet’s impute the missing data using the MICE package.
##
## iter imp variable
## 1 1 MasVnrArea* GarageYrBlt*
## 1 2 MasVnrArea* GarageYrBlt*
## 1 3 MasVnrArea GarageYrBlt
## 1 4 MasVnrArea GarageYrBlt*
## 1 5 MasVnrArea* GarageYrBlt*
## 2 1 MasVnrArea GarageYrBlt
## 2 2 MasVnrArea GarageYrBlt*
## 2 3 MasVnrArea* GarageYrBlt*
## 2 4 MasVnrArea GarageYrBlt
## 2 5 MasVnrArea GarageYrBlt*
## 3 1 MasVnrArea* GarageYrBlt*
## 3 2 MasVnrArea* GarageYrBlt*
## 3 3 MasVnrArea GarageYrBlt*
## 3 4 MasVnrArea GarageYrBlt
## 3 5 MasVnrArea GarageYrBlt*
## 4 1 MasVnrArea* GarageYrBlt*
## 4 2 MasVnrArea* GarageYrBlt*
## 4 3 MasVnrArea GarageYrBlt*
## 4 4 MasVnrArea* GarageYrBlt*
## 4 5 MasVnrArea* GarageYrBlt*
## 5 1 MasVnrArea* GarageYrBlt
## 5 2 MasVnrArea* GarageYrBlt*
## 5 3 MasVnrArea* GarageYrBlt*
## 5 4 MasVnrArea GarageYrBlt*
## 5 5 MasVnrArea* GarageYrBlt*
## 6 1 MasVnrArea GarageYrBlt*
## 6 2 MasVnrArea* GarageYrBlt
## 6 3 MasVnrArea GarageYrBlt*
## 6 4 MasVnrArea GarageYrBlt*
## 6 5 MasVnrArea* GarageYrBlt*
## 7 1 MasVnrArea GarageYrBlt
## 7 2 MasVnrArea* GarageYrBlt*
## 7 3 MasVnrArea* GarageYrBlt
## 7 4 MasVnrArea GarageYrBlt
## 7 5 MasVnrArea GarageYrBlt
## 8 1 MasVnrArea GarageYrBlt
## 8 2 MasVnrArea* GarageYrBlt*
## 8 3 MasVnrArea* GarageYrBlt*
## 8 4 MasVnrArea* GarageYrBlt*
## 8 5 MasVnrArea* GarageYrBlt
## 9 1 MasVnrArea* GarageYrBlt*
## 9 2 MasVnrArea* GarageYrBlt*
## 9 3 MasVnrArea* GarageYrBlt
## 9 4 MasVnrArea GarageYrBlt
## 9 5 MasVnrArea GarageYrBlt*
## 10 1 MasVnrArea GarageYrBlt
## 10 2 MasVnrArea GarageYrBlt*
## 10 3 MasVnrArea* GarageYrBlt*
## 10 4 MasVnrArea GarageYrBlt
## 10 5 MasVnrArea GarageYrBlt
## 11 1 MasVnrArea GarageYrBlt*
## 11 2 MasVnrArea* GarageYrBlt*
## 11 3 MasVnrArea* GarageYrBlt*
## 11 4 MasVnrArea GarageYrBlt*
## 11 5 MasVnrArea GarageYrBlt
## 12 1 MasVnrArea* GarageYrBlt*
## 12 2 MasVnrArea* GarageYrBlt
## 12 3 MasVnrArea GarageYrBlt*
## 12 4 MasVnrArea GarageYrBlt*
## 12 5 MasVnrArea* GarageYrBlt*
## 13 1 MasVnrArea GarageYrBlt*
## 13 2 MasVnrArea* GarageYrBlt*
## 13 3 MasVnrArea GarageYrBlt
## 13 4 MasVnrArea GarageYrBlt*
## 13 5 MasVnrArea* GarageYrBlt*
## 14 1 MasVnrArea* GarageYrBlt*
## 14 2 MasVnrArea* GarageYrBlt*
## 14 3 MasVnrArea* GarageYrBlt
## 14 4 MasVnrArea* GarageYrBlt*
## 14 5 MasVnrArea GarageYrBlt
## 15 1 MasVnrArea GarageYrBlt
## 15 2 MasVnrArea GarageYrBlt*
## 15 3 MasVnrArea* GarageYrBlt*
## 15 4 MasVnrArea GarageYrBlt
## 15 5 MasVnrArea GarageYrBlt*
## 16 1 MasVnrArea GarageYrBlt
## 16 2 MasVnrArea* GarageYrBlt*
## 16 3 MasVnrArea* GarageYrBlt
## 16 4 MasVnrArea* GarageYrBlt*
## 16 5 MasVnrArea* GarageYrBlt*
## 17 1 MasVnrArea* GarageYrBlt*
## 17 2 MasVnrArea* GarageYrBlt*
## 17 3 MasVnrArea GarageYrBlt*
## 17 4 MasVnrArea GarageYrBlt*
## 17 5 MasVnrArea* GarageYrBlt*
## 18 1 MasVnrArea* GarageYrBlt
## 18 2 MasVnrArea GarageYrBlt*
## 18 3 MasVnrArea GarageYrBlt*
## 18 4 MasVnrArea* GarageYrBlt*
## 18 5 MasVnrArea* GarageYrBlt*
## 19 1 MasVnrArea GarageYrBlt*
## 19 2 MasVnrArea* GarageYrBlt*
## 19 3 MasVnrArea* GarageYrBlt*
## 19 4 MasVnrArea GarageYrBlt
## 19 5 MasVnrArea GarageYrBlt
## 20 1 MasVnrArea GarageYrBlt*
## 20 2 MasVnrArea* GarageYrBlt*
## 20 3 MasVnrArea GarageYrBlt*
## 20 4 MasVnrArea* GarageYrBlt*
## 20 5 MasVnrArea* GarageYrBlt*
## 21 1 MasVnrArea GarageYrBlt*
## 21 2 MasVnrArea* GarageYrBlt*
## 21 3 MasVnrArea* GarageYrBlt*
## 21 4 MasVnrArea GarageYrBlt*
## 21 5 MasVnrArea* GarageYrBlt*
## 22 1 MasVnrArea* GarageYrBlt*
## 22 2 MasVnrArea GarageYrBlt
## 22 3 MasVnrArea GarageYrBlt
## 22 4 MasVnrArea* GarageYrBlt
## 22 5 MasVnrArea* GarageYrBlt*
## 23 1 MasVnrArea* GarageYrBlt*
## 23 2 MasVnrArea GarageYrBlt
## 23 3 MasVnrArea* GarageYrBlt*
## 23 4 MasVnrArea* GarageYrBlt*
## 23 5 MasVnrArea* GarageYrBlt*
## 24 1 MasVnrArea GarageYrBlt*
## 24 2 MasVnrArea* GarageYrBlt*
## 24 3 MasVnrArea GarageYrBlt*
## 24 4 MasVnrArea GarageYrBlt*
## 24 5 MasVnrArea* GarageYrBlt*
## 25 1 MasVnrArea* GarageYrBlt*
## 25 2 MasVnrArea GarageYrBlt*
## 25 3 MasVnrArea GarageYrBlt
## 25 4 MasVnrArea* GarageYrBlt
## 25 5 MasVnrArea* GarageYrBlt
## 26 1 MasVnrArea* GarageYrBlt*
## 26 2 MasVnrArea* GarageYrBlt*
## 26 3 MasVnrArea* GarageYrBlt
## 26 4 MasVnrArea* GarageYrBlt
## 26 5 MasVnrArea GarageYrBlt
## 27 1 MasVnrArea GarageYrBlt
## 27 2 MasVnrArea* GarageYrBlt*
## 27 3 MasVnrArea GarageYrBlt*
## 27 4 MasVnrArea* GarageYrBlt*
## 27 5 MasVnrArea GarageYrBlt*
## 28 1 MasVnrArea GarageYrBlt*
## 28 2 MasVnrArea* GarageYrBlt*
## 28 3 MasVnrArea* GarageYrBlt*
## 28 4 MasVnrArea GarageYrBlt
## 28 5 MasVnrArea GarageYrBlt
## 29 1 MasVnrArea GarageYrBlt
## 29 2 MasVnrArea* GarageYrBlt
## 29 3 MasVnrArea GarageYrBlt
## 29 4 MasVnrArea* GarageYrBlt*
## 29 5 MasVnrArea* GarageYrBlt*
## 30 1 MasVnrArea GarageYrBlt*
## 30 2 MasVnrArea GarageYrBlt*
## 30 3 MasVnrArea* GarageYrBlt*
## 30 4 MasVnrArea* GarageYrBlt
## 30 5 MasVnrArea GarageYrBlt
## 31 1 MasVnrArea* GarageYrBlt*
## 31 2 MasVnrArea GarageYrBlt*
## 31 3 MasVnrArea* GarageYrBlt*
## 31 4 MasVnrArea GarageYrBlt
## 31 5 MasVnrArea* GarageYrBlt*
## 32 1 MasVnrArea GarageYrBlt*
## 32 2 MasVnrArea* GarageYrBlt*
## 32 3 MasVnrArea GarageYrBlt
## 32 4 MasVnrArea GarageYrBlt
## 32 5 MasVnrArea GarageYrBlt
## 33 1 MasVnrArea* GarageYrBlt*
## 33 2 MasVnrArea* GarageYrBlt*
## 33 3 MasVnrArea* GarageYrBlt*
## 33 4 MasVnrArea GarageYrBlt
## 33 5 MasVnrArea* GarageYrBlt*
## 34 1 MasVnrArea GarageYrBlt
## 34 2 MasVnrArea* GarageYrBlt*
## 34 3 MasVnrArea* GarageYrBlt*
## 34 4 MasVnrArea* GarageYrBlt*
## 34 5 MasVnrArea GarageYrBlt
## 35 1 MasVnrArea* GarageYrBlt*
## 35 2 MasVnrArea* GarageYrBlt*
## 35 3 MasVnrArea* GarageYrBlt*
## 35 4 MasVnrArea* GarageYrBlt*
## 35 5 MasVnrArea* GarageYrBlt
## 36 1 MasVnrArea GarageYrBlt*
## 36 2 MasVnrArea GarageYrBlt*
## 36 3 MasVnrArea* GarageYrBlt*
## 36 4 MasVnrArea* GarageYrBlt*
## 36 5 MasVnrArea* GarageYrBlt*
## 37 1 MasVnrArea* GarageYrBlt*
## 37 2 MasVnrArea* GarageYrBlt*
## 37 3 MasVnrArea GarageYrBlt*
## 37 4 MasVnrArea GarageYrBlt*
## 37 5 MasVnrArea* GarageYrBlt*
## 38 1 MasVnrArea* GarageYrBlt*
## 38 2 MasVnrArea* GarageYrBlt*
## 38 3 MasVnrArea* GarageYrBlt
## 38 4 MasVnrArea GarageYrBlt*
## 38 5 MasVnrArea GarageYrBlt
## 39 1 MasVnrArea* GarageYrBlt
## 39 2 MasVnrArea GarageYrBlt*
## 39 3 MasVnrArea* GarageYrBlt*
## 39 4 MasVnrArea* GarageYrBlt
## 39 5 MasVnrArea* GarageYrBlt*
## 40 1 MasVnrArea* GarageYrBlt
## 40 2 MasVnrArea* GarageYrBlt
## 40 3 MasVnrArea* GarageYrBlt*
## 40 4 MasVnrArea* GarageYrBlt*
## 40 5 MasVnrArea* GarageYrBlt
## 41 1 MasVnrArea* GarageYrBlt*
## 41 2 MasVnrArea* GarageYrBlt*
## 41 3 MasVnrArea* GarageYrBlt
## 41 4 MasVnrArea GarageYrBlt*
## 41 5 MasVnrArea* GarageYrBlt*
## 42 1 MasVnrArea* GarageYrBlt*
## 42 2 MasVnrArea* GarageYrBlt*
## 42 3 MasVnrArea GarageYrBlt*
## 42 4 MasVnrArea* GarageYrBlt
## 42 5 MasVnrArea* GarageYrBlt*
## 43 1 MasVnrArea GarageYrBlt
## 43 2 MasVnrArea GarageYrBlt
## 43 3 MasVnrArea* GarageYrBlt*
## 43 4 MasVnrArea* GarageYrBlt*
## 43 5 MasVnrArea* GarageYrBlt*
## 44 1 MasVnrArea GarageYrBlt
## 44 2 MasVnrArea* GarageYrBlt*
## 44 3 MasVnrArea* GarageYrBlt*
## 44 4 MasVnrArea GarageYrBlt*
## 44 5 MasVnrArea* GarageYrBlt
## 45 1 MasVnrArea GarageYrBlt*
## 45 2 MasVnrArea GarageYrBlt
## 45 3 MasVnrArea GarageYrBlt
## 45 4 MasVnrArea* GarageYrBlt*
## 45 5 MasVnrArea* GarageYrBlt
## 46 1 MasVnrArea GarageYrBlt*
## 46 2 MasVnrArea GarageYrBlt
## 46 3 MasVnrArea* GarageYrBlt*
## 46 4 MasVnrArea* GarageYrBlt*
## 46 5 MasVnrArea* GarageYrBlt*
## 47 1 MasVnrArea* GarageYrBlt*
## 47 2 MasVnrArea GarageYrBlt
## 47 3 MasVnrArea GarageYrBlt
## 47 4 MasVnrArea GarageYrBlt*
## 47 5 MasVnrArea* GarageYrBlt
## 48 1 MasVnrArea* GarageYrBlt*
## 48 2 MasVnrArea GarageYrBlt*
## 48 3 MasVnrArea GarageYrBlt*
## 48 4 MasVnrArea GarageYrBlt
## 48 5 MasVnrArea* GarageYrBlt*
## 49 1 MasVnrArea GarageYrBlt*
## 49 2 MasVnrArea* GarageYrBlt*
## 49 3 MasVnrArea GarageYrBlt*
## 49 4 MasVnrArea GarageYrBlt*
## 49 5 MasVnrArea* GarageYrBlt*
## 50 1 MasVnrArea* GarageYrBlt
## 50 2 MasVnrArea GarageYrBlt*
## 50 3 MasVnrArea* GarageYrBlt*
## 50 4 MasVnrArea GarageYrBlt*
## 50 5 MasVnrArea* GarageYrBlt*
## * Please inspect the loggedEvents
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## MSSubClass LotArea OverallQual OverallCond YearBuilt
## "" "" "" "" ""
## YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## "" "pmm" "" "" ""
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## "" "" "" "" ""
## BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## "" "" "" "" ""
## KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars
## "" "" "" "pmm" ""
## GarageArea WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## "" "" "" "" ""
## ScreenPorch PoolArea MiscVal MoSold YrSold
## "" "" "" "" ""
## log_SalePrice
## ""
## PredictorMatrix:
## MSSubClass LotArea OverallQual OverallCond YearBuilt YearRemodAdd
## MSSubClass 0 1 1 1 1 1
## LotArea 1 0 1 1 1 1
## OverallQual 1 1 0 1 1 1
## OverallCond 1 1 1 0 1 1
## YearBuilt 1 1 1 1 0 1
## YearRemodAdd 1 1 1 1 1 0
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF X1stFlrSF
## MSSubClass 1 1 1 1 1 1
## LotArea 1 1 1 1 1 1
## OverallQual 1 1 1 1 1 1
## OverallCond 1 1 1 1 1 1
## YearBuilt 1 1 1 1 1 1
## YearRemodAdd 1 1 1 1 1 1
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## MSSubClass 1 1 1 1 1
## LotArea 1 1 1 1 1
## OverallQual 1 1 1 1 1
## OverallCond 1 1 1 1 1
## YearBuilt 1 1 1 1 1
## YearRemodAdd 1 1 1 1 1
## FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## MSSubClass 1 1 1 1 1
## LotArea 1 1 1 1 1
## OverallQual 1 1 1 1 1
## OverallCond 1 1 1 1 1
## YearBuilt 1 1 1 1 1
## YearRemodAdd 1 1 1 1 1
## Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF
## MSSubClass 1 1 1 1 1
## LotArea 1 1 1 1 1
## OverallQual 1 1 1 1 1
## OverallCond 1 1 1 1 1
## YearBuilt 1 1 1 1 1
## YearRemodAdd 1 1 1 1 1
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal
## MSSubClass 1 1 1 1 1 1
## LotArea 1 1 1 1 1 1
## OverallQual 1 1 1 1 1 1
## OverallCond 1 1 1 1 1 1
## YearBuilt 1 1 1 1 1 1
## YearRemodAdd 1 1 1 1 1 1
## MoSold YrSold log_SalePrice
## MSSubClass 1 1 1
## LotArea 1 1 1
## OverallQual 1 1 1
## OverallCond 1 1 1
## YearBuilt 1 1 1
## YearRemodAdd 1 1 1
## Number of logged events: 816
## it im dep meth
## 1 1 1 MasVnrArea pmm
## 2 1 1 MasVnrArea pmm
## 3 1 1 GarageYrBlt pmm
## 4 1 1 GarageYrBlt pmm
## 5 1 2 MasVnrArea pmm
## 6 1 2 MasVnrArea pmm
## out
## 1 BsmtFinSF1, BsmtFullBath
## 2 * A ridge penalty had to be used to calculate the inverse crossproduct of the predictor matrix. Please remove duplicate variables or unique respondent names/numbers from the imputation model. It may be advisable to check the fraction of missing information (fmi) to evaluate the validity of the imputation model
## 3 BsmtFinSF1, BsmtFullBath
## 4 * A ridge penalty had to be used to calculate the inverse crossproduct of the predictor matrix. Please remove duplicate variables or unique respondent names/numbers from the imputation model. It may be advisable to check the fraction of missing information (fmi) to evaluate the validity of the imputation model
## 5 BsmtFinSF1, BsmtFullBath
## 6 * A ridge penalty had to be used to calculate the inverse crossproduct of the predictor matrix. Please remove duplicate variables or unique respondent names/numbers from the imputation model. It may be advisable to check the fraction of missing information (fmi) to evaluate the validity of the imputation model
Rename the dataframe.
Train the model using lm.
Method: “stepAIC”, Numeric Variables
# Fit the full model using stepAIC
model1 <- lm(log_SalePrice ~., data = hpnum_train)
# Stepwise regression model
model1_step <- stepAIC(model1, direction = "both",
trace = FALSE)
summary(model1_step)##
## Call:
## lm(formula = log_SalePrice ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## FullBath + HalfBath + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageCars + WoodDeckSF + EnclosedPorch + X3SsnPorch + ScreenPorch +
## PoolArea + YrSold, data = hpnum_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92973 -0.06766 0.00243 0.07404 0.55321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.722e+01 5.869e+00 2.933 0.00341 **
## MSSubClass -6.174e-04 1.087e-04 -5.682 1.61e-08 ***
## LotArea 1.853e-06 4.229e-07 4.382 1.26e-05 ***
## OverallQual 8.434e-02 4.902e-03 17.205 < 2e-16 ***
## OverallCond 4.935e-02 4.258e-03 11.589 < 2e-16 ***
## YearBuilt 2.893e-03 2.534e-04 11.416 < 2e-16 ***
## YearRemodAdd 1.091e-03 2.717e-04 4.014 6.27e-05 ***
## BsmtFinSF1 8.116e-05 1.919e-05 4.228 2.51e-05 ***
## BsmtFinSF2 6.976e-05 2.947e-05 2.367 0.01804 *
## BsmtUnfSF 5.298e-05 1.752e-05 3.024 0.00254 **
## X1stFlrSF 1.964e-04 2.390e-05 8.216 4.67e-16 ***
## X2ndFlrSF 1.694e-04 2.035e-05 8.322 < 2e-16 ***
## LowQualFinSF 1.648e-04 8.281e-05 1.991 0.04671 *
## BsmtFullBath 5.979e-02 1.046e-02 5.717 1.31e-08 ***
## FullBath 3.772e-02 1.166e-02 3.236 0.00124 **
## HalfBath 2.089e-02 1.113e-02 1.876 0.06088 .
## KitchenAbvGr -5.147e-02 2.177e-02 -2.364 0.01822 *
## TotRmsAbvGrd 1.466e-02 4.628e-03 3.167 0.00157 **
## Fireplaces 4.566e-02 7.346e-03 6.216 6.68e-10 ***
## GarageCars 7.362e-02 7.101e-03 10.367 < 2e-16 ***
## WoodDeckSF 1.270e-04 3.340e-05 3.801 0.00015 ***
## EnclosedPorch 1.709e-04 7.070e-05 2.417 0.01579 *
## X3SsnPorch 2.272e-04 1.319e-04 1.723 0.08512 .
## ScreenPorch 3.642e-04 7.223e-05 5.043 5.18e-07 ***
## PoolArea -3.863e-04 9.888e-05 -3.907 9.77e-05 ***
## YrSold -7.223e-03 2.919e-03 -2.475 0.01344 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1464 on 1434 degrees of freedom
## Multiple R-squared: 0.868, Adjusted R-squared: 0.8657
## F-statistic: 377.2 on 25 and 1434 DF, p-value: < 2.2e-16
Method: “leapBackward”, Numeric Variables
set.seed(1977)
library(caret)
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
nv <- ncol(hpnum_train)-10
# Train the model
step.model <- train(log_SalePrice ~., data = hpnum_train,
method = "leapBackward",
tuneGrid = data.frame(nvmax = 20:nv),
trControl = train.control
)## Reordering variables and trying again:
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 20 0.1622507 0.8330429 0.10989897 0.02771617 0.05307579 0.012968060
## 2 21 0.1562831 0.8446512 0.10436740 0.02858296 0.05400624 0.007308561
## 3 22 0.1551216 0.8468942 0.10302751 0.02886732 0.05437192 0.006277085
## 4 23 0.1544610 0.8476513 0.10334411 0.02853211 0.05412154 0.007394441
## 5 24 0.1514962 0.8539444 0.09958089 0.02917332 0.05350025 0.005759485
## 6 25 0.1512278 0.8543889 0.09928569 0.02945537 0.05394382 0.005579948
## 7 26 0.1515411 0.8539754 0.09924665 0.02946447 0.05380142 0.004701382
Let’s eliminate some string variables from the larger dataset and recode them as numeric. Recoding will be accomplished with fastDummies function dummy_cols.
library(fastDummies)
# create list of character variables
is_char <- unlist(colnames(hp_train[,sapply(hp_train, is.character)]))
# remove all spaces from character variables in dataframe
hpchar_train <- as.data.frame(apply(hp_train[,is_char],2,function(x)gsub('\\s+', '',x)),stringsAsFactors = FALSE)
# reassemble
hp_train <- data.frame(hpnum_train, hpchar_train)
# create dummy variables for each, eliminating the referant, defined as the most frequent of each nominal level
hp_train2 <- dummy_columns(hp_train, select_columns = is_char, remove_most_frequent_dummy = TRUE, ignore_na = FALSE)
names(hp_train2)[names(hp_train2)=="MSZoning_C(all)"] <- "MSZoning_CAll"
names(hp_train2)[names(hp_train2)=="RoofMatl_Tar&Grv"] <- "RoofMatl_TarGrv"
# remove the character variables from the dataset
hp_train_final <- hp_train2[,!sapply(hp_train2, is.character)]Method: “bayesglm” with Recoded Variables
With all recoding done we’ll rerun the models to assess whether we can achieve a better fit. With the bayesglm method from caret:
set.seed(1981)
# parallel processing set again for clarity in code
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 5, savePred = TRUE, allowParallel = TRUE)
# Train the model
step.model2 <- train(log_SalePrice ~., data = hp_train_final,
method = "bayesglm",
trControl = train.control,
na.action = na.pass
)
step.model2$results## parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 none 0.1471863 0.8506474 0.08454199 0.04800956 0.08227343 0.007384579
RMSE is smaller, and \(R^2\) is quite high; with this many variables we’ve probably overfit the model, but for RMSE we have an improvement over the model that included only the numeric variables.
Final Model
Method: “stepAIC” with Recoded Variables
hp_train_final[is.na(hp_train_final)] <- 0
set.seed(1980)
# Fit the full model using stepAIC
model3 <- lm(log_SalePrice ~., data = hp_train_final)
# Stepwise regression model
step.model3 <- stepAIC(model3, direction = "both",
trace = FALSE)
summary(step.model3)##
## Call:
## lm(formula = log_SalePrice ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## FullBath + HalfBath + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageCars + GarageArea + WoodDeckSF + EnclosedPorch + X3SsnPorch +
## ScreenPorch + PoolArea + MSZoning_RM + MSZoning_CAll + Street_Grvl +
## LotShape_IR1 + LandContour_Bnk + LandContour_Low + Utilities_NoSeWa +
## LotConfig_Corner + LotConfig_CulDSac + LandSlope_Mod + LandSlope_Sev +
## Neighborhood_Veenker + Neighborhood_Crawfor + Neighborhood_NoRidge +
## Neighborhood_Mitchel + Neighborhood_Somerst + Neighborhood_BrkSide +
## Neighborhood_NridgHt + Neighborhood_MeadowV + Neighborhood_Edwards +
## Neighborhood_StoneBr + Neighborhood_ClearCr + Condition1_Feedr +
## Condition1_Artery + Condition1_RRAe + Condition1_RRAn + Condition2_PosN +
## Condition2_PosA + Condition2_RRAe + BldgType_2fmCon + BldgType_Twnhs +
## HouseStyle_1.5Fin + HouseStyle_SLvl + HouseStyle_2.5Unf +
## RoofStyle_Shed + RoofMatl_Metal + RoofMatl_Membran + RoofMatl_ClyTile +
## Exterior1st_MetalSd + Exterior1st_WdSdng + Exterior1st_BrkFace +
## Exterior1st_CemntBd + Exterior1st_BrkComm + Exterior2nd_HdBoard +
## Exterior2nd_WdSdng + Exterior2nd_CmentBd + Exterior2nd_BrkFace +
## Exterior2nd_AsbShng + ExterCond_Gd + ExterCond_Fa + Foundation_CBlock +
## Foundation_BrkTil + Foundation_Wood + Foundation_Slab + BsmtQual_Ex +
## BsmtCond_Po + BsmtExposure_Gd + BsmtExposure_Av + BsmtFinType1_GLQ +
## BsmtFinType2_BLQ + Heating_GasW + Heating_Grav + HeatingQC_Gd +
## HeatingQC_TA + HeatingQC_Fa + CentralAir_N + Electrical_Mix +
## KitchenQual_Ex + Functional_Min1 + Functional_Maj1 + Functional_Min2 +
## Functional_Mod + Functional_Maj2 + Functional_Sev + GarageQual_Fa +
## GarageQual_Ex + GarageCond_Fa + GarageCond_Po + GarageCond_Ex +
## SaleType_New + SaleType_ConLD + SaleType_CWD + SaleCondition_Abnorml +
## SaleCondition_Family + HouseStyle_2.5Fin, data = hp_train_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71268 -0.04811 0.00260 0.05437 0.71268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.873e+00 5.775e-01 10.170 < 2e-16 ***
## MSSubClass -3.750e-04 1.120e-04 -3.349 0.000835 ***
## LotArea 3.175e-06 4.246e-07 7.479 1.34e-13 ***
## OverallQual 4.444e-02 4.081e-03 10.890 < 2e-16 ***
## OverallCond 3.745e-02 3.543e-03 10.570 < 2e-16 ***
## YearBuilt 1.884e-03 2.530e-04 7.448 1.69e-13 ***
## YearRemodAdd 6.757e-04 2.158e-04 3.131 0.001777 **
## BsmtFinSF1 1.421e-04 1.734e-05 8.198 5.63e-16 ***
## BsmtFinSF2 1.220e-04 2.384e-05 5.117 3.56e-07 ***
## BsmtUnfSF 8.164e-05 1.575e-05 5.182 2.52e-07 ***
## X1stFlrSF 2.385e-04 1.976e-05 12.071 < 2e-16 ***
## X2ndFlrSF 2.343e-04 1.636e-05 14.325 < 2e-16 ***
## LowQualFinSF 1.964e-04 7.581e-05 2.590 0.009688 **
## BsmtFullBath 2.416e-02 7.808e-03 3.095 0.002011 **
## FullBath 1.664e-02 8.772e-03 1.897 0.058075 .
## HalfBath 2.109e-02 8.420e-03 2.505 0.012355 *
## KitchenAbvGr -5.672e-02 1.705e-02 -3.328 0.000899 ***
## TotRmsAbvGrd 7.031e-03 3.510e-03 2.003 0.045402 *
## Fireplaces 2.453e-02 5.494e-03 4.466 8.65e-06 ***
## GarageCars 2.637e-02 9.030e-03 2.920 0.003553 **
## GarageArea 1.126e-04 3.098e-05 3.633 0.000291 ***
## WoodDeckSF 9.546e-05 2.452e-05 3.893 0.000104 ***
## EnclosedPorch 1.307e-04 5.189e-05 2.518 0.011904 *
## X3SsnPorch 1.546e-04 9.517e-05 1.624 0.104503
## ScreenPorch 2.394e-04 5.306e-05 4.512 6.97e-06 ***
## PoolArea 1.461e-04 7.390e-05 1.977 0.048285 *
## MSZoning_RM -5.718e-02 1.035e-02 -5.524 3.96e-08 ***
## MSZoning_CAll -4.080e-01 3.729e-02 -10.943 < 2e-16 ***
## Street_Grvl -1.065e-01 4.668e-02 -2.281 0.022730 *
## LotShape_IR1 -9.330e-03 6.457e-03 -1.445 0.148693
## LandContour_Bnk -2.204e-02 1.492e-02 -1.477 0.139785
## LandContour_Low -4.567e-02 2.306e-02 -1.980 0.047864 *
## Utilities_NoSeWa -1.448e-01 1.062e-01 -1.364 0.172938
## LotConfig_Corner 1.875e-02 7.426e-03 2.525 0.011687 *
## LotConfig_CulDSac 4.482e-02 1.223e-02 3.666 0.000256 ***
## LandSlope_Mod 3.511e-02 1.620e-02 2.168 0.030368 *
## LandSlope_Sev -2.161e-01 4.641e-02 -4.656 3.54e-06 ***
## Neighborhood_Veenker 6.434e-02 3.291e-02 1.955 0.050830 .
## Neighborhood_Crawfor 1.342e-01 1.706e-02 7.864 7.55e-15 ***
## Neighborhood_NoRidge 4.863e-02 1.892e-02 2.570 0.010264 *
## Neighborhood_Mitchel -3.592e-02 1.606e-02 -2.236 0.025515 *
## Neighborhood_Somerst 5.685e-02 1.395e-02 4.074 4.90e-05 ***
## Neighborhood_BrkSide 4.362e-02 1.587e-02 2.748 0.006072 **
## Neighborhood_NridgHt 8.866e-02 1.580e-02 5.613 2.41e-08 ***
## Neighborhood_MeadowV -1.237e-01 3.344e-02 -3.701 0.000224 ***
## Neighborhood_Edwards -5.303e-02 1.209e-02 -4.385 1.25e-05 ***
## Neighborhood_StoneBr 1.346e-01 2.332e-02 5.772 9.69e-09 ***
## Neighborhood_ClearCr 4.990e-02 2.380e-02 2.097 0.036162 *
## Condition1_Feedr -5.572e-02 1.259e-02 -4.424 1.05e-05 ***
## Condition1_Artery -9.888e-02 1.689e-02 -5.853 6.04e-09 ***
## Condition1_RRAe -1.237e-01 3.190e-02 -3.879 0.000110 ***
## Condition1_RRAn -3.194e-02 2.113e-02 -1.512 0.130848
## Condition2_PosN -8.820e-01 7.785e-02 -11.329 < 2e-16 ***
## Condition2_PosA 2.606e-01 1.132e-01 2.302 0.021463 *
## Condition2_RRAe -5.176e-01 1.572e-01 -3.293 0.001016 **
## BldgType_2fmCon 4.499e-02 2.569e-02 1.751 0.080157 .
## BldgType_Twnhs -5.161e-02 1.936e-02 -2.666 0.007766 **
## HouseStyle_1.5Fin 2.314e-02 1.096e-02 2.112 0.034906 *
## HouseStyle_SLvl 2.600e-02 1.519e-02 1.711 0.087315 .
## HouseStyle_2.5Unf 5.335e-02 3.583e-02 1.489 0.136671
## RoofStyle_Shed 3.578e-01 1.140e-01 3.139 0.001734 **
## RoofMatl_Metal 2.604e-01 1.158e-01 2.249 0.024696 *
## RoofMatl_Membran 4.159e-01 1.181e-01 3.522 0.000443 ***
## RoofMatl_ClyTile -2.622e+00 1.302e-01 -20.141 < 2e-16 ***
## Exterior1st_MetalSd 1.246e-02 9.107e-03 1.368 0.171489
## Exterior1st_WdSdng -4.830e-02 1.747e-02 -2.765 0.005771 **
## Exterior1st_BrkFace 7.566e-02 2.300e-02 3.290 0.001029 **
## Exterior1st_CemntBd -1.061e-01 6.389e-02 -1.661 0.097019 .
## Exterior1st_BrkComm -1.707e-01 7.910e-02 -2.158 0.031115 *
## Exterior2nd_HdBoard -1.760e-02 9.020e-03 -1.951 0.051247 .
## Exterior2nd_WdSdng 3.752e-02 1.744e-02 2.151 0.031675 *
## Exterior2nd_CmentBd 1.211e-01 6.425e-02 1.885 0.059578 .
## Exterior2nd_BrkFace -4.244e-02 3.051e-02 -1.391 0.164355
## Exterior2nd_AsbShng -3.821e-02 2.544e-02 -1.502 0.133352
## ExterCond_Gd -2.200e-02 9.914e-03 -2.220 0.026618 *
## ExterCond_Fa -3.882e-02 2.325e-02 -1.670 0.095176 .
## Foundation_CBlock -2.692e-02 9.091e-03 -2.961 0.003117 **
## Foundation_BrkTil -4.136e-02 1.424e-02 -2.903 0.003751 **
## Foundation_Wood -1.646e-01 6.161e-02 -2.672 0.007640 **
## Foundation_Slab -4.804e-02 2.892e-02 -1.661 0.096955 .
## BsmtQual_Ex 3.266e-02 1.355e-02 2.410 0.016081 *
## BsmtCond_Po 3.123e-01 1.226e-01 2.546 0.010996 *
## BsmtExposure_Gd 4.851e-02 1.192e-02 4.069 5.00e-05 ***
## BsmtExposure_Av 1.564e-02 8.681e-03 1.801 0.071907 .
## BsmtFinType1_GLQ 1.964e-02 8.887e-03 2.210 0.027279 *
## BsmtFinType2_BLQ -4.887e-02 1.939e-02 -2.521 0.011832 *
## Heating_GasW 7.139e-02 2.796e-02 2.553 0.010793 *
## Heating_Grav -1.854e-01 4.648e-02 -3.989 7.00e-05 ***
## HeatingQC_Gd -2.281e-02 8.670e-03 -2.631 0.008609 **
## HeatingQC_TA -3.738e-02 8.347e-03 -4.478 8.16e-06 ***
## HeatingQC_Fa -2.584e-02 1.849e-02 -1.397 0.162579
## CentralAir_N -5.592e-02 1.484e-02 -3.769 0.000171 ***
## Electrical_Mix -3.383e-01 1.745e-01 -1.938 0.052818 .
## KitchenQual_Ex 5.188e-02 1.401e-02 3.704 0.000221 ***
## Functional_Min1 -3.875e-02 1.977e-02 -1.960 0.050189 .
## Functional_Maj1 -7.187e-02 3.094e-02 -2.323 0.020324 *
## Functional_Min2 -3.607e-02 1.932e-02 -1.867 0.062174 .
## Functional_Mod -1.386e-01 3.014e-02 -4.597 4.68e-06 ***
## Functional_Maj2 -3.153e-01 5.348e-02 -5.895 4.72e-09 ***
## Functional_Sev -3.562e-01 1.051e-01 -3.391 0.000716 ***
## GarageQual_Fa -3.858e-02 1.919e-02 -2.011 0.044516 *
## GarageQual_Ex 3.888e-01 1.166e-01 3.334 0.000878 ***
## GarageCond_Fa -3.157e-02 2.125e-02 -1.486 0.137526
## GarageCond_Po 1.203e-01 4.819e-02 2.496 0.012695 *
## GarageCond_Ex -3.474e-01 1.382e-01 -2.515 0.012032 *
## SaleType_New 4.990e-02 1.187e-02 4.202 2.82e-05 ***
## SaleType_ConLD 1.326e-01 3.718e-02 3.567 0.000373 ***
## SaleType_CWD 8.259e-02 5.339e-02 1.547 0.122145
## SaleCondition_Abnorml -6.188e-02 1.147e-02 -5.395 8.07e-08 ***
## SaleCondition_Family -5.699e-02 2.389e-02 -2.386 0.017172 *
## HouseStyle_2.5Fin -7.336e-02 5.039e-02 -1.456 0.145691
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1022 on 1349 degrees of freedom
## Multiple R-squared: 0.9395, Adjusted R-squared: 0.9346
## F-statistic: 190.5 on 110 and 1349 DF, p-value: < 2.2e-16
This is the bestmodel by far, tho it includes several nonsignificant coefficients from recoded string variables. \(R^2\) at 94% indicates adequate fit and residuals are symmetric. P-value is significant, but the degrees of freedom are quite low.
The goal was to allow R to select the model programatically. Let’s test these two models with the test set.
# allowing strings to be imported as factors
hp_test <- read.csv('https://raw.githubusercontent.com/sigmasigmaiota/DATA605/master/test.csv', stringsAsFactors = FALSE)
# create data frame with numeric variables
nums2 <- unlist(lapply(hp_test, is.numeric))
# replace NAs with 0s
hp_test[is.na(hp_test)] <- 0
hpnum_test <- hp_test[,nums2]
# create list of character variables
is_char2 <- colnames(hp_test[,sapply(hp_test, is.character)])
# remove all spaces from character variables in dataframe
hpchar_test <- as.data.frame(apply(hp_test[,is_char2],2,function(x)gsub('\\s+', '',x)),stringsAsFactors = FALSE)
# reassemble
hp_test <- data.frame(hpnum_test, hpchar_test)
# create dummy variables for each, eliminating the referant, defined as the most frequent of each nominal level
hp_test2 <- dummy_columns(hp_test, select_columns = is_char2, ignore_na = FALSE)
names(hp_test2)[names(hp_test2)=="MSZoning_C(all)"] <- "MSZoning_CAll"
names(hp_test2)[names(hp_test2)=="RoofMatl_Tar&Grv"] <- "RoofMatl_TarGrv"
# some variable levels found in the training set are not present in the testing set. In order to proceed, we must include these variables and set value to zero.
hp_test2$Condition2_RRAe <- 0
hp_test2$Condition2_RRAn <- 0
hp_test2$HouseStyle_2.5Fin <- 0
hp_test2$RoofMatl_Metal <- 0
hp_test2$RoofMatl_Membran <- 0
hp_test2$RoofMatl_ClyTile <- 0
hp_test2$Electrical_Mix <- 0
hp_test2$GarageQual_Ex <- 0
#hp_test2$PoolQC_Fa <- 0
hp_test2$Utilities_NoSeWa <- 0
hp_test2$Condition2_RRNn <- 0
hp_test2$RoofMatl_TarGrv <- 0
hp_test2$RoofMatl_Roll <- 0
hp_test2$Exterior1st_Stone <- 0
hp_test2$Exterior2nd_Other <- 0
hp_test2$Heating_OthW <- 0
hp_test2$Heating_Floor <- 0
hp_test2$Electrical_0 <- 0
#hp_test2$MiscFeature_TenC <- 0
hp_test2$Exterior1st_ImStucc <- 0
# remove al variables not found in the model
#keep_test1 <- names(coef(step.model2$finalModel, step.model2$bestTune$nvmax))[-1]
keep_test1 <- names(coef(step.model2$finalModel))[-1]
keep_test2 <- coef(step.model3$finalModel)[-1]
# predict with the model
prd1 <- predict(step.model2, hp_test2)
prd2 <- predict(step.model3, hp_test2)We’ll bind the predictions with the Id variable from the test set.
# since we are predicting log-transformed data, the results must be transformed again
sol1 <- data.frame(cbind(hp_test2$Id,exp(prd1)))
colnames(sol1) <- c("Id","SalePrice")
# formerly 4
write.csv(sol1, "C:/Users/steph/Documents/KaggleR/SJPredictions_hp4.csv", row.names = FALSE)
# since we are predicting log-transformed data, the results must be transformed again
sol2 <- data.frame(cbind(hp_test2$Id,exp(prd2)))
colnames(sol2) <- c("Id","SalePrice")
# formerly 3
write.csv(sol2, "C:/Users/steph/Documents/KaggleR/SJPredictions_hp5.csv", row.names = FALSE)Kaggle username: stephensjones; Display name: Stephen S Jones
My first submission score was 0.14179, number 3009 on the leaderboard. A second attempt yielded a score of 0.14343; wrong direction! Third attempt: 0.18616; wrong direction. Fourth (and final) attempt: 0.13288, 2317 on the leaderboard.
The stepAIC model with lm yielded the best results, though caret was much faster with parallel processing options improving efficiency. Imputation amond numeric variables decreased residuals slightly.
The final regression model is included above.