Final Exam

Problem 1.
Problem 2.
Regression Result
- Final Model

Problem 1.

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of $\mu=\sigma=\frac{N+1}{2}$.

N = 46

set.seed(1973)

# random variable X
X <- runif(10000, min = 1, max = N)

# random variable Y
Y = rnorm(10000, mean = (N+1)/2, sd = (N+1)/2)

# x is median of X
x <- median(X)

# y is first quartile of Y
y <- quantile(Y)[[2]]

Probabilities

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

a. $P(X > x | X > y )$

library(tidyverse)
library(dplyr)

Xdf <- data.frame(value = as.numeric(X), stringsAsFactors = FALSE)

cond_prob <- sum(Xdf > x & Xdf > y)/sum(Xdf > y)

cat("The requested probability is",cond_prob[[1]])

## The requested probability is 0.5858231

b. $P(X > x\text{ and } Y > y)$

Ydf <- data.frame(value = as.numeric(Y), stringsAsFactors = FALSE)

res_prob <- mean(Xdf$value > x) * mean(Ydf$value > y)

cat("The requested probability is",res_prob)

## The requested probability is 0.375

c. $P(X < x\text{ and } X > y )$

cond_prob2 <- mean(Xdf$value < x) * mean(Xdf$value > y)

cat("The requested probability is",cond_prob2[[1]])

## The requested probability is 0.42675

Marginal, Joint Probabilities

Investigate whether $P(X>x\text{ and } Y>y)=P(X>x)P(Y>y)$ by building a table and evaluating the marginal and joint probabilities.

Contingency tables, one with counts and one with probabilities, are below.

XYdf <- data.frame(cbind(Xdf$value,Ydf$value))

colnames(XYdf) <- c('xval','yval')

# compose dataframe with sums
tb_XY <- data.frame(c(sum(XYdf$xval > x & XYdf$yval > y),
                      sum(XYdf$xval <= x & XYdf$yval > y)),
                    c(sum(XYdf$xval > x & XYdf$yval <= y),
                      sum(XYdf$xval <= x & XYdf$yval <= y)),
                    row.names=c('X > x','X <= x'))

colnames(tb_XY) <- c('Y > y','Y <= y')

library(kableExtra)

tb_XY %>%
  kable(align = "c", format = "html")%>%
  kable_styling(bootstrap_options = c("striped", "hover"),full_width = F)%>%
  column_spec(1, bold = T, border_right = T)%>%
  column_spec(2, border_right = T)

	Y > y	Y <= y
X > x	3760	1240
X <= x	3740	1260

Marginal probabilities:

$P(Y>y)=$ 0.75.
$P(Y\le y)=$ 0.25.
$P(X>x)=$ 0.5.
$P(X\le x)=$ 0.5.

Joint probabilities:

$P(X>x\text{ and }Y>y)=$ 0.376.
$P(X>x\text{ and }Y\le y)=$ 0.124.
$P(X\le x\text{ and }Y>y)=$ 0.374.
$P(X\le x\text{ and }Y\le y)=$ 0.126.

These probabilities are confirmed in the table below.

# calculate proportion
tb_prop_XY <- tb_XY/10000

tb_prop_XY %>%
  kable(align = "c", format = "html", 
        table.attr = 'class="table table-striped table-hover"')%>%
  kable_styling(bootstrap_options = c("striped", "hover"),full_width = F)%>%
  column_spec(1, bold = T, border_right = T)%>%
  column_spec(2, border_right = T)

	Y > y	Y <= y
X > x	0.376	0.124
X <= x	0.374	0.126

Further, the joint probability $P(X>x\text{ and }Y>y)=$ 0.376.

The product of marginal probabilities

$P(X>x)=$ 0.5 and
$P(Y>y)=$ 0.75 is

$P(X>x)P(Y>y)=$ 0.375.

Fisher’s, Chi-squared

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Fisher’s results:

library(stats)
XY_matrix <- as.matrix(tb_XY)
fisher.test(XY_matrix, alternative = "two.sided", hybrid = TRUE)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  XY_matrix
## p-value = 0.6608
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.932132 1.119534
## sample estimates:
## odds ratio 
##   1.021566

Chi-squared results:

library(MASS)
chisq.test(tb_XY)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tb_XY
## X-squared = 0.19253, df = 1, p-value = 0.6608

Fisher’s Test is for 2-by-2 contingency tables; chi-square is appropriate when comparing two categorical variables 2 or more categories each.

Since the p-value for both tests are equivalent and both greater than 0.05, we do not reject the null hypothesis of the test (the distribution of X is independent of the distribution of Y) at the 5% significance level. This is the expected result, as the variables were generated randomly and independently.

Problem 2.

Descriptives

Provide univariate descriptive statistics and appropriate plots for the training data set.

# allowing strings to be imported as factors
hp_train <- read.csv('https://raw.githubusercontent.com/sigmasigmaiota/DATA605/master/train.csv', stringsAsFactors = FALSE)

# omit the iterative ID column
hp_train[,1] <- NULL

Using datatable from DT:

library(psych)
library(DT)
datatable(describe(hp_train), options = list(pageLength = 5))

Scatterplot Matrix

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

Using the plot above we selected three variables OverallQual, TotRmsAbvGrd, GrLivArea, and the dependent variable, SalePrice.

pairs(hp_train[c("OverallQual","TotRmsAbvGrd","GrLivArea","SalePrice")], pch = 19, lower.panel = NULL)

Correlation Matrix

Derive a correlation matrix for any three quantitative variables in the dataset.

Let’s test for normality in the three variables. We’ll render histograms and qqplots for the three variables, and follow with Shapiro-Wilkes and Kolmogorov Smirov tests for normality.

# create data frame with numeric variables
nums <- unlist(lapply(hp_train, is.numeric))
# replace NAs with 0s
hp_train[is.na(hp_train)] <- 0
hpnum_train <- hp_train[,nums]

library(ggpubr)
library(ggplot2)
library(cowplot)

p1<-ggplot(data=hpnum_train, aes(hpnum_train$OverallQual)) + 
  geom_histogram()+
  labs(title="OverallQual")
p2<-ggplot(data=hpnum_train, aes(hpnum_train$GrLivArea)) + 
  geom_histogram()+
  labs(title="GrLivArea")
p3<-ggplot(data=hpnum_train, aes(hpnum_train$TotRmsAbvGrd)) + 
  geom_histogram()+
  labs(title="TotRmsAbvGrd")
p4<-ggqqplot(hpnum_train$OverallQual, ylab="OverallQual")
p5<-ggqqplot(hpnum_train$GrLivArea, ylab="GrLivArea")
p6<-ggqqplot(hpnum_train$TotRmsAbvGrd, ylab="TotRmsAbvGrd")

plot_grid(p1,p2,p3,p4,p5,p6,nrow=2,ncol=3)

GrLivArea clearly does not follow a normal distribution; while OverallQual seems to approximate a discrete normal distribution, TotRmsAbvGrd seems to deviate significantly. Test results below will confirm these observations.

cat("OverallQual:")

## OverallQual:

shapiro.test(hpnum_train$OverallQual)

## 
##  Shapiro-Wilk normality test
## 
## data:  hpnum_train$OverallQual
## W = 0.94801, p-value < 2.2e-16

ks.test(hpnum_train$OverallQual,pnorm)

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  hpnum_train$OverallQual
## D = 0.99523, p-value < 2.2e-16
## alternative hypothesis: two-sided

cat("GrLivArea:")

## GrLivArea:

shapiro.test(hpnum_train$GrLivArea)

## 
##  Shapiro-Wilk normality test
## 
## data:  hpnum_train$GrLivArea
## W = 0.92798, p-value < 2.2e-16

ks.test(hpnum_train$GrLivArea,pnorm)

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  hpnum_train$GrLivArea
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

cat("TotRmsAbvGrd:")

## TotRmsAbvGrd:

shapiro.test(hpnum_train$TotRmsAbvGrd)

## 
##  Shapiro-Wilk normality test
## 
## data:  hpnum_train$TotRmsAbvGrd
## W = 0.94228, p-value < 2.2e-16

ks.test(hpnum_train$TotRmsAbvGrd,pnorm)

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  hpnum_train$TotRmsAbvGrd
## D = 0.99797, p-value < 2.2e-16
## alternative hypothesis: two-sided

According to our analyses, the variables are not normally distributed. This determines the method of the correlation tests; we’ll use method = "kendall", a nonparametric test.

library(corrplot)

# obtain correlations
TT <-cor(hpnum_train[c("OverallQual","TotRmsAbvGrd","GrLivArea")], method = "kendall")

# plot correlation plot
cex.before <- par("cex")
par(cex = 0.9)
corrplot(TT, insig = "blank", method = "color",
    addCoef.col="grey", 
    order = "AOE", tl.cex = 1/par("cex"),
    cl.cex = 1/par("cex"), addCoefasPercent = TRUE)

par(cex = cex.before)

Correlation Tests

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

Because we are using a nonparametric test method, cor.test will test for significance at the .95 level.

c_OG <- cor.test(hpnum_train$OverallQual, hpnum_train$GrLivArea,  method = "kendall", use = "complete.obs")
c_OT <- cor.test(hpnum_train$OverallQual, hpnum_train$TotRmsAbvGrd,  method = "kendall", use = "complete.obs")
c_TG <- cor.test(hpnum_train$TotRmsAbvGrd, hpnum_train$GrLivArea,  method = "kendall", use = "complete.obs")
cat("Testing OverallQual and GrLivArea,    p-val:",c_OG$p.value,"   We reject the null hypothesis.\n")

## Testing OverallQual and GrLivArea,    p-val: 6.870239e-130    We reject the null hypothesis.

cat("Testing OverallQual and TotRmsAbvGrd, p-val:",c_OT$p.value,"    We reject the null hypothesis.\n")

## Testing OverallQual and TotRmsAbvGrd, p-val: 3.056094e-64     We reject the null hypothesis.

cat("Testing TotRmsAbvGrd and GrLivArea,   p-val:",c_TG$p.value,"   We reject the null hypothesis.")

## Testing TotRmsAbvGrd and GrLivArea,   p-val: 8.066909e-285    We reject the null hypothesis.

To produce the desired confidence intervals for Kendall tau we use kendall.ci from the package NSM3 without the bootstrap option.

library(NSM3)
cat("OverallQual and GrLivArea:")

## OverallQual and GrLivArea:

kendall.ci(hpnum_train$OverallQual, hpnum_train$GrLivArea,alpha=0.20,type="t",bootstrap=F, example=F)

## 
## 1 - alpha = 0.8 two-sided CI for tau:
## 0.447, 0.481

cat("OverallQual and TotRmsAbvGrd:")

## OverallQual and TotRmsAbvGrd:

kendall.ci(hpnum_train$OverallQual, hpnum_train$TotRmsAbvGrd,alpha=0.20,type="t",bootstrap=F, example=F)

## 
## 1 - alpha = 0.8 two-sided CI for tau:
## 0.331, 0.371

cat("TotRmsAbvGrd and GrLivArea:")

## TotRmsAbvGrd and GrLivArea:

kendall.ci(hpnum_train$TotRmsAbvGrd, hpnum_train$GrLivArea,alpha=0.20,type="t",bootstrap=F, example=F)

## 
## 1 - alpha = 0.8 two-sided CI for tau:
## 0.672, 0.697

Discussion

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

As expected, each of the confidence intervals do not span zero, confirming our rejection of the null hypothesis. Each of the pairs of variables are correlated. Given the meaning of each variable this result is logical. With these findings I would not be concerned with familywise error; even using the Bonferroni correction method, each of the p-values are much less than $0.05/3\approx 0.015$, confirming our original result.

LU Decomposition

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Let’s check that the determinate does not equal zero.

CM <- unname(as.matrix(TT))

det(CM)

## [1] 0.4161713

Since the determinate does not equal zero, the inverse exists. We use the library matlib to find the inverse. The product of the correlation matrix and the precision matrix:

library(matlib)
CMI <- inv(CM)
CMCMI <- CM %*% CMI
CMCMI

##              [,1]          [,2]          [,3]
## [1,] 1.000000e+00  1.526603e-09 -6.066037e-11
## [2,] 5.717334e-09  1.000000e+00 -4.510915e-09
## [3,] 5.234534e-09 -3.307733e-09  1.000000e+00

cat("which is effectively\n")

## which is effectively

fractions(CMCMI)

##      [,1] [,2] [,3]
## [1,] 1    0    0   
## [2,] 0    1    0   
## [3,] 0    0    1

The product of the precision matrix and the correlation matrix:

CMI <- inv(CM)
CMICM <- CMI %*% CM
CMICM

##               [,1]          [,2]          [,3]
## [1,]  1.000000e+00  5.717334e-09  5.234534e-09
## [2,]  1.526603e-09  1.000000e+00 -3.307733e-09
## [3,] -6.066037e-11 -4.510915e-09  1.000000e+00

cat("which is effectively\n")

## which is effectively

fractions(CMICM)

##      [,1] [,2] [,3]
## [1,] 1    0    0   
## [2,] 0    1    0   
## [3,] 0    0    1

We’ll obtain the lower and upper decompositions of the matrix and its inverse using the package pracma.

library(pracma)
luCM <- lu(CM)
luCMI <- lu(CMI)

cat("Decomposition of the correlation matrix:\n")

## Decomposition of the correlation matrix:

luCM

## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.3513492 1.0000000    0
## [3,] 0.4641893 0.5944486    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.3513492 0.4641893
## [2,]    0 0.8765538 0.5210662
## [3,]    0 0.0000000 0.4747812

cat("Decomposition of the inverse:\n")

## Decomposition of the inverse:

luCMI

## $L
##             [,1]       [,2] [,3]
## [1,]  1.00000000  0.0000000    0
## [2,] -0.06348622  1.0000000    0
## [3,] -0.42075466 -0.6841587    1
## 
## $U
##          [,1]        [,2]       [,3]
## [1,] 1.278144 -0.08114454 -0.5377851
## [2,] 0.000000  1.87995768 -1.2861894
## [3,] 0.000000  0.00000000  1.0000000

Exponential PDF

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html).

According to the histogram of GrLivArea above, the variable is skewed to the right. The MASS package was loaded at an earlier point in this markdown. Let’s test that the mean is greater than the median in the training set and the minimum value of the variable is greater than zero.

mean(hp_train$GrLivArea) > median(hp_train$GrLivArea)

## [1] TRUE

min(hp_train$GrLivArea) > 0

## [1] TRUE

We’ll use fitdistr to fit the pdf.

fitdistr(hp_train$GrLivArea, "exponential")

##        rate    
##   6.598640e-04 
##  (1.726943e-05)

$\lambda$ Histogram

Find the optimal value of $\lambda$ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, $\lambda$)). Plot a histogram and compare it with a histogram of your original variable.

fit <- fitdistr(hp_train$GrLivArea, "exponential")

set.seed(1974)
est_data <- rexp(1000, fit$estimate[1])      # unkonwn distribution parameters

par(mfrow=c(1,2))
hist(hp_train$GrLivArea, pch=20, breaks=50, prob=TRUE, col = "lightblue", ylim = c(0,.0008), xlim = c(0,9000))
hist(est_data, pch=20, breaks=50, prob=TRUE, col = "orange", ylim = c(0,.0008), xlim = c(0,9000))

The distributions look very different with the axes set for optimum comparison. To continue experimentation, I would try a truncated negative binomial distribution to approximate the original variable.

CDF Percentiles

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

For $x < 0, x=0$; for $x \ge 0$ we have:

\[\begin{aligned} F(x;\lambda)= P(X\le x)&=\int_{0}^{x} \lambda e^{-\lambda x}\\ &=\bigg[-e^{-\lambda x}\bigg]_0^{x}\\ &=1 - e^{-\lambda x}\\ \end{aligned}\] To find the 5th percentile:

\[\begin{aligned} F(x;0.000659864 )= P(X\le x)&=1 - e^{-0.000659864 x}=.05\\ \implies 0.95 &= e^{-0.000659864 x}\\ \log(0.95)&=-0.000659864 x\\ x&=\frac{\log(95)}{-0.000659864}\\ x&=77.73313\\ \end{aligned}\]

To find the 95th percentile:

\[\begin{aligned} F(x;0.000659864 )= P(X\le x)&=1 - e^{-0.000659864 x}=.95\\ \implies 0.05 &= e^{-0.000659864 x}\\ \log(0.05)&=-0.000659864 x\\ x&=\frac{\log(05)}{-0.000659864}\\ x&=4539.924\\ \end{aligned}\]

Programmatically:

q <- seq(0.001,0.999,0.001)
qexpFIT <- data.frame(Q=q, Quantile=qexp(q, rate=fit$estimate[1]))  
cat("The 5th percentile:",qexpFIT[50,2],"\n")

## The 5th percentile: 77.73313

cat("The 95th percentile:",qexpFIT[950,2],"\n")

## The 95th percentile: 4539.924

Empirical CI

Also generate a 95% confidence interval from the empirical data, assuming normality.

The data in hp_train$GrLivArea are not normally distributed; however, we will assume normality.

m <- mean(hp_train$GrLivArea)
sdev <- sd(hp_train$GrLivArea)
n <- length(hp_train$GrLivArea)
error <- qnorm(0.975)*sdev/sqrt(n)
lower <- m-error
upper <- m+error

cat("Assuming normality, the 95% confidence interval is (",lower,",",upper,")\n")

## Assuming normality, the 95% confidence interval is ( 1488.509 , 1542.418 )

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

elow <- quantile(hp_train$GrLivArea, probs = 0.05)
eupp <- quantile(hp_train$GrLivArea, probs = 0.95)

cat("The empirical 5th and 95th percentiles are",elow,"and",eupp,"respectively.\n")

## The empirical 5th and 95th percentiles are 848 and 2466.1 respectively.

The empirical 5th and 95th percentiles (848 and 2466.1) are quite different than the confidence intervals calculated while assuming normality (1488.509 and 1542.418). Our variable, GrLivArea, is not normally distributed. The mean is greater than the median; further, error, mean and standard deviation should be recalculated to estimate proper distribution, as there are many more observations with value lesser than the mean than there are greater than the mean.

Let’s try the negative binomial distribution instead. A comparison of the histograms is below.

fit2 <- fitdistr(hp_train$GrLivArea, "negative binomial")

set.seed(1975)
est2_data <- rnegbin(1000, fit2$estimate[2],fit2$estimate[1])      # unknown distribution parameters

par(mfrow=c(1,2))
hist(hp_train$GrLivArea, pch=20, breaks=50, prob=TRUE, col = "lightblue", ylim = c(0,.0008), xlim = c(0,9000))
hist(est2_data, pch=20, breaks=50, prob=TRUE, col = "red", ylim = c(0,.0008), xlim = c(0,9000))

The fit is much better and the histograms closely resemble one another. Let’s use these parameters to calculate the confidence interval.

# mean is calculated from the estimated parameter of the proposed model
m2 <- fit2$estimate[2]
# standard deviation is calculated from the estimated paramter of the proposed model
sdev2 <- fit2$sd[2]
n2 <- length(hp_train$GrLivArea)
# we are estimating error following a negative binomial distribution, not a normal distribution
error2 <- qnbinom(0.975, mu = m2, size = fit2$estimate[1])*sdev2/sqrt(n2)
lower2 <- m2-error2
upper2 <- m2+error2

cat("Using the negative binomial parameters, the 95% confidence interval is (",lower2,",",upper2,")\n")

## Using the negative binomial parameters, the 95% confidence interval is ( 607.709 , 2423.218 )

Our estimated confidence interval (607.709, 2423.218) much more closely resembles the percentiles from the empirical data, 848 and 2466.1. Model estimation and fit improves drastically when the appropriate model is selected.

Regression Result

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

p7<-ggplot(data=hp_train, aes(hp_train$SalePrice)) + 
  geom_histogram()+
  labs(title="SalePrice")
p8<-ggqqplot(hp_train$SalePrice, ylab="SalePrice")

plot_grid(p7,p8)

Let’s use the descdist rendering to suggest a distribution for the dependent variable.

library(fitdistrplus)

descdist(hp_train$SalePrice, discrete=FALSE, boot=1000)

## summary statistics
## ------
## min:  34900   max:  755000 
## median:  163000 
## mean:  180921.2 
## estimated sd:  79442.5 
## estimated skewness:  1.882876 
## estimated kurtosis:  9.536282

From the figure above, it looks as if the lognormal distribution most closely approximates the empirical values in our data. Let’s compare with negative binomial.

fit_ln <- fitdist(hp_train$SalePrice, "lnorm")
fit_nb <- fitdist(hp_train$SalePrice, "nbinom")

par(mfrow=c(2,2))
plot.legend <- c("lognormal", "neg bin")
denscomp(list(fit_ln, fit_nb), legendtext = plot.legend)
cdfcomp (list(fit_ln, fit_nb), legendtext = plot.legend)
qqcomp  (list(fit_ln, fit_nb), legendtext = plot.legend)
ppcomp  (list(fit_ln, fit_nb), legendtext = plot.legend)

We’ll select a lognormal distribution, as the negative binomial distribution fit underperforms.

# create transformed variable in the numeric-only dataset and the original dataset
hpnum_train$log_SalePrice <- log1p(hpnum_train$SalePrice)
hp_train$log_SalePrice <- log1p(hp_train$SalePrice)

hpnum_train$SalePrice <- NULL
hp_train$SalePrice <- NULL

p9<-ggplot(data=hpnum_train, aes(hpnum_train$log_SalePrice))+
  geom_histogram()+
  labs(title="Log SalePrice")
p10<-ggqqplot(hpnum_train$log_SalePrice, ylab="Log SalePrice")

plot_grid(p9,p10)

The transformation results in a nearly normal distribution. Let’s fit a model with the numeric variables.

Method: “stepAIC”

We begin by starting over with a fresh dataset and imputing missing values in the numeric variables.

# do not allow strings to be imported as factors
hp_train <- read.csv('https://raw.githubusercontent.com/sigmasigmaiota/DATA605/master/train.csv', stringsAsFactors = FALSE)

# omit the iterative ID column
hp_train[,1] <- NULL

Before imputing missing data we assess the percentage of missing in each variable and omit those missing more than 7% in observations.

options(max.print=1000000)
pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(hp_train,2,pMiss)

##    MSSubClass      MSZoning   LotFrontage       LotArea        Street 
##    0.00000000    0.00000000   17.73972603    0.00000000    0.00000000 
##         Alley      LotShape   LandContour     Utilities     LotConfig 
##   93.76712329    0.00000000    0.00000000    0.00000000    0.00000000 
##     LandSlope  Neighborhood    Condition1    Condition2      BldgType 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##    HouseStyle   OverallQual   OverallCond     YearBuilt  YearRemodAdd 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##     RoofStyle      RoofMatl   Exterior1st   Exterior2nd    MasVnrType 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.54794521 
##    MasVnrArea     ExterQual     ExterCond    Foundation      BsmtQual 
##    0.54794521    0.00000000    0.00000000    0.00000000    2.53424658 
##      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1  BsmtFinType2 
##    2.53424658    2.60273973    2.53424658    0.00000000    2.60273973 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating     HeatingQC 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF  LowQualFinSF 
##    0.00000000    0.06849315    0.00000000    0.00000000    0.00000000 
##     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd    Functional 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##    Fireplaces   FireplaceQu    GarageType   GarageYrBlt  GarageFinish 
##    0.00000000   47.26027397    5.54794521    5.54794521    5.54794521 
##    GarageCars    GarageArea    GarageQual    GarageCond    PavedDrive 
##    0.00000000    0.00000000    5.54794521    5.54794521    0.00000000 
##    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch   ScreenPorch 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
##      PoolArea        PoolQC         Fence   MiscFeature       MiscVal 
##    0.00000000   99.52054795   80.75342466   96.30136986    0.00000000 
##        MoSold        YrSold      SaleType SaleCondition     SalePrice 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000

#apply(hp_train,1,pMiss)

Let’s eliminate the variables missing more than 7%.

drop <- c('FireplaceQu','GarageType','PoolQC','MiscFeature','Fence','LotFrontage','Alley')
hp_train = hp_train[,!(names(hp_train) %in% drop)]

# create data frame with numeric variables
nums <- unlist(lapply(hp_train, is.numeric))
hpnum_train <- hp_train[,nums]

# create transformed variable in the numeric-only dataset and the original dataset
hpnum_train$log_SalePrice <- log1p(hpnum_train$SalePrice)
hp_train$log_SalePrice <- log1p(hp_train$SalePrice)

hpnum_train$SalePrice <- NULL
hp_train$SalePrice <- NULL

Let’s impute the missing data using the MICE package.

library(mice)
tempData <- mice(hpnum_train,m=5,maxit=50,meth='pmm',seed=500)

## 
##  iter imp variable
##   1   1  MasVnrArea*  GarageYrBlt*
##   1   2  MasVnrArea*  GarageYrBlt*
##   1   3  MasVnrArea  GarageYrBlt
##   1   4  MasVnrArea  GarageYrBlt*
##   1   5  MasVnrArea*  GarageYrBlt*
##   2   1  MasVnrArea  GarageYrBlt
##   2   2  MasVnrArea  GarageYrBlt*
##   2   3  MasVnrArea*  GarageYrBlt*
##   2   4  MasVnrArea  GarageYrBlt
##   2   5  MasVnrArea  GarageYrBlt*
##   3   1  MasVnrArea*  GarageYrBlt*
##   3   2  MasVnrArea*  GarageYrBlt*
##   3   3  MasVnrArea  GarageYrBlt*
##   3   4  MasVnrArea  GarageYrBlt
##   3   5  MasVnrArea  GarageYrBlt*
##   4   1  MasVnrArea*  GarageYrBlt*
##   4   2  MasVnrArea*  GarageYrBlt*
##   4   3  MasVnrArea  GarageYrBlt*
##   4   4  MasVnrArea*  GarageYrBlt*
##   4   5  MasVnrArea*  GarageYrBlt*
##   5   1  MasVnrArea*  GarageYrBlt
##   5   2  MasVnrArea*  GarageYrBlt*
##   5   3  MasVnrArea*  GarageYrBlt*
##   5   4  MasVnrArea  GarageYrBlt*
##   5   5  MasVnrArea*  GarageYrBlt*
##   6   1  MasVnrArea  GarageYrBlt*
##   6   2  MasVnrArea*  GarageYrBlt
##   6   3  MasVnrArea  GarageYrBlt*
##   6   4  MasVnrArea  GarageYrBlt*
##   6   5  MasVnrArea*  GarageYrBlt*
##   7   1  MasVnrArea  GarageYrBlt
##   7   2  MasVnrArea*  GarageYrBlt*
##   7   3  MasVnrArea*  GarageYrBlt
##   7   4  MasVnrArea  GarageYrBlt
##   7   5  MasVnrArea  GarageYrBlt
##   8   1  MasVnrArea  GarageYrBlt
##   8   2  MasVnrArea*  GarageYrBlt*
##   8   3  MasVnrArea*  GarageYrBlt*
##   8   4  MasVnrArea*  GarageYrBlt*
##   8   5  MasVnrArea*  GarageYrBlt
##   9   1  MasVnrArea*  GarageYrBlt*
##   9   2  MasVnrArea*  GarageYrBlt*
##   9   3  MasVnrArea*  GarageYrBlt
##   9   4  MasVnrArea  GarageYrBlt
##   9   5  MasVnrArea  GarageYrBlt*
##   10   1  MasVnrArea  GarageYrBlt
##   10   2  MasVnrArea  GarageYrBlt*
##   10   3  MasVnrArea*  GarageYrBlt*
##   10   4  MasVnrArea  GarageYrBlt
##   10   5  MasVnrArea  GarageYrBlt
##   11   1  MasVnrArea  GarageYrBlt*
##   11   2  MasVnrArea*  GarageYrBlt*
##   11   3  MasVnrArea*  GarageYrBlt*
##   11   4  MasVnrArea  GarageYrBlt*
##   11   5  MasVnrArea  GarageYrBlt
##   12   1  MasVnrArea*  GarageYrBlt*
##   12   2  MasVnrArea*  GarageYrBlt
##   12   3  MasVnrArea  GarageYrBlt*
##   12   4  MasVnrArea  GarageYrBlt*
##   12   5  MasVnrArea*  GarageYrBlt*
##   13   1  MasVnrArea  GarageYrBlt*
##   13   2  MasVnrArea*  GarageYrBlt*
##   13   3  MasVnrArea  GarageYrBlt
##   13   4  MasVnrArea  GarageYrBlt*
##   13   5  MasVnrArea*  GarageYrBlt*
##   14   1  MasVnrArea*  GarageYrBlt*
##   14   2  MasVnrArea*  GarageYrBlt*
##   14   3  MasVnrArea*  GarageYrBlt
##   14   4  MasVnrArea*  GarageYrBlt*
##   14   5  MasVnrArea  GarageYrBlt
##   15   1  MasVnrArea  GarageYrBlt
##   15   2  MasVnrArea  GarageYrBlt*
##   15   3  MasVnrArea*  GarageYrBlt*
##   15   4  MasVnrArea  GarageYrBlt
##   15   5  MasVnrArea  GarageYrBlt*
##   16   1  MasVnrArea  GarageYrBlt
##   16   2  MasVnrArea*  GarageYrBlt*
##   16   3  MasVnrArea*  GarageYrBlt
##   16   4  MasVnrArea*  GarageYrBlt*
##   16   5  MasVnrArea*  GarageYrBlt*
##   17   1  MasVnrArea*  GarageYrBlt*
##   17   2  MasVnrArea*  GarageYrBlt*
##   17   3  MasVnrArea  GarageYrBlt*
##   17   4  MasVnrArea  GarageYrBlt*
##   17   5  MasVnrArea*  GarageYrBlt*
##   18   1  MasVnrArea*  GarageYrBlt
##   18   2  MasVnrArea  GarageYrBlt*
##   18   3  MasVnrArea  GarageYrBlt*
##   18   4  MasVnrArea*  GarageYrBlt*
##   18   5  MasVnrArea*  GarageYrBlt*
##   19   1  MasVnrArea  GarageYrBlt*
##   19   2  MasVnrArea*  GarageYrBlt*
##   19   3  MasVnrArea*  GarageYrBlt*
##   19   4  MasVnrArea  GarageYrBlt
##   19   5  MasVnrArea  GarageYrBlt
##   20   1  MasVnrArea  GarageYrBlt*
##   20   2  MasVnrArea*  GarageYrBlt*
##   20   3  MasVnrArea  GarageYrBlt*
##   20   4  MasVnrArea*  GarageYrBlt*
##   20   5  MasVnrArea*  GarageYrBlt*
##   21   1  MasVnrArea  GarageYrBlt*
##   21   2  MasVnrArea*  GarageYrBlt*
##   21   3  MasVnrArea*  GarageYrBlt*
##   21   4  MasVnrArea  GarageYrBlt*
##   21   5  MasVnrArea*  GarageYrBlt*
##   22   1  MasVnrArea*  GarageYrBlt*
##   22   2  MasVnrArea  GarageYrBlt
##   22   3  MasVnrArea  GarageYrBlt
##   22   4  MasVnrArea*  GarageYrBlt
##   22   5  MasVnrArea*  GarageYrBlt*
##   23   1  MasVnrArea*  GarageYrBlt*
##   23   2  MasVnrArea  GarageYrBlt
##   23   3  MasVnrArea*  GarageYrBlt*
##   23   4  MasVnrArea*  GarageYrBlt*
##   23   5  MasVnrArea*  GarageYrBlt*
##   24   1  MasVnrArea  GarageYrBlt*
##   24   2  MasVnrArea*  GarageYrBlt*
##   24   3  MasVnrArea  GarageYrBlt*
##   24   4  MasVnrArea  GarageYrBlt*
##   24   5  MasVnrArea*  GarageYrBlt*
##   25   1  MasVnrArea*  GarageYrBlt*
##   25   2  MasVnrArea  GarageYrBlt*
##   25   3  MasVnrArea  GarageYrBlt
##   25   4  MasVnrArea*  GarageYrBlt
##   25   5  MasVnrArea*  GarageYrBlt
##   26   1  MasVnrArea*  GarageYrBlt*
##   26   2  MasVnrArea*  GarageYrBlt*
##   26   3  MasVnrArea*  GarageYrBlt
##   26   4  MasVnrArea*  GarageYrBlt
##   26   5  MasVnrArea  GarageYrBlt
##   27   1  MasVnrArea  GarageYrBlt
##   27   2  MasVnrArea*  GarageYrBlt*
##   27   3  MasVnrArea  GarageYrBlt*
##   27   4  MasVnrArea*  GarageYrBlt*
##   27   5  MasVnrArea  GarageYrBlt*
##   28   1  MasVnrArea  GarageYrBlt*
##   28   2  MasVnrArea*  GarageYrBlt*
##   28   3  MasVnrArea*  GarageYrBlt*
##   28   4  MasVnrArea  GarageYrBlt
##   28   5  MasVnrArea  GarageYrBlt
##   29   1  MasVnrArea  GarageYrBlt
##   29   2  MasVnrArea*  GarageYrBlt
##   29   3  MasVnrArea  GarageYrBlt
##   29   4  MasVnrArea*  GarageYrBlt*
##   29   5  MasVnrArea*  GarageYrBlt*
##   30   1  MasVnrArea  GarageYrBlt*
##   30   2  MasVnrArea  GarageYrBlt*
##   30   3  MasVnrArea*  GarageYrBlt*
##   30   4  MasVnrArea*  GarageYrBlt
##   30   5  MasVnrArea  GarageYrBlt
##   31   1  MasVnrArea*  GarageYrBlt*
##   31   2  MasVnrArea  GarageYrBlt*
##   31   3  MasVnrArea*  GarageYrBlt*
##   31   4  MasVnrArea  GarageYrBlt
##   31   5  MasVnrArea*  GarageYrBlt*
##   32   1  MasVnrArea  GarageYrBlt*
##   32   2  MasVnrArea*  GarageYrBlt*
##   32   3  MasVnrArea  GarageYrBlt
##   32   4  MasVnrArea  GarageYrBlt
##   32   5  MasVnrArea  GarageYrBlt
##   33   1  MasVnrArea*  GarageYrBlt*
##   33   2  MasVnrArea*  GarageYrBlt*
##   33   3  MasVnrArea*  GarageYrBlt*
##   33   4  MasVnrArea  GarageYrBlt
##   33   5  MasVnrArea*  GarageYrBlt*
##   34   1  MasVnrArea  GarageYrBlt
##   34   2  MasVnrArea*  GarageYrBlt*
##   34   3  MasVnrArea*  GarageYrBlt*
##   34   4  MasVnrArea*  GarageYrBlt*
##   34   5  MasVnrArea  GarageYrBlt
##   35   1  MasVnrArea*  GarageYrBlt*
##   35   2  MasVnrArea*  GarageYrBlt*
##   35   3  MasVnrArea*  GarageYrBlt*
##   35   4  MasVnrArea*  GarageYrBlt*
##   35   5  MasVnrArea*  GarageYrBlt
##   36   1  MasVnrArea  GarageYrBlt*
##   36   2  MasVnrArea  GarageYrBlt*
##   36   3  MasVnrArea*  GarageYrBlt*
##   36   4  MasVnrArea*  GarageYrBlt*
##   36   5  MasVnrArea*  GarageYrBlt*
##   37   1  MasVnrArea*  GarageYrBlt*
##   37   2  MasVnrArea*  GarageYrBlt*
##   37   3  MasVnrArea  GarageYrBlt*
##   37   4  MasVnrArea  GarageYrBlt*
##   37   5  MasVnrArea*  GarageYrBlt*
##   38   1  MasVnrArea*  GarageYrBlt*
##   38   2  MasVnrArea*  GarageYrBlt*
##   38   3  MasVnrArea*  GarageYrBlt
##   38   4  MasVnrArea  GarageYrBlt*
##   38   5  MasVnrArea  GarageYrBlt
##   39   1  MasVnrArea*  GarageYrBlt
##   39   2  MasVnrArea  GarageYrBlt*
##   39   3  MasVnrArea*  GarageYrBlt*
##   39   4  MasVnrArea*  GarageYrBlt
##   39   5  MasVnrArea*  GarageYrBlt*
##   40   1  MasVnrArea*  GarageYrBlt
##   40   2  MasVnrArea*  GarageYrBlt
##   40   3  MasVnrArea*  GarageYrBlt*
##   40   4  MasVnrArea*  GarageYrBlt*
##   40   5  MasVnrArea*  GarageYrBlt
##   41   1  MasVnrArea*  GarageYrBlt*
##   41   2  MasVnrArea*  GarageYrBlt*
##   41   3  MasVnrArea*  GarageYrBlt
##   41   4  MasVnrArea  GarageYrBlt*
##   41   5  MasVnrArea*  GarageYrBlt*
##   42   1  MasVnrArea*  GarageYrBlt*
##   42   2  MasVnrArea*  GarageYrBlt*
##   42   3  MasVnrArea  GarageYrBlt*
##   42   4  MasVnrArea*  GarageYrBlt
##   42   5  MasVnrArea*  GarageYrBlt*
##   43   1  MasVnrArea  GarageYrBlt
##   43   2  MasVnrArea  GarageYrBlt
##   43   3  MasVnrArea*  GarageYrBlt*
##   43   4  MasVnrArea*  GarageYrBlt*
##   43   5  MasVnrArea*  GarageYrBlt*
##   44   1  MasVnrArea  GarageYrBlt
##   44   2  MasVnrArea*  GarageYrBlt*
##   44   3  MasVnrArea*  GarageYrBlt*
##   44   4  MasVnrArea  GarageYrBlt*
##   44   5  MasVnrArea*  GarageYrBlt
##   45   1  MasVnrArea  GarageYrBlt*
##   45   2  MasVnrArea  GarageYrBlt
##   45   3  MasVnrArea  GarageYrBlt
##   45   4  MasVnrArea*  GarageYrBlt*
##   45   5  MasVnrArea*  GarageYrBlt
##   46   1  MasVnrArea  GarageYrBlt*
##   46   2  MasVnrArea  GarageYrBlt
##   46   3  MasVnrArea*  GarageYrBlt*
##   46   4  MasVnrArea*  GarageYrBlt*
##   46   5  MasVnrArea*  GarageYrBlt*
##   47   1  MasVnrArea*  GarageYrBlt*
##   47   2  MasVnrArea  GarageYrBlt
##   47   3  MasVnrArea  GarageYrBlt
##   47   4  MasVnrArea  GarageYrBlt*
##   47   5  MasVnrArea*  GarageYrBlt
##   48   1  MasVnrArea*  GarageYrBlt*
##   48   2  MasVnrArea  GarageYrBlt*
##   48   3  MasVnrArea  GarageYrBlt*
##   48   4  MasVnrArea  GarageYrBlt
##   48   5  MasVnrArea*  GarageYrBlt*
##   49   1  MasVnrArea  GarageYrBlt*
##   49   2  MasVnrArea*  GarageYrBlt*
##   49   3  MasVnrArea  GarageYrBlt*
##   49   4  MasVnrArea  GarageYrBlt*
##   49   5  MasVnrArea*  GarageYrBlt*
##   50   1  MasVnrArea*  GarageYrBlt
##   50   2  MasVnrArea  GarageYrBlt*
##   50   3  MasVnrArea*  GarageYrBlt*
##   50   4  MasVnrArea  GarageYrBlt*
##   50   5  MasVnrArea*  GarageYrBlt*
##  * Please inspect the loggedEvents

summary(tempData)

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##    MSSubClass       LotArea   OverallQual   OverallCond     YearBuilt 
##            ""            ""            ""            ""            "" 
##  YearRemodAdd    MasVnrArea    BsmtFinSF1    BsmtFinSF2     BsmtUnfSF 
##            ""         "pmm"            ""            ""            "" 
##   TotalBsmtSF     X1stFlrSF     X2ndFlrSF  LowQualFinSF     GrLivArea 
##            ""            ""            ""            ""            "" 
##  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath  BedroomAbvGr 
##            ""            ""            ""            ""            "" 
##  KitchenAbvGr  TotRmsAbvGrd    Fireplaces   GarageYrBlt    GarageCars 
##            ""            ""            ""         "pmm"            "" 
##    GarageArea    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##            ""            ""            ""            ""            "" 
##   ScreenPorch      PoolArea       MiscVal        MoSold        YrSold 
##            ""            ""            ""            ""            "" 
## log_SalePrice 
##            "" 
## PredictorMatrix:
##              MSSubClass LotArea OverallQual OverallCond YearBuilt YearRemodAdd
## MSSubClass            0       1           1           1         1            1
## LotArea               1       0           1           1         1            1
## OverallQual           1       1           0           1         1            1
## OverallCond           1       1           1           0         1            1
## YearBuilt             1       1           1           1         0            1
## YearRemodAdd          1       1           1           1         1            0
##              MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF X1stFlrSF
## MSSubClass            1          1          1         1           1         1
## LotArea               1          1          1         1           1         1
## OverallQual           1          1          1         1           1         1
## OverallCond           1          1          1         1           1         1
## YearBuilt             1          1          1         1           1         1
## YearRemodAdd          1          1          1         1           1         1
##              X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## MSSubClass           1            1         1            1            1
## LotArea              1            1         1            1            1
## OverallQual          1            1         1            1            1
## OverallCond          1            1         1            1            1
## YearBuilt            1            1         1            1            1
## YearRemodAdd         1            1         1            1            1
##              FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## MSSubClass          1        1            1            1            1
## LotArea             1        1            1            1            1
## OverallQual         1        1            1            1            1
## OverallCond         1        1            1            1            1
## YearBuilt           1        1            1            1            1
## YearRemodAdd        1        1            1            1            1
##              Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF
## MSSubClass            1           1          1          1          1
## LotArea               1           1          1          1          1
## OverallQual           1           1          1          1          1
## OverallCond           1           1          1          1          1
## YearBuilt             1           1          1          1          1
## YearRemodAdd          1           1          1          1          1
##              OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal
## MSSubClass             1             1          1           1        1       1
## LotArea                1             1          1           1        1       1
## OverallQual            1             1          1           1        1       1
## OverallCond            1             1          1           1        1       1
## YearBuilt              1             1          1           1        1       1
## YearRemodAdd           1             1          1           1        1       1
##              MoSold YrSold log_SalePrice
## MSSubClass        1      1             1
## LotArea           1      1             1
## OverallQual       1      1             1
## OverallCond       1      1             1
## YearBuilt         1      1             1
## YearRemodAdd      1      1             1
## Number of logged events:  816 
##   it im         dep meth
## 1  1  1  MasVnrArea  pmm
## 2  1  1  MasVnrArea  pmm
## 3  1  1 GarageYrBlt  pmm
## 4  1  1 GarageYrBlt  pmm
## 5  1  2  MasVnrArea  pmm
## 6  1  2  MasVnrArea  pmm
##                                                                                                                                                                                                                                                                                                                       out
## 1                                                                                                                                                                                                                                                                                                BsmtFinSF1, BsmtFullBath
## 2 * A ridge penalty had to be used to calculate the inverse crossproduct of the predictor matrix. Please remove duplicate variables or unique respondent names/numbers from the imputation model. It may be advisable to check the fraction of missing information (fmi) to evaluate the validity of the imputation model
## 3                                                                                                                                                                                                                                                                                                BsmtFinSF1, BsmtFullBath
## 4 * A ridge penalty had to be used to calculate the inverse crossproduct of the predictor matrix. Please remove duplicate variables or unique respondent names/numbers from the imputation model. It may be advisable to check the fraction of missing information (fmi) to evaluate the validity of the imputation model
## 5                                                                                                                                                                                                                                                                                                BsmtFinSF1, BsmtFullBath
## 6 * A ridge penalty had to be used to calculate the inverse crossproduct of the predictor matrix. Please remove duplicate variables or unique respondent names/numbers from the imputation model. It may be advisable to check the fraction of missing information (fmi) to evaluate the validity of the imputation model

completedData <- complete(tempData,1)

Rename the dataframe.

hpnum_train <- completedData

Train the model using lm.

Method: “stepAIC”, Numeric Variables

# Fit the full model using stepAIC
model1 <- lm(log_SalePrice ~., data = hpnum_train)
# Stepwise regression model
model1_step <- stepAIC(model1, direction = "both", 
                      trace = FALSE)
summary(model1_step)

## 
## Call:
## lm(formula = log_SalePrice ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtFinSF2 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + 
##     FullBath + HalfBath + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + 
##     GarageCars + WoodDeckSF + EnclosedPorch + X3SsnPorch + ScreenPorch + 
##     PoolArea + YrSold, data = hpnum_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92973 -0.06766  0.00243  0.07404  0.55321 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.722e+01  5.869e+00   2.933  0.00341 ** 
## MSSubClass    -6.174e-04  1.087e-04  -5.682 1.61e-08 ***
## LotArea        1.853e-06  4.229e-07   4.382 1.26e-05 ***
## OverallQual    8.434e-02  4.902e-03  17.205  < 2e-16 ***
## OverallCond    4.935e-02  4.258e-03  11.589  < 2e-16 ***
## YearBuilt      2.893e-03  2.534e-04  11.416  < 2e-16 ***
## YearRemodAdd   1.091e-03  2.717e-04   4.014 6.27e-05 ***
## BsmtFinSF1     8.116e-05  1.919e-05   4.228 2.51e-05 ***
## BsmtFinSF2     6.976e-05  2.947e-05   2.367  0.01804 *  
## BsmtUnfSF      5.298e-05  1.752e-05   3.024  0.00254 ** 
## X1stFlrSF      1.964e-04  2.390e-05   8.216 4.67e-16 ***
## X2ndFlrSF      1.694e-04  2.035e-05   8.322  < 2e-16 ***
## LowQualFinSF   1.648e-04  8.281e-05   1.991  0.04671 *  
## BsmtFullBath   5.979e-02  1.046e-02   5.717 1.31e-08 ***
## FullBath       3.772e-02  1.166e-02   3.236  0.00124 ** 
## HalfBath       2.089e-02  1.113e-02   1.876  0.06088 .  
## KitchenAbvGr  -5.147e-02  2.177e-02  -2.364  0.01822 *  
## TotRmsAbvGrd   1.466e-02  4.628e-03   3.167  0.00157 ** 
## Fireplaces     4.566e-02  7.346e-03   6.216 6.68e-10 ***
## GarageCars     7.362e-02  7.101e-03  10.367  < 2e-16 ***
## WoodDeckSF     1.270e-04  3.340e-05   3.801  0.00015 ***
## EnclosedPorch  1.709e-04  7.070e-05   2.417  0.01579 *  
## X3SsnPorch     2.272e-04  1.319e-04   1.723  0.08512 .  
## ScreenPorch    3.642e-04  7.223e-05   5.043 5.18e-07 ***
## PoolArea      -3.863e-04  9.888e-05  -3.907 9.77e-05 ***
## YrSold        -7.223e-03  2.919e-03  -2.475  0.01344 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1464 on 1434 degrees of freedom
## Multiple R-squared:  0.868,  Adjusted R-squared:  0.8657 
## F-statistic: 377.2 on 25 and 1434 DF,  p-value: < 2.2e-16

Method: “leapBackward”, Numeric Variables

set.seed(1977)
library(caret)
library(parallel)
library(doParallel)

cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 5, allowParallel = TRUE)

nv <- ncol(hpnum_train)-10 

# Train the model
step.model <- train(log_SalePrice ~., data = hpnum_train,
                    method = "leapBackward",
                    tuneGrid = data.frame(nvmax = 20:nv),
                    trControl = train.control
                    )

## Reordering variables and trying again:

step.model$results

##   nvmax      RMSE  Rsquared        MAE     RMSESD RsquaredSD       MAESD
## 1    20 0.1622507 0.8330429 0.10989897 0.02771617 0.05307579 0.012968060
## 2    21 0.1562831 0.8446512 0.10436740 0.02858296 0.05400624 0.007308561
## 3    22 0.1551216 0.8468942 0.10302751 0.02886732 0.05437192 0.006277085
## 4    23 0.1544610 0.8476513 0.10334411 0.02853211 0.05412154 0.007394441
## 5    24 0.1514962 0.8539444 0.09958089 0.02917332 0.05350025 0.005759485
## 6    25 0.1512278 0.8543889 0.09928569 0.02945537 0.05394382 0.005579948
## 7    26 0.1515411 0.8539754 0.09924665 0.02946447 0.05380142 0.004701382

stopCluster(cluster)

Let’s eliminate some string variables from the larger dataset and recode them as numeric. Recoding will be accomplished with fastDummies function dummy_cols.

library(fastDummies)
# create list of character variables 
is_char <- unlist(colnames(hp_train[,sapply(hp_train, is.character)]))

# remove all spaces from character variables in dataframe
hpchar_train <- as.data.frame(apply(hp_train[,is_char],2,function(x)gsub('\\s+', '',x)),stringsAsFactors = FALSE)

# reassemble
hp_train <- data.frame(hpnum_train, hpchar_train)

# create dummy variables for each, eliminating the referant, defined as the most frequent of each nominal level 
hp_train2 <- dummy_columns(hp_train, select_columns = is_char, remove_most_frequent_dummy = TRUE, ignore_na = FALSE)

names(hp_train2)[names(hp_train2)=="MSZoning_C(all)"] <- "MSZoning_CAll"
names(hp_train2)[names(hp_train2)=="RoofMatl_Tar&Grv"] <- "RoofMatl_TarGrv"

# remove the character variables from the dataset
hp_train_final <- hp_train2[,!sapply(hp_train2, is.character)]

Method: “bayesglm” with Recoded Variables

With all recoding done we’ll rerun the models to assess whether we can achieve a better fit. With the bayesglm method from caret:

set.seed(1981)
# parallel processing set again for clarity in code
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

# Set up repeated k-fold cross-validation
train.control <- trainControl(method = "cv", number = 5, savePred = TRUE, allowParallel = TRUE)
# Train the model
step.model2 <- train(log_SalePrice ~., data = hp_train_final,
                    method = "bayesglm",
                    trControl = train.control,
                    na.action = na.pass
                    )
step.model2$results

##   parameter      RMSE  Rsquared        MAE     RMSESD RsquaredSD       MAESD
## 1      none 0.1471863 0.8506474 0.08454199 0.04800956 0.08227343 0.007384579

stopCluster(cluster)

RMSE is smaller, and $R^2$ is quite high; with this many variables we’ve probably overfit the model, but for RMSE we have an improvement over the model that included only the numeric variables.

Final Model

Method: “stepAIC” with Recoded Variables

hp_train_final[is.na(hp_train_final)] <- 0

set.seed(1980)
# Fit the full model using stepAIC
model3 <- lm(log_SalePrice ~., data = hp_train_final)
# Stepwise regression model
step.model3 <- stepAIC(model3, direction = "both", 
                      trace = FALSE)
summary(step.model3)

## 
## Call:
## lm(formula = log_SalePrice ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtFinSF2 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + 
##     FullBath + HalfBath + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + 
##     GarageCars + GarageArea + WoodDeckSF + EnclosedPorch + X3SsnPorch + 
##     ScreenPorch + PoolArea + MSZoning_RM + MSZoning_CAll + Street_Grvl + 
##     LotShape_IR1 + LandContour_Bnk + LandContour_Low + Utilities_NoSeWa + 
##     LotConfig_Corner + LotConfig_CulDSac + LandSlope_Mod + LandSlope_Sev + 
##     Neighborhood_Veenker + Neighborhood_Crawfor + Neighborhood_NoRidge + 
##     Neighborhood_Mitchel + Neighborhood_Somerst + Neighborhood_BrkSide + 
##     Neighborhood_NridgHt + Neighborhood_MeadowV + Neighborhood_Edwards + 
##     Neighborhood_StoneBr + Neighborhood_ClearCr + Condition1_Feedr + 
##     Condition1_Artery + Condition1_RRAe + Condition1_RRAn + Condition2_PosN + 
##     Condition2_PosA + Condition2_RRAe + BldgType_2fmCon + BldgType_Twnhs + 
##     HouseStyle_1.5Fin + HouseStyle_SLvl + HouseStyle_2.5Unf + 
##     RoofStyle_Shed + RoofMatl_Metal + RoofMatl_Membran + RoofMatl_ClyTile + 
##     Exterior1st_MetalSd + Exterior1st_WdSdng + Exterior1st_BrkFace + 
##     Exterior1st_CemntBd + Exterior1st_BrkComm + Exterior2nd_HdBoard + 
##     Exterior2nd_WdSdng + Exterior2nd_CmentBd + Exterior2nd_BrkFace + 
##     Exterior2nd_AsbShng + ExterCond_Gd + ExterCond_Fa + Foundation_CBlock + 
##     Foundation_BrkTil + Foundation_Wood + Foundation_Slab + BsmtQual_Ex + 
##     BsmtCond_Po + BsmtExposure_Gd + BsmtExposure_Av + BsmtFinType1_GLQ + 
##     BsmtFinType2_BLQ + Heating_GasW + Heating_Grav + HeatingQC_Gd + 
##     HeatingQC_TA + HeatingQC_Fa + CentralAir_N + Electrical_Mix + 
##     KitchenQual_Ex + Functional_Min1 + Functional_Maj1 + Functional_Min2 + 
##     Functional_Mod + Functional_Maj2 + Functional_Sev + GarageQual_Fa + 
##     GarageQual_Ex + GarageCond_Fa + GarageCond_Po + GarageCond_Ex + 
##     SaleType_New + SaleType_ConLD + SaleType_CWD + SaleCondition_Abnorml + 
##     SaleCondition_Family + HouseStyle_2.5Fin, data = hp_train_final)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71268 -0.04811  0.00260  0.05437  0.71268 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.873e+00  5.775e-01  10.170  < 2e-16 ***
## MSSubClass            -3.750e-04  1.120e-04  -3.349 0.000835 ***
## LotArea                3.175e-06  4.246e-07   7.479 1.34e-13 ***
## OverallQual            4.444e-02  4.081e-03  10.890  < 2e-16 ***
## OverallCond            3.745e-02  3.543e-03  10.570  < 2e-16 ***
## YearBuilt              1.884e-03  2.530e-04   7.448 1.69e-13 ***
## YearRemodAdd           6.757e-04  2.158e-04   3.131 0.001777 ** 
## BsmtFinSF1             1.421e-04  1.734e-05   8.198 5.63e-16 ***
## BsmtFinSF2             1.220e-04  2.384e-05   5.117 3.56e-07 ***
## BsmtUnfSF              8.164e-05  1.575e-05   5.182 2.52e-07 ***
## X1stFlrSF              2.385e-04  1.976e-05  12.071  < 2e-16 ***
## X2ndFlrSF              2.343e-04  1.636e-05  14.325  < 2e-16 ***
## LowQualFinSF           1.964e-04  7.581e-05   2.590 0.009688 ** 
## BsmtFullBath           2.416e-02  7.808e-03   3.095 0.002011 ** 
## FullBath               1.664e-02  8.772e-03   1.897 0.058075 .  
## HalfBath               2.109e-02  8.420e-03   2.505 0.012355 *  
## KitchenAbvGr          -5.672e-02  1.705e-02  -3.328 0.000899 ***
## TotRmsAbvGrd           7.031e-03  3.510e-03   2.003 0.045402 *  
## Fireplaces             2.453e-02  5.494e-03   4.466 8.65e-06 ***
## GarageCars             2.637e-02  9.030e-03   2.920 0.003553 ** 
## GarageArea             1.126e-04  3.098e-05   3.633 0.000291 ***
## WoodDeckSF             9.546e-05  2.452e-05   3.893 0.000104 ***
## EnclosedPorch          1.307e-04  5.189e-05   2.518 0.011904 *  
## X3SsnPorch             1.546e-04  9.517e-05   1.624 0.104503    
## ScreenPorch            2.394e-04  5.306e-05   4.512 6.97e-06 ***
## PoolArea               1.461e-04  7.390e-05   1.977 0.048285 *  
## MSZoning_RM           -5.718e-02  1.035e-02  -5.524 3.96e-08 ***
## MSZoning_CAll         -4.080e-01  3.729e-02 -10.943  < 2e-16 ***
## Street_Grvl           -1.065e-01  4.668e-02  -2.281 0.022730 *  
## LotShape_IR1          -9.330e-03  6.457e-03  -1.445 0.148693    
## LandContour_Bnk       -2.204e-02  1.492e-02  -1.477 0.139785    
## LandContour_Low       -4.567e-02  2.306e-02  -1.980 0.047864 *  
## Utilities_NoSeWa      -1.448e-01  1.062e-01  -1.364 0.172938    
## LotConfig_Corner       1.875e-02  7.426e-03   2.525 0.011687 *  
## LotConfig_CulDSac      4.482e-02  1.223e-02   3.666 0.000256 ***
## LandSlope_Mod          3.511e-02  1.620e-02   2.168 0.030368 *  
## LandSlope_Sev         -2.161e-01  4.641e-02  -4.656 3.54e-06 ***
## Neighborhood_Veenker   6.434e-02  3.291e-02   1.955 0.050830 .  
## Neighborhood_Crawfor   1.342e-01  1.706e-02   7.864 7.55e-15 ***
## Neighborhood_NoRidge   4.863e-02  1.892e-02   2.570 0.010264 *  
## Neighborhood_Mitchel  -3.592e-02  1.606e-02  -2.236 0.025515 *  
## Neighborhood_Somerst   5.685e-02  1.395e-02   4.074 4.90e-05 ***
## Neighborhood_BrkSide   4.362e-02  1.587e-02   2.748 0.006072 ** 
## Neighborhood_NridgHt   8.866e-02  1.580e-02   5.613 2.41e-08 ***
## Neighborhood_MeadowV  -1.237e-01  3.344e-02  -3.701 0.000224 ***
## Neighborhood_Edwards  -5.303e-02  1.209e-02  -4.385 1.25e-05 ***
## Neighborhood_StoneBr   1.346e-01  2.332e-02   5.772 9.69e-09 ***
## Neighborhood_ClearCr   4.990e-02  2.380e-02   2.097 0.036162 *  
## Condition1_Feedr      -5.572e-02  1.259e-02  -4.424 1.05e-05 ***
## Condition1_Artery     -9.888e-02  1.689e-02  -5.853 6.04e-09 ***
## Condition1_RRAe       -1.237e-01  3.190e-02  -3.879 0.000110 ***
## Condition1_RRAn       -3.194e-02  2.113e-02  -1.512 0.130848    
## Condition2_PosN       -8.820e-01  7.785e-02 -11.329  < 2e-16 ***
## Condition2_PosA        2.606e-01  1.132e-01   2.302 0.021463 *  
## Condition2_RRAe       -5.176e-01  1.572e-01  -3.293 0.001016 ** 
## BldgType_2fmCon        4.499e-02  2.569e-02   1.751 0.080157 .  
## BldgType_Twnhs        -5.161e-02  1.936e-02  -2.666 0.007766 ** 
## HouseStyle_1.5Fin      2.314e-02  1.096e-02   2.112 0.034906 *  
## HouseStyle_SLvl        2.600e-02  1.519e-02   1.711 0.087315 .  
## HouseStyle_2.5Unf      5.335e-02  3.583e-02   1.489 0.136671    
## RoofStyle_Shed         3.578e-01  1.140e-01   3.139 0.001734 ** 
## RoofMatl_Metal         2.604e-01  1.158e-01   2.249 0.024696 *  
## RoofMatl_Membran       4.159e-01  1.181e-01   3.522 0.000443 ***
## RoofMatl_ClyTile      -2.622e+00  1.302e-01 -20.141  < 2e-16 ***
## Exterior1st_MetalSd    1.246e-02  9.107e-03   1.368 0.171489    
## Exterior1st_WdSdng    -4.830e-02  1.747e-02  -2.765 0.005771 ** 
## Exterior1st_BrkFace    7.566e-02  2.300e-02   3.290 0.001029 ** 
## Exterior1st_CemntBd   -1.061e-01  6.389e-02  -1.661 0.097019 .  
## Exterior1st_BrkComm   -1.707e-01  7.910e-02  -2.158 0.031115 *  
## Exterior2nd_HdBoard   -1.760e-02  9.020e-03  -1.951 0.051247 .  
## Exterior2nd_WdSdng     3.752e-02  1.744e-02   2.151 0.031675 *  
## Exterior2nd_CmentBd    1.211e-01  6.425e-02   1.885 0.059578 .  
## Exterior2nd_BrkFace   -4.244e-02  3.051e-02  -1.391 0.164355    
## Exterior2nd_AsbShng   -3.821e-02  2.544e-02  -1.502 0.133352    
## ExterCond_Gd          -2.200e-02  9.914e-03  -2.220 0.026618 *  
## ExterCond_Fa          -3.882e-02  2.325e-02  -1.670 0.095176 .  
## Foundation_CBlock     -2.692e-02  9.091e-03  -2.961 0.003117 ** 
## Foundation_BrkTil     -4.136e-02  1.424e-02  -2.903 0.003751 ** 
## Foundation_Wood       -1.646e-01  6.161e-02  -2.672 0.007640 ** 
## Foundation_Slab       -4.804e-02  2.892e-02  -1.661 0.096955 .  
## BsmtQual_Ex            3.266e-02  1.355e-02   2.410 0.016081 *  
## BsmtCond_Po            3.123e-01  1.226e-01   2.546 0.010996 *  
## BsmtExposure_Gd        4.851e-02  1.192e-02   4.069 5.00e-05 ***
## BsmtExposure_Av        1.564e-02  8.681e-03   1.801 0.071907 .  
## BsmtFinType1_GLQ       1.964e-02  8.887e-03   2.210 0.027279 *  
## BsmtFinType2_BLQ      -4.887e-02  1.939e-02  -2.521 0.011832 *  
## Heating_GasW           7.139e-02  2.796e-02   2.553 0.010793 *  
## Heating_Grav          -1.854e-01  4.648e-02  -3.989 7.00e-05 ***
## HeatingQC_Gd          -2.281e-02  8.670e-03  -2.631 0.008609 ** 
## HeatingQC_TA          -3.738e-02  8.347e-03  -4.478 8.16e-06 ***
## HeatingQC_Fa          -2.584e-02  1.849e-02  -1.397 0.162579    
## CentralAir_N          -5.592e-02  1.484e-02  -3.769 0.000171 ***
## Electrical_Mix        -3.383e-01  1.745e-01  -1.938 0.052818 .  
## KitchenQual_Ex         5.188e-02  1.401e-02   3.704 0.000221 ***
## Functional_Min1       -3.875e-02  1.977e-02  -1.960 0.050189 .  
## Functional_Maj1       -7.187e-02  3.094e-02  -2.323 0.020324 *  
## Functional_Min2       -3.607e-02  1.932e-02  -1.867 0.062174 .  
## Functional_Mod        -1.386e-01  3.014e-02  -4.597 4.68e-06 ***
## Functional_Maj2       -3.153e-01  5.348e-02  -5.895 4.72e-09 ***
## Functional_Sev        -3.562e-01  1.051e-01  -3.391 0.000716 ***
## GarageQual_Fa         -3.858e-02  1.919e-02  -2.011 0.044516 *  
## GarageQual_Ex          3.888e-01  1.166e-01   3.334 0.000878 ***
## GarageCond_Fa         -3.157e-02  2.125e-02  -1.486 0.137526    
## GarageCond_Po          1.203e-01  4.819e-02   2.496 0.012695 *  
## GarageCond_Ex         -3.474e-01  1.382e-01  -2.515 0.012032 *  
## SaleType_New           4.990e-02  1.187e-02   4.202 2.82e-05 ***
## SaleType_ConLD         1.326e-01  3.718e-02   3.567 0.000373 ***
## SaleType_CWD           8.259e-02  5.339e-02   1.547 0.122145    
## SaleCondition_Abnorml -6.188e-02  1.147e-02  -5.395 8.07e-08 ***
## SaleCondition_Family  -5.699e-02  2.389e-02  -2.386 0.017172 *  
## HouseStyle_2.5Fin     -7.336e-02  5.039e-02  -1.456 0.145691    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1022 on 1349 degrees of freedom
## Multiple R-squared:  0.9395, Adjusted R-squared:  0.9346 
## F-statistic: 190.5 on 110 and 1349 DF,  p-value: < 2.2e-16

This is the bestmodel by far, tho it includes several nonsignificant coefficients from recoded string variables. $R^2$ at 94% indicates adequate fit and residuals are symmetric. P-value is significant, but the degrees of freedom are quite low.

The goal was to allow R to select the model programatically. Let’s test these two models with the test set.

# allowing strings to be imported as factors
hp_test <- read.csv('https://raw.githubusercontent.com/sigmasigmaiota/DATA605/master/test.csv', stringsAsFactors = FALSE)

# create data frame with numeric variables
nums2 <- unlist(lapply(hp_test, is.numeric))
# replace NAs with 0s
hp_test[is.na(hp_test)] <- 0
hpnum_test <- hp_test[,nums2]

# create list of character variables 
is_char2 <- colnames(hp_test[,sapply(hp_test, is.character)])

# remove all spaces from character variables in dataframe
hpchar_test <- as.data.frame(apply(hp_test[,is_char2],2,function(x)gsub('\\s+', '',x)),stringsAsFactors = FALSE)

# reassemble
hp_test <- data.frame(hpnum_test, hpchar_test)

# create dummy variables for each, eliminating the referant, defined as the most frequent of each nominal level 
hp_test2 <- dummy_columns(hp_test, select_columns = is_char2, ignore_na = FALSE)

names(hp_test2)[names(hp_test2)=="MSZoning_C(all)"] <- "MSZoning_CAll"
names(hp_test2)[names(hp_test2)=="RoofMatl_Tar&Grv"] <- "RoofMatl_TarGrv"

# some variable levels found in the training set are not present in the testing set. In order to proceed, we must include these variables and set value to zero.
hp_test2$Condition2_RRAe <- 0
hp_test2$Condition2_RRAn <- 0
hp_test2$HouseStyle_2.5Fin <- 0
hp_test2$RoofMatl_Metal <- 0
hp_test2$RoofMatl_Membran <- 0
hp_test2$RoofMatl_ClyTile <- 0
hp_test2$Electrical_Mix <- 0
hp_test2$GarageQual_Ex <- 0
#hp_test2$PoolQC_Fa <- 0
hp_test2$Utilities_NoSeWa <- 0
hp_test2$Condition2_RRNn <- 0
hp_test2$RoofMatl_TarGrv <- 0
hp_test2$RoofMatl_Roll <- 0
hp_test2$Exterior1st_Stone <- 0
hp_test2$Exterior2nd_Other <- 0
hp_test2$Heating_OthW <- 0
hp_test2$Heating_Floor <- 0
hp_test2$Electrical_0 <- 0
#hp_test2$MiscFeature_TenC <- 0
hp_test2$Exterior1st_ImStucc <- 0

# remove al variables not found in the model
#keep_test1 <- names(coef(step.model2$finalModel, step.model2$bestTune$nvmax))[-1]
keep_test1 <- names(coef(step.model2$finalModel))[-1]
keep_test2 <- coef(step.model3$finalModel)[-1]

# predict with the model
prd1 <- predict(step.model2, hp_test2)
prd2 <- predict(step.model3, hp_test2)

We’ll bind the predictions with the Id variable from the test set.

# since we are predicting log-transformed data, the results must be transformed again
sol1 <- data.frame(cbind(hp_test2$Id,exp(prd1)))
colnames(sol1) <- c("Id","SalePrice")
# formerly 4
write.csv(sol1, "C:/Users/steph/Documents/KaggleR/SJPredictions_hp4.csv", row.names = FALSE)

# since we are predicting log-transformed data, the results must be transformed again
sol2 <- data.frame(cbind(hp_test2$Id,exp(prd2)))
colnames(sol2) <- c("Id","SalePrice")
# formerly 3
write.csv(sol2, "C:/Users/steph/Documents/KaggleR/SJPredictions_hp5.csv", row.names = FALSE)

Kaggle username: stephensjones; Display name: Stephen S Jones

My first submission score was 0.14179, number 3009 on the leaderboard. A second attempt yielded a score of 0.14343; wrong direction! Third attempt: 0.18616; wrong direction. Fourth (and final) attempt: 0.13288, 2317 on the leaderboard.

The stepAIC model with lm yielded the best results, though caret was much faster with parallel processing options improving efficiency. Imputation amond numeric variables decreased residuals slightly.

The final regression model is included above.

Final Exam

Final Exam

Problem 1.

Probabilities

a. \(P(X > x | X > y )\)

b. \(P(X > x\text{ and } Y > y)\)

c. \(P(X < x\text{ and } X > y )\)

Marginal, Joint Probabilities

Fisher’s, Chi-squared

Problem 2.

Descriptives

Scatterplot Matrix

Correlation Matrix

Correlation Tests

Discussion

LU Decomposition

Exponential PDF

\(\lambda\) Histogram

CDF Percentiles

Empirical CI

Regression Result

Method: “stepAIC”

Method: “stepAIC”, Numeric Variables

Method: “leapBackward”, Numeric Variables

Method: “bayesglm” with Recoded Variables

Final Model

Method: “stepAIC” with Recoded Variables