Problem 1

Generate Distributions

Probability Density 1: X~Gamma. Using R, generate a random variable \(X\) that has 10,000 random Gamma pdf values. A Gamma pdf is completely described by \(n\) (a size parameter) and \(\lambda\) (a shape parameter). Choose any \(n\) greater than 3 and an expected value (\(\lambda\)) between 2 and 10.

set.seed(123)
lambda <- 2
n <- 8

X <- rgamma(10000, n, lambda)

hist(X, main = "Randomely Generated Gamma Distribution")

Probability Density 2: Y~Sum of Exponentials. Generate 10,000 observations from the sum of \(n\) exponential pdfs with a rate/shape parameter (\(\lambda\)). The \(n\) and \(\lambda\) must be the same as in the previous case.

set.seed(123)
Y <- rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda)

hist(Y)

Probability Density 3: Z~Exponential. Generate 10,000 observations from a single exponential pdf with rate/shape parameter (\(\lambda\)).

set.seed(123)
Z <- rexp(10000, lambda)

hist(Z, main = "Randomely Generated Exponential Distribution")

Empirical and Calculated Means and Variance

1a. Calculate the empirical expected value (means) and variances of all three pdfs.

mean(X)
## [1] 3.974504
var(X)
## [1] 1.968511
mean(Y)
## [1] 3.995146
var(Y)
## [1] 1.986324
mean(Z)
## [1] 0.5018906
var(Z)
## [1] 0.2499411

1b. Using calculus, calculate the expected value and variance of the Gamma pdf (X). Using the moment generating function for exponential, calculate the expected value of the single exponential (Z) and the sum of exponential (Y).

Gamma Distribution:

\(E(x) = {n \over \lambda}\) and \(\sigma^2 = {n \over \lambda^2}\).

n/lambda
## [1] 4
n/lambda^2
## [1] 2

Single Exponential:

\[ f(x) = \begin{cases} \lambda e^{-\lambda x} & x \ge 0 \\ 0 & x < 0 \end{cases} \]

Moment Generating Function:

\(M_{x}(t) = E(e^{xt}) = \int_{-\infty}^{+\infty} e^{xt} f(x) dx = \int_{-\infty}^{0} e^{xt} 0 dx + \int_{0}^{+\infty} e^{xt} \lambda e^{-\lambda x} dx\)

\(\hspace{28pt} = \lambda \int_{0}^{\infty} e^{(t - \lambda)x} dx\)

\(\hspace{28pt} = {\lambda \over t - \lambda} [e^{(t - \lambda)x}]_{0}^{\infty}\)

\(\hspace{28pt} = {\lambda \over t - \lambda} [0-1]\)

\(\hspace{28pt} = \boxed {{\lambda \over \lambda - t}}\)

Expected Value and Variance:

\(M_{x}^{'}(t) = \lambda (\lambda - t)^{-2}\)

\(M_{x}^{'}(0) = \lambda (\lambda - 0)^{-2}\)

\(\hspace{28pt} = \lambda (\lambda)^{-2}\)

\(\hspace{28pt} = \boxed {\lambda^{-1} = E(X)}\)

\(M_{x}^{''}(t) = 2 \lambda (\lambda - t)^{-3}\)

\(M_{x}^{''}(0) = 2 \lambda (\lambda - 0)^{-3}\)

\(\hspace{28pt} = 2 \lambda (\lambda)^{-3}\)

\(\hspace{28pt} = 2\lambda^{-2} = E(X^2)\)

\(\sigma^{2} = 2\lambda^{-2} - (\lambda^{-1})^2\)

\(\hspace{13pt} = {2 \over \lambda^2} - {1 \over \lambda^2}\)

\(\hspace{13pt} = {1 \over \lambda^2}\)

\(\hspace{13pt} = \boxed {\lambda^{-2} = \sigma^{2}}\)

# single exponential
lambda**(-1)
## [1] 0.5
lambda**(-2)
## [1] 0.25

Sum of Exponentials

For the moment generating function of the sum of \(n\) exponentials, multiply the moment generating function of the single exponential \(n\) times.

Moment Generating Function:

\(M_{x}(t) = ({\lambda \over \lambda - t})^{8}\)

Expected Value and Variance:

\(M_{x}'(t) = 8({\lambda \over \lambda - t})^{7} \cdot \lambda (\lambda - t)^{-2}\)

\(M_{x}'(0) = 8({\lambda \over \lambda - 0})^{7} \cdot \lambda (\lambda - 0)^{-2}\)

\(\hspace{28pt} = \boxed{8 \lambda^{-1} = E(x)}\)

\(M_{x}''(t) = [8({\lambda \over \lambda - t})^{7}][{2 \over \lambda^2}] + [\lambda (\lambda - t)^{-2}][56({\lambda \over \lambda - t})^{6} \cdot \lambda (\lambda - t)^{-2}]\)

\(M_{x}''(0) = [8({\lambda \over \lambda - 0})^{7}][{2 \over \lambda^2}] + [\lambda (\lambda - 0)^{-2}][56({\lambda \over \lambda - 0})^{8} \cdot \lambda (\lambda - 0)^{-2}]\)

\(\hspace{28pt} = {16 \over \lambda^2} + {56 \over \lambda^2}\)

\(\hspace{28pt} = 72 \lambda^{-2} = E(X^2)\)

\(\sigma^2 = 72 \lambda^{-2} - (8 \lambda^{-1})^2\)

\(\hspace{13pt} = \boxed{8 \lambda^{-2} = \sigma^2}\)

n*(lambda**-1)
## [1] 4
n*(lambda**-2)
## [1] 2

Probability

1c-e. Probability. For pdf Z (exponential), calculate empirically probabilities a-c. Then evaluate through calculus whether the memoryless property holds.

\[ P(A|B) = {P(A \hspace{2pt} \cap \hspace{2pt} B) \over P(B)} = {P(B|A)P(A) \over P(B)} \]

a. \(P(Z > \lambda | Z > {\lambda \over 2})\)

\(P(Z > \lambda | Z > {\lambda \over 2}) = {P(Z > {\lambda \over 2} | Z > \lambda)P(Z > \lambda) \over P(Z > {\lambda \over 2})}\)

\(P(Z > {\lambda \over 2} \hspace{2pt} | \hspace{2pt} Z > \lambda) = 1\)

\(P(Z > \lambda | Z > {\lambda \over 2}) = {P(Z > \lambda) \over P(Z > {\lambda \over 2})}\)

Rather than calculate for \(P(Z > \lambda)\) and \(P(Z > {\lambda \over 2})\), calculate for \(1 - P(Z \le \lambda)\) and \(1 - P(Z \le {\lambda \over 2})\)

I also calculated each of these based on the definition of the exponential distribution.

p_a <- length(which(Z>lambda)) / length(Z)
p_b <- length(which(Z>lambda/2)) / length(Z)

p_a / p_b
## [1] 0.1300813
(1-pexp(lambda, lambda)) / (1-pexp(lambda/2, lambda))
## [1] 0.1353353

b. \(P(Z > 2 \lambda | Z > \lambda)\)

\(P(Z > 2 \lambda | Z > \lambda) = {P(Z > \lambda | Z > 2 \lambda)P(Z > 2 \lambda) \over P(Z > \lambda)}\)

\(P(Z > \lambda \hspace{2pt} | \hspace{2pt} Z > 2 \lambda) = 1\)

\(P(Z > 2 \lambda | Z > \lambda) = {P(Z > 2 \lambda) \over P(Z > \lambda)} = {1 - P(Z \le 2 \lambda) \over 1 - P(Z \le \lambda)}\)

p_a <- length(which(Z>2*lambda)) / length(Z)
p_b <- length(which(Z>lambda)) / length(Z)

p_a / p_b
## [1] 0.01704545
(1-pexp(2*lambda, lambda)) / (1-pexp(lambda, lambda))
## [1] 0.01831564

c. \(P(Z > 3 \lambda | Z > \lambda)\)

\(P(Z > 3 \lambda | Z > \lambda) = {P(Z > 3 \lambda) \over P(Z > \lambda)} = {1 - P(Z \le 3 \lambda) \over 1 - P(Z \le \lambda)}\)

p_a <- length(which(Z>3*lambda)) / length(Z)
p_b <- length(which(Z>lambda)) / length(Z)

p_a / p_b
## [1] 0
(1-pexp(3*lambda, lambda)) / (1-pexp(lambda, lambda))
## [1] 0.0003354626

Memoryless Property:

The memoryless property says that \(P(X > r + t \hspace{2pt} | \hspace{2pt} X > r) = P(X > t)\).

Let’s use a randomly defined \(t = 3\). \(r\) is the inverse of the rate, so for this problem \(r = {1 \over \lambda}\).

\(P(X > n + t \hspace{2pt} | \hspace{2pt} X > n)\)

t <- 3

(1-pexp((1/lambda)+t, lambda)) / (1-pexp((1/lambda), lambda))
## [1] 0.002478752

\(P(X > t)\)

(1-pexp(t, lambda))
## [1] 0.002478752

The probabilities are the same, so the memoryless property holds.

Independence Tests

Loosely investigate whether \(P(YZ) = P(Y)P(Z)\) by building a table with quartiles and evaluating the marginal and joint probabilities.

# find quantiles
quantile_Y <- matrix(c(quantile(Y, probs = c(0, .25, .5, .75, 1))))
quantile_Z <- matrix(c(quantile(Z, probs = c(0, .25, .5, .75, 1))))

cont_table <- table(cut(Y, quantile_Y), cut(Z, quantile_Z))

joint_prob <- prop.table(cont_table)

row_sums <- rowSums(joint_prob)
col_sums <- c(colSums(joint_prob), 0)

joint_prob_marg <- cbind(joint_prob, row_sums)
joint_prob_marg <- rbind(joint_prob_marg, col_sums)
joint_prob_marg[5,5] <- sum(joint_prob)

colnames(joint_prob_marg) <- c("Quant 1 Y", "Quant 2 Y", "Quant 3 Y", "Quant 4 Y", "Sum")
rownames(joint_prob_marg) <- c("Quant 1 Z", "Quant 2 Z", "Quant 3 Z", "Quant 4 Z", "Sum")

joint_prob_marg
##            Quant 1 Y  Quant 2 Y Quant 3 Y  Quant 4 Y     Sum
## Quant 1 Z 0.09231846 0.07751550 0.0550110 0.02500500 0.24985
## Quant 2 Z 0.06551310 0.06691338 0.0690138 0.04860972 0.25005
## Quant 3 Z 0.05031006 0.05771154 0.0675135 0.07451490 0.25005
## Quant 4 Z 0.04180836 0.04780956 0.0585117 0.10192038 0.25005
## Sum       0.24994999 0.24994999 0.2500500 0.25005001 1.00000
matrix(joint_prob_marg[1:4,5]) %*% t(matrix(joint_prob_marg[5,1:4]))
##         [,1]    [,2]       [,3]       [,4]
## [1,] 0.06245 0.06245 0.06247499 0.06247499
## [2,] 0.06250 0.06250 0.06252501 0.06252501
## [3,] 0.06250 0.06250 0.06252501 0.06252501
## [4,] 0.06250 0.06250 0.06252501 0.06252501

We can see that \(P(YZ) \ne P(Y)P(Z)\) which suggest dependence for Y and Z.

Check to see if independence holds by using Fisher’s Exact Test and Chi Square Test. What is the difference between the two? Which is more appropriate?

#fisher
fisher.test(cont_table, simulate.p.value = T)
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  cont_table
## p-value = 0.0004998
## alternative hypothesis: two.sided
# chi square
chisq.test(cont_table)
## 
##  Pearson's Chi-squared test
## 
## data:  cont_table
## X-squared = 863.19, df = 9, p-value < 2.2e-16

Based on the p-value of both of these tests, we can see that there is dependence.

Fisher test performs better for smaller samples while Chi Square is better used for larger samples and a larger contingency table. Chi Square would be more appropriate here as it is a larger contingency table and a large sample size.

Problem 2

Data taken from https://www.kaggle.com/c/house-prices-advanced-regression-techniques.

train_data <- read.csv(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_605/FinalProj/train.csv"))

Let’s first split the training data into training and validating data sets.

# 80% train data
set.seed(123)
num_fit <- .8 * nrow(train_data)

sample <- sample(1:nrow(train_data), num_fit, replace = F)

train <- train_data[sample,]
validate <- train_data[-sample,]

rownames(train) <- train$Id

train <- train |>
  dplyr::select(-Id)

rmarkdown::paged_table(train, options = list(rows.print = 5))

Descriptive and Inferential Statistics

Provide univariate descriptive statics and appropriate plots for the training data set.

glimpse(train)
## Rows: 1,168
## Columns: 80
## $ MSSubClass    <int> 60, 20, 20, 20, 20, 60, 60, 60, 20, 20, 60, 70, 20, 50, …
## $ MSZoning      <chr> "RL", "RL", "RL", "FV", "RL", "RL", "RL", "RL", "RL", "R…
## $ LotFrontage   <int> 59, 60, 63, 62, 60, 75, NA, 107, 62, 89, NA, 51, 49, 50,…
## $ LotArea       <int> 11228, 8281, 17423, 7500, 7180, 9675, 10304, 10186, 9858…
## $ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ Alley         <chr> NA, NA, NA, "Pave", NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ LotShape      <chr> "IR2", "IR1", "IR1", "Reg", "IR1", "Reg", "IR1", "IR1", …
## $ LandContour   <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", …
## $ Utilities     <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu…
## $ LotConfig     <chr> "CulDSac", "Inside", "CulDSac", "Inside", "Inside", "Ins…
## $ LandSlope     <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", …
## $ Neighborhood  <chr> "SawyerW", "Sawyer", "StoneBr", "Somerst", "CollgCr", "S…
## $ Condition1    <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "PosN", …
## $ Condition2    <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", …
## $ BldgType      <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", …
## $ HouseStyle    <chr> "2Story", "1Story", "1Story", "1Story", "1Story", "2Stor…
## $ OverallQual   <int> 7, 5, 9, 7, 5, 7, 5, 7, 5, 9, 8, 5, 8, 6, 7, 5, 5, 5, 6,…
## $ OverallCond   <int> 5, 5, 5, 5, 7, 5, 7, 5, 6, 5, 5, 8, 5, 6, 5, 5, 7, 8, 5,…
## $ YearBuilt     <int> 1993, 1965, 2008, 2005, 1972, 2005, 1976, 1992, 1968, 20…
## $ YearRemodAdd  <int> 1993, 1965, 2009, 2005, 1972, 2005, 1976, 1992, 1968, 20…
## $ RoofStyle     <chr> "Gable", "Gable", "Hip", "Gable", "Hip", "Gable", "Gable…
## $ RoofMatl      <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "…
## $ Exterior1st   <chr> "VinylSd", "MetalSd", "VinylSd", "VinylSd", "HdBoard", "…
## $ Exterior2nd   <chr> "VinylSd", "MetalSd", "VinylSd", "VinylSd", "HdBoard", "…
## $ MasVnrType    <chr> "None", "None", "Stone", "None", "None", "None", "BrkFac…
## $ MasVnrArea    <int> 0, 0, 748, 0, 0, 0, 44, 0, 0, 0, 396, 0, 0, 0, 0, 0, 0, …
## $ ExterQual     <chr> "Gd", "TA", "Ex", "Gd", "TA", "Gd", "TA", "Gd", "TA", "E…
## $ ExterCond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "Gd", "TA", "TA", "T…
## $ Foundation    <chr> "PConc", "CBlock", "PConc", "PConc", "CBlock", "PConc", …
## $ BsmtQual      <chr> "Gd", "TA", "Ex", "Gd", "TA", "Gd", "TA", "Gd", "TA", "E…
## $ BsmtCond      <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ BsmtExposure  <chr> "No", "No", "No", "No", "Av", "Mn", "No", "No", "No", "G…
## $ BsmtFinType1  <chr> "BLQ", "Rec", "GLQ", "Unf", "ALQ", "GLQ", "ALQ", "GLQ", …
## $ BsmtFinSF1    <int> 50, 553, 1904, 0, 390, 341, 381, 674, 510, 0, 0, 0, 1721…
## $ BsmtFinType2  <chr> "GLQ", "BLQ", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", …
## $ BsmtFinSF2    <int> 531, 311, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BsmtUnfSF     <int> 499, 0, 312, 1257, 474, 772, 399, 76, 354, 2002, 1055, 9…
## $ TotalBsmtSF   <int> 1080, 864, 2216, 1257, 864, 1113, 780, 750, 864, 2002, 1…
## $ Heating       <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", …
## $ HeatingQC     <chr> "Ex", "Gd", "Ex", "Ex", "TA", "Ex", "Ex", "Ex", "TA", "E…
## $ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ Electrical    <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S…
## $ X1stFlrSF     <int> 1080, 864, 2234, 1266, 864, 1113, 1088, 1061, 874, 2018,…
## $ X2ndFlrSF     <int> 1017, 0, 0, 0, 0, 858, 780, 862, 0, 0, 1208, 574, 0, 595…
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GrLivArea     <int> 2097, 864, 2234, 1266, 864, 1971, 1868, 1923, 874, 2018,…
## $ BsmtFullBath  <int> 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ BsmtHalfBath  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ FullBath      <int> 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 2,…
## $ HalfBath      <int> 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,…
## $ BedroomAbvGr  <int> 3, 3, 1, 3, 3, 3, 4, 3, 3, 3, 3, 4, 1, 3, 3, 4, 2, 2, 3,…
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ KitchenQual   <chr> "Gd", "TA", "Ex", "Gd", "TA", "Gd", "Gd", "Gd", "TA", "E…
## $ TotRmsAbvGrd  <int> 9, 5, 9, 6, 5, 8, 9, 8, 5, 10, 7, 8, 8, 6, 8, 7, 4, 5, 6…
## $ Functional    <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", …
## $ Fireplaces    <int> 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 2,…
## $ FireplaceQu   <chr> "TA", "Po", "Gd", "TA", NA, "Gd", "TA", "TA", NA, "Gd", …
## $ GarageType    <chr> "Attchd", "Detchd", "Attchd", "Attchd", "Detchd", "Attch…
## $ GarageYrBlt   <int> 1993, 1965, 2009, 2005, 1989, 2005, 1976, 1992, 1968, 20…
## $ GarageFinish  <chr> "Unf", "Unf", "Fin", "Unf", "Unf", "RFn", "Unf", "RFn", …
## $ GarageCars    <int> 3, 1, 3, 2, 1, 2, 2, 2, 1, 3, 2, 1, 3, 1, 2, 1, 1, 1, 2,…
## $ GarageArea    <int> 678, 360, 1166, 453, 352, 689, 484, 564, 288, 746, 905, …
## $ GarageQual    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ GarageCond    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ PavedDrive    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ WoodDeckSF    <int> 196, 0, 0, 38, 0, 0, 448, 240, 33, 144, 0, 24, 192, 0, 1…
## $ OpenPorchSF   <int> 187, 0, 60, 144, 0, 48, 96, 39, 0, 76, 45, 0, 267, 162, …
## $ EnclosedPorch <int> 0, 236, 0, 0, 0, 0, 0, 0, 0, 0, 0, 150, 0, 0, 0, 108, 0,…
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 189, 0, 0, 126, 0, 0, 0, 0…
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PoolQC        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Fence         <chr> NA, "GdWo", NA, NA, NA, NA, NA, NA, "GdWo", NA, NA, NA, …
## $ MiscFeature   <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Shed", NA, NA, NA, NA, …
## $ MiscVal       <int> 0, 0, 0, 0, 0, 0, 0, 0, 600, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ MoSold        <int> 12, 12, 7, 4, 5, 2, 10, 6, 11, 5, 9, 5, 1, 12, 7, 8, 8, …
## $ YrSold        <int> 2008, 2009, 2009, 2006, 2008, 2009, 2009, 2010, 2009, 20…
## $ SaleType      <chr> "WD", "WD", "New", "WD", "WD", "WD", "WD", "WD", "WD", "…
## $ SaleCondition <chr> "Normal", "Normal", "Partial", "Normal", "Normal", "Norm…
## $ SalePrice     <int> 228000, 62383, 501837, 176000, 127000, 253000, 197500, 1…

Most of the variable are categorical.

Let’s take a look at the distributions for the numeric columns first.

numeric_train <- train[,sapply(train, is.numeric)]

# descriptive statistics
summary(numeric_train)
##    MSSubClass      LotFrontage        LotArea        OverallQual    
##  Min.   : 20.00   Min.   : 21.00   Min.   :  1300   Min.   : 1.000  
##  1st Qu.: 20.00   1st Qu.: 59.25   1st Qu.:  7552   1st Qu.: 5.000  
##  Median : 50.00   Median : 70.00   Median :  9504   Median : 6.000  
##  Mean   : 56.47   Mean   : 70.37   Mean   : 10385   Mean   : 6.104  
##  3rd Qu.: 70.00   3rd Qu.: 80.00   3rd Qu.: 11604   3rd Qu.: 7.000  
##  Max.   :190.00   Max.   :313.00   Max.   :164660   Max.   :10.000  
##                   NA's   :214                                       
##   OverallCond      YearBuilt     YearRemodAdd    MasVnrArea    
##  Min.   :1.000   Min.   :1872   Min.   :1950   Min.   :   0.0  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   1st Qu.:   0.0  
##  Median :5.000   Median :1973   Median :1993   Median :   0.0  
##  Mean   :5.555   Mean   :1972   Mean   :1985   Mean   : 103.3  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   3rd Qu.: 168.0  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Max.   :1600.0  
##                                                NA's   :7       
##    BsmtFinSF1       BsmtFinSF2        BsmtUnfSF       TotalBsmtSF  
##  Min.   :   0.0   Min.   :   0.00   Min.   :   0.0   Min.   :   0  
##  1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 230.0   1st Qu.: 799  
##  Median : 385.5   Median :   0.00   Median : 477.5   Median :1004  
##  Mean   : 448.3   Mean   :  48.22   Mean   : 569.1   Mean   :1066  
##  3rd Qu.: 719.0   3rd Qu.:   0.00   3rd Qu.: 810.2   3rd Qu.:1312  
##  Max.   :5644.0   Max.   :1474.00   Max.   :2336.0   Max.   :6110  
##                                                                    
##    X1stFlrSF        X2ndFlrSF       LowQualFinSF       GrLivArea   
##  Min.   : 334.0   Min.   :   0.0   Min.   :  0.000   Min.   : 334  
##  1st Qu.: 887.5   1st Qu.:   0.0   1st Qu.:  0.000   1st Qu.:1144  
##  Median :1094.5   Median :   0.0   Median :  0.000   Median :1456  
##  Mean   :1168.6   Mean   : 344.8   Mean   :  6.235   Mean   :1520  
##  3rd Qu.:1402.8   3rd Qu.: 728.2   3rd Qu.:  0.000   3rd Qu.:1779  
##  Max.   :4692.0   Max.   :2065.0   Max.   :572.000   Max.   :5642  
##                                                                    
##   BsmtFullBath     BsmtHalfBath        FullBath        HalfBath     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :2.000   Median :0.0000  
##  Mean   :0.4238   Mean   :0.05308   Mean   :1.564   Mean   :0.3818  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :2.0000   Max.   :2.00000   Max.   :3.000   Max.   :2.0000  
##                                                                     
##   BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd     Fireplaces    
##  Min.   :0.000   Min.   :0.000   Min.   : 2.00   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.00   1st Qu.:0.0000  
##  Median :3.000   Median :1.000   Median : 6.00   Median :1.0000  
##  Mean   :2.882   Mean   :1.051   Mean   : 6.54   Mean   :0.6199  
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.00   3rd Qu.:1.0000  
##  Max.   :8.000   Max.   :3.000   Max.   :14.00   Max.   :3.0000  
##                                                                  
##   GarageYrBlt     GarageCars      GarageArea       WoodDeckSF    
##  Min.   :1900   Min.   :0.000   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:1962   1st Qu.:1.000   1st Qu.: 337.5   1st Qu.:  0.00  
##  Median :1980   Median :2.000   Median : 480.0   Median :  0.00  
##  Mean   :1979   Mean   :1.772   Mean   : 475.0   Mean   : 92.97  
##  3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0   3rd Qu.:168.00  
##  Max.   :2010   Max.   :4.000   Max.   :1418.0   Max.   :857.00  
##  NA's   :65                                                      
##   OpenPorchSF     EnclosedPorch   X3SsnPorch       ScreenPorch    
##  Min.   :  0.00   Min.   :  0   Min.   :  0.000   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0   1st Qu.:  0.000   1st Qu.:  0.00  
##  Median : 25.00   Median :  0   Median :  0.000   Median :  0.00  
##  Mean   : 47.55   Mean   : 22   Mean   :  3.627   Mean   : 15.12  
##  3rd Qu.: 68.00   3rd Qu.:  0   3rd Qu.:  0.000   3rd Qu.:  0.00  
##  Max.   :547.00   Max.   :552   Max.   :508.000   Max.   :480.00  
##                                                                   
##     PoolArea          MiscVal             MoSold           YrSold    
##  Min.   :  0.000   Min.   :    0.00   Min.   : 1.000   Min.   :2006  
##  1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 4.000   1st Qu.:2007  
##  Median :  0.000   Median :    0.00   Median : 6.000   Median :2008  
##  Mean   :  2.955   Mean   :   43.57   Mean   : 6.264   Mean   :2008  
##  3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009  
##  Max.   :738.000   Max.   :15500.00   Max.   :12.000   Max.   :2010  
##                                                                      
##    SalePrice     
##  Min.   : 39300  
##  1st Qu.:129900  
##  Median :162700  
##  Mean   :180833  
##  3rd Qu.:213500  
##  Max.   :755000  
## 

Distribution of sale prices:

train |>
  ggplot(aes(x = SalePrice)) +
  geom_histogram(fill = "steelblue", bins = 30) +
  scale_x_continuous(label = scales::comma) +
  labs(title = "Distribution of House Prices", x = "Sale Price", y = "Count")

Sale prices are skewed to the right and have a median of $163K.

par(mfrow=c(3,3))
par(mai=c(.3,.3,.3,.3))

variables <- names(numeric_train)

for (i in 1:(length(variables)-1)) {
  hist(numeric_train[[variables[i]]], main = variables[i], col = "lightblue")
}

There are a number of skewed variables. We can try log transformations to some variables to try and normalize.

Now let’s look at the categorical variables.

char_train <- train[,sapply(train, is.character)]
char_train$SalePrice <- train$SalePrice

variables <- names(char_train)

for (i in 1:(length(variables)-1)) {
  char_train[[variables[i]]] <- as.factor(char_train[[variables[i]]])
}

char_train |>
  dplyr::select(-SalePrice) |>
  summary()
##     MSZoning    Street      Alley      LotShape  LandContour  Utilities   
##  C (all):  7   Grvl:   4   Grvl:  38   IR1:388   Bnk:  52    AllPub:1167  
##  FV     : 51   Pave:1164   Pave:  31   IR2: 30   HLS:  43    NoSeWa:   1  
##  RH     : 14               NA's:1099   IR3:  6   Low:  31                 
##  RL     :929                           Reg:744   Lvl:1042                 
##  RM     :167                                                              
##                                                                           
##                                                                           
##    LotConfig   LandSlope   Neighborhood   Condition1    Condition2  
##  Corner :215   Gtl:1104   NAmes  :178   Norm   :1003   Artery:   2  
##  CulDSac: 69   Mod:  55   CollgCr:116   Feedr  :  68   Feedr :   4  
##  FR2    : 41   Sev:   9   OldTown: 90   Artery :  39   Norm  :1157  
##  FR3    :  2              Edwards: 85   RRAn   :  20   PosA  :   1  
##  Inside :841              Somerst: 69   PosN   :  16   PosN  :   2  
##                           Sawyer : 64   RRAe   :  11   RRAn  :   1  
##                           (Other):566   (Other):  11   RRNn  :   1  
##    BldgType     HouseStyle    RoofStyle      RoofMatl     Exterior1st 
##  1Fam  :973   1Story :594   Flat   : 11   ClyTile:   1   VinylSd:408  
##  2fmCon: 24   2Story :360   Gable  :905   CompShg:1147   HdBoard:181  
##  Duplex: 44   1.5Fin :117   Gambrel:  9   Metal  :   1   MetalSd:173  
##  Twnhs : 28   SLvl   : 47   Hip    :237   Roll   :   1   Wd Sdng:161  
##  TwnhsE: 99   SFoyer : 26   Mansard:  5   Tar&Grv:   9   Plywood: 89  
##               1.5Unf : 10   Shed   :  1   WdShake:   3   CemntBd: 53  
##               (Other): 14                 WdShngl:   6   (Other):103  
##   Exterior2nd    MasVnrType  ExterQual ExterCond  Foundation  BsmtQual  
##  VinylSd:399   BrkCmn : 15   Ex: 43    Ex:   2   BrkTil:110   Ex  : 98  
##  HdBoard:170   BrkFace:352   Fa: 11    Fa:  23   CBlock:520   Fa  : 29  
##  MetalSd:168   None   :689   Gd:393    Gd: 116   PConc :510   Gd  :483  
##  Wd Sdng:158   Stone  :105   TA:721    Po:   1   Slab  : 19   TA  :530  
##  Plywood:116   NA's   :  7             TA:1026   Stone :  6   NA's: 28  
##  CmentBd: 52                                     Wood  :  3             
##  (Other):105                                                            
##  BsmtCond    BsmtExposure BsmtFinType1 BsmtFinType2  Heating     HeatingQC
##  Fa  :  34   Av  :176     ALQ :177     ALQ : 15     Floor:   1   Ex:593   
##  Gd  :  46   Gd  :105     BLQ :117     BLQ : 32     GasA :1143   Fa: 45   
##  Po  :   1   Mn  : 87     GLQ :331     GLQ : 11     GasW :  13   Gd:186   
##  TA  :1059   No  :771     LwQ : 61     LwQ : 41     Grav :   6   Po:  1   
##  NA's:  28   NA's: 29     Rec :115     Rec : 43     OthW :   2   TA:343   
##                           Unf :339     Unf :998     Wall :   3            
##                           NA's: 28     NA's: 28                           
##  CentralAir Electrical   KitchenQual Functional  FireplaceQu   GarageType 
##  N:  73     FuseA:  71   Ex: 83      Maj1:  13   Ex  : 23    2Types :  6  
##  Y:1095     FuseF:  20   Fa: 29      Maj2:   4   Fa  : 25    Attchd :702  
##             FuseP:   3   Gd:460      Min1:  27   Gd  :294    Basment: 16  
##             SBrkr:1073   TA:596      Min2:  25   Po  : 17    BuiltIn: 67  
##             NA's :   1               Mod :  11   TA  :256    CarPort:  7  
##                                      Sev :   1   NA's:553    Detchd :305  
##                                      Typ :1087               NA's   : 65  
##  GarageFinish GarageQual  GarageCond  PavedDrive  PoolQC       Fence    
##  Fin :285     Ex  :   3   Ex  :   2   N:  65     Ex  :   2   GdPrv: 44  
##  RFn :343     Fa  :  34   Fa  :  29   P:  24     Fa  :   2   GdWo : 47  
##  Unf :475     Gd  :  10   Gd  :   7   Y:1079     Gd  :   2   MnPrv:125  
##  NA's: 65     Po  :   2   Po  :   4              NA's:1162   MnWw : 10  
##               TA  :1054   TA  :1061                          NA's :942  
##               NA's:  65   NA's:  65                                     
##                                                                         
##  MiscFeature    SaleType    SaleCondition
##  Gar2:   1   WD     :1008   Abnorml: 83  
##  Othr:   2   New    : 102   AdjLand:  4  
##  Shed:  43   COD    :  36   Alloca : 10  
##  TenC:   1   ConLD  :   7   Family : 18  
##  NA's:1121   ConLI  :   4   Normal :948  
##              ConLw  :   4   Partial:105  
##              (Other):   7

There are a number of variables with NA values. Specifically, Alley, PoolQC, MiscFeature, Fence, and FireplaceQu have a lot of NA values.

par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))

for (i in 1:(length(variables)-1)) {
  boxplot(char_train$SalePrice~char_train[[variables[i]]], main = variables[i])
}

MSZoning may have an affect on SalePrice. There is also a lot of variability in price based on Neighborhood, ExterQUal, BsmntQual, KitchenQual, and CentralAir.

Scatterplot Matrix

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

Let’s make a scatterplot matrix for some of the numeric varaibles.

subset_train <- train |>
  dplyr::select(SalePrice, OverallQual, LotArea, X1stFlrSF, GrLivArea, FullBath, TotRmsAbvGrd, GarageArea)

pairs(subset_train, gap = 0.5)

Correlation Matrix

Derive a correlation matrix for any three quantitative variables in the data set.

cor_subset <- subset_train |>
  dplyr::select(SalePrice, GrLivArea, FullBath)

par(mfrow = c(1,3))

plot(SalePrice~GrLivArea, cor_subset) +
  abline(lm(SalePrice~GrLivArea, cor_subset), col = "red")
## integer(0)
plot(SalePrice~FullBath, cor_subset) +
  abline(lm(SalePrice~FullBath, cor_subset), col = "red")
## integer(0)
plot(GrLivArea~FullBath, cor_subset) +
  abline(lm(GrLivArea~FullBath, cor_subset), col = "red")

## integer(0)
(cor_matrix <- cor(cor_subset))
##           SalePrice GrLivArea  FullBath
## SalePrice 1.0000000 0.6961226 0.5462469
## GrLivArea 0.6961226 1.0000000 0.6231547
## FullBath  0.5462469 0.6231547 1.0000000

Test the hypothesis that the correlation between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

cor.test(cor_subset$SalePrice, cor_subset$GrLivArea, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  cor_subset$SalePrice and cor_subset$GrLivArea
## t = 33.11, df = 1166, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6762607 0.7149732
## sample estimates:
##       cor 
## 0.6961226

The correlation is 0.696 with a p-value of less than 2.2e-16, so this is a significant correlation.

CI[0.676, 0.715]: We are 80% confident that the true correlation lies within the range of 0.676 and 0.715.

cor.test(cor_subset$SalePrice, cor_subset$FullBath, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  cor_subset$SalePrice and cor_subset$FullBath
## t = 22.268, df = 1166, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5193648 0.5720490
## sample estimates:
##       cor 
## 0.5462469

The correlation is 0.546 with a p-value of less than 2.2e-16, so this is a significant correlation.

CI[0.519, 0.572]: We are 80% confident that the true correlation lies within the range of 0.519 and 0.572.

cor.test(cor_subset$GrLivArea, cor_subset$FullBath, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  cor_subset$GrLivArea and cor_subset$FullBath
## t = 27.207, df = 1166, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5996492 0.6455859
## sample estimates:
##       cor 
## 0.6231547

The correlation is 0.623 with a p-value of less than 2.2e-16, so this is a significant correlation.

CI[0.600, .646]: We are 80% confident that the true correlation lies within the range of 0.600 and 0.646.

All three pairwise sets have significant correlations. There is a strong correlation between sale price and above grade living area (71% correlated). There is a moderate correlation between sale price and total number of full bathrooms (56% correlated) and between above grade living area and total number of full bathrooms (63% correlated).

Family Wise Error Rate (FWE):

The formula for calculating family wise error can be found here.

\(FWE \le 1 - (1 - \alpha)^c\) where \(\alpha\) is your chosen confidence level and \(c\) is the number of correlation tests.

We use an 80% confidence interval, so \(\alpha = 0.2\), and we conducted three tests.

a <- 0.2
c <- 3

1-(1-a)**c
## [1] 0.488

The family wise error here is 0.488, so there is a 48.8% chance that we made a Type I error in at least one of the correlation tests above. This is a very high likelihood and so I would be very concerned about familywise error here.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as a precision matrix and contains variance inflation factors on the diagonal.)

# precision matrix - inverse of correlation matrix
(prec_matrix <- solve(cor_matrix))
##            SalePrice  GrLivArea   FullBath
## SalePrice  2.0212682 -1.1754845 -0.3716028
## GrLivArea -1.1754845  2.3184589 -0.8026538
## FullBath  -0.3716028 -0.8026538  1.7031644

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

prec_matrix %*% cor_matrix
##               SalePrice    GrLivArea     FullBath
## SalePrice  1.000000e+00 1.387779e-16 1.110223e-16
## GrLivArea  5.551115e-17 1.000000e+00 2.220446e-16
## FullBath  -1.110223e-16 2.220446e-16 1.000000e+00
cor_matrix %*% prec_matrix
##              SalePrice    GrLivArea     FullBath
## SalePrice 1.000000e+00 5.551115e-17 0.000000e+00
## GrLivArea 1.110223e-16 1.000000e+00 2.220446e-16
## FullBath  5.551115e-17 2.220446e-16 1.000000e+00

Conduct LU decomposition on the matrix.

lu.decomposition(prec_matrix)
## $L
##            [,1]       [,2] [,3]
## [1,]  1.0000000  0.0000000    0
## [2,] -0.5815579  1.0000000    0
## [3,] -0.1838463 -0.6231547    1
## 
## $U
##          [,1]      [,2]       [,3]
## [1,] 2.021268 -1.175484 -0.3716028
## [2,] 0.000000  1.634847 -1.0187624
## [3,] 0.000000  0.000000  1.0000000

Calculus-Based Probability & Statistics:

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr() to fit an exponential probability density function.

Chosen variable: SalePrice

# histogram original
hist(train$SalePrice)

# fit exponential pdf
(fit <- fitdistr(train$SalePrice, "exponential"))
##        rate    
##   5.529968e-06 
##  (1.618084e-07)

Find the optimal value of \(\lambda\) for this distribution, and then take 1,000 samples from this exponential distribution using this value (e.e, rexp(1000, \(\lambda\))).

lambda <- fit$estimate

# exponential dist
exp_dist <- rexp(1000, lambda)

Plot a histogram and compare it with the histogram of your original variable.

par(mfrow = c(1,2))
hist(train$SalePrice, main = "Histogram of SalePrice")
hist(exp_dist, main = "Histogram of Exponential")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

# percentiles using CDF
quantile(exp_dist, c(.05, .95))
##        5%       95% 
##  10979.05 577624.65
# conf interval of empirical data
t.test(train$SalePrice)$conf.int
## [1] 176274.6 185391.2
## attr(,"conf.level")
## [1] 0.95
# empirical percentiles
quantile(train$SalePrice, c(.05, .95))
##       5%      95% 
##  88350.0 325510.6

There is a signicant difference between the quantiles for the generated exponential distribution and the quantiles for the empirical data. This suggests that the SalePrices does not closely follow the exponential distribution.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with your analysis. Report your Kaggle.com username and score.

kdepairs(subset_train)

trans_train <- subset_train |>
  mutate(log_LotArea = log(LotArea))

par(mfrow = c(1,2))
hist(subset_train$LotArea, main = "Original")
hist(trans_train$log_LotArea, main = "Log Transformed")

# remove extreme outliers from LotArea
iqr <- IQR(subset_train$LotArea)

summary(subset_train$LotArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7552    9504   10385   11604  164660
trans_train <- trans_train |>
  mutate(LotArea_filtered = ifelse(LotArea >= 11602 + (1.5*iqr) | LotArea <= 7554 - (1.5*iqr), NA, LotArea))

par(mfrow = c(1,2))
boxplot(subset_train$LotArea, main = "Original")
boxplot(trans_train$LotArea_filtered, main = "Filtered for Outliers")

summary(trans_train$LotArea_filtered)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1491    7406    9308    9288   11240   17671      54
par(mfrow = c(1,2))

plot(trans_train$SalePrice~trans_train$log_LotArea, main = "Log Transformed") +
  abline(lm(trans_train$SalePrice~trans_train$log_LotArea), col = "red")
## integer(0)
plot(trans_train$SalePrice~trans_train$LotArea_filtered, main = "Filtered") +
  abline(lm(trans_train$SalePrice~trans_train$LotArea_filtered), col = "red")

## integer(0)

Apply log transformation to normalize X1stFlrSF and GrLivArea and check to see how correlation changes.

trans_train |>
  dplyr::select(-LotArea_filtered) |>
  mutate(log_X1stFlrSF = log(X1stFlrSF),
         log_GrLivArea = log(GrLivArea)) |>
  kdepairs()

The transformed variables are less correlated than the original variables. This is probably because SalePrice itself is skewed. The log transformed LotArea is more correlated than the original.

Build Initial Model

test_lm <- lm(SalePrice ~ OverallQual + log(LotArea) + GrLivArea + X1stFlrSF + FullBath + TotRmsAbvGrd + GarageArea + MSZoning + Neighborhood + ExterQual + KitchenQual + CentralAir + CentralAir*GrLivArea, train)

summary(test_lm)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + log(LotArea) + GrLivArea + 
##     X1stFlrSF + FullBath + TotRmsAbvGrd + GarageArea + MSZoning + 
##     Neighborhood + ExterQual + KitchenQual + CentralAir + CentralAir * 
##     GrLivArea, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -380441  -14476     954   13485  265461 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -93566.310  35719.521  -2.619 0.008925 ** 
## OverallQual            14021.509   1382.783  10.140  < 2e-16 ***
## log(LotArea)           17804.752   3213.588   5.540 3.76e-08 ***
## GrLivArea                 19.918      9.170   2.172 0.030057 *  
## X1stFlrSF                 13.529      3.889   3.479 0.000522 ***
## FullBath                1621.942   2792.776   0.581 0.561516    
## TotRmsAbvGrd             836.023   1172.229   0.713 0.475876    
## GarageArea                27.934      6.488   4.306 1.81e-05 ***
## MSZoningFV             22955.955  18656.000   1.230 0.218772    
## MSZoningRH             20312.852  18392.598   1.104 0.269654    
## MSZoningRL             24608.622  15974.211   1.541 0.123715    
## MSZoningRM             25392.318  15247.046   1.665 0.096113 .  
## NeighborhoodBlueste    -4303.792  35827.966  -0.120 0.904407    
## NeighborhoodBrDale    -20215.540  14704.791  -1.375 0.169480    
## NeighborhoodBrkSide   -20700.447  11956.998  -1.731 0.083683 .  
## NeighborhoodClearCr    -7028.895  13182.335  -0.533 0.593997    
## NeighborhoodCollgCr    -9071.965  10263.380  -0.884 0.376931    
## NeighborhoodCrawfor     5384.623  11436.329   0.471 0.637850    
## NeighborhoodEdwards   -35785.593  10951.508  -3.268 0.001117 ** 
## NeighborhoodGilbert   -12269.440  11015.405  -1.114 0.265584    
## NeighborhoodIDOTRR    -30727.873  14064.010  -2.185 0.029105 *  
## NeighborhoodMeadowV   -13761.151  14602.756  -0.942 0.346207    
## NeighborhoodMitchel   -19310.723  11623.973  -1.661 0.096935 .  
## NeighborhoodNAmes     -22697.774  10539.909  -2.154 0.031491 *  
## NeighborhoodNoRidge    45745.668  11758.679   3.890 0.000106 ***
## NeighborhoodNPkVill    -6913.853  15505.768  -0.446 0.655763    
## NeighborhoodNridgHt    30002.388  10717.326   2.799 0.005207 ** 
## NeighborhoodNWAmes    -21615.329  11045.522  -1.957 0.050603 .  
## NeighborhoodOldTown   -36614.716  12011.911  -3.048 0.002356 ** 
## NeighborhoodSawyer    -21392.211  11215.315  -1.907 0.056722 .  
## NeighborhoodSawyerW   -16347.813  11026.624  -1.483 0.138467    
## NeighborhoodSomerst     4583.337  12594.375   0.364 0.715987    
## NeighborhoodStoneBr    42138.086  11967.047   3.521 0.000447 ***
## NeighborhoodSWISU     -31713.802  13369.090  -2.372 0.017851 *  
## NeighborhoodTimber       232.308  11664.563   0.020 0.984114    
## NeighborhoodVeenker    17735.515  14381.748   1.233 0.217760    
## ExterQualFa           -27903.972  14811.121  -1.884 0.059825 .  
## ExterQualGd           -28441.041   6732.932  -4.224 2.59e-05 ***
## ExterQualTA           -32819.015   7487.320  -4.383 1.28e-05 ***
## KitchenQualFa         -43290.222   9510.320  -4.552 5.89e-06 ***
## KitchenQualGd         -35902.600   5151.835  -6.969 5.43e-12 ***
## KitchenQualTA         -44814.028   5705.035  -7.855 9.28e-15 ***
## CentralAirY            -7304.536  12457.281  -0.586 0.557747    
## GrLivArea:CentralAirY     15.336      8.329   1.841 0.065850 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33890 on 1124 degrees of freedom
## Multiple R-squared:  0.8245, Adjusted R-squared:  0.8178 
## F-statistic: 122.8 on 43 and 1124 DF,  p-value: < 2.2e-16

The model accounts for about 82% of the variability within the data.

Improve Model

Let’s see if we can improve the model using a backward stepwise approach.

better_lm <- step(test_lm, direction = "backward")
## Start:  AIC=24409.78
## SalePrice ~ OverallQual + log(LotArea) + GrLivArea + X1stFlrSF + 
##     FullBath + TotRmsAbvGrd + GarageArea + MSZoning + Neighborhood + 
##     ExterQual + KitchenQual + CentralAir + CentralAir * GrLivArea
## 
##                        Df  Sum of Sq        RSS   AIC
## - MSZoning              4 3.4105e+09 1.2945e+12 24405
## - FullBath              1 3.8742e+08 1.2915e+12 24408
## - TotRmsAbvGrd          1 5.8424e+08 1.2917e+12 24408
## <none>                               1.2911e+12 24410
## - GrLivArea:CentralAir  1 3.8941e+09 1.2950e+12 24411
## - X1stFlrSF             1 1.3905e+10 1.3050e+12 24420
## - ExterQual             3 2.3162e+10 1.3142e+12 24425
## - GarageArea            1 2.1295e+10 1.3124e+12 24427
## - log(LotArea)          1 3.5259e+10 1.3263e+12 24439
## - KitchenQual           3 7.1440e+10 1.3625e+12 24467
## - OverallQual           1 1.1810e+11 1.4092e+12 24510
## - Neighborhood         24 2.3525e+11 1.5263e+12 24557
## 
## Step:  AIC=24404.86
## SalePrice ~ OverallQual + log(LotArea) + GrLivArea + X1stFlrSF + 
##     FullBath + TotRmsAbvGrd + GarageArea + Neighborhood + ExterQual + 
##     KitchenQual + CentralAir + GrLivArea:CentralAir
## 
##                        Df  Sum of Sq        RSS   AIC
## - FullBath              1 4.0329e+08 1.2949e+12 24403
## - TotRmsAbvGrd          1 5.8545e+08 1.2951e+12 24403
## <none>                               1.2945e+12 24405
## - GrLivArea:CentralAir  1 5.4185e+09 1.2999e+12 24408
## - X1stFlrSF             1 1.4578e+10 1.3091e+12 24416
## - ExterQual             3 2.2920e+10 1.3174e+12 24419
## - GarageArea            1 2.0088e+10 1.3146e+12 24421
## - log(LotArea)          1 3.7780e+10 1.3323e+12 24437
## - KitchenQual           3 7.2176e+10 1.3667e+12 24462
## - OverallQual           1 1.2162e+11 1.4161e+12 24508
## - Neighborhood         24 2.4948e+11 1.5440e+12 24563
## 
## Step:  AIC=24403.23
## SalePrice ~ OverallQual + log(LotArea) + GrLivArea + X1stFlrSF + 
##     TotRmsAbvGrd + GarageArea + Neighborhood + ExterQual + KitchenQual + 
##     CentralAir + GrLivArea:CentralAir
## 
##                        Df  Sum of Sq        RSS   AIC
## - TotRmsAbvGrd          1 7.5483e+08 1.2956e+12 24402
## <none>                               1.2949e+12 24403
## - GrLivArea:CentralAir  1 5.5269e+09 1.3004e+12 24406
## - X1stFlrSF             1 1.4816e+10 1.3097e+12 24415
## - ExterQual             3 2.2997e+10 1.3179e+12 24418
## - GarageArea            1 2.0146e+10 1.3150e+12 24419
## - log(LotArea)          1 3.7377e+10 1.3323e+12 24435
## - KitchenQual           3 7.1789e+10 1.3667e+12 24460
## - OverallQual           1 1.2164e+11 1.4165e+12 24506
## - Neighborhood         24 2.5045e+11 1.5453e+12 24562
## 
## Step:  AIC=24401.91
## SalePrice ~ OverallQual + log(LotArea) + GrLivArea + X1stFlrSF + 
##     GarageArea + Neighborhood + ExterQual + KitchenQual + CentralAir + 
##     GrLivArea:CentralAir
## 
##                        Df  Sum of Sq        RSS   AIC
## <none>                               1.2956e+12 24402
## - GrLivArea:CentralAir  1 5.1948e+09 1.3008e+12 24405
## - X1stFlrSF             1 1.4163e+10 1.3098e+12 24413
## - ExterQual             3 2.3395e+10 1.3190e+12 24417
## - GarageArea            1 1.9824e+10 1.3155e+12 24418
## - log(LotArea)          1 3.8443e+10 1.3341e+12 24434
## - KitchenQual           3 7.1622e+10 1.3673e+12 24459
## - OverallQual           1 1.2089e+11 1.4165e+12 24504
## - Neighborhood         24 2.4992e+11 1.5456e+12 24560
summary(better_lm)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + log(LotArea) + GrLivArea + 
##     X1stFlrSF + GarageArea + Neighborhood + ExterQual + KitchenQual + 
##     CentralAir + GrLivArea:CentralAir, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -387144  -14201     992   13555  262258 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -61642.091  30483.155  -2.022 0.043394 *  
## OverallQual            14047.778   1368.116  10.268  < 2e-16 ***
## log(LotArea)           17724.160   3060.970   5.790 9.09e-09 ***
## GrLivArea                 21.287      8.024   2.653 0.008089 ** 
## X1stFlrSF                 13.328      3.792   3.515 0.000458 ***
## GarageArea                26.777      6.440   4.158 3.45e-05 ***
## NeighborhoodBlueste    -5424.106  35415.396  -0.153 0.878302    
## NeighborhoodBrDale    -20290.037  13751.054  -1.476 0.140350    
## NeighborhoodBrkSide   -21149.019  11439.908  -1.849 0.064761 .  
## NeighborhoodClearCr    -8003.552  13032.100  -0.614 0.539246    
## NeighborhoodCollgCr    -9309.326  10188.900  -0.914 0.361083    
## NeighborhoodCrawfor     4206.713  11285.834   0.373 0.709410    
## NeighborhoodEdwards   -36507.616  10821.288  -3.374 0.000767 ***
## NeighborhoodGilbert   -12237.630  10956.952  -1.117 0.264282    
## NeighborhoodIDOTRR    -35823.681  12272.328  -2.919 0.003580 ** 
## NeighborhoodMeadowV   -14146.120  13510.743  -1.047 0.295311    
## NeighborhoodMitchel   -19466.049  11502.373  -1.692 0.090855 .  
## NeighborhoodNAmes     -23390.708  10437.545  -2.241 0.025219 *  
## NeighborhoodNoRidge    44371.009  11630.723   3.815 0.000144 ***
## NeighborhoodNPkVill    -6431.449  15438.866  -0.417 0.677068    
## NeighborhoodNridgHt    30186.883  10681.791   2.826 0.004796 ** 
## NeighborhoodNWAmes    -21536.259  10991.326  -1.959 0.050313 .  
## NeighborhoodOldTown   -36729.475  10682.853  -3.438 0.000607 ***
## NeighborhoodSawyer    -21946.900  11102.977  -1.977 0.048322 *  
## NeighborhoodSawyerW   -17056.809  10930.070  -1.561 0.118912    
## NeighborhoodSomerst     3220.497  10302.584   0.313 0.754649    
## NeighborhoodStoneBr    41433.879  11918.487   3.476 0.000527 ***
## NeighborhoodSWISU     -32945.671  13123.483  -2.510 0.012197 *  
## NeighborhoodTimber        96.435  11598.986   0.008 0.993368    
## NeighborhoodVeenker    16507.523  14261.676   1.157 0.247323    
## ExterQualFa           -34600.650  14332.786  -2.414 0.015933 *  
## ExterQualGd           -28658.259   6713.786  -4.269 2.13e-05 ***
## ExterQualTA           -33150.043   7459.453  -4.444 9.70e-06 ***
## KitchenQualFa         -42853.255   9484.451  -4.518 6.89e-06 ***
## KitchenQualGd         -35576.532   5101.680  -6.973 5.25e-12 ***
## KitchenQualTA         -44663.908   5672.493  -7.874 8.03e-15 ***
## CentralAirY            -9831.940  12288.695  -0.800 0.423833    
## GrLivArea:CentralAirY     17.356      8.154   2.129 0.033507 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33860 on 1130 degrees of freedom
## Multiple R-squared:  0.8239, Adjusted R-squared:  0.8181 
## F-statistic: 142.9 on 37 and 1130 DF,  p-value: < 2.2e-16

The model performs the same but removed GrLivArea, FullBath, and MSZOning.

par(mfrow = c(1,2))
plot(better_lm, which = c(1,2))

The fitted vs residuals plot has a slight curve and there are some outlying residuals. There also seems issues in the tails of the residuals.

vif(better_lm, type = "predictor")
## GVIFs computed for predictors
##                      GVIF Df GVIF^(1/(2*Df)) Interacts With
## OverallQual  3.641133e+00  1        1.908175           --  
## LotArea      1.865968e+07  0             Inf           --  
## GrLivArea    3.645335e+00  3        1.240575     CentralAir
## X1stFlrSF    2.161214e+00  1        1.470107           --  
## GarageArea   1.944068e+00  1        1.394299           --  
## Neighborhood 1.456642e+01 24        1.057393           --  
## ExterQual    6.621240e+00  3        1.370324           --  
## KitchenQual  5.333864e+00  3        1.321824           --  
## CentralAir   3.645335e+00  3        1.240575      GrLivArea
##                                                                                                      Other Predictors
## OverallQual               LotArea, GrLivArea, X1stFlrSF, GarageArea, Neighborhood, ExterQual, KitchenQual, CentralAir
## LotArea      OverallQual, LotArea, GrLivArea, X1stFlrSF, GarageArea, Neighborhood, ExterQual, KitchenQual, CentralAir
## GrLivArea                           OverallQual, LotArea, X1stFlrSF, GarageArea, Neighborhood, ExterQual, KitchenQual
## X1stFlrSF               OverallQual, LotArea, GrLivArea, GarageArea, Neighborhood, ExterQual, KitchenQual, CentralAir
## GarageArea               OverallQual, LotArea, GrLivArea, X1stFlrSF, Neighborhood, ExterQual, KitchenQual, CentralAir
## Neighborhood               OverallQual, LotArea, GrLivArea, X1stFlrSF, GarageArea, ExterQual, KitchenQual, CentralAir
## ExterQual               OverallQual, LotArea, GrLivArea, X1stFlrSF, GarageArea, Neighborhood, KitchenQual, CentralAir
## KitchenQual               OverallQual, LotArea, GrLivArea, X1stFlrSF, GarageArea, Neighborhood, ExterQual, CentralAir
## CentralAir                          OverallQual, LotArea, X1stFlrSF, GarageArea, Neighborhood, ExterQual, KitchenQual

There is very high collinearity for LotArea. Let’s remove this predictor.

update_lm <- update(better_lm, .~. - log(LotArea))

summary(update_lm)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + X1stFlrSF + 
##     GarageArea + Neighborhood + ExterQual + KitchenQual + CentralAir + 
##     GrLivArea:CentralAir, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -391541  -14865     364   13126  252622 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            76162.161  19320.268   3.942 8.57e-05 ***
## OverallQual            13451.468   1383.713   9.721  < 2e-16 ***
## GrLivArea                 23.152      8.132   2.847  0.00449 ** 
## X1stFlrSF                 17.899      3.762   4.758 2.21e-06 ***
## GarageArea                34.268      6.399   5.356 1.03e-07 ***
## NeighborhoodBlueste   -20349.196  35825.808  -0.568  0.57015    
## NeighborhoodBrDale    -27987.049  13882.079  -2.016  0.04403 *  
## NeighborhoodBrkSide    -7497.397  11354.168  -0.660  0.50918    
## NeighborhoodClearCr    19758.186  12291.144   1.608  0.10822    
## NeighborhoodCollgCr     8506.740   9851.914   0.863  0.38807    
## NeighborhoodCrawfor    21635.041  11032.379   1.961  0.05012 .  
## NeighborhoodEdwards   -19229.188  10550.254  -1.823  0.06862 .  
## NeighborhoodGilbert     8168.136  10522.895   0.776  0.43778    
## NeighborhoodIDOTRR    -20357.406  12149.160  -1.676  0.09409 .  
## NeighborhoodMeadowV   -19351.773  13673.284  -1.415  0.15726    
## NeighborhoodMitchel      -37.827  11159.221  -0.003  0.99730    
## NeighborhoodNAmes      -5737.676  10124.910  -0.567  0.57104    
## NeighborhoodNoRidge    61837.465  11393.136   5.428 6.98e-08 ***
## NeighborhoodNPkVill    -8075.745  15656.660  -0.516  0.60609    
## NeighborhoodNridgHt    44864.782  10524.821   4.263 2.19e-05 ***
## NeighborhoodNWAmes     -2856.533  10657.272  -0.268  0.78872    
## NeighborhoodOldTown   -23087.174  10568.590  -2.185  0.02913 *  
## NeighborhoodSawyer     -2710.166  10745.568  -0.252  0.80092    
## NeighborhoodSawyerW      539.343  10649.042   0.051  0.95962    
## NeighborhoodSomerst    15951.403  10208.966   1.562  0.11845    
## NeighborhoodStoneBr    54112.621  11882.907   4.554 5.84e-06 ***
## NeighborhoodSWISU     -20043.962  13117.614  -1.528  0.12679    
## NeighborhoodTimber     22179.147  11110.503   1.996  0.04615 *  
## NeighborhoodVeenker    38922.428  13922.251   2.796  0.00527 ** 
## ExterQualFa           -34320.778  14537.353  -2.361  0.01840 *  
## ExterQualGd           -28974.362   6809.423  -4.255 2.26e-05 ***
## ExterQualTA           -32442.786   7564.948  -4.289 1.95e-05 ***
## KitchenQualFa         -43197.807   9619.685  -4.491 7.83e-06 ***
## KitchenQualGd         -37526.497   5163.239  -7.268 6.79e-13 ***
## KitchenQualTA         -45390.853   5752.078  -7.891 7.03e-15 ***
## CentralAirY           -12940.470  12452.259  -1.039  0.29893    
## GrLivArea:CentralAirY     19.467      8.262   2.356  0.01864 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34340 on 1131 degrees of freedom
## Multiple R-squared:  0.8187, Adjusted R-squared:  0.8129 
## F-statistic: 141.8 on 36 and 1131 DF,  p-value: < 2.2e-16

The \(R^2\) is not much changed.

vif(update_lm)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                           GVIF Df GVIF^(1/(2*Df))
## OverallQual           3.620502  1        1.902762
## GrLivArea            18.162544  1        4.261754
## X1stFlrSF             2.067551  1        1.437898
## GarageArea            1.865621  1        1.365877
## Neighborhood          8.280477 24        1.045024
## ExterQual             6.605269  3        1.369772
## KitchenQual           5.289657  3        1.319992
## CentralAir            8.996460  1        2.999410
## GrLivArea:CentralAir 26.701587  1        5.167358

It seems we have gotten rid of some collinearity issues.

par(mfrow = c(1,2))
plot(update_lm, which = c(1,2))

Validate Model

Let’s see how the model performs so far.

validate$predictSalePrice <- predict(update_lm, validate)

se <- (validate$predictSalePrice - validate$SalePrice)**2

(RMSE <- sqrt(mean(se)))
## [1] 29711.92

Let’s look back at the data and see if there is anything we can add to improve the model.

par(mfrow = c(1,3))
boxplot(train$SalePrice~as.factor(train$MSSubClass))
boxplot(train$SalePrice~as.factor(train$MoSold))
boxplot(train$SalePrice~as.factor(train$YrSold))

update_lm <- lm(SalePrice ~ OverallQual + GrLivArea + X1stFlrSF + TotRmsAbvGrd + GarageArea + Neighborhood + ExterQual + KitchenQual + GrLivArea:CentralAir + HouseStyle + MSSubClass, train)

summary(update_lm)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + X1stFlrSF + 
##     TotRmsAbvGrd + GarageArea + Neighborhood + ExterQual + KitchenQual + 
##     GrLivArea:CentralAir + HouseStyle + MSSubClass, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -361206  -14251      86   13097  241451 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            68198.758  18126.033   3.762 0.000177 ***
## OverallQual            13707.155   1384.171   9.903  < 2e-16 ***
## GrLivArea                 55.933      7.611   7.349 3.85e-13 ***
## X1stFlrSF                -12.390      7.664  -1.617 0.106246    
## TotRmsAbvGrd            1209.810   1148.328   1.054 0.292320    
## GarageArea                29.947      6.339   4.724 2.60e-06 ***
## NeighborhoodBlueste   -10052.371  35100.199  -0.286 0.774632    
## NeighborhoodBrDale    -16731.311  13775.393  -1.215 0.224782    
## NeighborhoodBrkSide   -17618.012  11675.781  -1.509 0.131596    
## NeighborhoodClearCr     6904.093  12351.192   0.559 0.576285    
## NeighborhoodCollgCr    -5928.162  10053.383  -0.590 0.555532    
## NeighborhoodCrawfor    10001.348  11158.344   0.896 0.370279    
## NeighborhoodEdwards   -31655.288  10697.378  -2.959 0.003149 ** 
## NeighborhoodGilbert    -5697.059  10740.744  -0.530 0.595929    
## NeighborhoodIDOTRR    -29146.888  12401.486  -2.350 0.018932 *  
## NeighborhoodMeadowV   -15041.472  13608.110  -1.105 0.269253    
## NeighborhoodMitchel   -13242.853  11183.157  -1.184 0.236593    
## NeighborhoodNAmes     -21751.195  10326.830  -2.106 0.035401 *  
## NeighborhoodNoRidge    46380.747  11570.683   4.008 6.51e-05 ***
## NeighborhoodNPkVill     -701.074  15368.956  -0.046 0.963624    
## NeighborhoodNridgHt    37018.256  10446.339   3.544 0.000411 ***
## NeighborhoodNWAmes    -18236.836  10824.559  -1.685 0.092312 .  
## NeighborhoodOldTown   -35583.367  10766.760  -3.305 0.000980 ***
## NeighborhoodSawyer    -19966.946  10949.998  -1.823 0.068499 .  
## NeighborhoodSawyerW   -11267.435  10730.692  -1.050 0.293935    
## NeighborhoodSomerst     7886.497  10222.247   0.772 0.440571    
## NeighborhoodStoneBr    49332.870  11678.564   4.224 2.59e-05 ***
## NeighborhoodSWISU     -29633.897  13272.885  -2.233 0.025768 *  
## NeighborhoodTimber      7667.237  11228.430   0.683 0.494848    
## NeighborhoodVeenker    28748.378  13830.796   2.079 0.037883 *  
## ExterQualFa           -36671.004  14246.307  -2.574 0.010178 *  
## ExterQualGd           -28340.402   6681.796  -4.241 2.40e-05 ***
## ExterQualTA           -32235.980   7408.118  -4.351 1.48e-05 ***
## KitchenQualFa         -39435.862   9390.630  -4.199 2.89e-05 ***
## KitchenQualGd         -35367.325   5069.389  -6.977 5.15e-12 ***
## KitchenQualTA         -42562.160   5658.814  -7.521 1.11e-13 ***
## HouseStyle1.5Unf       10043.242  11756.104   0.854 0.393121    
## HouseStyle1Story       18564.773   5530.655   3.357 0.000815 ***
## HouseStyle2.5Fin      -13002.369  15903.886  -0.818 0.413782    
## HouseStyle2.5Unf          89.984  13083.985   0.007 0.994514    
## HouseStyle2Story       -3067.525   4650.244  -0.660 0.509615    
## HouseStyleSFoyer       29513.507   8443.084   3.496 0.000491 ***
## HouseStyleSLvl         21337.933   6763.755   3.155 0.001649 ** 
## MSSubClass              -191.196     33.302  -5.741 1.21e-08 ***
## GrLivArea:CentralAirY      8.456      3.235   2.614 0.009080 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33570 on 1123 degrees of freedom
## Multiple R-squared:  0.828,  Adjusted R-squared:  0.8213 
## F-statistic: 122.9 on 44 and 1123 DF,  p-value: < 2.2e-16
best_lm <- step(update_lm, direction = "backward")
## Start:  AIC=24388.28
## SalePrice ~ OverallQual + GrLivArea + X1stFlrSF + TotRmsAbvGrd + 
##     GarageArea + Neighborhood + ExterQual + KitchenQual + GrLivArea:CentralAir + 
##     HouseStyle + MSSubClass
## 
##                        Df  Sum of Sq        RSS   AIC
## - TotRmsAbvGrd          1 1.2506e+09 1.2666e+12 24387
## <none>                               1.2653e+12 24388
## - X1stFlrSF             1 2.9446e+09 1.2683e+12 24389
## - GrLivArea:CentralAir  1 7.6966e+09 1.2730e+12 24393
## - HouseStyle            7 2.3144e+10 1.2885e+12 24395
## - ExterQual             3 2.2343e+10 1.2877e+12 24403
## - GarageArea            1 2.5145e+10 1.2905e+12 24409
## - MSSubClass            1 3.7140e+10 1.3025e+12 24420
## - KitchenQual           3 6.5553e+10 1.3309e+12 24441
## - OverallQual           1 1.1050e+11 1.3758e+12 24484
## - Neighborhood         24 2.6681e+11 1.5322e+12 24564
## 
## Step:  AIC=24387.43
## SalePrice ~ OverallQual + GrLivArea + X1stFlrSF + GarageArea + 
##     Neighborhood + ExterQual + KitchenQual + HouseStyle + MSSubClass + 
##     GrLivArea:CentralAir
## 
##                        Df  Sum of Sq        RSS   AIC
## <none>                               1.2666e+12 24387
## - X1stFlrSF             1 2.8051e+09 1.2694e+12 24388
## - GrLivArea:CentralAir  1 7.2050e+09 1.2738e+12 24392
## - HouseStyle            7 2.2394e+10 1.2890e+12 24394
## - ExterQual             3 2.2844e+10 1.2894e+12 24402
## - GarageArea            1 2.4727e+10 1.2913e+12 24408
## - MSSubClass            1 3.9175e+10 1.3058e+12 24421
## - KitchenQual           3 6.5369e+10 1.3320e+12 24440
## - OverallQual           1 1.0925e+11 1.3759e+12 24482
## - Neighborhood         24 2.6649e+11 1.5331e+12 24563
vif(best_lm)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                           GVIF Df GVIF^(1/(2*Df))
## OverallQual           3.756865  1        1.938263
## GrLivArea            14.107881  1        3.756046
## X1stFlrSF             8.970489  1        2.995078
## GarageArea            1.913683  1        1.383359
## Neighborhood         18.383918 24        1.062533
## ExterQual             6.664807  3        1.371823
## KitchenQual           5.351207  3        1.322539
## HouseStyle           20.936987  7        1.242656
## MSSubClass            2.004517  1        1.415810
## GrLivArea:CentralAir  4.251176  1        2.061838
validate$predictSalePrice <- predict(best_lm, validate)

se <- (validate$predictSalePrice - validate$SalePrice)**2

(RMSE <- sqrt(mean(se)))
## [1] 29541.11

We reduced the RMSE slightly.

This was my initial model for submission, however I noticed that there were 2 NA values in the predictions from this model. Upon examination, there was one column with an NA value for KitchenQual and one column with an NA value for GarageArea. I removed these variables from the model.

update_lm <- update(best_lm, .~. - KitchenQual - GarageArea)

best_lm <- step(update_lm, direction = "backward")
## Start:  AIC=24459.68
## SalePrice ~ OverallQual + GrLivArea + X1stFlrSF + Neighborhood + 
##     ExterQual + HouseStyle + MSSubClass + GrLivArea:CentralAir
## 
##                        Df  Sum of Sq        RSS   AIC
## - X1stFlrSF             1 9.7606e+08 1.3577e+12 24459
## <none>                               1.3567e+12 24460
## - GrLivArea:CentralAir  1 9.8073e+09 1.3665e+12 24466
## - HouseStyle            7 3.1525e+10 1.3882e+12 24473
## - MSSubClass            1 5.4153e+10 1.4108e+12 24503
## - ExterQual             3 6.7172e+10 1.4239e+12 24510
## - OverallQual           1 1.5213e+11 1.5088e+12 24582
## - Neighborhood         24 3.3015e+11 1.6868e+12 24666
## 
## Step:  AIC=24458.52
## SalePrice ~ OverallQual + GrLivArea + Neighborhood + ExterQual + 
##     HouseStyle + MSSubClass + GrLivArea:CentralAir
## 
##                        Df  Sum of Sq        RSS   AIC
## <none>                               1.3577e+12 24459
## - GrLivArea:CentralAir  1 9.9975e+09 1.3677e+12 24465
## - HouseStyle            7 4.9124e+10 1.4068e+12 24486
## - MSSubClass            1 5.4976e+10 1.4126e+12 24503
## - ExterQual             3 6.6421e+10 1.4241e+12 24508
## - OverallQual           1 1.5132e+11 1.5090e+12 24580
## - Neighborhood         24 3.2958e+11 1.6872e+12 24664
summary(best_lm)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + Neighborhood + 
##     ExterQual + HouseStyle + MSSubClass + GrLivArea:CentralAir, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -370362  -14845      43   14253  258513 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            46270.318  17157.272   2.697 0.007104 ** 
## OverallQual            15561.502   1387.226  11.218  < 2e-16 ***
## GrLivArea                 57.355      4.266  13.446  < 2e-16 ***
## NeighborhoodBlueste   -10989.630  36176.839  -0.304 0.761355    
## NeighborhoodBrDale    -21615.428  14129.279  -1.530 0.126338    
## NeighborhoodBrkSide   -21722.688  11947.940  -1.818 0.069312 .  
## NeighborhoodClearCr     4232.710  12698.208   0.333 0.738946    
## NeighborhoodCollgCr    -7136.464  10346.292  -0.690 0.490487    
## NeighborhoodCrawfor     2781.582  11442.578   0.243 0.807979    
## NeighborhoodEdwards   -36629.538  10967.201  -3.340 0.000865 ***
## NeighborhoodGilbert   -11565.287  11050.088  -1.047 0.295497    
## NeighborhoodIDOTRR    -31584.378  12760.490  -2.475 0.013463 *  
## NeighborhoodMeadowV   -15944.531  13970.465  -1.141 0.253986    
## NeighborhoodMitchel   -16950.480  11488.809  -1.475 0.140386    
## NeighborhoodNAmes     -26354.293  10606.942  -2.485 0.013113 *  
## NeighborhoodNoRidge    42560.578  11811.045   3.603 0.000328 ***
## NeighborhoodNPkVill    -2452.881  15823.996  -0.155 0.876841    
## NeighborhoodNridgHt    45748.791  10717.850   4.268 2.13e-05 ***
## NeighborhoodNWAmes    -23467.127  11140.805  -2.106 0.035389 *  
## NeighborhoodOldTown   -37599.611  11053.571  -3.402 0.000693 ***
## NeighborhoodSawyer    -24974.210  11259.605  -2.218 0.026751 *  
## NeighborhoodSawyerW   -14936.962  11048.656  -1.352 0.176670    
## NeighborhoodSomerst    10249.210  10493.931   0.977 0.328937    
## NeighborhoodStoneBr    56635.193  11985.225   4.725 2.59e-06 ***
## NeighborhoodSWISU     -37388.368  13566.310  -2.756 0.005946 ** 
## NeighborhoodTimber      7991.847  11563.979   0.691 0.489646    
## NeighborhoodVeenker    29507.478  14213.933   2.076 0.038124 *  
## ExterQualFa           -54124.786  13767.874  -3.931 8.97e-05 ***
## ExterQualGd           -45228.513   6357.954  -7.114 2.00e-12 ***
## ExterQualTA           -52276.548   7193.553  -7.267 6.84e-13 ***
## HouseStyle1.5Unf        7776.145  11805.811   0.659 0.510241    
## HouseStyle1Story       16835.153   4282.803   3.931 8.98e-05 ***
## HouseStyle2.5Fin       -5761.278  15471.165  -0.372 0.709674    
## HouseStyle2.5Unf       -3783.669  13311.841  -0.284 0.776284    
## HouseStyle2Story        2204.908   4286.453   0.514 0.607080    
## HouseStyleSFoyer       35052.163   8250.932   4.248 2.33e-05 ***
## HouseStyleSLvl         24116.517   6529.907   3.693 0.000232 ***
## MSSubClass              -228.877     33.851  -6.761 2.19e-11 ***
## GrLivArea:CentralAirY      9.400      3.260   2.883 0.004009 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34680 on 1129 degrees of freedom
## Multiple R-squared:  0.8155, Adjusted R-squared:  0.8093 
## F-statistic: 131.3 on 38 and 1129 DF,  p-value: < 2.2e-16
validate$predictSalePrice <- predict(best_lm, validate)

se <- (validate$predictSalePrice - validate$SalePrice)**2

(RMSE <- sqrt(mean(se)))
## [1] 30172.79

The \(R^2\) value decreased slightly and the RMSE increased slightly, but not by too much for the model to be much affected.

Test Model - Predictions

test <- read.csv(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_605/FinalProj/test.csv"))

test$SalePrice <- predict(best_lm, test)

Results:

predictions <- data.frame("Id" = test$Id, 
                          "SalePrice" = test$SalePrice)

write.csv(predictions, "C:/Users/Shoshana/Documents/CUNY SPS/cuny-sps/DATA_605/FinalProj/predictions.csv", row.names = F)