Probability Density 1: X~Gamma. Using R, generate a random variable \(X\) that has 10,000 random Gamma pdf values. A Gamma pdf is completely described by \(n\) (a size parameter) and \(\lambda\) (a shape parameter). Choose any \(n\) greater than 3 and an expected value (\(\lambda\)) between 2 and 10.
set.seed(123)
lambda <- 2
n <- 8
X <- rgamma(10000, n, lambda)
hist(X, main = "Randomely Generated Gamma Distribution")
Probability Density 2: Y~Sum of Exponentials. Generate 10,000 observations from the sum of \(n\) exponential pdfs with a rate/shape parameter (\(\lambda\)). The \(n\) and \(\lambda\) must be the same as in the previous case.
set.seed(123)
Y <- rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda)
hist(Y)
Probability Density 3: Z~Exponential. Generate 10,000 observations from a single exponential pdf with rate/shape parameter (\(\lambda\)).
set.seed(123)
Z <- rexp(10000, lambda)
hist(Z, main = "Randomely Generated Exponential Distribution")
1a. Calculate the empirical expected value (means) and variances of all three pdfs.
mean(X)
## [1] 3.974504
var(X)
## [1] 1.968511
mean(Y)
## [1] 3.995146
var(Y)
## [1] 1.986324
mean(Z)
## [1] 0.5018906
var(Z)
## [1] 0.2499411
1b. Using calculus, calculate the expected value and variance of the Gamma pdf (X). Using the moment generating function for exponential, calculate the expected value of the single exponential (Z) and the sum of exponential (Y).
Gamma Distribution:
\(E(x) = {n \over \lambda}\) and \(\sigma^2 = {n \over \lambda^2}\).
n/lambda
## [1] 4
n/lambda^2
## [1] 2
Single Exponential:
\[ f(x) = \begin{cases} \lambda e^{-\lambda x} & x \ge 0 \\ 0 & x < 0 \end{cases} \]
Moment Generating Function:
\(M_{x}(t) = E(e^{xt}) = \int_{-\infty}^{+\infty} e^{xt} f(x) dx = \int_{-\infty}^{0} e^{xt} 0 dx + \int_{0}^{+\infty} e^{xt} \lambda e^{-\lambda x} dx\)
\(\hspace{28pt} = \lambda \int_{0}^{\infty} e^{(t - \lambda)x} dx\)
\(\hspace{28pt} = {\lambda \over t - \lambda} [e^{(t - \lambda)x}]_{0}^{\infty}\)
\(\hspace{28pt} = {\lambda \over t - \lambda} [0-1]\)
\(\hspace{28pt} = \boxed {{\lambda \over \lambda - t}}\)
Expected Value and Variance:
\(M_{x}^{'}(t) = \lambda (\lambda - t)^{-2}\)
\(M_{x}^{'}(0) = \lambda (\lambda - 0)^{-2}\)
\(\hspace{28pt} = \lambda (\lambda)^{-2}\)
\(\hspace{28pt} = \boxed {\lambda^{-1} = E(X)}\)
\(M_{x}^{''}(t) = 2 \lambda (\lambda - t)^{-3}\)
\(M_{x}^{''}(0) = 2 \lambda (\lambda - 0)^{-3}\)
\(\hspace{28pt} = 2 \lambda (\lambda)^{-3}\)
\(\hspace{28pt} = 2\lambda^{-2} = E(X^2)\)
\(\sigma^{2} = 2\lambda^{-2} - (\lambda^{-1})^2\)
\(\hspace{13pt} = {2 \over \lambda^2} - {1 \over \lambda^2}\)
\(\hspace{13pt} = {1 \over \lambda^2}\)
\(\hspace{13pt} = \boxed {\lambda^{-2} = \sigma^{2}}\)
# single exponential
lambda**(-1)
## [1] 0.5
lambda**(-2)
## [1] 0.25
For the moment generating function of the sum of \(n\) exponentials, multiply the moment generating function of the single exponential \(n\) times.
Moment Generating Function:
\(M_{x}(t) = ({\lambda \over \lambda - t})^{8}\)
Expected Value and Variance:
\(M_{x}'(t) = 8({\lambda \over \lambda - t})^{7} \cdot \lambda (\lambda - t)^{-2}\)
\(M_{x}'(0) = 8({\lambda \over \lambda - 0})^{7} \cdot \lambda (\lambda - 0)^{-2}\)
\(\hspace{28pt} = \boxed{8 \lambda^{-1} = E(x)}\)
\(M_{x}''(t) = [8({\lambda \over \lambda - t})^{7}][{2 \over \lambda^2}] + [\lambda (\lambda - t)^{-2}][56({\lambda \over \lambda - t})^{6} \cdot \lambda (\lambda - t)^{-2}]\)
\(M_{x}''(0) = [8({\lambda \over \lambda - 0})^{7}][{2 \over \lambda^2}] + [\lambda (\lambda - 0)^{-2}][56({\lambda \over \lambda - 0})^{8} \cdot \lambda (\lambda - 0)^{-2}]\)
\(\hspace{28pt} = {16 \over \lambda^2} + {56 \over \lambda^2}\)
\(\hspace{28pt} = 72 \lambda^{-2} = E(X^2)\)
\(\sigma^2 = 72 \lambda^{-2} - (8 \lambda^{-1})^2\)
\(\hspace{13pt} = \boxed{8 \lambda^{-2} = \sigma^2}\)
n*(lambda**-1)
## [1] 4
n*(lambda**-2)
## [1] 2
1c-e. Probability. For pdf Z (exponential), calculate empirically probabilities a-c. Then evaluate through calculus whether the memoryless property holds.
\[ P(A|B) = {P(A \hspace{2pt} \cap \hspace{2pt} B) \over P(B)} = {P(B|A)P(A) \over P(B)} \]
a. \(P(Z > \lambda | Z > {\lambda \over 2})\)
\(P(Z > \lambda | Z > {\lambda \over 2}) = {P(Z > {\lambda \over 2} | Z > \lambda)P(Z > \lambda) \over P(Z > {\lambda \over 2})}\)
\(P(Z > {\lambda \over 2} \hspace{2pt} | \hspace{2pt} Z > \lambda) = 1\)
\(P(Z > \lambda | Z > {\lambda \over 2}) = {P(Z > \lambda) \over P(Z > {\lambda \over 2})}\)
Rather than calculate for \(P(Z > \lambda)\) and \(P(Z > {\lambda \over 2})\), calculate for \(1 - P(Z \le \lambda)\) and \(1 - P(Z \le {\lambda \over 2})\)
I also calculated each of these based on the definition of the exponential distribution.
p_a <- length(which(Z>lambda)) / length(Z)
p_b <- length(which(Z>lambda/2)) / length(Z)
p_a / p_b
## [1] 0.1300813
(1-pexp(lambda, lambda)) / (1-pexp(lambda/2, lambda))
## [1] 0.1353353
b. \(P(Z > 2 \lambda | Z > \lambda)\)
\(P(Z > 2 \lambda | Z > \lambda) = {P(Z > \lambda | Z > 2 \lambda)P(Z > 2 \lambda) \over P(Z > \lambda)}\)
\(P(Z > \lambda \hspace{2pt} | \hspace{2pt} Z > 2 \lambda) = 1\)
\(P(Z > 2 \lambda | Z > \lambda) = {P(Z > 2 \lambda) \over P(Z > \lambda)} = {1 - P(Z \le 2 \lambda) \over 1 - P(Z \le \lambda)}\)
p_a <- length(which(Z>2*lambda)) / length(Z)
p_b <- length(which(Z>lambda)) / length(Z)
p_a / p_b
## [1] 0.01704545
(1-pexp(2*lambda, lambda)) / (1-pexp(lambda, lambda))
## [1] 0.01831564
c. \(P(Z > 3 \lambda | Z > \lambda)\)
\(P(Z > 3 \lambda | Z > \lambda) = {P(Z > 3 \lambda) \over P(Z > \lambda)} = {1 - P(Z \le 3 \lambda) \over 1 - P(Z \le \lambda)}\)
p_a <- length(which(Z>3*lambda)) / length(Z)
p_b <- length(which(Z>lambda)) / length(Z)
p_a / p_b
## [1] 0
(1-pexp(3*lambda, lambda)) / (1-pexp(lambda, lambda))
## [1] 0.0003354626
Memoryless Property:
The memoryless property says that \(P(X > r + t \hspace{2pt} | \hspace{2pt} X > r) = P(X > t)\).
Let’s use a randomly defined \(t = 3\). \(r\) is the inverse of the rate, so for this problem \(r = {1 \over \lambda}\).
\(P(X > n + t \hspace{2pt} | \hspace{2pt} X > n)\)
t <- 3
(1-pexp((1/lambda)+t, lambda)) / (1-pexp((1/lambda), lambda))
## [1] 0.002478752
\(P(X > t)\)
(1-pexp(t, lambda))
## [1] 0.002478752
The probabilities are the same, so the memoryless property holds.
Loosely investigate whether \(P(YZ) = P(Y)P(Z)\) by building a table with quartiles and evaluating the marginal and joint probabilities.
# find quantiles
quantile_Y <- matrix(c(quantile(Y, probs = c(0, .25, .5, .75, 1))))
quantile_Z <- matrix(c(quantile(Z, probs = c(0, .25, .5, .75, 1))))
cont_table <- table(cut(Y, quantile_Y), cut(Z, quantile_Z))
joint_prob <- prop.table(cont_table)
row_sums <- rowSums(joint_prob)
col_sums <- c(colSums(joint_prob), 0)
joint_prob_marg <- cbind(joint_prob, row_sums)
joint_prob_marg <- rbind(joint_prob_marg, col_sums)
joint_prob_marg[5,5] <- sum(joint_prob)
colnames(joint_prob_marg) <- c("Quant 1 Y", "Quant 2 Y", "Quant 3 Y", "Quant 4 Y", "Sum")
rownames(joint_prob_marg) <- c("Quant 1 Z", "Quant 2 Z", "Quant 3 Z", "Quant 4 Z", "Sum")
joint_prob_marg
## Quant 1 Y Quant 2 Y Quant 3 Y Quant 4 Y Sum
## Quant 1 Z 0.09231846 0.07751550 0.0550110 0.02500500 0.24985
## Quant 2 Z 0.06551310 0.06691338 0.0690138 0.04860972 0.25005
## Quant 3 Z 0.05031006 0.05771154 0.0675135 0.07451490 0.25005
## Quant 4 Z 0.04180836 0.04780956 0.0585117 0.10192038 0.25005
## Sum 0.24994999 0.24994999 0.2500500 0.25005001 1.00000
matrix(joint_prob_marg[1:4,5]) %*% t(matrix(joint_prob_marg[5,1:4]))
## [,1] [,2] [,3] [,4]
## [1,] 0.06245 0.06245 0.06247499 0.06247499
## [2,] 0.06250 0.06250 0.06252501 0.06252501
## [3,] 0.06250 0.06250 0.06252501 0.06252501
## [4,] 0.06250 0.06250 0.06252501 0.06252501
We can see that \(P(YZ) \ne P(Y)P(Z)\) which suggest dependence for Y and Z.
Check to see if independence holds by using Fisher’s Exact Test and Chi Square Test. What is the difference between the two? Which is more appropriate?
#fisher
fisher.test(cont_table, simulate.p.value = T)
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: cont_table
## p-value = 0.0004998
## alternative hypothesis: two.sided
# chi square
chisq.test(cont_table)
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = 863.19, df = 9, p-value < 2.2e-16
Based on the p-value of both of these tests, we can see that there is dependence.
Fisher test performs better for smaller samples while Chi Square is better used for larger samples and a larger contingency table. Chi Square would be more appropriate here as it is a larger contingency table and a large sample size.
Data taken from https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
train_data <- read.csv(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_605/FinalProj/train.csv"))
Let’s first split the training data into training and validating data sets.
# 80% train data
set.seed(123)
num_fit <- .8 * nrow(train_data)
sample <- sample(1:nrow(train_data), num_fit, replace = F)
train <- train_data[sample,]
validate <- train_data[-sample,]
rownames(train) <- train$Id
train <- train |>
dplyr::select(-Id)
rmarkdown::paged_table(train, options = list(rows.print = 5))
Provide univariate descriptive statics and appropriate plots for the training data set.
glimpse(train)
## Rows: 1,168
## Columns: 80
## $ MSSubClass <int> 60, 20, 20, 20, 20, 60, 60, 60, 20, 20, 60, 70, 20, 50, …
## $ MSZoning <chr> "RL", "RL", "RL", "FV", "RL", "RL", "RL", "RL", "RL", "R…
## $ LotFrontage <int> 59, 60, 63, 62, 60, 75, NA, 107, 62, 89, NA, 51, 49, 50,…
## $ LotArea <int> 11228, 8281, 17423, 7500, 7180, 9675, 10304, 10186, 9858…
## $ Street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ Alley <chr> NA, NA, NA, "Pave", NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ LotShape <chr> "IR2", "IR1", "IR1", "Reg", "IR1", "Reg", "IR1", "IR1", …
## $ LandContour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", …
## $ Utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu…
## $ LotConfig <chr> "CulDSac", "Inside", "CulDSac", "Inside", "Inside", "Ins…
## $ LandSlope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", …
## $ Neighborhood <chr> "SawyerW", "Sawyer", "StoneBr", "Somerst", "CollgCr", "S…
## $ Condition1 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "PosN", …
## $ Condition2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", …
## $ BldgType <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", …
## $ HouseStyle <chr> "2Story", "1Story", "1Story", "1Story", "1Story", "2Stor…
## $ OverallQual <int> 7, 5, 9, 7, 5, 7, 5, 7, 5, 9, 8, 5, 8, 6, 7, 5, 5, 5, 6,…
## $ OverallCond <int> 5, 5, 5, 5, 7, 5, 7, 5, 6, 5, 5, 8, 5, 6, 5, 5, 7, 8, 5,…
## $ YearBuilt <int> 1993, 1965, 2008, 2005, 1972, 2005, 1976, 1992, 1968, 20…
## $ YearRemodAdd <int> 1993, 1965, 2009, 2005, 1972, 2005, 1976, 1992, 1968, 20…
## $ RoofStyle <chr> "Gable", "Gable", "Hip", "Gable", "Hip", "Gable", "Gable…
## $ RoofMatl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "…
## $ Exterior1st <chr> "VinylSd", "MetalSd", "VinylSd", "VinylSd", "HdBoard", "…
## $ Exterior2nd <chr> "VinylSd", "MetalSd", "VinylSd", "VinylSd", "HdBoard", "…
## $ MasVnrType <chr> "None", "None", "Stone", "None", "None", "None", "BrkFac…
## $ MasVnrArea <int> 0, 0, 748, 0, 0, 0, 44, 0, 0, 0, 396, 0, 0, 0, 0, 0, 0, …
## $ ExterQual <chr> "Gd", "TA", "Ex", "Gd", "TA", "Gd", "TA", "Gd", "TA", "E…
## $ ExterCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "Gd", "TA", "TA", "T…
## $ Foundation <chr> "PConc", "CBlock", "PConc", "PConc", "CBlock", "PConc", …
## $ BsmtQual <chr> "Gd", "TA", "Ex", "Gd", "TA", "Gd", "TA", "Gd", "TA", "E…
## $ BsmtCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ BsmtExposure <chr> "No", "No", "No", "No", "Av", "Mn", "No", "No", "No", "G…
## $ BsmtFinType1 <chr> "BLQ", "Rec", "GLQ", "Unf", "ALQ", "GLQ", "ALQ", "GLQ", …
## $ BsmtFinSF1 <int> 50, 553, 1904, 0, 390, 341, 381, 674, 510, 0, 0, 0, 1721…
## $ BsmtFinType2 <chr> "GLQ", "BLQ", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", …
## $ BsmtFinSF2 <int> 531, 311, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BsmtUnfSF <int> 499, 0, 312, 1257, 474, 772, 399, 76, 354, 2002, 1055, 9…
## $ TotalBsmtSF <int> 1080, 864, 2216, 1257, 864, 1113, 780, 750, 864, 2002, 1…
## $ Heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", …
## $ HeatingQC <chr> "Ex", "Gd", "Ex", "Ex", "TA", "Ex", "Ex", "Ex", "TA", "E…
## $ CentralAir <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ Electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S…
## $ X1stFlrSF <int> 1080, 864, 2234, 1266, 864, 1113, 1088, 1061, 874, 2018,…
## $ X2ndFlrSF <int> 1017, 0, 0, 0, 0, 858, 780, 862, 0, 0, 1208, 574, 0, 595…
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GrLivArea <int> 2097, 864, 2234, 1266, 864, 1971, 1868, 1923, 874, 2018,…
## $ BsmtFullBath <int> 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ BsmtHalfBath <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ FullBath <int> 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 2,…
## $ HalfBath <int> 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,…
## $ BedroomAbvGr <int> 3, 3, 1, 3, 3, 3, 4, 3, 3, 3, 3, 4, 1, 3, 3, 4, 2, 2, 3,…
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ KitchenQual <chr> "Gd", "TA", "Ex", "Gd", "TA", "Gd", "Gd", "Gd", "TA", "E…
## $ TotRmsAbvGrd <int> 9, 5, 9, 6, 5, 8, 9, 8, 5, 10, 7, 8, 8, 6, 8, 7, 4, 5, 6…
## $ Functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", …
## $ Fireplaces <int> 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 2,…
## $ FireplaceQu <chr> "TA", "Po", "Gd", "TA", NA, "Gd", "TA", "TA", NA, "Gd", …
## $ GarageType <chr> "Attchd", "Detchd", "Attchd", "Attchd", "Detchd", "Attch…
## $ GarageYrBlt <int> 1993, 1965, 2009, 2005, 1989, 2005, 1976, 1992, 1968, 20…
## $ GarageFinish <chr> "Unf", "Unf", "Fin", "Unf", "Unf", "RFn", "Unf", "RFn", …
## $ GarageCars <int> 3, 1, 3, 2, 1, 2, 2, 2, 1, 3, 2, 1, 3, 1, 2, 1, 1, 1, 2,…
## $ GarageArea <int> 678, 360, 1166, 453, 352, 689, 484, 564, 288, 746, 905, …
## $ GarageQual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ GarageCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
## $ PavedDrive <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ WoodDeckSF <int> 196, 0, 0, 38, 0, 0, 448, 240, 33, 144, 0, 24, 192, 0, 1…
## $ OpenPorchSF <int> 187, 0, 60, 144, 0, 48, 96, 39, 0, 76, 45, 0, 267, 162, …
## $ EnclosedPorch <int> 0, 236, 0, 0, 0, 0, 0, 0, 0, 0, 0, 150, 0, 0, 0, 108, 0,…
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 189, 0, 0, 126, 0, 0, 0, 0…
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PoolQC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Fence <chr> NA, "GdWo", NA, NA, NA, NA, NA, NA, "GdWo", NA, NA, NA, …
## $ MiscFeature <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Shed", NA, NA, NA, NA, …
## $ MiscVal <int> 0, 0, 0, 0, 0, 0, 0, 0, 600, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ MoSold <int> 12, 12, 7, 4, 5, 2, 10, 6, 11, 5, 9, 5, 1, 12, 7, 8, 8, …
## $ YrSold <int> 2008, 2009, 2009, 2006, 2008, 2009, 2009, 2010, 2009, 20…
## $ SaleType <chr> "WD", "WD", "New", "WD", "WD", "WD", "WD", "WD", "WD", "…
## $ SaleCondition <chr> "Normal", "Normal", "Partial", "Normal", "Normal", "Norm…
## $ SalePrice <int> 228000, 62383, 501837, 176000, 127000, 253000, 197500, 1…
Most of the variable are categorical.
Let’s take a look at the distributions for the numeric columns first.
numeric_train <- train[,sapply(train, is.numeric)]
# descriptive statistics
summary(numeric_train)
## MSSubClass LotFrontage LotArea OverallQual
## Min. : 20.00 Min. : 21.00 Min. : 1300 Min. : 1.000
## 1st Qu.: 20.00 1st Qu.: 59.25 1st Qu.: 7552 1st Qu.: 5.000
## Median : 50.00 Median : 70.00 Median : 9504 Median : 6.000
## Mean : 56.47 Mean : 70.37 Mean : 10385 Mean : 6.104
## 3rd Qu.: 70.00 3rd Qu.: 80.00 3rd Qu.: 11604 3rd Qu.: 7.000
## Max. :190.00 Max. :313.00 Max. :164660 Max. :10.000
## NA's :214
## OverallCond YearBuilt YearRemodAdd MasVnrArea
## Min. :1.000 Min. :1872 Min. :1950 Min. : 0.0
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 1st Qu.: 0.0
## Median :5.000 Median :1973 Median :1993 Median : 0.0
## Mean :5.555 Mean :1972 Mean :1985 Mean : 103.3
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 3rd Qu.: 168.0
## Max. :9.000 Max. :2010 Max. :2010 Max. :1600.0
## NA's :7
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 230.0 1st Qu.: 799
## Median : 385.5 Median : 0.00 Median : 477.5 Median :1004
## Mean : 448.3 Mean : 48.22 Mean : 569.1 Mean :1066
## 3rd Qu.: 719.0 3rd Qu.: 0.00 3rd Qu.: 810.2 3rd Qu.:1312
## Max. :5644.0 Max. :1474.00 Max. :2336.0 Max. :6110
##
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## Min. : 334.0 Min. : 0.0 Min. : 0.000 Min. : 334
## 1st Qu.: 887.5 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:1144
## Median :1094.5 Median : 0.0 Median : 0.000 Median :1456
## Mean :1168.6 Mean : 344.8 Mean : 6.235 Mean :1520
## 3rd Qu.:1402.8 3rd Qu.: 728.2 3rd Qu.: 0.000 3rd Qu.:1779
## Max. :4692.0 Max. :2065.0 Max. :572.000 Max. :5642
##
## BsmtFullBath BsmtHalfBath FullBath HalfBath
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000
## Mean :0.4238 Mean :0.05308 Mean :1.564 Mean :0.3818
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :2.0000 Max. :2.00000 Max. :3.000 Max. :2.0000
##
## BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## Min. :0.000 Min. :0.000 Min. : 2.00 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.00 1st Qu.:0.0000
## Median :3.000 Median :1.000 Median : 6.00 Median :1.0000
## Mean :2.882 Mean :1.051 Mean : 6.54 Mean :0.6199
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.00 3rd Qu.:1.0000
## Max. :8.000 Max. :3.000 Max. :14.00 Max. :3.0000
##
## GarageYrBlt GarageCars GarageArea WoodDeckSF
## Min. :1900 Min. :0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.:1962 1st Qu.:1.000 1st Qu.: 337.5 1st Qu.: 0.00
## Median :1980 Median :2.000 Median : 480.0 Median : 0.00
## Mean :1979 Mean :1.772 Mean : 475.0 Mean : 92.97
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.:168.00
## Max. :2010 Max. :4.000 Max. :1418.0 Max. :857.00
## NA's :65
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## Min. : 0.00 Min. : 0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 25.00 Median : 0 Median : 0.000 Median : 0.00
## Mean : 47.55 Mean : 22 Mean : 3.627 Mean : 15.12
## 3rd Qu.: 68.00 3rd Qu.: 0 3rd Qu.: 0.000 3rd Qu.: 0.00
## Max. :547.00 Max. :552 Max. :508.000 Max. :480.00
##
## PoolArea MiscVal MoSold YrSold
## Min. : 0.000 Min. : 0.00 Min. : 1.000 Min. :2006
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 4.000 1st Qu.:2007
## Median : 0.000 Median : 0.00 Median : 6.000 Median :2008
## Mean : 2.955 Mean : 43.57 Mean : 6.264 Mean :2008
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :738.000 Max. :15500.00 Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 39300
## 1st Qu.:129900
## Median :162700
## Mean :180833
## 3rd Qu.:213500
## Max. :755000
##
Distribution of sale prices:
train |>
ggplot(aes(x = SalePrice)) +
geom_histogram(fill = "steelblue", bins = 30) +
scale_x_continuous(label = scales::comma) +
labs(title = "Distribution of House Prices", x = "Sale Price", y = "Count")
Sale prices are skewed to the right and have a median of $163K.
par(mfrow=c(3,3))
par(mai=c(.3,.3,.3,.3))
variables <- names(numeric_train)
for (i in 1:(length(variables)-1)) {
hist(numeric_train[[variables[i]]], main = variables[i], col = "lightblue")
}
There are a number of skewed variables. We can try log transformations to some variables to try and normalize.
Now let’s look at the categorical variables.
char_train <- train[,sapply(train, is.character)]
char_train$SalePrice <- train$SalePrice
variables <- names(char_train)
for (i in 1:(length(variables)-1)) {
char_train[[variables[i]]] <- as.factor(char_train[[variables[i]]])
}
char_train |>
dplyr::select(-SalePrice) |>
summary()
## MSZoning Street Alley LotShape LandContour Utilities
## C (all): 7 Grvl: 4 Grvl: 38 IR1:388 Bnk: 52 AllPub:1167
## FV : 51 Pave:1164 Pave: 31 IR2: 30 HLS: 43 NoSeWa: 1
## RH : 14 NA's:1099 IR3: 6 Low: 31
## RL :929 Reg:744 Lvl:1042
## RM :167
##
##
## LotConfig LandSlope Neighborhood Condition1 Condition2
## Corner :215 Gtl:1104 NAmes :178 Norm :1003 Artery: 2
## CulDSac: 69 Mod: 55 CollgCr:116 Feedr : 68 Feedr : 4
## FR2 : 41 Sev: 9 OldTown: 90 Artery : 39 Norm :1157
## FR3 : 2 Edwards: 85 RRAn : 20 PosA : 1
## Inside :841 Somerst: 69 PosN : 16 PosN : 2
## Sawyer : 64 RRAe : 11 RRAn : 1
## (Other):566 (Other): 11 RRNn : 1
## BldgType HouseStyle RoofStyle RoofMatl Exterior1st
## 1Fam :973 1Story :594 Flat : 11 ClyTile: 1 VinylSd:408
## 2fmCon: 24 2Story :360 Gable :905 CompShg:1147 HdBoard:181
## Duplex: 44 1.5Fin :117 Gambrel: 9 Metal : 1 MetalSd:173
## Twnhs : 28 SLvl : 47 Hip :237 Roll : 1 Wd Sdng:161
## TwnhsE: 99 SFoyer : 26 Mansard: 5 Tar&Grv: 9 Plywood: 89
## 1.5Unf : 10 Shed : 1 WdShake: 3 CemntBd: 53
## (Other): 14 WdShngl: 6 (Other):103
## Exterior2nd MasVnrType ExterQual ExterCond Foundation BsmtQual
## VinylSd:399 BrkCmn : 15 Ex: 43 Ex: 2 BrkTil:110 Ex : 98
## HdBoard:170 BrkFace:352 Fa: 11 Fa: 23 CBlock:520 Fa : 29
## MetalSd:168 None :689 Gd:393 Gd: 116 PConc :510 Gd :483
## Wd Sdng:158 Stone :105 TA:721 Po: 1 Slab : 19 TA :530
## Plywood:116 NA's : 7 TA:1026 Stone : 6 NA's: 28
## CmentBd: 52 Wood : 3
## (Other):105
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Heating HeatingQC
## Fa : 34 Av :176 ALQ :177 ALQ : 15 Floor: 1 Ex:593
## Gd : 46 Gd :105 BLQ :117 BLQ : 32 GasA :1143 Fa: 45
## Po : 1 Mn : 87 GLQ :331 GLQ : 11 GasW : 13 Gd:186
## TA :1059 No :771 LwQ : 61 LwQ : 41 Grav : 6 Po: 1
## NA's: 28 NA's: 29 Rec :115 Rec : 43 OthW : 2 TA:343
## Unf :339 Unf :998 Wall : 3
## NA's: 28 NA's: 28
## CentralAir Electrical KitchenQual Functional FireplaceQu GarageType
## N: 73 FuseA: 71 Ex: 83 Maj1: 13 Ex : 23 2Types : 6
## Y:1095 FuseF: 20 Fa: 29 Maj2: 4 Fa : 25 Attchd :702
## FuseP: 3 Gd:460 Min1: 27 Gd :294 Basment: 16
## SBrkr:1073 TA:596 Min2: 25 Po : 17 BuiltIn: 67
## NA's : 1 Mod : 11 TA :256 CarPort: 7
## Sev : 1 NA's:553 Detchd :305
## Typ :1087 NA's : 65
## GarageFinish GarageQual GarageCond PavedDrive PoolQC Fence
## Fin :285 Ex : 3 Ex : 2 N: 65 Ex : 2 GdPrv: 44
## RFn :343 Fa : 34 Fa : 29 P: 24 Fa : 2 GdWo : 47
## Unf :475 Gd : 10 Gd : 7 Y:1079 Gd : 2 MnPrv:125
## NA's: 65 Po : 2 Po : 4 NA's:1162 MnWw : 10
## TA :1054 TA :1061 NA's :942
## NA's: 65 NA's: 65
##
## MiscFeature SaleType SaleCondition
## Gar2: 1 WD :1008 Abnorml: 83
## Othr: 2 New : 102 AdjLand: 4
## Shed: 43 COD : 36 Alloca : 10
## TenC: 1 ConLD : 7 Family : 18
## NA's:1121 ConLI : 4 Normal :948
## ConLw : 4 Partial:105
## (Other): 7
There are a number of variables with NA values. Specifically,
Alley, PoolQC, MiscFeature,
Fence, and FireplaceQu have a lot of NA
values.
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
for (i in 1:(length(variables)-1)) {
boxplot(char_train$SalePrice~char_train[[variables[i]]], main = variables[i])
}
MSZoning may have an affect on SalePrice.
There is also a lot of variability in price based on
Neighborhood, ExterQUal,
BsmntQual, KitchenQual, and
CentralAir.
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
Let’s make a scatterplot matrix for some of the numeric varaibles.
subset_train <- train |>
dplyr::select(SalePrice, OverallQual, LotArea, X1stFlrSF, GrLivArea, FullBath, TotRmsAbvGrd, GarageArea)
pairs(subset_train, gap = 0.5)
Derive a correlation matrix for any three quantitative variables in the data set.
cor_subset <- subset_train |>
dplyr::select(SalePrice, GrLivArea, FullBath)
par(mfrow = c(1,3))
plot(SalePrice~GrLivArea, cor_subset) +
abline(lm(SalePrice~GrLivArea, cor_subset), col = "red")
## integer(0)
plot(SalePrice~FullBath, cor_subset) +
abline(lm(SalePrice~FullBath, cor_subset), col = "red")
## integer(0)
plot(GrLivArea~FullBath, cor_subset) +
abline(lm(GrLivArea~FullBath, cor_subset), col = "red")
## integer(0)
(cor_matrix <- cor(cor_subset))
## SalePrice GrLivArea FullBath
## SalePrice 1.0000000 0.6961226 0.5462469
## GrLivArea 0.6961226 1.0000000 0.6231547
## FullBath 0.5462469 0.6231547 1.0000000
Test the hypothesis that the correlation between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
cor.test(cor_subset$SalePrice, cor_subset$GrLivArea, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: cor_subset$SalePrice and cor_subset$GrLivArea
## t = 33.11, df = 1166, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6762607 0.7149732
## sample estimates:
## cor
## 0.6961226
The correlation is 0.696 with a p-value of less than 2.2e-16, so this is a significant correlation.
CI[0.676, 0.715]: We are 80% confident that the true correlation lies within the range of 0.676 and 0.715.
cor.test(cor_subset$SalePrice, cor_subset$FullBath, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: cor_subset$SalePrice and cor_subset$FullBath
## t = 22.268, df = 1166, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5193648 0.5720490
## sample estimates:
## cor
## 0.5462469
The correlation is 0.546 with a p-value of less than 2.2e-16, so this is a significant correlation.
CI[0.519, 0.572]: We are 80% confident that the true correlation lies within the range of 0.519 and 0.572.
cor.test(cor_subset$GrLivArea, cor_subset$FullBath, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: cor_subset$GrLivArea and cor_subset$FullBath
## t = 27.207, df = 1166, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5996492 0.6455859
## sample estimates:
## cor
## 0.6231547
The correlation is 0.623 with a p-value of less than 2.2e-16, so this is a significant correlation.
CI[0.600, .646]: We are 80% confident that the true correlation lies within the range of 0.600 and 0.646.
All three pairwise sets have significant correlations. There is a strong correlation between sale price and above grade living area (71% correlated). There is a moderate correlation between sale price and total number of full bathrooms (56% correlated) and between above grade living area and total number of full bathrooms (63% correlated).
Family Wise Error Rate (FWE):
The formula for calculating family wise error can be found here.
\(FWE \le 1 - (1 - \alpha)^c\) where \(\alpha\) is your chosen confidence level and \(c\) is the number of correlation tests.
We use an 80% confidence interval, so \(\alpha = 0.2\), and we conducted three tests.
a <- 0.2
c <- 3
1-(1-a)**c
## [1] 0.488
The family wise error here is 0.488, so there is a 48.8% chance that we made a Type I error in at least one of the correlation tests above. This is a very high likelihood and so I would be very concerned about familywise error here.
Invert your correlation matrix from above. (This is known as a precision matrix and contains variance inflation factors on the diagonal.)
# precision matrix - inverse of correlation matrix
(prec_matrix <- solve(cor_matrix))
## SalePrice GrLivArea FullBath
## SalePrice 2.0212682 -1.1754845 -0.3716028
## GrLivArea -1.1754845 2.3184589 -0.8026538
## FullBath -0.3716028 -0.8026538 1.7031644
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
prec_matrix %*% cor_matrix
## SalePrice GrLivArea FullBath
## SalePrice 1.000000e+00 1.387779e-16 1.110223e-16
## GrLivArea 5.551115e-17 1.000000e+00 2.220446e-16
## FullBath -1.110223e-16 2.220446e-16 1.000000e+00
cor_matrix %*% prec_matrix
## SalePrice GrLivArea FullBath
## SalePrice 1.000000e+00 5.551115e-17 0.000000e+00
## GrLivArea 1.110223e-16 1.000000e+00 2.220446e-16
## FullBath 5.551115e-17 2.220446e-16 1.000000e+00
Conduct LU decomposition on the matrix.
lu.decomposition(prec_matrix)
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] -0.5815579 1.0000000 0
## [3,] -0.1838463 -0.6231547 1
##
## $U
## [,1] [,2] [,3]
## [1,] 2.021268 -1.175484 -0.3716028
## [2,] 0.000000 1.634847 -1.0187624
## [3,] 0.000000 0.000000 1.0000000
Many times, it makes sense to fit a closed form distribution to data.
Select a variable in the Kaggle.com training dataset that is
skewed to the right, shift it so that the minimum value is absolutely
above zero if necessary. Then load the MASS package and run
fitdistr() to fit an exponential probability density
function.
Chosen variable: SalePrice
# histogram original
hist(train$SalePrice)
# fit exponential pdf
(fit <- fitdistr(train$SalePrice, "exponential"))
## rate
## 5.529968e-06
## (1.618084e-07)
Find the optimal value of \(\lambda\) for this distribution, and then take 1,000 samples from this exponential distribution using this value (e.e, rexp(1000, \(\lambda\))).
lambda <- fit$estimate
# exponential dist
exp_dist <- rexp(1000, lambda)
Plot a histogram and compare it with the histogram of your original variable.
par(mfrow = c(1,2))
hist(train$SalePrice, main = "Histogram of SalePrice")
hist(exp_dist, main = "Histogram of Exponential")
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
# percentiles using CDF
quantile(exp_dist, c(.05, .95))
## 5% 95%
## 10979.05 577624.65
# conf interval of empirical data
t.test(train$SalePrice)$conf.int
## [1] 176274.6 185391.2
## attr(,"conf.level")
## [1] 0.95
# empirical percentiles
quantile(train$SalePrice, c(.05, .95))
## 5% 95%
## 88350.0 325510.6
There is a signicant difference between the quantiles for the
generated exponential distribution and the quantiles for the empirical
data. This suggests that the SalePrices does not closely
follow the exponential distribution.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with your analysis. Report your Kaggle.com username and score.
kdepairs(subset_train)
trans_train <- subset_train |>
mutate(log_LotArea = log(LotArea))
par(mfrow = c(1,2))
hist(subset_train$LotArea, main = "Original")
hist(trans_train$log_LotArea, main = "Log Transformed")
# remove extreme outliers from LotArea
iqr <- IQR(subset_train$LotArea)
summary(subset_train$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7552 9504 10385 11604 164660
trans_train <- trans_train |>
mutate(LotArea_filtered = ifelse(LotArea >= 11602 + (1.5*iqr) | LotArea <= 7554 - (1.5*iqr), NA, LotArea))
par(mfrow = c(1,2))
boxplot(subset_train$LotArea, main = "Original")
boxplot(trans_train$LotArea_filtered, main = "Filtered for Outliers")
summary(trans_train$LotArea_filtered)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1491 7406 9308 9288 11240 17671 54
par(mfrow = c(1,2))
plot(trans_train$SalePrice~trans_train$log_LotArea, main = "Log Transformed") +
abline(lm(trans_train$SalePrice~trans_train$log_LotArea), col = "red")
## integer(0)
plot(trans_train$SalePrice~trans_train$LotArea_filtered, main = "Filtered") +
abline(lm(trans_train$SalePrice~trans_train$LotArea_filtered), col = "red")
## integer(0)
Apply log transformation to normalize X1stFlrSF and
GrLivArea and check to see how correlation changes.
trans_train |>
dplyr::select(-LotArea_filtered) |>
mutate(log_X1stFlrSF = log(X1stFlrSF),
log_GrLivArea = log(GrLivArea)) |>
kdepairs()
The transformed variables are less correlated than the original
variables. This is probably because SalePrice itself is
skewed. The log transformed LotArea is more correlated than
the original.
test_lm <- lm(SalePrice ~ OverallQual + log(LotArea) + GrLivArea + X1stFlrSF + FullBath + TotRmsAbvGrd + GarageArea + MSZoning + Neighborhood + ExterQual + KitchenQual + CentralAir + CentralAir*GrLivArea, train)
summary(test_lm)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + log(LotArea) + GrLivArea +
## X1stFlrSF + FullBath + TotRmsAbvGrd + GarageArea + MSZoning +
## Neighborhood + ExterQual + KitchenQual + CentralAir + CentralAir *
## GrLivArea, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -380441 -14476 954 13485 265461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -93566.310 35719.521 -2.619 0.008925 **
## OverallQual 14021.509 1382.783 10.140 < 2e-16 ***
## log(LotArea) 17804.752 3213.588 5.540 3.76e-08 ***
## GrLivArea 19.918 9.170 2.172 0.030057 *
## X1stFlrSF 13.529 3.889 3.479 0.000522 ***
## FullBath 1621.942 2792.776 0.581 0.561516
## TotRmsAbvGrd 836.023 1172.229 0.713 0.475876
## GarageArea 27.934 6.488 4.306 1.81e-05 ***
## MSZoningFV 22955.955 18656.000 1.230 0.218772
## MSZoningRH 20312.852 18392.598 1.104 0.269654
## MSZoningRL 24608.622 15974.211 1.541 0.123715
## MSZoningRM 25392.318 15247.046 1.665 0.096113 .
## NeighborhoodBlueste -4303.792 35827.966 -0.120 0.904407
## NeighborhoodBrDale -20215.540 14704.791 -1.375 0.169480
## NeighborhoodBrkSide -20700.447 11956.998 -1.731 0.083683 .
## NeighborhoodClearCr -7028.895 13182.335 -0.533 0.593997
## NeighborhoodCollgCr -9071.965 10263.380 -0.884 0.376931
## NeighborhoodCrawfor 5384.623 11436.329 0.471 0.637850
## NeighborhoodEdwards -35785.593 10951.508 -3.268 0.001117 **
## NeighborhoodGilbert -12269.440 11015.405 -1.114 0.265584
## NeighborhoodIDOTRR -30727.873 14064.010 -2.185 0.029105 *
## NeighborhoodMeadowV -13761.151 14602.756 -0.942 0.346207
## NeighborhoodMitchel -19310.723 11623.973 -1.661 0.096935 .
## NeighborhoodNAmes -22697.774 10539.909 -2.154 0.031491 *
## NeighborhoodNoRidge 45745.668 11758.679 3.890 0.000106 ***
## NeighborhoodNPkVill -6913.853 15505.768 -0.446 0.655763
## NeighborhoodNridgHt 30002.388 10717.326 2.799 0.005207 **
## NeighborhoodNWAmes -21615.329 11045.522 -1.957 0.050603 .
## NeighborhoodOldTown -36614.716 12011.911 -3.048 0.002356 **
## NeighborhoodSawyer -21392.211 11215.315 -1.907 0.056722 .
## NeighborhoodSawyerW -16347.813 11026.624 -1.483 0.138467
## NeighborhoodSomerst 4583.337 12594.375 0.364 0.715987
## NeighborhoodStoneBr 42138.086 11967.047 3.521 0.000447 ***
## NeighborhoodSWISU -31713.802 13369.090 -2.372 0.017851 *
## NeighborhoodTimber 232.308 11664.563 0.020 0.984114
## NeighborhoodVeenker 17735.515 14381.748 1.233 0.217760
## ExterQualFa -27903.972 14811.121 -1.884 0.059825 .
## ExterQualGd -28441.041 6732.932 -4.224 2.59e-05 ***
## ExterQualTA -32819.015 7487.320 -4.383 1.28e-05 ***
## KitchenQualFa -43290.222 9510.320 -4.552 5.89e-06 ***
## KitchenQualGd -35902.600 5151.835 -6.969 5.43e-12 ***
## KitchenQualTA -44814.028 5705.035 -7.855 9.28e-15 ***
## CentralAirY -7304.536 12457.281 -0.586 0.557747
## GrLivArea:CentralAirY 15.336 8.329 1.841 0.065850 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33890 on 1124 degrees of freedom
## Multiple R-squared: 0.8245, Adjusted R-squared: 0.8178
## F-statistic: 122.8 on 43 and 1124 DF, p-value: < 2.2e-16
The model accounts for about 82% of the variability within the data.
Let’s see if we can improve the model using a backward stepwise approach.
better_lm <- step(test_lm, direction = "backward")
## Start: AIC=24409.78
## SalePrice ~ OverallQual + log(LotArea) + GrLivArea + X1stFlrSF +
## FullBath + TotRmsAbvGrd + GarageArea + MSZoning + Neighborhood +
## ExterQual + KitchenQual + CentralAir + CentralAir * GrLivArea
##
## Df Sum of Sq RSS AIC
## - MSZoning 4 3.4105e+09 1.2945e+12 24405
## - FullBath 1 3.8742e+08 1.2915e+12 24408
## - TotRmsAbvGrd 1 5.8424e+08 1.2917e+12 24408
## <none> 1.2911e+12 24410
## - GrLivArea:CentralAir 1 3.8941e+09 1.2950e+12 24411
## - X1stFlrSF 1 1.3905e+10 1.3050e+12 24420
## - ExterQual 3 2.3162e+10 1.3142e+12 24425
## - GarageArea 1 2.1295e+10 1.3124e+12 24427
## - log(LotArea) 1 3.5259e+10 1.3263e+12 24439
## - KitchenQual 3 7.1440e+10 1.3625e+12 24467
## - OverallQual 1 1.1810e+11 1.4092e+12 24510
## - Neighborhood 24 2.3525e+11 1.5263e+12 24557
##
## Step: AIC=24404.86
## SalePrice ~ OverallQual + log(LotArea) + GrLivArea + X1stFlrSF +
## FullBath + TotRmsAbvGrd + GarageArea + Neighborhood + ExterQual +
## KitchenQual + CentralAir + GrLivArea:CentralAir
##
## Df Sum of Sq RSS AIC
## - FullBath 1 4.0329e+08 1.2949e+12 24403
## - TotRmsAbvGrd 1 5.8545e+08 1.2951e+12 24403
## <none> 1.2945e+12 24405
## - GrLivArea:CentralAir 1 5.4185e+09 1.2999e+12 24408
## - X1stFlrSF 1 1.4578e+10 1.3091e+12 24416
## - ExterQual 3 2.2920e+10 1.3174e+12 24419
## - GarageArea 1 2.0088e+10 1.3146e+12 24421
## - log(LotArea) 1 3.7780e+10 1.3323e+12 24437
## - KitchenQual 3 7.2176e+10 1.3667e+12 24462
## - OverallQual 1 1.2162e+11 1.4161e+12 24508
## - Neighborhood 24 2.4948e+11 1.5440e+12 24563
##
## Step: AIC=24403.23
## SalePrice ~ OverallQual + log(LotArea) + GrLivArea + X1stFlrSF +
## TotRmsAbvGrd + GarageArea + Neighborhood + ExterQual + KitchenQual +
## CentralAir + GrLivArea:CentralAir
##
## Df Sum of Sq RSS AIC
## - TotRmsAbvGrd 1 7.5483e+08 1.2956e+12 24402
## <none> 1.2949e+12 24403
## - GrLivArea:CentralAir 1 5.5269e+09 1.3004e+12 24406
## - X1stFlrSF 1 1.4816e+10 1.3097e+12 24415
## - ExterQual 3 2.2997e+10 1.3179e+12 24418
## - GarageArea 1 2.0146e+10 1.3150e+12 24419
## - log(LotArea) 1 3.7377e+10 1.3323e+12 24435
## - KitchenQual 3 7.1789e+10 1.3667e+12 24460
## - OverallQual 1 1.2164e+11 1.4165e+12 24506
## - Neighborhood 24 2.5045e+11 1.5453e+12 24562
##
## Step: AIC=24401.91
## SalePrice ~ OverallQual + log(LotArea) + GrLivArea + X1stFlrSF +
## GarageArea + Neighborhood + ExterQual + KitchenQual + CentralAir +
## GrLivArea:CentralAir
##
## Df Sum of Sq RSS AIC
## <none> 1.2956e+12 24402
## - GrLivArea:CentralAir 1 5.1948e+09 1.3008e+12 24405
## - X1stFlrSF 1 1.4163e+10 1.3098e+12 24413
## - ExterQual 3 2.3395e+10 1.3190e+12 24417
## - GarageArea 1 1.9824e+10 1.3155e+12 24418
## - log(LotArea) 1 3.8443e+10 1.3341e+12 24434
## - KitchenQual 3 7.1622e+10 1.3673e+12 24459
## - OverallQual 1 1.2089e+11 1.4165e+12 24504
## - Neighborhood 24 2.4992e+11 1.5456e+12 24560
summary(better_lm)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + log(LotArea) + GrLivArea +
## X1stFlrSF + GarageArea + Neighborhood + ExterQual + KitchenQual +
## CentralAir + GrLivArea:CentralAir, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -387144 -14201 992 13555 262258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -61642.091 30483.155 -2.022 0.043394 *
## OverallQual 14047.778 1368.116 10.268 < 2e-16 ***
## log(LotArea) 17724.160 3060.970 5.790 9.09e-09 ***
## GrLivArea 21.287 8.024 2.653 0.008089 **
## X1stFlrSF 13.328 3.792 3.515 0.000458 ***
## GarageArea 26.777 6.440 4.158 3.45e-05 ***
## NeighborhoodBlueste -5424.106 35415.396 -0.153 0.878302
## NeighborhoodBrDale -20290.037 13751.054 -1.476 0.140350
## NeighborhoodBrkSide -21149.019 11439.908 -1.849 0.064761 .
## NeighborhoodClearCr -8003.552 13032.100 -0.614 0.539246
## NeighborhoodCollgCr -9309.326 10188.900 -0.914 0.361083
## NeighborhoodCrawfor 4206.713 11285.834 0.373 0.709410
## NeighborhoodEdwards -36507.616 10821.288 -3.374 0.000767 ***
## NeighborhoodGilbert -12237.630 10956.952 -1.117 0.264282
## NeighborhoodIDOTRR -35823.681 12272.328 -2.919 0.003580 **
## NeighborhoodMeadowV -14146.120 13510.743 -1.047 0.295311
## NeighborhoodMitchel -19466.049 11502.373 -1.692 0.090855 .
## NeighborhoodNAmes -23390.708 10437.545 -2.241 0.025219 *
## NeighborhoodNoRidge 44371.009 11630.723 3.815 0.000144 ***
## NeighborhoodNPkVill -6431.449 15438.866 -0.417 0.677068
## NeighborhoodNridgHt 30186.883 10681.791 2.826 0.004796 **
## NeighborhoodNWAmes -21536.259 10991.326 -1.959 0.050313 .
## NeighborhoodOldTown -36729.475 10682.853 -3.438 0.000607 ***
## NeighborhoodSawyer -21946.900 11102.977 -1.977 0.048322 *
## NeighborhoodSawyerW -17056.809 10930.070 -1.561 0.118912
## NeighborhoodSomerst 3220.497 10302.584 0.313 0.754649
## NeighborhoodStoneBr 41433.879 11918.487 3.476 0.000527 ***
## NeighborhoodSWISU -32945.671 13123.483 -2.510 0.012197 *
## NeighborhoodTimber 96.435 11598.986 0.008 0.993368
## NeighborhoodVeenker 16507.523 14261.676 1.157 0.247323
## ExterQualFa -34600.650 14332.786 -2.414 0.015933 *
## ExterQualGd -28658.259 6713.786 -4.269 2.13e-05 ***
## ExterQualTA -33150.043 7459.453 -4.444 9.70e-06 ***
## KitchenQualFa -42853.255 9484.451 -4.518 6.89e-06 ***
## KitchenQualGd -35576.532 5101.680 -6.973 5.25e-12 ***
## KitchenQualTA -44663.908 5672.493 -7.874 8.03e-15 ***
## CentralAirY -9831.940 12288.695 -0.800 0.423833
## GrLivArea:CentralAirY 17.356 8.154 2.129 0.033507 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33860 on 1130 degrees of freedom
## Multiple R-squared: 0.8239, Adjusted R-squared: 0.8181
## F-statistic: 142.9 on 37 and 1130 DF, p-value: < 2.2e-16
The model performs the same but removed GrLivArea,
FullBath, and MSZOning.
par(mfrow = c(1,2))
plot(better_lm, which = c(1,2))
The fitted vs residuals plot has a slight curve and there are some outlying residuals. There also seems issues in the tails of the residuals.
vif(better_lm, type = "predictor")
## GVIFs computed for predictors
## GVIF Df GVIF^(1/(2*Df)) Interacts With
## OverallQual 3.641133e+00 1 1.908175 --
## LotArea 1.865968e+07 0 Inf --
## GrLivArea 3.645335e+00 3 1.240575 CentralAir
## X1stFlrSF 2.161214e+00 1 1.470107 --
## GarageArea 1.944068e+00 1 1.394299 --
## Neighborhood 1.456642e+01 24 1.057393 --
## ExterQual 6.621240e+00 3 1.370324 --
## KitchenQual 5.333864e+00 3 1.321824 --
## CentralAir 3.645335e+00 3 1.240575 GrLivArea
## Other Predictors
## OverallQual LotArea, GrLivArea, X1stFlrSF, GarageArea, Neighborhood, ExterQual, KitchenQual, CentralAir
## LotArea OverallQual, LotArea, GrLivArea, X1stFlrSF, GarageArea, Neighborhood, ExterQual, KitchenQual, CentralAir
## GrLivArea OverallQual, LotArea, X1stFlrSF, GarageArea, Neighborhood, ExterQual, KitchenQual
## X1stFlrSF OverallQual, LotArea, GrLivArea, GarageArea, Neighborhood, ExterQual, KitchenQual, CentralAir
## GarageArea OverallQual, LotArea, GrLivArea, X1stFlrSF, Neighborhood, ExterQual, KitchenQual, CentralAir
## Neighborhood OverallQual, LotArea, GrLivArea, X1stFlrSF, GarageArea, ExterQual, KitchenQual, CentralAir
## ExterQual OverallQual, LotArea, GrLivArea, X1stFlrSF, GarageArea, Neighborhood, KitchenQual, CentralAir
## KitchenQual OverallQual, LotArea, GrLivArea, X1stFlrSF, GarageArea, Neighborhood, ExterQual, CentralAir
## CentralAir OverallQual, LotArea, X1stFlrSF, GarageArea, Neighborhood, ExterQual, KitchenQual
There is very high collinearity for LotArea. Let’s
remove this predictor.
update_lm <- update(better_lm, .~. - log(LotArea))
summary(update_lm)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + X1stFlrSF +
## GarageArea + Neighborhood + ExterQual + KitchenQual + CentralAir +
## GrLivArea:CentralAir, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -391541 -14865 364 13126 252622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76162.161 19320.268 3.942 8.57e-05 ***
## OverallQual 13451.468 1383.713 9.721 < 2e-16 ***
## GrLivArea 23.152 8.132 2.847 0.00449 **
## X1stFlrSF 17.899 3.762 4.758 2.21e-06 ***
## GarageArea 34.268 6.399 5.356 1.03e-07 ***
## NeighborhoodBlueste -20349.196 35825.808 -0.568 0.57015
## NeighborhoodBrDale -27987.049 13882.079 -2.016 0.04403 *
## NeighborhoodBrkSide -7497.397 11354.168 -0.660 0.50918
## NeighborhoodClearCr 19758.186 12291.144 1.608 0.10822
## NeighborhoodCollgCr 8506.740 9851.914 0.863 0.38807
## NeighborhoodCrawfor 21635.041 11032.379 1.961 0.05012 .
## NeighborhoodEdwards -19229.188 10550.254 -1.823 0.06862 .
## NeighborhoodGilbert 8168.136 10522.895 0.776 0.43778
## NeighborhoodIDOTRR -20357.406 12149.160 -1.676 0.09409 .
## NeighborhoodMeadowV -19351.773 13673.284 -1.415 0.15726
## NeighborhoodMitchel -37.827 11159.221 -0.003 0.99730
## NeighborhoodNAmes -5737.676 10124.910 -0.567 0.57104
## NeighborhoodNoRidge 61837.465 11393.136 5.428 6.98e-08 ***
## NeighborhoodNPkVill -8075.745 15656.660 -0.516 0.60609
## NeighborhoodNridgHt 44864.782 10524.821 4.263 2.19e-05 ***
## NeighborhoodNWAmes -2856.533 10657.272 -0.268 0.78872
## NeighborhoodOldTown -23087.174 10568.590 -2.185 0.02913 *
## NeighborhoodSawyer -2710.166 10745.568 -0.252 0.80092
## NeighborhoodSawyerW 539.343 10649.042 0.051 0.95962
## NeighborhoodSomerst 15951.403 10208.966 1.562 0.11845
## NeighborhoodStoneBr 54112.621 11882.907 4.554 5.84e-06 ***
## NeighborhoodSWISU -20043.962 13117.614 -1.528 0.12679
## NeighborhoodTimber 22179.147 11110.503 1.996 0.04615 *
## NeighborhoodVeenker 38922.428 13922.251 2.796 0.00527 **
## ExterQualFa -34320.778 14537.353 -2.361 0.01840 *
## ExterQualGd -28974.362 6809.423 -4.255 2.26e-05 ***
## ExterQualTA -32442.786 7564.948 -4.289 1.95e-05 ***
## KitchenQualFa -43197.807 9619.685 -4.491 7.83e-06 ***
## KitchenQualGd -37526.497 5163.239 -7.268 6.79e-13 ***
## KitchenQualTA -45390.853 5752.078 -7.891 7.03e-15 ***
## CentralAirY -12940.470 12452.259 -1.039 0.29893
## GrLivArea:CentralAirY 19.467 8.262 2.356 0.01864 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34340 on 1131 degrees of freedom
## Multiple R-squared: 0.8187, Adjusted R-squared: 0.8129
## F-statistic: 141.8 on 36 and 1131 DF, p-value: < 2.2e-16
The \(R^2\) is not much changed.
vif(update_lm)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## OverallQual 3.620502 1 1.902762
## GrLivArea 18.162544 1 4.261754
## X1stFlrSF 2.067551 1 1.437898
## GarageArea 1.865621 1 1.365877
## Neighborhood 8.280477 24 1.045024
## ExterQual 6.605269 3 1.369772
## KitchenQual 5.289657 3 1.319992
## CentralAir 8.996460 1 2.999410
## GrLivArea:CentralAir 26.701587 1 5.167358
It seems we have gotten rid of some collinearity issues.
par(mfrow = c(1,2))
plot(update_lm, which = c(1,2))
Let’s see how the model performs so far.
validate$predictSalePrice <- predict(update_lm, validate)
se <- (validate$predictSalePrice - validate$SalePrice)**2
(RMSE <- sqrt(mean(se)))
## [1] 29711.92
Let’s look back at the data and see if there is anything we can add to improve the model.
par(mfrow = c(1,3))
boxplot(train$SalePrice~as.factor(train$MSSubClass))
boxplot(train$SalePrice~as.factor(train$MoSold))
boxplot(train$SalePrice~as.factor(train$YrSold))
update_lm <- lm(SalePrice ~ OverallQual + GrLivArea + X1stFlrSF + TotRmsAbvGrd + GarageArea + Neighborhood + ExterQual + KitchenQual + GrLivArea:CentralAir + HouseStyle + MSSubClass, train)
summary(update_lm)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + X1stFlrSF +
## TotRmsAbvGrd + GarageArea + Neighborhood + ExterQual + KitchenQual +
## GrLivArea:CentralAir + HouseStyle + MSSubClass, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -361206 -14251 86 13097 241451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68198.758 18126.033 3.762 0.000177 ***
## OverallQual 13707.155 1384.171 9.903 < 2e-16 ***
## GrLivArea 55.933 7.611 7.349 3.85e-13 ***
## X1stFlrSF -12.390 7.664 -1.617 0.106246
## TotRmsAbvGrd 1209.810 1148.328 1.054 0.292320
## GarageArea 29.947 6.339 4.724 2.60e-06 ***
## NeighborhoodBlueste -10052.371 35100.199 -0.286 0.774632
## NeighborhoodBrDale -16731.311 13775.393 -1.215 0.224782
## NeighborhoodBrkSide -17618.012 11675.781 -1.509 0.131596
## NeighborhoodClearCr 6904.093 12351.192 0.559 0.576285
## NeighborhoodCollgCr -5928.162 10053.383 -0.590 0.555532
## NeighborhoodCrawfor 10001.348 11158.344 0.896 0.370279
## NeighborhoodEdwards -31655.288 10697.378 -2.959 0.003149 **
## NeighborhoodGilbert -5697.059 10740.744 -0.530 0.595929
## NeighborhoodIDOTRR -29146.888 12401.486 -2.350 0.018932 *
## NeighborhoodMeadowV -15041.472 13608.110 -1.105 0.269253
## NeighborhoodMitchel -13242.853 11183.157 -1.184 0.236593
## NeighborhoodNAmes -21751.195 10326.830 -2.106 0.035401 *
## NeighborhoodNoRidge 46380.747 11570.683 4.008 6.51e-05 ***
## NeighborhoodNPkVill -701.074 15368.956 -0.046 0.963624
## NeighborhoodNridgHt 37018.256 10446.339 3.544 0.000411 ***
## NeighborhoodNWAmes -18236.836 10824.559 -1.685 0.092312 .
## NeighborhoodOldTown -35583.367 10766.760 -3.305 0.000980 ***
## NeighborhoodSawyer -19966.946 10949.998 -1.823 0.068499 .
## NeighborhoodSawyerW -11267.435 10730.692 -1.050 0.293935
## NeighborhoodSomerst 7886.497 10222.247 0.772 0.440571
## NeighborhoodStoneBr 49332.870 11678.564 4.224 2.59e-05 ***
## NeighborhoodSWISU -29633.897 13272.885 -2.233 0.025768 *
## NeighborhoodTimber 7667.237 11228.430 0.683 0.494848
## NeighborhoodVeenker 28748.378 13830.796 2.079 0.037883 *
## ExterQualFa -36671.004 14246.307 -2.574 0.010178 *
## ExterQualGd -28340.402 6681.796 -4.241 2.40e-05 ***
## ExterQualTA -32235.980 7408.118 -4.351 1.48e-05 ***
## KitchenQualFa -39435.862 9390.630 -4.199 2.89e-05 ***
## KitchenQualGd -35367.325 5069.389 -6.977 5.15e-12 ***
## KitchenQualTA -42562.160 5658.814 -7.521 1.11e-13 ***
## HouseStyle1.5Unf 10043.242 11756.104 0.854 0.393121
## HouseStyle1Story 18564.773 5530.655 3.357 0.000815 ***
## HouseStyle2.5Fin -13002.369 15903.886 -0.818 0.413782
## HouseStyle2.5Unf 89.984 13083.985 0.007 0.994514
## HouseStyle2Story -3067.525 4650.244 -0.660 0.509615
## HouseStyleSFoyer 29513.507 8443.084 3.496 0.000491 ***
## HouseStyleSLvl 21337.933 6763.755 3.155 0.001649 **
## MSSubClass -191.196 33.302 -5.741 1.21e-08 ***
## GrLivArea:CentralAirY 8.456 3.235 2.614 0.009080 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33570 on 1123 degrees of freedom
## Multiple R-squared: 0.828, Adjusted R-squared: 0.8213
## F-statistic: 122.9 on 44 and 1123 DF, p-value: < 2.2e-16
best_lm <- step(update_lm, direction = "backward")
## Start: AIC=24388.28
## SalePrice ~ OverallQual + GrLivArea + X1stFlrSF + TotRmsAbvGrd +
## GarageArea + Neighborhood + ExterQual + KitchenQual + GrLivArea:CentralAir +
## HouseStyle + MSSubClass
##
## Df Sum of Sq RSS AIC
## - TotRmsAbvGrd 1 1.2506e+09 1.2666e+12 24387
## <none> 1.2653e+12 24388
## - X1stFlrSF 1 2.9446e+09 1.2683e+12 24389
## - GrLivArea:CentralAir 1 7.6966e+09 1.2730e+12 24393
## - HouseStyle 7 2.3144e+10 1.2885e+12 24395
## - ExterQual 3 2.2343e+10 1.2877e+12 24403
## - GarageArea 1 2.5145e+10 1.2905e+12 24409
## - MSSubClass 1 3.7140e+10 1.3025e+12 24420
## - KitchenQual 3 6.5553e+10 1.3309e+12 24441
## - OverallQual 1 1.1050e+11 1.3758e+12 24484
## - Neighborhood 24 2.6681e+11 1.5322e+12 24564
##
## Step: AIC=24387.43
## SalePrice ~ OverallQual + GrLivArea + X1stFlrSF + GarageArea +
## Neighborhood + ExterQual + KitchenQual + HouseStyle + MSSubClass +
## GrLivArea:CentralAir
##
## Df Sum of Sq RSS AIC
## <none> 1.2666e+12 24387
## - X1stFlrSF 1 2.8051e+09 1.2694e+12 24388
## - GrLivArea:CentralAir 1 7.2050e+09 1.2738e+12 24392
## - HouseStyle 7 2.2394e+10 1.2890e+12 24394
## - ExterQual 3 2.2844e+10 1.2894e+12 24402
## - GarageArea 1 2.4727e+10 1.2913e+12 24408
## - MSSubClass 1 3.9175e+10 1.3058e+12 24421
## - KitchenQual 3 6.5369e+10 1.3320e+12 24440
## - OverallQual 1 1.0925e+11 1.3759e+12 24482
## - Neighborhood 24 2.6649e+11 1.5331e+12 24563
vif(best_lm)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## GVIF Df GVIF^(1/(2*Df))
## OverallQual 3.756865 1 1.938263
## GrLivArea 14.107881 1 3.756046
## X1stFlrSF 8.970489 1 2.995078
## GarageArea 1.913683 1 1.383359
## Neighborhood 18.383918 24 1.062533
## ExterQual 6.664807 3 1.371823
## KitchenQual 5.351207 3 1.322539
## HouseStyle 20.936987 7 1.242656
## MSSubClass 2.004517 1 1.415810
## GrLivArea:CentralAir 4.251176 1 2.061838
validate$predictSalePrice <- predict(best_lm, validate)
se <- (validate$predictSalePrice - validate$SalePrice)**2
(RMSE <- sqrt(mean(se)))
## [1] 29541.11
We reduced the RMSE slightly.
This was my initial model for submission, however I noticed that
there were 2 NA values in the predictions from this model. Upon
examination, there was one column with an NA value for
KitchenQual and one column with an NA value for
GarageArea. I removed these variables from the model.
update_lm <- update(best_lm, .~. - KitchenQual - GarageArea)
best_lm <- step(update_lm, direction = "backward")
## Start: AIC=24459.68
## SalePrice ~ OverallQual + GrLivArea + X1stFlrSF + Neighborhood +
## ExterQual + HouseStyle + MSSubClass + GrLivArea:CentralAir
##
## Df Sum of Sq RSS AIC
## - X1stFlrSF 1 9.7606e+08 1.3577e+12 24459
## <none> 1.3567e+12 24460
## - GrLivArea:CentralAir 1 9.8073e+09 1.3665e+12 24466
## - HouseStyle 7 3.1525e+10 1.3882e+12 24473
## - MSSubClass 1 5.4153e+10 1.4108e+12 24503
## - ExterQual 3 6.7172e+10 1.4239e+12 24510
## - OverallQual 1 1.5213e+11 1.5088e+12 24582
## - Neighborhood 24 3.3015e+11 1.6868e+12 24666
##
## Step: AIC=24458.52
## SalePrice ~ OverallQual + GrLivArea + Neighborhood + ExterQual +
## HouseStyle + MSSubClass + GrLivArea:CentralAir
##
## Df Sum of Sq RSS AIC
## <none> 1.3577e+12 24459
## - GrLivArea:CentralAir 1 9.9975e+09 1.3677e+12 24465
## - HouseStyle 7 4.9124e+10 1.4068e+12 24486
## - MSSubClass 1 5.4976e+10 1.4126e+12 24503
## - ExterQual 3 6.6421e+10 1.4241e+12 24508
## - OverallQual 1 1.5132e+11 1.5090e+12 24580
## - Neighborhood 24 3.2958e+11 1.6872e+12 24664
summary(best_lm)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + Neighborhood +
## ExterQual + HouseStyle + MSSubClass + GrLivArea:CentralAir,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -370362 -14845 43 14253 258513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46270.318 17157.272 2.697 0.007104 **
## OverallQual 15561.502 1387.226 11.218 < 2e-16 ***
## GrLivArea 57.355 4.266 13.446 < 2e-16 ***
## NeighborhoodBlueste -10989.630 36176.839 -0.304 0.761355
## NeighborhoodBrDale -21615.428 14129.279 -1.530 0.126338
## NeighborhoodBrkSide -21722.688 11947.940 -1.818 0.069312 .
## NeighborhoodClearCr 4232.710 12698.208 0.333 0.738946
## NeighborhoodCollgCr -7136.464 10346.292 -0.690 0.490487
## NeighborhoodCrawfor 2781.582 11442.578 0.243 0.807979
## NeighborhoodEdwards -36629.538 10967.201 -3.340 0.000865 ***
## NeighborhoodGilbert -11565.287 11050.088 -1.047 0.295497
## NeighborhoodIDOTRR -31584.378 12760.490 -2.475 0.013463 *
## NeighborhoodMeadowV -15944.531 13970.465 -1.141 0.253986
## NeighborhoodMitchel -16950.480 11488.809 -1.475 0.140386
## NeighborhoodNAmes -26354.293 10606.942 -2.485 0.013113 *
## NeighborhoodNoRidge 42560.578 11811.045 3.603 0.000328 ***
## NeighborhoodNPkVill -2452.881 15823.996 -0.155 0.876841
## NeighborhoodNridgHt 45748.791 10717.850 4.268 2.13e-05 ***
## NeighborhoodNWAmes -23467.127 11140.805 -2.106 0.035389 *
## NeighborhoodOldTown -37599.611 11053.571 -3.402 0.000693 ***
## NeighborhoodSawyer -24974.210 11259.605 -2.218 0.026751 *
## NeighborhoodSawyerW -14936.962 11048.656 -1.352 0.176670
## NeighborhoodSomerst 10249.210 10493.931 0.977 0.328937
## NeighborhoodStoneBr 56635.193 11985.225 4.725 2.59e-06 ***
## NeighborhoodSWISU -37388.368 13566.310 -2.756 0.005946 **
## NeighborhoodTimber 7991.847 11563.979 0.691 0.489646
## NeighborhoodVeenker 29507.478 14213.933 2.076 0.038124 *
## ExterQualFa -54124.786 13767.874 -3.931 8.97e-05 ***
## ExterQualGd -45228.513 6357.954 -7.114 2.00e-12 ***
## ExterQualTA -52276.548 7193.553 -7.267 6.84e-13 ***
## HouseStyle1.5Unf 7776.145 11805.811 0.659 0.510241
## HouseStyle1Story 16835.153 4282.803 3.931 8.98e-05 ***
## HouseStyle2.5Fin -5761.278 15471.165 -0.372 0.709674
## HouseStyle2.5Unf -3783.669 13311.841 -0.284 0.776284
## HouseStyle2Story 2204.908 4286.453 0.514 0.607080
## HouseStyleSFoyer 35052.163 8250.932 4.248 2.33e-05 ***
## HouseStyleSLvl 24116.517 6529.907 3.693 0.000232 ***
## MSSubClass -228.877 33.851 -6.761 2.19e-11 ***
## GrLivArea:CentralAirY 9.400 3.260 2.883 0.004009 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34680 on 1129 degrees of freedom
## Multiple R-squared: 0.8155, Adjusted R-squared: 0.8093
## F-statistic: 131.3 on 38 and 1129 DF, p-value: < 2.2e-16
validate$predictSalePrice <- predict(best_lm, validate)
se <- (validate$predictSalePrice - validate$SalePrice)**2
(RMSE <- sqrt(mean(se)))
## [1] 30172.79
The \(R^2\) value decreased slightly and the RMSE increased slightly, but not by too much for the model to be much affected.
test <- read.csv(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_605/FinalProj/test.csv"))
test$SalePrice <- predict(best_lm, test)
Results:
predictions <- data.frame("Id" = test$Id,
"SalePrice" = test$SalePrice)
write.csv(predictions, "C:/Users/Shoshana/Documents/CUNY SPS/cuny-sps/DATA_605/FinalProj/predictions.csv", row.names = F)