library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.2
library(stats)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(mice)
## Warning: package 'mice' was built under R version 4.3.2
##
## Attaching package: 'mice'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
Pick one of the quanititative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
Calculate as a minimum the below probabilities a through
c. Assume the small letter “x” is estimated as the 3d quartile of the X
variable, and the small letter “y” is estimated as the 2d quartile of
the Y variable. Interpret the meaning of all probabilities. In addition,
make a table of counts as shown below. a. P(X>x |
Y>y) b. P(X>x, Y>y) c. P(X<x | Y>y)
x/y <=2d quartile >2d quartile Total
<=3d quartile >3d quartile
Total
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
First, load the data.
train <- read.csv("C:\\Users\\shaya\\OneDrive\\Documents\\repos\\Data605\\Final_Project\\house-prices-advanced-regression-techniques\\train.csv")
eval <- read.csv("C:\\Users\\shaya\\OneDrive\\Documents\\repos\\Data605\\Final_Project\\house-prices-advanced-regression-techniques\\test.csv")
Since there is a requirement for the variable to be skewed to the right, I will plot the density of the numeric columns to determine an approriate one to use.
# Plot the density of numeric columns
train |>
keep(is.numeric) |>
gather(key = "variable", value = "value") |>
ggplot(aes(x = value)) +
geom_histogram(aes(y = after_stat(density)), bins = 20, fill = '#4E79A7', color = 'black') +
stat_density(geom = "line", color = "red") +
facet_wrap(~ variable, scales = 'free') +
theme(strip.text = element_text(size = 5))
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_density()`).
Based on these plots, I will select the TotalBsmtSF variable as X and the SalePrice variable as Y. To ensure that the TotalBsmtSF variable is skewed to the right, I will calculate the skewness using the e1071 package.
skewness <- skewness(train$TotalBsmtSF)
skewness
## [1] 1.521124
Our X variable has a skewness value of 1.52. Usually, any skewness value greater than 1 is considered to be highly skewed.
Next, I will calculate the probabilities a, b, and c as requested.
# Define X and Y
X <- train$TotalBsmtSF
Y <- train$SalePrice
# Find the quartiles
x <- quantile(X, 0.75) # 3rd quartile of X
y <- quantile(Y, 0.50) # 2nd quartile of Y
# Store counts
A <- as.numeric(X > x)
B <- as.numeric(Y > y)
# Calculate probabilities
# a. P(X > x | Y > y)
prob_a <- sum(X > x & Y > y) / sum(Y > y)
# b. P(X > x, Y > y)
prob_b <- sum(X > x & Y > y) / nrow(train)
# c. P(X < x | Y > y)
prob_c <- sum(X < x & Y > y) / sum(Y > y)
# Create a table of counts
counts_table <- table(X > x, Y > y)
# Rename columns and rows
colnames(counts_table) <- c("<=2d quartile", ">2d quartile")
rownames(counts_table) <- c("<=3d quartile", ">3d quartile")
# Add total row and column
counts_table <- addmargins(counts_table, margin = 1)
counts_table <- addmargins(counts_table, margin = 2)
# Print the probabilities
cat("Probabilities:\n")
## Probabilities:
cat("a. P(X>x | Y>y) =", prob_a, "\n")
## a. P(X>x | Y>y) = 0.4519231
cat("b. P(X>x, Y>y) =", prob_b, "\n")
## b. P(X>x, Y>y) = 0.2253425
cat("c. P(X<x | Y>y) =", prob_c, "\n")
## c. P(X<x | Y>y) = 0.5480769
# Print the table of counts with totals
cat("\nTable of counts:\n")
##
## Table of counts:
print(counts_table)
##
## <=2d quartile >2d quartile Sum
## <=3d quartile 696 399 1095
## >3d quartile 36 329 365
## Sum 732 728 1460
The probabilities are as follows: a. P(X>x | Y>y) = 0.4519231, which means the probability of TotalBsmtSF being greater than the 3rd quartile (1298.25) given that SalePrice is greater than the 2nd quartile (163000) is around 45.19%. b. P(X>x, Y>y) = 0.2253425, which means the probability of both, TotalBsmtSF being greater than the 3rd quartile (1298.25) and SalePrice being greater than the 2nd quartile (163000) is around 22.53%. c. P(X<x | Y>y) = 0.5480769, which means the probability of TotalBsmtSF being less than the 3rd quartile (1298.25) given that SalePrice is greater than the 2nd quartile (163000) is around 54.81%.
Since the probabilities of a and c are both in cases of Y > y, and their total is 1, the probability of (X=x | Y>y) must be 0. In order to confirm that, I calculated the probability and checked if the sum of the probabilities of a, c, and d is equal to 1.
prob_d <- sum(X == x & Y > y) / sum(Y > y)
prob_d
## [1] 0
# check if prob_a + prob_c + prob_d = 1
prob_a + prob_c + prob_d == 1
## [1] TRUE
Since they all sum up to 1, we can conclude that the probabilities are correct.
Next, I will check if splitting the training data in this fashion makes them independent. I will create new variables A and B based on the conditions provided and then check if P(A|B) = P(A)P(B) both mathematically and using a Chi-Square test for association.
# Create new variables A and B
A <- as.numeric(X > x)
B <- as.numeric(Y > y)
# Calculate P(A|B) and P(A)P(B)
P_A_given_B <- sum(A == 1 & B == 1) / sum(B == 1)
P_A <- sum(A == 1) / nrow(train)
P_B <- sum(B == 1) / nrow(train)
# Check if P(A|B) = P(A)P(B)
P_A_given_B == P_A * P_B
## [1] FALSE
The result is FALSE, which means that A and B are not independent. To confirm this, I will perform a Chi-Square test for association.
# Create a contingency table
contingency_table <- table(A, B)
# Perform a Chi-Square test for association
chisq.test(contingency_table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency_table
## X-squared = 313.61, df = 1, p-value < 2.2e-16
The p-value is extremely low (2.2e-16), which means that we reject the null hypothesis that A and B are independent. Therefore, A and B are dependent.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
First, I will provide univariate descriptive statistics and appropriate plots for the training data set. This is usually the first step in all my data analysis projects. I typically use the str() and summary() function to get a summary of the data.
# Display the structure of the data
str(train)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
The above output gives us the structure of the data set, including the number of observations and variables, the names of the variables, and the data types of the variables.
# Descriptive statistics
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
The above output gives us the quartiles, mean, and standard deviation of the numeric columns in the data set. It also gives us the count and frequency of the categorical variables.We will further analyze these distributions using descriptive plots.
Earlier, I plotted the density of the numeric columns to determine an appropriate variable to use as X. I will repeat that here to help get an overview of the data as a whole.
# Plot the density of numeric columns
train |>
keep(is.numeric) |>
gather(key = "variable", value = "value") |>
ggplot(aes(x = value)) +
geom_histogram(aes(y = after_stat(density)), bins = 20, fill = '#4E79A7', color = 'black') +
stat_density(geom = "line", color = "red") +
facet_wrap(~ variable, scales = 'free') +
theme(strip.text = element_text(size = 5))
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_density()`).
We see many right skewed variables. This is logical since many of these
variables are related to the size of the house, and it is common for
houses to have a right-skewed distribution of sizes. We see an ID column
which doesn’t impart any knowledge and really should not be treated as a
numeric column. We also see a Year column which is not a numeric column
but should be treated as a factor.
Similar to above, I will create boxplots for the numeric columns to get a better understanding of the data.
# Create boxplots
train |>
keep(is.numeric) |>
gather(key = "variable", value = "value") |>
ggplot(aes(x = variable, y = value)) +
geom_boxplot() +
facet_wrap(~ variable, scales = 'free') +
theme(strip.text = element_text(size = 5))
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The above plots show us the distribution of the numeric columns in the
data set. We can see that many of the variables have many outliers,
which is common in real-world data sets.
Next, I will plot the distribution of the categorical variables in the data set.
# Plot the distribution of categorical variables
train |>
select_if(~ is.character(.)) |>
gather(key = "variable", value = "value") |>
ggplot(aes(x = value)) +
geom_bar(fill = "#4E79A7") +
facet_wrap(~ variable, scales = 'free') +
theme(strip.text = element_text(size = 5)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We see all different types of distributions among the categorical
variables. Many of the variables seem to have one predominant category,
while others have a more even distribution.
We now check for missing data and visualize the values by plotting the missing data count per column in descending order.
missing_data <- train %>%
select_if(~ any(is.na(.))) %>%
summarise_all(~ sum(is.na(.))) %>%
gather(key = "variable", value = "missing_count") %>%
arrange(missing_count)
# Plot the missing data pattern
missing_data %>%
ggplot(aes(x = reorder(variable, missing_count), y = missing_count)) +
geom_bar(stat = "identity", fill = "#4E79A7") +
labs(title = "Missing Data Pattern", x = "Variable", y = "Missing Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
We can see, various columns have varying degree of missingness. These
columns will need to be treated before we can proceed with the analysis.
Luckily, the columns I chose as the X and Y variables do not have any
missing data.
Next, I will create a pairplot of the numeric variables to visualize the relationships between the variables. Since there are too many variables to plot and still be able to interpret the plot, I will only plot 6 numeric variables with the highest correlation with the SalePrice variable.
# Find the 6 numeric variables with the highest correlation with SalePrice
correlation_matrix <- cor(train[, sapply(train, is.numeric)])
correlation_with_saleprice <- correlation_matrix["SalePrice", ]
top_correlated_variables <- names(sort(correlation_with_saleprice, decreasing = TRUE)[2:7])
# Create a pairplot of the top correlated variables plus the SalePrice variable
train |>
dplyr::select(c(top_correlated_variables, "SalePrice")) |>
ggpairs()
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(top_correlated_variables)
##
## # Now:
## data %>% select(all_of(top_correlated_variables))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The pairplot shows the relationships between the top 6 numeric variables with the highest correlation with the SalePrice variable. We see how each variable has positive linear relationships with the SalePrice variable. We also see how some variables have a significant relationship with each other, this is expected since many of these variables are related to the size of the house but can lead to multicollinearity in a model.
Next, I will plot the scatterplot of X, TotalBsmtSF, and Y, SalePrice.
# Plot the scatterplot of X and Y with a linear regression line
ggplot(train, aes(x = TotalBsmtSF, y = SalePrice)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Scatterplot of TotalBsmtSF and SalePrice", x = "TotalBsmtSF", y = "SalePrice")
## `geom_smooth()` using formula = 'y ~ x'
We see a clearly positive linear relationship between TotalBsmtSF and
SalePrice. We also see how the linear regression line seems to be
drifting further from the data points as the TotalBsmtSF increases. This
illustrates the risks of predicting outside the range of the data.
I am unsure what exactly is meant by the difference in the mean of the variables. I will simply calculate the difference in means between TotalBsmtSF and SalePrice and calculate the 95% confidence interval for this difference. But since these variables are on different scales, I am not sure how meaningful this analysis will be.
Afterwards, I will calculate the difference in means of the TotalBsmtSF when above and below the median of SalePrice and calculate the 95% confidence interval. This will give us a better understanding of the relationship between the two variables and seems like a more ilely interpretation of the assignment.
# Calculate the 95% confidence interval for the difference in the mean of the variables
X_mean <- mean(X)
Y_mean <- mean(Y)
X_sd <- sd(X)
Y_sd <- sd(Y)
n <- length(X)
m <- length(Y)
# Calculate the standard error
SE <- sqrt((X_sd^2 / n) + (Y_sd^2 / m))
# Calculate the margin of error
ME <- qt(0.975, df = n + m - 2) * SE
# Calculate the confidence interval
CI <- c((X_mean - Y_mean) - ME, (X_mean - Y_mean) + ME)
CI
## [1] -183940.5 -175787.0
The 95% Confidence Interval for the difference in means is between -183940.5 and -175787.0. This means that we are 95% confident that the difference in means between TotalBsmtSF and SalePrice is between -183940.5 and -175787.0.
I will double check my work using the t-test function in R.
# Perform t-test
t_test_result <- t.test(X, Y)
# Calculate the confidence interval for the difference in means
ci <- t_test_result$conf.int
# Print the confidence interval
ci
## [1] -183942.2 -175785.3
## attr(,"conf.level")
## [1] 0.95
We achieved the same results using the t-test function in R.
As mentioned above, I will calculate the difference in means of the TotalBsmtSF when above and below the median of SalePrice.
# Calculate the difference in means of TotalBsmtSF when above and below the median of SalePrice
median_saleprice <- median(Y)
# Calculate the 95% confidence interval for the difference in the mean of TotalBsmtSF when above and below the median of SalePrice
X_below_median <- X[Y <= median_saleprice]
X_above_median <- X[Y > median_saleprice]
n_below <- length(X_below_median)
n_above <- length(X_above_median)
mean_below <- mean(X_below_median)
mean_above <- mean(X_above_median)
sd_below <- sd(X_below_median)
sd_above <- sd(X_above_median)
# Calculate the standard error
SE <- sqrt((sd_below^2 / n_below) + (sd_above^2 / n_above))
# Calculate the margin of error
ME <- qt(0.975, df = n_below + n_above - 2) * SE
# Calculate the confidence interval
CI <- c((mean_below - mean_above) - ME, (mean_below - mean_above) + ME)
CI
## [1] -424.9548 -343.9242
# Observed difference in means
mean_below - mean_above
## [1] -384.4395
Again, I will double check my work using the t-test function in R.
# Perform t-test
t_test_result <- t.test(X_below_median, X_above_median)
# Calculate the confidence interval for the difference in means
ci <- t_test_result$conf.int
# Print the confidence interval
ci
## [1] -424.9554 -343.9236
## attr(,"conf.level")
## [1] 0.95
The 95% Confidence Interval for the difference in means of TotalBsmtSF when above and below the median of SalePrice is between -424.96 and -343.92. This means that we are 95% confident that the difference in means between TotalBsmtSF when above and below the median of SalePrice is between -424.96 and -343.92. This tracks with the observed -384.44 difference in means.
Next, I will derive a correlation matrix for two of the quantitative variables I selected and test the hypothesis that the correlation between these variables is 0. I will also calculate the 99% confidence interval for the correlation.
# Derive a correlation matrix for two of the quantitative variables you selected
correlation_matrix <- cor(train[, c("TotalBsmtSF", "SalePrice")])
correlation_matrix
## TotalBsmtSF SalePrice
## TotalBsmtSF 1.0000000 0.6135806
## SalePrice 0.6135806 1.0000000
The correlation between TotalBsmtSF and SalePrice is 0.61, which indicates a strong positive linear relationship between the two variables. This reinforces what we saw in the scatterplot.
# Calculate correlation coefficient
correlation_coefficient <- cor(X, Y)
# Sample size
n <- length(X)
# Degrees of freedom
df <- n - 2
# Manual calculation of t-statistic
t_statistic <- correlation_coefficient * sqrt(df) / sqrt(1 - correlation_coefficient^2)
# Manual calculation of p-value
p_value <- 2 * pt(-abs(t_statistic), df)
# Calculate 99% confidence interval using Fisher transformation to get the z-score
r_transform <- 0.5 * log((1 + correlation_coefficient) / (1 - correlation_coefficient))
CI_lower <- tanh(r_transform - qnorm(0.995) / sqrt(n - 3))
CI_upper <- tanh(r_transform + qnorm(0.995) / sqrt(n - 3))
# Print the results
cat("Correlation Coefficient:", correlation_coefficient, "\n")
## Correlation Coefficient: 0.6135806
cat("t-statistic:", t_statistic, "\n")
## t-statistic: 29.67055
cat("p-value:", p_value, "\n")
## p-value: 9.484229e-152
cat("99% Confidence Interval:", CI_lower, "-", CI_upper, "\n")
## 99% Confidence Interval: 0.5697562 - 0.6539251
Here, I verified that the correlation coefficient is 0.61, which is the same as the correlation matrix. The t-statistic is 29.67, and the p-value is extremely low, which means we reject the null hypothesis that the correlation between TotalBsmtSF and SalePrice is 0. The 99% confidence interval for the correlation is between 0.57 and 0.65, which means we are 99% confident that the correlation between TotalBsmtSF and SalePrice is between 0.57 and 0.65. This fits our observed correlation of 0.61.
Now, I double check my work using the cor.test function in R.
# Calculate the correlation coefficient
correlation <- cor(X, Y)
# Perform the hypothesis test and obtain the 99% confidence interval
test_result <- cor.test(X, Y, method = "pearson", conf.level = 0.99)
# Print the test result
print(test_result)
##
## Pearson's product-moment correlation
##
## data: X and Y
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.5697562 0.6539251
## sample estimates:
## cor
## 0.6135806
We achieved the same results using the cor.test function in R. (I am unsure why the p-value is different, but the correlation coefficient and confidence interval are the same, and both p-values are extremely low).
I will repeat the above analysis for each of the 6 variables most correlated with the SalePrice variable.
# Calculate the correlation matrix for the top correlated variables with SalePrice
correlation_matrix_top <- cor(train[, c(top_correlated_variables, "SalePrice")])
correlation_matrix_top
## OverallQual GrLivArea GarageCars GarageArea TotalBsmtSF X1stFlrSF
## OverallQual 1.0000000 0.5930074 0.6006707 0.5620218 0.5378085 0.4762238
## GrLivArea 0.5930074 1.0000000 0.4672474 0.4689975 0.4548682 0.5660240
## GarageCars 0.6006707 0.4672474 1.0000000 0.8824754 0.4345848 0.4393168
## GarageArea 0.5620218 0.4689975 0.8824754 1.0000000 0.4866655 0.4897817
## TotalBsmtSF 0.5378085 0.4548682 0.4345848 0.4866655 1.0000000 0.8195300
## X1stFlrSF 0.4762238 0.5660240 0.4393168 0.4897817 0.8195300 1.0000000
## SalePrice 0.7909816 0.7086245 0.6404092 0.6234314 0.6135806 0.6058522
## SalePrice
## OverallQual 0.7909816
## GrLivArea 0.7086245
## GarageCars 0.6404092
## GarageArea 0.6234314
## TotalBsmtSF 0.6135806
## X1stFlrSF 0.6058522
## SalePrice 1.0000000
The correlation matrix shows the correlation between the top 6 variables and the SalePrice variable. We see that all the variables have a positive correlation with the SalePrice variable, which is expected since they are the top correlated variables. We also see relatively strong correlations with each other as noted above. Overall, the matrix reinforces what we saw in the pairplot.
# Create an empty dataframe to store the results
results_df <- data.frame(variable = character(), observed_correlation = numeric(),
CI_lower = numeric(), CI_upper = numeric(), stringsAsFactors = FALSE)
# Create a dataframe to store the correlation coefficients for the top correlated variables
correlation_coefficients <- cor(train[, c(top_correlated_variables, "SalePrice")])
# Iterate over each variable and perform hypothesis testing
for (variable in top_correlated_variables) {
# Perform Pearson correlation test
test_result <- cor.test(train[[variable]], train$SalePrice, method = "pearson", conf.level = 0.99)
# Extract information from test result
observed_correlation <- correlation_coefficients[variable, "SalePrice"]
CI_lower <- test_result$conf.int[1]
CI_upper <- test_result$conf.int[2]
# Append results to dataframe
results_df <- rbind(results_df, data.frame(variable = variable,
observed_correlation = observed_correlation,
CI_lower = CI_lower,
CI_upper = CI_upper))
}
# Print the results dataframe
print(results_df)
## variable observed_correlation CI_lower CI_upper
## 1 OverallQual 0.7909816 0.7643382 0.8149288
## 2 GrLivArea 0.7086245 0.6733974 0.7406408
## 3 GarageCars 0.6404092 0.5988712 0.6785107
## 4 GarageArea 0.6234314 0.5804338 0.6629623
## 5 TotalBsmtSF 0.6135806 0.5697562 0.6539251
## 6 X1stFlrSF 0.6058522 0.5613896 0.6468270
The results dataframe shows the observed correlation between each of the top 6 variables and the SalePrice variable, as well as the 99% confidence interval for the correlation. We see that all the variables have a strong positive correlation with the SalePrice variable, and the confidence intervals are relatively tight. This reinforces what we saw in the pairplot and the correlation matrix.
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.
First, I will invert the correlation matrix to obtain the precision matrix. The precision matrix contains variance inflation factors on the diagonal.
# Invert the correlation matrix
precision_matrix <- solve(correlation_matrix)
precision_matrix
## TotalBsmtSF SalePrice
## TotalBsmtSF 1.6038006 -0.9840609
## SalePrice -0.9840609 1.6038006
Next, I will multiply the correlation matrix by the precision matrix.
# Multiply the correlation matrix by the precision matrix
correlation_matrix * precision_matrix
## TotalBsmtSF SalePrice
## TotalBsmtSF 1.6038006 -0.6038006
## SalePrice -0.6038006 1.6038006
Finally, I will multiply the precision matrix by the correlation matrix.
# Multiply the precision matrix by the correlation matrix
precision_matrix * correlation_matrix
## TotalBsmtSF SalePrice
## TotalBsmtSF 1.6038006 -0.6038006
## SalePrice -0.6038006 1.6038006
Next, I will conduct principal components analysis (PCA) on the correlation matrix. PCA is a technique used to reduce the dimensionality of the data by transforming the data into a new coordinate system. It does this by finding the principal components, which are the directions in which the data varies the most. I will use the prcomp function in R to perform PCA.
# Perform PCA
pca_result <- prcomp(train[, c("TotalBsmtSF", "SalePrice")], scale = TRUE)
# Print the PCA result
summary(pca_result)
## Importance of components:
## PC1 PC2
## Standard deviation 1.2703 0.6216
## Proportion of Variance 0.8068 0.1932
## Cumulative Proportion 0.8068 1.0000
pca_result
## Standard deviations (1, .., p=2):
## [1] 1.2702679 0.6216265
##
## Rotation (n x k) = (2 x 2):
## PC1 PC2
## TotalBsmtSF 0.7071068 -0.7071068
## SalePrice 0.7071068 0.7071068
The summary of the PCA result shows the standard deviation of the principal components, the proportion of variance explained by each principal component, and the cumulative proportion of variance explained. The first principal component explains 80.68% of the variance, while the second principal component explains 19.32% of the variance. Together, the two principal components explain 100% of the variance. The PCA result shows the principal components, the standard deviation of the principal components, and the rotation matrix. The rotation matrix shows how the original variables are related to the principal components. The first principal component is a linear combination of the original variables that maximizes the variance. The second principal component is orthogonal to the first principal component and captures the remaining variance.
Now I will perform PCA on the complete rows of all the numeric columns of the dataframe. I will remove any binary variables in the dataset since they will not contribute to the PCA analysis.
Before continuing with the PCA analysis, I will impute missing values for the numeric columns. This is to help build the models later on. I will utilize the MICE package to impute the missing values using the mice function. Before doing that, though, I will split the data into training and testing sets to avoid data leakage.
Earlier, I noticed some columns with a majority of their values missing. I will remove all the columns with more than 50% missing values from the imputation process and not use any of these columns in the modelling either. I will also remove all the character columns from the data since PCA only works with numeric data (and this simplifies the imputation process)
# Remove columns with more than 50% missing values
missing_values <- sapply(train, function(x) sum(is.na(x)))
cols_to_remove <- names(missing_values[missing_values > 0.5 * nrow(train)])
train <- train[, !colnames(train) %in% cols_to_remove]
eval <- eval[, !colnames(eval) %in% cols_to_remove]
# Remove all the character columns
train <- train[, sapply(train, is.numeric)]
eval <- eval[, sapply(eval, is.numeric)]
# Split the data into training and testing sets
set.seed(1125)
train_index <- createDataPartition(train$SalePrice, p = 0.8, list = FALSE, times = 1)
train <- train[train_index, ]
test <- train[-train_index, ]
The MICE package provides a flexible and easy-to-use method for imputing missing values in a dataset. I will use the mice function to impute the missing values in the numeric columns of the training set. Since the target variable should not be used for imputation, I will exclude the SalePrice variable from the imputation process.
# Remove the SalePrice variable from the training and testing sets
train_no_target <- train[, !colnames(train) %in% c("SalePrice")]
test_no_target <- test[, !colnames(test) %in% c("SalePrice")]
# Combine for imputation
combined_data <- rbind(train_no_target, test_no_target, eval)
# Create a data type variable to indicate whether the data is from the training, testing, or evaluation set
data_type <- c(rep("train", nrow(train)),
rep("test", nrow(test)),
rep("eval", nrow(eval)))
# Function to impute missing values
impute_func <- function(data, data_type) {
ini <- mice(data, maxit = 0, ignore = data_type != "train")
meth <- ini$meth
imputed_object <- mice(data, method = meth, m = 1, maxit = 30, seed = 1125, print = FALSE)
imputed_data <- complete(imputed_object, 1)
print(meth)
return(list(imputed_object = imputed_object, imputed_data = imputed_data))
}
# Call function
results <- impute_func(combined_data, data_type)
## Warning: Number of logged events: 416
## Id MSSubClass LotFrontage LotArea OverallQual
## "" "" "pmm" "" ""
## OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## "" "" "" "pmm" "pmm"
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF X1stFlrSF X2ndFlrSF
## "pmm" "pmm" "pmm" "" ""
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## "" "" "pmm" "pmm" ""
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## "" "" "" "" ""
## GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF
## "pmm" "pmm" "pmm" "" ""
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal
## "" "" "" "" ""
## MoSold YrSold
## "" ""
# Reintegrate the target variable
reintegrate_targets <- function(imputed_data, original_data, target_vars) {
target_data <- original_data[, target_vars, drop = FALSE]
cbind(imputed_data, target_data)
}
# Combine original data
full_combined_data <- rbind(train, test)
# Reintegrate target
imputed_data_with_targets <- reintegrate_targets(results$imputed_data[data_type != "eval", ], full_combined_data, "SalePrice")
# Split the data back into training, testing, and evaluation sets
train_imputed <- imputed_data_with_targets[data_type == "train", ]
test_imputed <- imputed_data_with_targets[data_type == "test", ]
eval_imputed <- results$imputed_data[data_type == "eval", ]
# Remove the extra columns
cols_to_remove <- c(".imp", ".id")
train_imputed <- train_imputed[, !colnames(train_imputed) %in% cols_to_remove]
test_imputed <- test_imputed[, !colnames(test_imputed) %in% cols_to_remove]
eval_imputed <- eval_imputed[, !colnames(eval_imputed) %in% cols_to_remove]
To confirm that the imputation was successful, I will check the dimensions of the imputed dataframes and compare them to the original dataframes.
dim(train_imputed)
## [1] 1169 38
dim(test_imputed)
## [1] 229 38
dim(eval_imputed)
## [1] 1459 37
dim(train)
## [1] 1169 38
dim(test)
## [1] 229 38
dim(eval)
## [1] 1459 37
The dimensions of the imputed dataframes match the dimensions of the original dataframes, which indicates that the imputation was successful.
Next I will check if there are any missing values in the imputed dataframes to ensure that the imputation was successful.
# Check for missing values in the imputed dataframes
sum(is.na(train_imputed))
## [1] 0
sum(is.na(test_imputed))
## [1] 0
sum(is.na(eval_imputed))
## [1] 0
There are now no missing values in any of the imputed dataframes, which confirms that the imputation was successful. It is important to note that the dataframes all had the character columns removed along with columns with more than 50% missing values before the imputation process.
pca_result_all <- train_imputed %>%
select_if(function(col) length(unique(col)) > 2) %>% # Exclude binary variables
prcomp(scale = TRUE) # Perform PCA
# Summary of PCA result
summary(pca_result_all)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.8105 1.81751 1.59373 1.43613 1.24698 1.09429 1.09332
## Proportion of Variance 0.2079 0.08693 0.06684 0.05428 0.04092 0.03151 0.03146
## Cumulative Proportion 0.2079 0.29480 0.36164 0.41591 0.45683 0.48835 0.51980
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.07343 1.05978 1.04906 1.04087 1.01899 1.00310 0.99072
## Proportion of Variance 0.03032 0.02956 0.02896 0.02851 0.02732 0.02648 0.02583
## Cumulative Proportion 0.55013 0.57968 0.60864 0.63715 0.66448 0.69096 0.71679
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.96551 0.96057 0.92801 0.91377 0.90517 0.89065 0.84954
## Proportion of Variance 0.02453 0.02428 0.02266 0.02197 0.02156 0.02088 0.01899
## Cumulative Proportion 0.74132 0.76560 0.78826 0.81024 0.83180 0.85267 0.87167
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.8246 0.78998 0.75063 0.70955 0.6434 0.62427 0.55898
## Proportion of Variance 0.0179 0.01642 0.01483 0.01325 0.0109 0.01026 0.00822
## Cumulative Proportion 0.8896 0.90598 0.92081 0.93406 0.9450 0.95521 0.96344
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.52646 0.51266 0.44767 0.44632 0.36959 0.35559 0.31314
## Proportion of Variance 0.00729 0.00692 0.00527 0.00524 0.00359 0.00333 0.00258
## Cumulative Proportion 0.97073 0.97765 0.98292 0.98816 0.99176 0.99508 0.99766
## PC36 PC37 PC38
## Standard deviation 0.29797 1.248e-15 4.226e-16
## Proportion of Variance 0.00234 0.000e+00 0.000e+00
## Cumulative Proportion 1.00000 1.000e+00 1.000e+00
The summary of the PCA result shows that 18 principal components contain 80% of the variance, 23 principal components contain 90% of the variance, 27 principal components contain 95% of the variance, and 33 principal components contain 99% of the variance. This means that we can reduce the dimensionality of the data from 38 variables to those 18, 23, 27, or 33 principal components while retaining their respective percentages of variance. I will store the PCA components of each of the above thresholds as a variable for potential use in later modeling.
# Store the PCA components for each threshold
pca_components_80 <- pca_result_all$x[, 1:18]
pca_components_90 <- pca_result_all$x[, 1:23]
pca_components_95 <- pca_result_all$x[, 1:27]
pca_components_99 <- pca_result_all$x[, 1:33]
Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
First, I will shift the variable that is skewed to the right so that the minimum value is above zero. I have saved the TotalBsmtSF variable as X.
# Shift the variable so that the minimum value is above zero
X_shifted <- X - min(X) + 0.1
Next, I will fit an exponential probability density function to the shifted variable using the fitdistr function from the MASS package. I will find the optimal value of \(\lambda\) for this distribution.
# Fit an exponential probability density function to the shifted variable
fit <- fitdistr(X_shifted, densfun = "exponential")
# Optimal value of lambda
lambda <- fit$estimate
lambda
## rate
## 0.0009456001
Next, I will take 1000 samples from this exponential distribution using the optimal value of \(\lambda\) and plot a histogram to compare it with a histogram of the original variable. I will add vertical lines for the mean and mode of both the original variable and the exponential distribution. This will aid in visualizing the differences between the two distributions.
# Take 1000 samples from the exponential distribution
set.seed((1125))
samples <- rexp(1000, lambda)
# Calculate mean and mode of the original variable
mean_original <- mean(X_shifted)
mode_original <- density(X_shifted)$x[which.max(density(X_shifted)$y)]
# Calculate mean and mode of the exponential distribution
mean_exponential <- 1 / lambda # Mean of exponential distribution is 1 / lambda
mode_exponential <- 0 # Mode of exponential distribution is always 0
# Plot a histogram of the original variable and the exponential distribution
ggplot() +
geom_histogram(aes(X_shifted, fill = "Original Variable"), bins = 30, alpha = 0.7) +
geom_histogram(aes(samples, fill = "Exponential Distribution"), bins = 30, alpha = 0.7) +
geom_vline(xintercept = mean_original, linetype = "dashed", color = "blue", size = 1) + # Mean line (original)
geom_vline(xintercept = mode_original, linetype = "dashed", color = "red", size = 1) + # Mode line (original)
geom_vline(xintercept = mean_exponential, linetype = "dotted", color = "green", size = 1) + # Mean line (exponential)
geom_vline(xintercept = mode_exponential, linetype = "dotted", color = "purple", size = 1) + # Mode line (exponential)
labs(title = "Histogram of Original Variable and Exponential Distribution", x = "Value", y = "Frequency") +
scale_fill_manual(values = c("#4E79A7", "#F28E2B"),
labels = c("Original Variable", "Exponential Distribution"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The above plot shows the histogram of the original variable and the exponential distribution. The blue dashed line represents the mean of the original variable, the red dashed line represents the mode of the original variable, the green dotted line represents the mean of the exponential distribution, and the purple dotted line represents the mode of the exponential distribution. We see the mean of both the original variable and the exponential distribution are the same, while the mode of the original variable is very different from the mode of the exponential distribution. This is expected since the exponential distribution is unimodal with a mode of 0 and we know we set the minimum value of the shifted variable to be above 0.
Based on the above plots, it seems like the exponential distribution does not fit the original variable very well. While the mean stays the same, the overall distribution is very different.
Next, I will use the exponential probability density function to find the 5th and 95th percentiles using the cumulative distribution function (CDF).
# Find the 5th and 95th percentiles using the exponential CDF
percentile_5 <- qexp(0.05, rate = lambda)
percentile_95 <- qexp(0.95, rate = lambda)
percentile_5
## [1] 54.24417
percentile_95
## [1] 3168.075
The 5th percentile of the exponential distribution is 54.24 and the 95th percentile is 3168.08. This means that 90% of the exponential distribution falls between 54.24 and 3168.08.
Next, I will generate a 95% confidence interval from the empirical data, assuming normality. For this section, I will use the original variable X, not the shifted variable X_shifted. But the difference in the CIs should be almost irrelevant since the original shift barely moved the data.
# Calculate the standard error
SE <- sd(X) / sqrt(length(X))
# Calculate the margin of error
ME <- qt(0.975, df = length(X) - 1) * SE
# Calculate the confidence interval
CI <- c(mean(X) - ME, mean(X) + ME)
CI
## [1] 1034.908 1079.951
The 95% Confidence Interval for the original variable X is between 1034.9 and 1079.95. This means that we are 95% confident that the mean of the original variable X is between 1034.9 and 1079.95. As before, I will double check our work using the t-test function in R.
# Generate a 95% confidence interval assuming normality
conf_interval <- t.test(X)$conf.int
# Print the confidence interval
cat("95% Confidence Interval (Normality assumption): [", conf_interval[1], ",", conf_interval[2], "]\n")
## 95% Confidence Interval (Normality assumption): [ 1034.908 , 1079.951 ]
The results are the same as before.
Finally, I will calculate the empirical 5th and 95th percentiles of the original variable X.
# Calculate the 5th and 95th percentiles for variable "X"
percentile_5th <- quantile(X, probs = 0.05)
percentile_95th <- quantile(X, probs = 0.95)
percentile_5th
## 5%
## 519.3
percentile_95th
## 95%
## 1753
The empirical 5th percentile of the original variable X is 519.3 and the 95th percentile is 1753. This means that 90% of the data falls between 519.3 and 1753.
Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
For this question, I will build a linear regression model to predict the SalePrice variable using the TotalBsmtSF variable. These were selected in the previous sections as the X and Y variables. I will then build a regression model using all the numeric variables, a regression model using the top correlated variables, and regression models using the PCA components. I will predict the SalePrice variable against the testing set and calculate the RMSE and R-squared values for each model.
# Build a linear regression model
model <- lm(SalePrice ~ TotalBsmtSF, train_imputed)
# Print the model summary
summary(model)
##
## Call:
## lm(formula = SalePrice ~ TotalBsmtSF, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -176177 -40006 -14862 35293 403093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51726.570 4876.258 10.61 <2e-16 ***
## TotalBsmtSF 122.823 4.299 28.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61630 on 1167 degrees of freedom
## Multiple R-squared: 0.4116, Adjusted R-squared: 0.4111
## F-statistic: 816.4 on 1 and 1167 DF, p-value: < 2.2e-16
# Build a regression model using all the numeric variables
model_all <- lm(SalePrice ~ ., train_imputed)
# Print the model summary
summary(model_all)
##
## Call:
## lm(formula = SalePrice ~ ., data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138346 -15955 -1147 14172 217521
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.300e+05 1.338e+06 -0.321 0.74806
## Id -2.309e-02 2.069e+00 -0.011 0.99110
## MSSubClass -1.382e+02 2.680e+01 -5.158 2.95e-07 ***
## LotFrontage 5.356e+01 4.771e+01 1.123 0.26185
## LotArea 4.872e-01 9.231e-02 5.278 1.57e-07 ***
## OverallQual 1.530e+04 1.147e+03 13.344 < 2e-16 ***
## OverallCond 4.706e+03 9.553e+02 4.926 9.64e-07 ***
## YearBuilt 3.544e+02 6.718e+01 5.275 1.59e-07 ***
## YearRemodAdd 1.584e+02 6.469e+01 2.449 0.01446 *
## MasVnrArea 3.315e+01 5.704e+00 5.813 7.98e-09 ***
## BsmtFinSF1 4.155e+01 4.535e+00 9.162 < 2e-16 ***
## BsmtFinSF2 1.827e+01 6.642e+00 2.751 0.00603 **
## BsmtUnfSF 2.097e+01 3.958e+00 5.298 1.41e-07 ***
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 7.026e+01 5.631e+00 12.477 < 2e-16 ***
## X2ndFlrSF 7.863e+01 4.748e+00 16.560 < 2e-16 ***
## LowQualFinSF 2.244e+01 1.749e+01 1.283 0.19974
## GrLivArea NA NA NA NA
## BsmtFullBath 1.210e+03 2.471e+03 0.489 0.62461
## BsmtHalfBath -8.845e+02 3.792e+03 -0.233 0.81559
## FullBath -3.003e+03 2.662e+03 -1.128 0.25948
## HalfBath -6.784e+03 2.535e+03 -2.676 0.00755 **
## BedroomAbvGr -1.375e+04 1.625e+03 -8.460 < 2e-16 ***
## KitchenAbvGr -1.119e+04 4.906e+03 -2.280 0.02279 *
## TotRmsAbvGrd 2.969e+03 1.194e+03 2.488 0.01300 *
## Fireplaces 2.713e+01 1.688e+03 0.016 0.98718
## GarageYrBlt -7.529e+00 7.052e+01 -0.107 0.91500
## GarageCars 2.039e+03 2.724e+03 0.749 0.45425
## GarageArea 1.344e+01 9.504e+00 1.414 0.15761
## WoodDeckSF 1.631e+01 7.579e+00 2.152 0.03158 *
## OpenPorchSF 1.838e+01 1.433e+01 1.283 0.19982
## EnclosedPorch -1.422e+01 1.634e+01 -0.870 0.38453
## X3SsnPorch -4.598e-01 2.754e+01 -0.017 0.98668
## ScreenPorch 2.071e+01 1.617e+01 1.280 0.20067
## PoolArea 5.610e+01 2.089e+01 2.685 0.00735 **
## MiscVal -1.505e+00 1.581e+00 -0.952 0.34139
## MoSold -1.331e+02 3.276e+02 -0.406 0.68454
## YrSold -3.106e+02 6.663e+02 -0.466 0.64118
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29400 on 1133 degrees of freedom
## Multiple R-squared: 0.87, Adjusted R-squared: 0.866
## F-statistic: 216.7 on 35 and 1133 DF, p-value: < 2.2e-16
# Build a regression model using the top correlated variables
model_top <- lm(SalePrice ~ ., train_imputed[, c(top_correlated_variables, "SalePrice")])
# Print the model summary
summary(model_top)
##
## Call:
## lm(formula = SalePrice ~ ., data = train_imputed[, c(top_correlated_variables,
## "SalePrice")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -146424 -19747 -1269 17147 250127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.138e+05 5.012e+03 -22.707 < 2e-16 ***
## OverallQual 2.252e+04 1.109e+03 20.302 < 2e-16 ***
## GrLivArea 5.560e+01 2.737e+00 20.314 < 2e-16 ***
## GarageCars 4.061e+03 3.114e+03 1.304 0.192471
## GarageArea 3.974e+01 1.061e+01 3.747 0.000187 ***
## TotalBsmtSF 3.657e+01 4.378e+00 8.352 < 2e-16 ***
## X1stFlrSF 7.743e+00 5.014e+00 1.544 0.122807
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35180 on 1162 degrees of freedom
## Multiple R-squared: 0.8091, Adjusted R-squared: 0.8081
## F-statistic: 820.8 on 6 and 1162 DF, p-value: < 2.2e-16
# Build a regression model using the PCA components
pca_80 <- cbind(SalePrice = train_imputed$SalePrice, pca_components_80)
model_pca_80 <- lm(SalePrice ~ ., as.data.frame(pca_80))
# Print the model summary
summary(model_pca_80)
##
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(pca_80))
##
## Residuals:
## Min 1Q Median 3Q Max
## -111363 -16440 -1587 14109 221191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181183.60 848.23 213.603 < 2e-16 ***
## PC1 25951.12 301.93 85.950 < 2e-16 ***
## PC2 1583.71 466.90 3.392 0.000717 ***
## PC3 -4335.87 532.46 -8.143 9.94e-16 ***
## PC4 5096.21 590.89 8.625 < 2e-16 ***
## PC5 -6243.57 680.52 -9.175 < 2e-16 ***
## PC6 -307.08 775.47 -0.396 0.692185
## PC7 -3608.17 776.16 -4.649 3.73e-06 ***
## PC8 -7004.98 790.54 -8.861 < 2e-16 ***
## PC9 4050.73 800.72 5.059 4.91e-07 ***
## PC10 2065.12 808.90 2.553 0.010809 *
## PC11 839.59 815.27 1.030 0.303305
## PC12 4047.54 832.78 4.860 1.33e-06 ***
## PC13 1534.67 845.97 1.814 0.069923 .
## PC14 1425.41 856.54 1.664 0.096356 .
## PC15 1616.89 878.90 1.840 0.066073 .
## PC16 -3068.96 883.42 -3.474 0.000532 ***
## PC17 64.63 914.42 0.071 0.943661
## PC18 2135.54 928.67 2.300 0.021651 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29000 on 1150 degrees of freedom
## Multiple R-squared: 0.8716, Adjusted R-squared: 0.8696
## F-statistic: 433.8 on 18 and 1150 DF, p-value: < 2.2e-16
# Build a regression model using the PCA components
pca_90 <- cbind(SalePrice = train_imputed$SalePrice, pca_components_90)
model_pca_90 <- lm(SalePrice ~ ., as.data.frame(pca_90))
# Print the model summary
summary(model_pca_90)
##
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(pca_90))
##
## Residuals:
## Min 1Q Median 3Q Max
## -115735 -15300 -1922 13700 204129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181183.60 823.39 220.047 < 2e-16 ***
## PC1 25951.12 293.09 88.542 < 2e-16 ***
## PC2 1583.71 453.22 3.494 0.000493 ***
## PC3 -4335.87 516.86 -8.389 < 2e-16 ***
## PC4 5096.21 573.58 8.885 < 2e-16 ***
## PC5 -6243.57 660.59 -9.452 < 2e-16 ***
## PC6 -307.08 752.76 -0.408 0.683397
## PC7 -3608.17 753.43 -4.789 1.90e-06 ***
## PC8 -7004.98 767.39 -9.128 < 2e-16 ***
## PC9 4050.73 777.27 5.211 2.22e-07 ***
## PC10 2065.12 785.22 2.630 0.008653 **
## PC11 839.59 791.40 1.061 0.288959
## PC12 4047.54 808.39 5.007 6.40e-07 ***
## PC13 1534.67 821.20 1.869 0.061902 .
## PC14 1425.41 831.46 1.714 0.086736 .
## PC15 1616.89 853.16 1.895 0.058321 .
## PC16 -3068.96 857.55 -3.579 0.000360 ***
## PC17 64.63 887.64 0.073 0.941965
## PC18 2135.54 901.47 2.369 0.018004 *
## PC19 1480.49 910.04 1.627 0.104046
## PC20 1952.95 924.87 2.112 0.034937 *
## PC21 -5620.63 969.63 -5.797 8.74e-09 ***
## PC22 -5760.86 998.91 -5.767 1.04e-08 ***
## PC23 1260.73 1042.74 1.209 0.226889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28150 on 1145 degrees of freedom
## Multiple R-squared: 0.8796, Adjusted R-squared: 0.8771
## F-statistic: 363.5 on 23 and 1145 DF, p-value: < 2.2e-16
# Build a regression model using the PCA components
pca_95 <- cbind(SalePrice = train_imputed$SalePrice, pca_components_95)
model_pca_95 <- lm(SalePrice ~ ., as.data.frame(pca_95))
# Print the model summary
summary(model_pca_95)
##
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(pca_95))
##
## Residuals:
## Min 1Q Median 3Q Max
## -110355 -13723 -989 12948 162820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181183.60 716.46 252.886 < 2e-16 ***
## PC1 25951.12 255.03 101.756 < 2e-16 ***
## PC2 1583.71 394.37 4.016 6.31e-05 ***
## PC3 -4335.87 449.75 -9.641 < 2e-16 ***
## PC4 5096.21 499.10 10.211 < 2e-16 ***
## PC5 -6243.57 574.81 -10.862 < 2e-16 ***
## PC6 -307.08 655.01 -0.469 0.639291
## PC7 -3608.17 655.59 -5.504 4.59e-08 ***
## PC8 -7004.98 667.74 -10.491 < 2e-16 ***
## PC9 4050.73 676.34 5.989 2.82e-09 ***
## PC10 2065.12 683.25 3.022 0.002563 **
## PC11 839.59 688.63 1.219 0.223010
## PC12 4047.54 703.41 5.754 1.12e-08 ***
## PC13 1534.67 714.56 2.148 0.031946 *
## PC14 1425.41 723.49 1.970 0.049058 *
## PC15 1616.89 742.37 2.178 0.029610 *
## PC16 -3068.96 746.19 -4.113 4.19e-05 ***
## PC17 64.63 772.37 0.084 0.933323
## PC18 2135.54 784.41 2.722 0.006578 **
## PC19 1480.49 791.87 1.870 0.061791 .
## PC20 1952.95 804.77 2.427 0.015390 *
## PC21 -5620.63 843.71 -6.662 4.20e-11 ***
## PC22 -5760.86 869.19 -6.628 5.24e-11 ***
## PC23 1260.73 907.33 1.389 0.164953
## PC24 -252.87 954.89 -0.265 0.791197
## PC25 3477.78 1010.17 3.443 0.000597 ***
## PC26 20663.82 1113.96 18.550 < 2e-16 ***
## PC27 4481.66 1148.17 3.903 0.000100 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24500 on 1141 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.907
## F-statistic: 422.8 on 27 and 1141 DF, p-value: < 2.2e-16
# Build a regression model using the PCA components
pca_99 <- cbind(SalePrice = train_imputed$SalePrice, pca_components_99)
model_pca_99 <- lm(SalePrice ~ ., as.data.frame(pca_99))
# Print the model summary
summary(model_pca_99)
##
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(pca_99))
##
## Residuals:
## Min 1Q Median 3Q Max
## -83702 -10816 82 10788 140491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181183.60 559.70 323.716 < 2e-16 ***
## PC1 25951.12 199.23 130.257 < 2e-16 ***
## PC2 1583.71 308.08 5.141 3.22e-07 ***
## PC3 -4335.87 351.34 -12.341 < 2e-16 ***
## PC4 5096.21 389.89 13.071 < 2e-16 ***
## PC5 -6243.57 449.04 -13.904 < 2e-16 ***
## PC6 -307.08 511.69 -0.600 0.548542
## PC7 -3608.17 512.14 -7.045 3.20e-12 ***
## PC8 -7004.98 521.63 -13.429 < 2e-16 ***
## PC9 4050.73 528.35 7.667 3.78e-14 ***
## PC10 2065.12 533.75 3.869 0.000115 ***
## PC11 839.59 537.95 1.561 0.118869
## PC12 4047.54 549.51 7.366 3.38e-13 ***
## PC13 1534.67 558.21 2.749 0.006067 **
## PC14 1425.41 565.18 2.522 0.011804 *
## PC15 1616.89 579.94 2.788 0.005391 **
## PC16 -3068.96 582.92 -5.265 1.68e-07 ***
## PC17 64.63 603.38 0.107 0.914711
## PC18 2135.54 612.78 3.485 0.000511 ***
## PC19 1480.49 618.60 2.393 0.016860 *
## PC20 1952.95 628.68 3.106 0.001941 **
## PC21 -5620.63 659.11 -8.528 < 2e-16 ***
## PC22 -5760.86 679.01 -8.484 < 2e-16 ***
## PC23 1260.73 708.80 1.779 0.075560 .
## PC24 -252.87 745.95 -0.339 0.734680
## PC25 3477.78 789.14 4.407 1.15e-05 ***
## PC26 20663.82 870.22 23.745 < 2e-16 ***
## PC27 4481.66 896.94 4.997 6.75e-07 ***
## PC28 6428.26 1001.72 6.417 2.03e-10 ***
## PC29 -2626.69 1063.59 -2.470 0.013671 *
## PC30 -17238.56 1092.23 -15.783 < 2e-16 ***
## PC31 -10383.98 1250.79 -8.302 2.88e-16 ***
## PC32 8577.79 1254.58 6.837 1.32e-11 ***
## PC33 27212.41 1515.03 17.962 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19140 on 1135 degrees of freedom
## Multiple R-squared: 0.9448, Adjusted R-squared: 0.9432
## F-statistic: 589 on 33 and 1135 DF, p-value: < 2.2e-16
First I need to perform the same PCA on the testing set as I did on the training set. I will use the same PCA components as I did on the training set.
# Perform PCA on the testing set
pca_test <- prcomp(test_imputed, scale = TRUE)
# Add the PCA components to the testing set
test_imputed <- cbind(test_imputed, pca_test$x)
# Predict the SalePrice using the linear regression model
predictions <- predict(model, test_imputed)
# Calculate the RMSE
rmse_model1 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))
predictions <- predict(model_all, test_imputed)
# Calculate the RMSE
rmse_model_all <- sqrt(mean((test_imputed$SalePrice - predictions)^2))
predictions <- predict(model_top, test_imputed)
# Calculate the RMSE
rmse_model_top <- sqrt(mean((test_imputed$SalePrice - predictions)^2))
predictions <- predict(model_pca_80, test_imputed)
# Calculate the RMSE
rmse_model_pca_80 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))
predictions <- predict(model_pca_90, test_imputed)
# Calculate the RMSE
rmse_model_pca_90 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))
predictions <- predict(model_pca_95, test_imputed)
# Calculate the RMSE
rmse_model_pca_95 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))
predictions <- predict(model_pca_99, test_imputed)
# Calculate the RMSE
rmse_model_pca_99 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))
# Combine all the models rmse and r-squared into one dataframe
model_results <- data.frame(Model = c("TotalBsmtSF", "All Numeric Variables", "Top Correlated Variables", "PCA 80", "PCA 90", "PCA 95", "PCA 99"),
RMSE = c(rmse_model1, rmse_model_all, rmse_model_top, rmse_model_pca_80, rmse_model_pca_90, rmse_model_pca_95, rmse_model_pca_99),
R_squared = c(summary(model)$r.squared, summary(model_all)$r.squared, summary(model_top)$r.squared, summary(model_pca_80)$r.squared, summary(model_pca_90)$r.squared, summary(model_pca_95)$r.squared, summary(model_pca_99)$r.squared),
AIC = c(AIC(model), AIC(model_all), AIC(model_top), AIC(model_pca_80), AIC(model_pca_90), AIC(model_pca_95), AIC(model_pca_99)))
model_results
## Model RMSE R_squared AIC
## 1 TotalBsmtSF 61202.28 0.4116120 29107.16
## 2 All Numeric Variables 27467.64 0.8700424 27409.76
## 3 Top Correlated Variables 33713.83 0.8090931 27801.33
## 4 PCA 80 147295.33 0.8716175 27361.50
## 5 PCA 90 147595.97 0.8795522 27296.92
## 6 PCA 95 148919.64 0.9091219 26975.62
## 7 PCA 99 150560.87 0.9448317 26404.14
Using the above table allows me to easily see that including more than the PCA components of the top 80% of the variance does not improve the model. This allows us to eliminate the extra PCA models. The model with the lowest AIC was the model using the top correlated variables. The model with the lowest RMSE was the model using all the numeric variables. The model with the highest R-squared was the PCA model. I will work on improving all three of these models and see which we can improve the most.
We start with adding a backwards elimination to the model using all the numeric variables. This will allow us to simplify what is likely the most complicated model with the most variables and reduces the need for PCA since it reduces multicollinearity
# Perform backwards elimination on the model using all the numeric variables
model_all_backwards <- step(model_all)
## Start: AIC=24090.28
## SalePrice ~ Id + MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF +
## LowQualFinSF + GrLivArea + BsmtFullBath + BsmtHalfBath +
## FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd +
## Fireplaces + GarageYrBlt + GarageCars + GarageArea + WoodDeckSF +
## OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch +
## PoolArea + MiscVal + MoSold + YrSold
##
##
## Step: AIC=24090.28
## SalePrice ~ Id + MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF +
## LowQualFinSF + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF +
## EnclosedPorch + X3SsnPorch + ScreenPorch + PoolArea + MiscVal +
## MoSold + YrSold
##
##
## Step: AIC=24090.28
## SalePrice ~ Id + MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt +
## GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch +
## X3SsnPorch + ScreenPorch + PoolArea + MiscVal + MoSold +
## YrSold
##
## Df Sum of Sq RSS AIC
## - Id 1 1.0760e+05 9.7911e+11 24088
## - Fireplaces 1 2.2312e+05 9.7911e+11 24088
## - X3SsnPorch 1 2.4093e+05 9.7911e+11 24088
## - GarageYrBlt 1 9.8503e+06 9.7912e+11 24088
## - BsmtHalfBath 1 4.7025e+07 9.7916e+11 24088
## - MoSold 1 1.4271e+08 9.7925e+11 24089
## - YrSold 1 1.8780e+08 9.7930e+11 24089
## - BsmtFullBath 1 2.0703e+08 9.7932e+11 24089
## - GarageCars 1 4.8430e+08 9.7959e+11 24089
## - EnclosedPorch 1 6.5397e+08 9.7976e+11 24089
## - MiscVal 1 7.8292e+08 9.7989e+11 24089
## - LotFrontage 1 1.0891e+09 9.8020e+11 24090
## - FullBath 1 1.0999e+09 9.8021e+11 24090
## - ScreenPorch 1 1.4167e+09 9.8053e+11 24090
## - OpenPorchSF 1 1.4221e+09 9.8053e+11 24090
## - LowQualFinSF 1 1.4226e+09 9.8053e+11 24090
## <none> 9.7911e+11 24090
## - GarageArea 1 1.7280e+09 9.8084e+11 24090
## - WoodDeckSF 1 4.0035e+09 9.8311e+11 24093
## - KitchenAbvGr 1 4.4923e+09 9.8360e+11 24094
## - YearRemodAdd 1 5.1843e+09 9.8429e+11 24095
## - TotRmsAbvGrd 1 5.3478e+09 9.8446e+11 24095
## - HalfBath 1 6.1897e+09 9.8530e+11 24096
## - PoolArea 1 6.2311e+09 9.8534e+11 24096
## - BsmtFinSF2 1 6.5422e+09 9.8565e+11 24096
## - OverallCond 1 2.0970e+10 1.0001e+12 24113
## - MSSubClass 1 2.2988e+10 1.0021e+12 24115
## - YearBuilt 1 2.4047e+10 1.0032e+12 24117
## - LotArea 1 2.4070e+10 1.0032e+12 24117
## - BsmtUnfSF 1 2.4254e+10 1.0034e+12 24117
## - MasVnrArea 1 2.9199e+10 1.0083e+12 24123
## - BedroomAbvGr 1 6.1854e+10 1.0410e+12 24160
## - BsmtFinSF1 1 7.2542e+10 1.0517e+12 24172
## - X1stFlrSF 1 1.3453e+11 1.1136e+12 24239
## - OverallQual 1 1.5387e+11 1.1330e+12 24259
## - X2ndFlrSF 1 2.3698e+11 1.2161e+12 24342
##
## Step: AIC=24088.28
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt +
## GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch +
## X3SsnPorch + ScreenPorch + PoolArea + MiscVal + MoSold +
## YrSold
##
## Df Sum of Sq RSS AIC
## - X3SsnPorch 1 2.2817e+05 9.7911e+11 24086
## - Fireplaces 1 2.2972e+05 9.7911e+11 24086
## - GarageYrBlt 1 9.9144e+06 9.7912e+11 24086
## - BsmtHalfBath 1 4.7023e+07 9.7916e+11 24086
## - MoSold 1 1.4315e+08 9.7925e+11 24087
## - YrSold 1 1.8791e+08 9.7930e+11 24087
## - BsmtFullBath 1 2.0693e+08 9.7932e+11 24087
## - GarageCars 1 4.8419e+08 9.7959e+11 24087
## - EnclosedPorch 1 6.5389e+08 9.7976e+11 24087
## - MiscVal 1 7.8291e+08 9.7989e+11 24087
## - LotFrontage 1 1.0908e+09 9.8020e+11 24088
## - FullBath 1 1.1002e+09 9.8021e+11 24088
## - ScreenPorch 1 1.4169e+09 9.8053e+11 24088
## - OpenPorchSF 1 1.4225e+09 9.8053e+11 24088
## - LowQualFinSF 1 1.4285e+09 9.8054e+11 24088
## <none> 9.7911e+11 24088
## - GarageArea 1 1.7281e+09 9.8084e+11 24088
## - WoodDeckSF 1 4.0133e+09 9.8312e+11 24091
## - KitchenAbvGr 1 4.5008e+09 9.8361e+11 24092
## - YearRemodAdd 1 5.1884e+09 9.8430e+11 24093
## - TotRmsAbvGrd 1 5.3483e+09 9.8446e+11 24093
## - HalfBath 1 6.2064e+09 9.8532e+11 24094
## - PoolArea 1 6.2525e+09 9.8536e+11 24094
## - BsmtFinSF2 1 6.5558e+09 9.8567e+11 24094
## - OverallCond 1 2.0995e+10 1.0001e+12 24111
## - MSSubClass 1 2.2997e+10 1.0021e+12 24113
## - YearBuilt 1 2.4050e+10 1.0032e+12 24115
## - LotArea 1 2.4098e+10 1.0032e+12 24115
## - BsmtUnfSF 1 2.4271e+10 1.0034e+12 24115
## - MasVnrArea 1 2.9284e+10 1.0084e+12 24121
## - BedroomAbvGr 1 6.1931e+10 1.0410e+12 24158
## - BsmtFinSF1 1 7.2623e+10 1.0517e+12 24170
## - X1stFlrSF 1 1.3503e+11 1.1141e+12 24237
## - OverallQual 1 1.5439e+11 1.1335e+12 24257
## - X2ndFlrSF 1 2.3702e+11 1.2161e+12 24340
##
## Step: AIC=24086.28
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt +
## GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch +
## ScreenPorch + PoolArea + MiscVal + MoSold + YrSold
##
## Df Sum of Sq RSS AIC
## - Fireplaces 1 2.3774e+05 9.7911e+11 24084
## - GarageYrBlt 1 9.9313e+06 9.7912e+11 24084
## - BsmtHalfBath 1 4.7199e+07 9.7916e+11 24084
## - MoSold 1 1.4381e+08 9.7925e+11 24085
## - YrSold 1 1.8865e+08 9.7930e+11 24085
## - BsmtFullBath 1 2.0681e+08 9.7932e+11 24085
## - GarageCars 1 4.8397e+08 9.7959e+11 24085
## - EnclosedPorch 1 6.5409e+08 9.7976e+11 24085
## - MiscVal 1 7.8306e+08 9.7989e+11 24085
## - LotFrontage 1 1.0978e+09 9.8021e+11 24086
## - FullBath 1 1.1024e+09 9.8021e+11 24086
## - ScreenPorch 1 1.4207e+09 9.8053e+11 24086
## - OpenPorchSF 1 1.4257e+09 9.8054e+11 24086
## - LowQualFinSF 1 1.4283e+09 9.8054e+11 24086
## <none> 9.7911e+11 24086
## - GarageArea 1 1.7304e+09 9.8084e+11 24086
## - WoodDeckSF 1 4.0290e+09 9.8314e+11 24089
## - KitchenAbvGr 1 4.5025e+09 9.8361e+11 24090
## - YearRemodAdd 1 5.1882e+09 9.8430e+11 24091
## - TotRmsAbvGrd 1 5.3556e+09 9.8447e+11 24091
## - HalfBath 1 6.2082e+09 9.8532e+11 24092
## - PoolArea 1 6.2534e+09 9.8536e+11 24092
## - BsmtFinSF2 1 6.5683e+09 9.8568e+11 24092
## - OverallCond 1 2.1005e+10 1.0001e+12 24109
## - MSSubClass 1 2.3000e+10 1.0021e+12 24111
## - YearBuilt 1 2.4056e+10 1.0032e+12 24113
## - LotArea 1 2.4104e+10 1.0032e+12 24113
## - BsmtUnfSF 1 2.4271e+10 1.0034e+12 24113
## - MasVnrArea 1 2.9286e+10 1.0084e+12 24119
## - BedroomAbvGr 1 6.2029e+10 1.0411e+12 24156
## - BsmtFinSF1 1 7.2626e+10 1.0517e+12 24168
## - X1stFlrSF 1 1.3534e+11 1.1144e+12 24236
## - OverallQual 1 1.5441e+11 1.1335e+12 24256
## - X2ndFlrSF 1 2.3711e+11 1.2162e+12 24338
##
## Step: AIC=24084.28
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + GarageYrBlt + GarageCars +
## GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch +
## PoolArea + MiscVal + MoSold + YrSold
##
## Df Sum of Sq RSS AIC
## - GarageYrBlt 1 1.0335e+07 9.7912e+11 24082
## - BsmtHalfBath 1 4.7089e+07 9.7916e+11 24082
## - MoSold 1 1.4360e+08 9.7925e+11 24083
## - YrSold 1 1.8850e+08 9.7930e+11 24083
## - BsmtFullBath 1 2.0696e+08 9.7932e+11 24083
## - GarageCars 1 4.9098e+08 9.7960e+11 24083
## - EnclosedPorch 1 6.5503e+08 9.7977e+11 24083
## - MiscVal 1 7.8282e+08 9.7989e+11 24083
## - LotFrontage 1 1.1005e+09 9.8021e+11 24084
## - FullBath 1 1.1045e+09 9.8022e+11 24084
## - OpenPorchSF 1 1.4281e+09 9.8054e+11 24084
## - LowQualFinSF 1 1.4281e+09 9.8054e+11 24084
## - ScreenPorch 1 1.4491e+09 9.8056e+11 24084
## <none> 9.7911e+11 24084
## - GarageArea 1 1.7401e+09 9.8085e+11 24084
## - WoodDeckSF 1 4.0549e+09 9.8317e+11 24087
## - KitchenAbvGr 1 4.5869e+09 9.8370e+11 24088
## - YearRemodAdd 1 5.2048e+09 9.8432e+11 24089
## - TotRmsAbvGrd 1 5.3564e+09 9.8447e+11 24089
## - HalfBath 1 6.2229e+09 9.8533e+11 24090
## - PoolArea 1 6.2539e+09 9.8536e+11 24090
## - BsmtFinSF2 1 6.5733e+09 9.8568e+11 24090
## - OverallCond 1 2.1011e+10 1.0001e+12 24107
## - MSSubClass 1 2.3030e+10 1.0021e+12 24110
## - YearBuilt 1 2.4057e+10 1.0032e+12 24111
## - BsmtUnfSF 1 2.4368e+10 1.0035e+12 24111
## - LotArea 1 2.4463e+10 1.0036e+12 24111
## - MasVnrArea 1 2.9287e+10 1.0084e+12 24117
## - BedroomAbvGr 1 6.2722e+10 1.0418e+12 24155
## - BsmtFinSF1 1 7.2653e+10 1.0518e+12 24166
## - X1stFlrSF 1 1.4354e+11 1.1226e+12 24242
## - OverallQual 1 1.5753e+11 1.1366e+12 24257
## - X2ndFlrSF 1 2.4040e+11 1.2195e+12 24339
##
## Step: AIC=24082.29
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF +
## OpenPorchSF + EnclosedPorch + ScreenPorch + PoolArea + MiscVal +
## MoSold + YrSold
##
## Df Sum of Sq RSS AIC
## - BsmtHalfBath 1 4.6204e+07 9.7917e+11 24080
## - MoSold 1 1.4295e+08 9.7926e+11 24081
## - YrSold 1 1.8891e+08 9.7931e+11 24081
## - BsmtFullBath 1 2.0933e+08 9.7933e+11 24081
## - GarageCars 1 4.8691e+08 9.7961e+11 24081
## - EnclosedPorch 1 6.5544e+08 9.7978e+11 24081
## - MiscVal 1 7.8220e+08 9.7990e+11 24081
## - FullBath 1 1.1168e+09 9.8024e+11 24082
## - LotFrontage 1 1.1313e+09 9.8025e+11 24082
## - LowQualFinSF 1 1.4204e+09 9.8054e+11 24082
## - OpenPorchSF 1 1.4252e+09 9.8055e+11 24082
## - ScreenPorch 1 1.4565e+09 9.8058e+11 24082
## <none> 9.7912e+11 24082
## - GarageArea 1 1.8013e+09 9.8092e+11 24082
## - WoodDeckSF 1 4.0523e+09 9.8317e+11 24085
## - KitchenAbvGr 1 4.5766e+09 9.8370e+11 24086
## - TotRmsAbvGrd 1 5.3475e+09 9.8447e+11 24087
## - YearRemodAdd 1 5.3901e+09 9.8451e+11 24087
## - HalfBath 1 6.2129e+09 9.8533e+11 24088
## - PoolArea 1 6.2820e+09 9.8540e+11 24088
## - BsmtFinSF2 1 6.5733e+09 9.8569e+11 24088
## - OverallCond 1 2.1183e+10 1.0003e+12 24105
## - MSSubClass 1 2.3113e+10 1.0022e+12 24108
## - BsmtUnfSF 1 2.4366e+10 1.0035e+12 24109
## - LotArea 1 2.4520e+10 1.0036e+12 24109
## - MasVnrArea 1 2.9422e+10 1.0085e+12 24115
## - YearBuilt 1 3.2040e+10 1.0112e+12 24118
## - BedroomAbvGr 1 6.2713e+10 1.0418e+12 24153
## - BsmtFinSF1 1 7.2706e+10 1.0518e+12 24164
## - X1stFlrSF 1 1.4483e+11 1.1240e+12 24242
## - OverallQual 1 1.5752e+11 1.1366e+12 24255
## - X2ndFlrSF 1 2.4108e+11 1.2202e+12 24338
##
## Step: AIC=24080.35
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF +
## EnclosedPorch + ScreenPorch + PoolArea + MiscVal + MoSold +
## YrSold
##
## Df Sum of Sq RSS AIC
## - MoSold 1 1.4456e+08 9.7931e+11 24079
## - YrSold 1 1.8021e+08 9.7935e+11 24079
## - BsmtFullBath 1 2.9603e+08 9.7946e+11 24079
## - GarageCars 1 4.8616e+08 9.7965e+11 24079
## - EnclosedPorch 1 6.6299e+08 9.7983e+11 24079
## - MiscVal 1 7.7039e+08 9.7994e+11 24079
## - FullBath 1 1.0886e+09 9.8026e+11 24080
## - LotFrontage 1 1.1102e+09 9.8028e+11 24080
## - LowQualFinSF 1 1.4231e+09 9.8059e+11 24080
## - OpenPorchSF 1 1.4278e+09 9.8059e+11 24080
## - ScreenPorch 1 1.4521e+09 9.8062e+11 24080
## <none> 9.7917e+11 24080
## - GarageArea 1 1.8071e+09 9.8097e+11 24081
## - WoodDeckSF 1 4.0239e+09 9.8319e+11 24083
## - KitchenAbvGr 1 4.5612e+09 9.8373e+11 24084
## - YearRemodAdd 1 5.3545e+09 9.8452e+11 24085
## - TotRmsAbvGrd 1 5.3805e+09 9.8455e+11 24085
## - HalfBath 1 6.1830e+09 9.8535e+11 24086
## - PoolArea 1 6.2674e+09 9.8543e+11 24086
## - BsmtFinSF2 1 6.5314e+09 9.8570e+11 24086
## - OverallCond 1 2.1178e+10 1.0003e+12 24103
## - MSSubClass 1 2.3375e+10 1.0025e+12 24106
## - BsmtUnfSF 1 2.4384e+10 1.0036e+12 24107
## - LotArea 1 2.4475e+10 1.0036e+12 24107
## - MasVnrArea 1 2.9385e+10 1.0086e+12 24113
## - YearBuilt 1 3.1997e+10 1.0112e+12 24116
## - BedroomAbvGr 1 6.3496e+10 1.0427e+12 24152
## - BsmtFinSF1 1 7.3768e+10 1.0529e+12 24163
## - X1stFlrSF 1 1.4491e+11 1.1241e+12 24240
## - OverallQual 1 1.5791e+11 1.1371e+12 24253
## - X2ndFlrSF 1 2.4126e+11 1.2204e+12 24336
##
## Step: AIC=24078.52
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF +
## EnclosedPorch + ScreenPorch + PoolArea + MiscVal + YrSold
##
## Df Sum of Sq RSS AIC
## - YrSold 1 1.4199e+08 9.7945e+11 24077
## - BsmtFullBath 1 3.0597e+08 9.7962e+11 24077
## - GarageCars 1 4.8538e+08 9.7980e+11 24077
## - EnclosedPorch 1 6.3805e+08 9.7995e+11 24077
## - MiscVal 1 7.6489e+08 9.8008e+11 24077
## - LotFrontage 1 1.0837e+09 9.8040e+11 24078
## - FullBath 1 1.0959e+09 9.8041e+11 24078
## - OpenPorchSF 1 1.3977e+09 9.8071e+11 24078
## - ScreenPorch 1 1.4338e+09 9.8075e+11 24078
## - LowQualFinSF 1 1.4473e+09 9.8076e+11 24078
## <none> 9.7931e+11 24079
## - GarageArea 1 1.8358e+09 9.8115e+11 24079
## - WoodDeckSF 1 4.0024e+09 9.8331e+11 24081
## - KitchenAbvGr 1 4.6376e+09 9.8395e+11 24082
## - YearRemodAdd 1 5.3462e+09 9.8466e+11 24083
## - TotRmsAbvGrd 1 5.5200e+09 9.8483e+11 24083
## - HalfBath 1 6.1500e+09 9.8546e+11 24084
## - PoolArea 1 6.3592e+09 9.8567e+11 24084
## - BsmtFinSF2 1 6.5866e+09 9.8590e+11 24084
## - OverallCond 1 2.1147e+10 1.0005e+12 24102
## - MSSubClass 1 2.3338e+10 1.0026e+12 24104
## - BsmtUnfSF 1 2.4574e+10 1.0039e+12 24106
## - LotArea 1 2.4581e+10 1.0039e+12 24106
## - MasVnrArea 1 2.9398e+10 1.0087e+12 24111
## - YearBuilt 1 3.2019e+10 1.0113e+12 24114
## - BedroomAbvGr 1 6.3950e+10 1.0433e+12 24151
## - BsmtFinSF1 1 7.3922e+10 1.0532e+12 24162
## - X1stFlrSF 1 1.4479e+11 1.1241e+12 24238
## - OverallQual 1 1.5783e+11 1.1371e+12 24251
## - X2ndFlrSF 1 2.4112e+11 1.2204e+12 24334
##
## Step: AIC=24076.69
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF +
## EnclosedPorch + ScreenPorch + PoolArea + MiscVal
##
## Df Sum of Sq RSS AIC
## - BsmtFullBath 1 2.7259e+08 9.7973e+11 24075
## - GarageCars 1 5.0704e+08 9.7996e+11 24075
## - EnclosedPorch 1 6.3891e+08 9.8009e+11 24076
## - MiscVal 1 7.6499e+08 9.8022e+11 24076
## - LotFrontage 1 1.0513e+09 9.8050e+11 24076
## - FullBath 1 1.1057e+09 9.8056e+11 24076
## - ScreenPorch 1 1.4087e+09 9.8086e+11 24076
## - OpenPorchSF 1 1.4253e+09 9.8088e+11 24076
## - LowQualFinSF 1 1.4702e+09 9.8092e+11 24076
## <none> 9.7945e+11 24077
## - GarageArea 1 1.8249e+09 9.8128e+11 24077
## - WoodDeckSF 1 3.9721e+09 9.8343e+11 24079
## - KitchenAbvGr 1 4.7314e+09 9.8419e+11 24080
## - YearRemodAdd 1 5.2827e+09 9.8474e+11 24081
## - TotRmsAbvGrd 1 5.5575e+09 9.8501e+11 24081
## - HalfBath 1 6.1792e+09 9.8563e+11 24082
## - PoolArea 1 6.5283e+09 9.8598e+11 24083
## - BsmtFinSF2 1 6.5619e+09 9.8602e+11 24083
## - OverallCond 1 2.1120e+10 1.0006e+12 24100
## - MSSubClass 1 2.3263e+10 1.0027e+12 24102
## - LotArea 1 2.4653e+10 1.0041e+12 24104
## - BsmtUnfSF 1 2.4679e+10 1.0041e+12 24104
## - MasVnrArea 1 2.9408e+10 1.0089e+12 24109
## - YearBuilt 1 3.2013e+10 1.0115e+12 24112
## - BedroomAbvGr 1 6.3818e+10 1.0433e+12 24149
## - BsmtFinSF1 1 7.4470e+10 1.0539e+12 24160
## - X1stFlrSF 1 1.4483e+11 1.1243e+12 24236
## - OverallQual 1 1.5772e+11 1.1372e+12 24249
## - X2ndFlrSF 1 2.4108e+11 1.2205e+12 24332
##
## Step: AIC=24075.02
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd +
## GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch +
## ScreenPorch + PoolArea + MiscVal
##
## Df Sum of Sq RSS AIC
## - GarageCars 1 5.1100e+08 9.8024e+11 24074
## - EnclosedPorch 1 6.0998e+08 9.8034e+11 24074
## - MiscVal 1 8.1104e+08 9.8054e+11 24074
## - LotFrontage 1 1.0507e+09 9.8078e+11 24074
## - FullBath 1 1.2665e+09 9.8099e+11 24075
## - ScreenPorch 1 1.4168e+09 9.8114e+11 24075
## - LowQualFinSF 1 1.4619e+09 9.8119e+11 24075
## - OpenPorchSF 1 1.4728e+09 9.8120e+11 24075
## <none> 9.7973e+11 24075
## - GarageArea 1 1.8496e+09 9.8158e+11 24075
## - WoodDeckSF 1 4.1210e+09 9.8385e+11 24078
## - KitchenAbvGr 1 4.6559e+09 9.8438e+11 24079
## - YearRemodAdd 1 5.5699e+09 9.8530e+11 24080
## - TotRmsAbvGrd 1 5.6451e+09 9.8537e+11 24080
## - HalfBath 1 6.3062e+09 9.8603e+11 24081
## - PoolArea 1 6.5583e+09 9.8628e+11 24081
## - BsmtFinSF2 1 7.4581e+09 9.8718e+11 24082
## - OverallCond 1 2.0915e+10 1.0006e+12 24098
## - MSSubClass 1 2.3001e+10 1.0027e+12 24100
## - BsmtUnfSF 1 2.4672e+10 1.0044e+12 24102
## - LotArea 1 2.4965e+10 1.0047e+12 24102
## - MasVnrArea 1 2.9163e+10 1.0089e+12 24107
## - YearBuilt 1 3.2275e+10 1.0120e+12 24111
## - BedroomAbvGr 1 6.3859e+10 1.0436e+12 24147
## - BsmtFinSF1 1 9.4538e+10 1.0743e+12 24181
## - X1stFlrSF 1 1.4455e+11 1.1243e+12 24234
## - OverallQual 1 1.5758e+11 1.1373e+12 24247
## - X2ndFlrSF 1 2.4082e+11 1.2205e+12 24330
##
## Step: AIC=24073.63
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd +
## GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch +
## PoolArea + MiscVal
##
## Df Sum of Sq RSS AIC
## - EnclosedPorch 1 6.3871e+08 9.8088e+11 24072
## - MiscVal 1 8.6575e+08 9.8110e+11 24073
## - LotFrontage 1 1.0372e+09 9.8127e+11 24073
## - FullBath 1 1.1803e+09 9.8142e+11 24073
## - LowQualFinSF 1 1.3410e+09 9.8158e+11 24073
## - OpenPorchSF 1 1.3559e+09 9.8159e+11 24073
## - ScreenPorch 1 1.4518e+09 9.8169e+11 24073
## <none> 9.8024e+11 24074
## - WoodDeckSF 1 4.1121e+09 9.8435e+11 24077
## - KitchenAbvGr 1 4.6725e+09 9.8491e+11 24077
## - YearRemodAdd 1 5.6542e+09 9.8589e+11 24078
## - TotRmsAbvGrd 1 5.7904e+09 9.8603e+11 24079
## - HalfBath 1 6.1375e+09 9.8637e+11 24079
## - PoolArea 1 6.5401e+09 9.8678e+11 24079
## - BsmtFinSF2 1 7.2690e+09 9.8751e+11 24080
## - GarageArea 1 1.0144e+10 9.9038e+11 24084
## - OverallCond 1 2.0750e+10 1.0010e+12 24096
## - MSSubClass 1 2.2828e+10 1.0031e+12 24099
## - BsmtUnfSF 1 2.4506e+10 1.0047e+12 24101
## - LotArea 1 2.5329e+10 1.0056e+12 24101
## - MasVnrArea 1 2.9067e+10 1.0093e+12 24106
## - YearBuilt 1 3.3344e+10 1.0136e+12 24111
## - BedroomAbvGr 1 6.3998e+10 1.0442e+12 24146
## - BsmtFinSF1 1 9.4043e+10 1.0743e+12 24179
## - X1stFlrSF 1 1.4524e+11 1.1255e+12 24233
## - OverallQual 1 1.6090e+11 1.1411e+12 24249
## - X2ndFlrSF 1 2.4078e+11 1.2210e+12 24328
##
## Step: AIC=24072.39
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd +
## GarageArea + WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea +
## MiscVal
##
## Df Sum of Sq RSS AIC
## - MiscVal 1 8.9586e+08 9.8177e+11 24072
## - LotFrontage 1 9.7591e+08 9.8185e+11 24072
## - FullBath 1 1.1984e+09 9.8207e+11 24072
## - LowQualFinSF 1 1.3741e+09 9.8225e+11 24072
## - OpenPorchSF 1 1.4508e+09 9.8233e+11 24072
## <none> 9.8088e+11 24072
## - ScreenPorch 1 1.7835e+09 9.8266e+11 24073
## - WoodDeckSF 1 4.4151e+09 9.8529e+11 24076
## - KitchenAbvGr 1 4.6344e+09 9.8551e+11 24076
## - YearRemodAdd 1 5.4613e+09 9.8634e+11 24077
## - TotRmsAbvGrd 1 5.9966e+09 9.8687e+11 24078
## - HalfBath 1 6.0607e+09 9.8694e+11 24078
## - PoolArea 1 6.2921e+09 9.8717e+11 24078
## - BsmtFinSF2 1 7.0914e+09 9.8797e+11 24079
## - GarageArea 1 1.0016e+10 9.9089e+11 24082
## - OverallCond 1 2.1989e+10 1.0029e+12 24096
## - MSSubClass 1 2.2738e+10 1.0036e+12 24097
## - BsmtUnfSF 1 2.4262e+10 1.0051e+12 24099
## - LotArea 1 2.5707e+10 1.0066e+12 24101
## - MasVnrArea 1 2.9339e+10 1.0102e+12 24105
## - YearBuilt 1 4.1019e+10 1.0219e+12 24118
## - BedroomAbvGr 1 6.3794e+10 1.0447e+12 24144
## - BsmtFinSF1 1 9.3655e+10 1.0745e+12 24177
## - X1stFlrSF 1 1.4497e+11 1.1258e+12 24232
## - OverallQual 1 1.6033e+11 1.1412e+12 24247
## - X2ndFlrSF 1 2.4020e+11 1.2211e+12 24326
##
## Step: AIC=24071.45
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd +
## GarageArea + WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea
##
## Df Sum of Sq RSS AIC
## - LotFrontage 1 9.6947e+08 9.8274e+11 24071
## - FullBath 1 1.1728e+09 9.8294e+11 24071
## - LowQualFinSF 1 1.3945e+09 9.8317e+11 24071
## - OpenPorchSF 1 1.4823e+09 9.8325e+11 24071
## <none> 9.8177e+11 24072
## - ScreenPorch 1 1.6942e+09 9.8347e+11 24072
## - WoodDeckSF 1 4.4369e+09 9.8621e+11 24075
## - KitchenAbvGr 1 4.9784e+09 9.8675e+11 24075
## - YearRemodAdd 1 5.4975e+09 9.8727e+11 24076
## - TotRmsAbvGrd 1 5.7800e+09 9.8755e+11 24076
## - HalfBath 1 6.0322e+09 9.8780e+11 24077
## - PoolArea 1 6.1365e+09 9.8791e+11 24077
## - BsmtFinSF2 1 7.0270e+09 9.8880e+11 24078
## - GarageArea 1 1.0167e+10 9.9194e+11 24082
## - OverallCond 1 2.1471e+10 1.0032e+12 24095
## - MSSubClass 1 2.2454e+10 1.0042e+12 24096
## - BsmtUnfSF 1 2.4034e+10 1.0058e+12 24098
## - LotArea 1 2.5355e+10 1.0071e+12 24099
## - MasVnrArea 1 2.9541e+10 1.0113e+12 24104
## - YearBuilt 1 4.0633e+10 1.0224e+12 24117
## - BedroomAbvGr 1 6.3187e+10 1.0450e+12 24142
## - BsmtFinSF1 1 9.3139e+10 1.0749e+12 24175
## - X1stFlrSF 1 1.4613e+11 1.1279e+12 24232
## - OverallQual 1 1.6083e+11 1.1426e+12 24247
## - X2ndFlrSF 1 2.4004e+11 1.2218e+12 24325
##
## Step: AIC=24070.61
## SalePrice ~ MSSubClass + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + FullBath +
## HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea +
## WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea
##
## Df Sum of Sq RSS AIC
## - FullBath 1 1.2151e+09 9.8396e+11 24070
## - OpenPorchSF 1 1.5064e+09 9.8425e+11 24070
## - LowQualFinSF 1 1.5244e+09 9.8427e+11 24070
## - ScreenPorch 1 1.5677e+09 9.8431e+11 24071
## <none> 9.8274e+11 24071
## - WoodDeckSF 1 4.1882e+09 9.8693e+11 24074
## - KitchenAbvGr 1 4.8071e+09 9.8755e+11 24074
## - YearRemodAdd 1 5.4456e+09 9.8819e+11 24075
## - TotRmsAbvGrd 1 5.8919e+09 9.8863e+11 24076
## - HalfBath 1 6.0559e+09 9.8880e+11 24076
## - PoolArea 1 6.5366e+09 9.8928e+11 24076
## - BsmtFinSF2 1 6.9364e+09 9.8968e+11 24077
## - GarageArea 1 1.1211e+10 9.9395e+11 24082
## - OverallCond 1 2.1911e+10 1.0047e+12 24094
## - BsmtUnfSF 1 2.3447e+10 1.0062e+12 24096
## - MasVnrArea 1 2.9602e+10 1.0123e+12 24103
## - MSSubClass 1 2.9644e+10 1.0124e+12 24103
## - LotArea 1 3.0055e+10 1.0128e+12 24104
## - YearBuilt 1 4.1017e+10 1.0238e+12 24116
## - BedroomAbvGr 1 6.2225e+10 1.0450e+12 24140
## - BsmtFinSF1 1 9.2492e+10 1.0752e+12 24174
## - X1stFlrSF 1 1.5305e+11 1.1358e+12 24238
## - OverallQual 1 1.6069e+11 1.1434e+12 24246
## - X2ndFlrSF 1 2.4324e+11 1.2260e+12 24327
##
## Step: AIC=24070.05
## SalePrice ~ MSSubClass + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + HalfBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea +
## WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea
##
## Df Sum of Sq RSS AIC
## - OpenPorchSF 1 1.3346e+09 9.8529e+11 24070
## - LowQualFinSF 1 1.4675e+09 9.8542e+11 24070
## - ScreenPorch 1 1.6399e+09 9.8560e+11 24070
## <none> 9.8396e+11 24070
## - WoodDeckSF 1 4.1964e+09 9.8815e+11 24073
## - HalfBath 1 4.8958e+09 9.8885e+11 24074
## - YearRemodAdd 1 4.9647e+09 9.8892e+11 24074
## - KitchenAbvGr 1 5.7542e+09 9.8971e+11 24075
## - TotRmsAbvGrd 1 5.8498e+09 9.8981e+11 24075
## - PoolArea 1 6.7169e+09 9.9067e+11 24076
## - BsmtFinSF2 1 7.2826e+09 9.9124e+11 24077
## - GarageArea 1 1.1378e+10 9.9533e+11 24082
## - OverallCond 1 2.2538e+10 1.0065e+12 24095
## - BsmtUnfSF 1 2.3626e+10 1.0076e+12 24096
## - LotArea 1 2.9575e+10 1.0135e+12 24103
## - MasVnrArea 1 3.0110e+10 1.0141e+12 24103
## - MSSubClass 1 3.0202e+10 1.0142e+12 24103
## - YearBuilt 1 4.0880e+10 1.0248e+12 24116
## - BedroomAbvGr 1 6.4365e+10 1.0483e+12 24142
## - BsmtFinSF1 1 9.5276e+10 1.0792e+12 24176
## - X1stFlrSF 1 1.5634e+11 1.1403e+12 24240
## - OverallQual 1 1.5951e+11 1.1435e+12 24244
## - X2ndFlrSF 1 2.7446e+11 1.2584e+12 24356
##
## Step: AIC=24069.64
## SalePrice ~ MSSubClass + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + HalfBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea +
## WoodDeckSF + ScreenPorch + PoolArea
##
## Df Sum of Sq RSS AIC
## - LowQualFinSF 1 1.5645e+09 9.8686e+11 24070
## <none> 9.8529e+11 24070
## - ScreenPorch 1 1.7452e+09 9.8704e+11 24070
## - WoodDeckSF 1 3.9819e+09 9.8927e+11 24072
## - HalfBath 1 4.7083e+09 9.9000e+11 24073
## - YearRemodAdd 1 5.1416e+09 9.9043e+11 24074
## - TotRmsAbvGrd 1 5.7281e+09 9.9102e+11 24074
## - KitchenAbvGr 1 6.0131e+09 9.9130e+11 24075
## - PoolArea 1 6.7659e+09 9.9206e+11 24076
## - BsmtFinSF2 1 7.3686e+09 9.9266e+11 24076
## - GarageArea 1 1.1643e+10 9.9693e+11 24081
## - OverallCond 1 2.2821e+10 1.0081e+12 24094
## - BsmtUnfSF 1 2.4661e+10 1.0100e+12 24097
## - LotArea 1 2.9603e+10 1.0149e+12 24102
## - MasVnrArea 1 2.9628e+10 1.0149e+12 24102
## - MSSubClass 1 3.0414e+10 1.0157e+12 24103
## - YearBuilt 1 4.1262e+10 1.0266e+12 24116
## - BedroomAbvGr 1 6.4487e+10 1.0498e+12 24142
## - BsmtFinSF1 1 9.6950e+10 1.0822e+12 24177
## - X1stFlrSF 1 1.5785e+11 1.1431e+12 24241
## - OverallQual 1 1.6187e+11 1.1472e+12 24246
## - X2ndFlrSF 1 2.8065e+11 1.2659e+12 24361
##
## Step: AIC=24069.49
## SalePrice ~ MSSubClass + LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + HalfBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + GarageArea + WoodDeckSF + ScreenPorch +
## PoolArea
##
## Df Sum of Sq RSS AIC
## <none> 9.8686e+11 24070
## - ScreenPorch 1 1.7909e+09 9.8865e+11 24070
## - WoodDeckSF 1 3.9871e+09 9.9084e+11 24072
## - HalfBath 1 4.8047e+09 9.9166e+11 24073
## - YearRemodAdd 1 5.3353e+09 9.9219e+11 24074
## - KitchenAbvGr 1 6.8138e+09 9.9367e+11 24076
## - TotRmsAbvGrd 1 7.0366e+09 9.9389e+11 24076
## - PoolArea 1 7.2233e+09 9.9408e+11 24076
## - BsmtFinSF2 1 7.4082e+09 9.9426e+11 24076
## - GarageArea 1 1.1531e+10 9.9839e+11 24081
## - OverallCond 1 2.2117e+10 1.0090e+12 24093
## - BsmtUnfSF 1 2.5062e+10 1.0119e+12 24097
## - MasVnrArea 1 2.9075e+10 1.0159e+12 24101
## - MSSubClass 1 2.9449e+10 1.0163e+12 24102
## - LotArea 1 2.9586e+10 1.0164e+12 24102
## - YearBuilt 1 3.9745e+10 1.0266e+12 24114
## - BedroomAbvGr 1 6.4680e+10 1.0515e+12 24142
## - BsmtFinSF1 1 9.7510e+10 1.0844e+12 24178
## - X1stFlrSF 1 1.5643e+11 1.1433e+12 24240
## - OverallQual 1 1.6277e+11 1.1496e+12 24246
## - X2ndFlrSF 1 2.7921e+11 1.2661e+12 24359
# Print the model summary
summary(model_all_backwards)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + HalfBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea +
## WoodDeckSF + ScreenPorch + PoolArea, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138320 -15519 -1104 14104 216346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.033e+06 1.073e+05 -9.625 < 2e-16 ***
## MSSubClass -1.441e+02 2.462e+01 -5.853 6.29e-09 ***
## LotArea 5.140e-01 8.762e-02 5.867 5.81e-09 ***
## OverallQual 1.532e+04 1.114e+03 13.760 < 2e-16 ***
## OverallCond 4.713e+03 9.291e+02 5.072 4.58e-07 ***
## YearBuilt 3.413e+02 5.020e+01 6.800 1.68e-11 ***
## YearRemodAdd 1.535e+02 6.163e+01 2.491 0.012868 *
## MasVnrArea 3.276e+01 5.633e+00 5.816 7.82e-09 ***
## BsmtFinSF1 4.281e+01 4.019e+00 10.650 < 2e-16 ***
## BsmtFinSF2 1.877e+01 6.394e+00 2.936 0.003395 **
## BsmtUnfSF 2.107e+01 3.903e+00 5.399 8.12e-08 ***
## X1stFlrSF 6.942e+01 5.146e+00 13.490 < 2e-16 ***
## X2ndFlrSF 7.630e+01 4.234e+00 18.022 < 2e-16 ***
## HalfBath -5.536e+03 2.342e+03 -2.364 0.018236 *
## BedroomAbvGr -1.371e+04 1.581e+03 -8.674 < 2e-16 ***
## KitchenAbvGr -1.326e+04 4.710e+03 -2.815 0.004955 **
## TotRmsAbvGrd 3.327e+03 1.163e+03 2.861 0.004299 **
## GarageArea 1.992e+01 5.440e+00 3.663 0.000261 ***
## WoodDeckSF 1.596e+01 7.413e+00 2.154 0.031476 *
## ScreenPorch 2.273e+01 1.575e+01 1.443 0.149192
## PoolArea 5.952e+01 2.053e+01 2.899 0.003818 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29320 on 1148 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8667
## F-statistic: 380.8 on 20 and 1148 DF, p-value: < 2.2e-16
This summary is encouraging. The model only has a very slightly lower R-squared value, but the amount of variables has been reduced from 37 to 20. This will likely help with overfitting and multicollinearity. I will now check the VIF of the model to ensure that multicollinearity is not an issue.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
#Check VIF of the model
vif(model_all_backwards)
## MSSubClass LotArea OverallQual OverallCond YearBuilt YearRemodAdd
## 1.492273 1.150386 3.185767 1.525783 3.108975 2.196403
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF X1stFlrSF X2ndFlrSF
## 1.342669 4.108074 1.439863 4.034166 5.011759 4.686271
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd GarageArea WoodDeckSF
## 1.871522 2.297084 1.548946 4.786408 1.816705 1.165230
## ScreenPorch PoolArea
## 1.071343 1.042236
While much better than before, there are still some variables with VIFs higher than I would like, including one just barely above 5. However, when I attempted to remove some of these variables, the loss in R-squared was large. Since this model’s goal is overall prediction, and not necessarily understanding the relationships between the variables, I will keep the model as is.
Next, I will work on performing a backwards elimination manually to see if I reach simialr results.
# Perform backwards elimination manually on the model using all the numeric variables
model_all_manual <- lm(SalePrice ~ ., train_imputed)
# Print the model summary
summary(model_all_manual)
##
## Call:
## lm(formula = SalePrice ~ ., data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138346 -15955 -1147 14172 217521
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.300e+05 1.338e+06 -0.321 0.74806
## Id -2.309e-02 2.069e+00 -0.011 0.99110
## MSSubClass -1.382e+02 2.680e+01 -5.158 2.95e-07 ***
## LotFrontage 5.356e+01 4.771e+01 1.123 0.26185
## LotArea 4.872e-01 9.231e-02 5.278 1.57e-07 ***
## OverallQual 1.530e+04 1.147e+03 13.344 < 2e-16 ***
## OverallCond 4.706e+03 9.553e+02 4.926 9.64e-07 ***
## YearBuilt 3.544e+02 6.718e+01 5.275 1.59e-07 ***
## YearRemodAdd 1.584e+02 6.469e+01 2.449 0.01446 *
## MasVnrArea 3.315e+01 5.704e+00 5.813 7.98e-09 ***
## BsmtFinSF1 4.155e+01 4.535e+00 9.162 < 2e-16 ***
## BsmtFinSF2 1.827e+01 6.642e+00 2.751 0.00603 **
## BsmtUnfSF 2.097e+01 3.958e+00 5.298 1.41e-07 ***
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 7.026e+01 5.631e+00 12.477 < 2e-16 ***
## X2ndFlrSF 7.863e+01 4.748e+00 16.560 < 2e-16 ***
## LowQualFinSF 2.244e+01 1.749e+01 1.283 0.19974
## GrLivArea NA NA NA NA
## BsmtFullBath 1.210e+03 2.471e+03 0.489 0.62461
## BsmtHalfBath -8.845e+02 3.792e+03 -0.233 0.81559
## FullBath -3.003e+03 2.662e+03 -1.128 0.25948
## HalfBath -6.784e+03 2.535e+03 -2.676 0.00755 **
## BedroomAbvGr -1.375e+04 1.625e+03 -8.460 < 2e-16 ***
## KitchenAbvGr -1.119e+04 4.906e+03 -2.280 0.02279 *
## TotRmsAbvGrd 2.969e+03 1.194e+03 2.488 0.01300 *
## Fireplaces 2.713e+01 1.688e+03 0.016 0.98718
## GarageYrBlt -7.529e+00 7.052e+01 -0.107 0.91500
## GarageCars 2.039e+03 2.724e+03 0.749 0.45425
## GarageArea 1.344e+01 9.504e+00 1.414 0.15761
## WoodDeckSF 1.631e+01 7.579e+00 2.152 0.03158 *
## OpenPorchSF 1.838e+01 1.433e+01 1.283 0.19982
## EnclosedPorch -1.422e+01 1.634e+01 -0.870 0.38453
## X3SsnPorch -4.598e-01 2.754e+01 -0.017 0.98668
## ScreenPorch 2.071e+01 1.617e+01 1.280 0.20067
## PoolArea 5.610e+01 2.089e+01 2.685 0.00735 **
## MiscVal -1.505e+00 1.581e+00 -0.952 0.34139
## MoSold -1.331e+02 3.276e+02 -0.406 0.68454
## YrSold -3.106e+02 6.663e+02 -0.466 0.64118
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29400 on 1133 degrees of freedom
## Multiple R-squared: 0.87, Adjusted R-squared: 0.866
## F-statistic: 216.7 on 35 and 1133 DF, p-value: < 2.2e-16
I first remove the variables with perfect multicollinearity.
# Remove variables with perfect multicollinearity
model_all_manual <- update(model_all_manual, . ~ . - TotalBsmtSF - GrLivArea)
# Print the model summary
summary(model_all_manual)
##
## Call:
## lm(formula = SalePrice ~ Id + MSSubClass + LotFrontage + LotArea +
## OverallQual + OverallCond + YearBuilt + YearRemodAdd + MasVnrArea +
## BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF +
## LowQualFinSF + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF +
## EnclosedPorch + X3SsnPorch + ScreenPorch + PoolArea + MiscVal +
## MoSold + YrSold, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138346 -15955 -1147 14172 217521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.300e+05 1.338e+06 -0.321 0.74806
## Id -2.309e-02 2.069e+00 -0.011 0.99110
## MSSubClass -1.382e+02 2.680e+01 -5.158 2.95e-07 ***
## LotFrontage 5.356e+01 4.771e+01 1.123 0.26185
## LotArea 4.872e-01 9.231e-02 5.278 1.57e-07 ***
## OverallQual 1.530e+04 1.147e+03 13.344 < 2e-16 ***
## OverallCond 4.706e+03 9.553e+02 4.926 9.64e-07 ***
## YearBuilt 3.544e+02 6.718e+01 5.275 1.59e-07 ***
## YearRemodAdd 1.584e+02 6.469e+01 2.449 0.01446 *
## MasVnrArea 3.315e+01 5.704e+00 5.813 7.98e-09 ***
## BsmtFinSF1 4.155e+01 4.535e+00 9.162 < 2e-16 ***
## BsmtFinSF2 1.827e+01 6.642e+00 2.751 0.00603 **
## BsmtUnfSF 2.097e+01 3.958e+00 5.298 1.41e-07 ***
## X1stFlrSF 7.026e+01 5.631e+00 12.477 < 2e-16 ***
## X2ndFlrSF 7.863e+01 4.748e+00 16.560 < 2e-16 ***
## LowQualFinSF 2.244e+01 1.749e+01 1.283 0.19974
## BsmtFullBath 1.210e+03 2.471e+03 0.489 0.62461
## BsmtHalfBath -8.845e+02 3.792e+03 -0.233 0.81559
## FullBath -3.003e+03 2.662e+03 -1.128 0.25948
## HalfBath -6.784e+03 2.535e+03 -2.676 0.00755 **
## BedroomAbvGr -1.375e+04 1.625e+03 -8.460 < 2e-16 ***
## KitchenAbvGr -1.119e+04 4.906e+03 -2.280 0.02279 *
## TotRmsAbvGrd 2.969e+03 1.194e+03 2.488 0.01300 *
## Fireplaces 2.713e+01 1.688e+03 0.016 0.98718
## GarageYrBlt -7.529e+00 7.052e+01 -0.107 0.91500
## GarageCars 2.039e+03 2.724e+03 0.749 0.45425
## GarageArea 1.344e+01 9.504e+00 1.414 0.15761
## WoodDeckSF 1.631e+01 7.579e+00 2.152 0.03158 *
## OpenPorchSF 1.838e+01 1.433e+01 1.283 0.19982
## EnclosedPorch -1.422e+01 1.634e+01 -0.870 0.38453
## X3SsnPorch -4.598e-01 2.754e+01 -0.017 0.98668
## ScreenPorch 2.071e+01 1.617e+01 1.280 0.20067
## PoolArea 5.610e+01 2.089e+01 2.685 0.00735 **
## MiscVal -1.505e+00 1.581e+00 -0.952 0.34139
## MoSold -1.331e+02 3.276e+02 -0.406 0.68454
## YrSold -3.106e+02 6.663e+02 -0.466 0.64118
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29400 on 1133 degrees of freedom
## Multiple R-squared: 0.87, Adjusted R-squared: 0.866
## F-statistic: 216.7 on 35 and 1133 DF, p-value: < 2.2e-16
Next we reomove some variables with high p-values. ID should have been dropped right away. I will remove ID and LotArea
# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - Id - LotArea)
# Print the model summary
summary(model_all_manual)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt +
## GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch +
## X3SsnPorch + ScreenPorch + PoolArea + MiscVal + MoSold +
## YrSold, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138646 -16210 -1108 14215 216784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.505e+05 1.353e+06 -0.185 0.85316
## MSSubClass -1.343e+02 2.708e+01 -4.960 8.11e-07 ***
## LotFrontage 1.164e+02 4.670e+01 2.492 0.01285 *
## OverallQual 1.479e+04 1.154e+03 12.819 < 2e-16 ***
## OverallCond 4.633e+03 9.653e+02 4.799 1.81e-06 ***
## YearBuilt 3.366e+02 6.785e+01 4.960 8.12e-07 ***
## YearRemodAdd 1.614e+02 6.540e+01 2.467 0.01376 *
## MasVnrArea 3.336e+01 5.760e+00 5.791 9.03e-09 ***
## BsmtFinSF1 4.287e+01 4.577e+00 9.368 < 2e-16 ***
## BsmtFinSF2 2.116e+01 6.688e+00 3.164 0.00160 **
## BsmtUnfSF 2.194e+01 3.997e+00 5.488 5.01e-08 ***
## X1stFlrSF 7.112e+01 5.682e+00 12.517 < 2e-16 ***
## X2ndFlrSF 7.951e+01 4.798e+00 16.571 < 2e-16 ***
## LowQualFinSF 2.182e+01 1.766e+01 1.235 0.21706
## BsmtFullBath 1.947e+03 2.495e+03 0.781 0.43521
## BsmtHalfBath -1.444e+02 3.832e+03 -0.038 0.96996
## FullBath -2.153e+03 2.687e+03 -0.801 0.42323
## HalfBath -6.923e+03 2.561e+03 -2.704 0.00696 **
## BedroomAbvGr -1.376e+04 1.643e+03 -8.378 < 2e-16 ***
## KitchenAbvGr -1.217e+04 4.952e+03 -2.458 0.01412 *
## TotRmsAbvGrd 2.835e+03 1.207e+03 2.349 0.01899 *
## Fireplaces 1.078e+03 1.695e+03 0.636 0.52507
## GarageYrBlt -1.503e+01 7.128e+01 -0.211 0.83303
## GarageCars 2.497e+03 2.754e+03 0.907 0.36478
## GarageArea 1.284e+01 9.609e+00 1.336 0.18171
## WoodDeckSF 1.958e+01 7.632e+00 2.565 0.01044 *
## OpenPorchSF 1.790e+01 1.449e+01 1.235 0.21702
## EnclosedPorch -1.776e+01 1.651e+01 -1.075 0.28249
## X3SsnPorch -2.366e+00 2.782e+01 -0.085 0.93224
## ScreenPorch 2.034e+01 1.636e+01 1.244 0.21383
## PoolArea 5.262e+01 2.107e+01 2.497 0.01268 *
## MiscVal -1.132e+00 1.597e+00 -0.709 0.47860
## MoSold -1.808e+02 3.310e+02 -0.546 0.58497
## YrSold -3.777e+02 6.737e+02 -0.561 0.57519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29730 on 1135 degrees of freedom
## Multiple R-squared: 0.8668, Adjusted R-squared: 0.863
## F-statistic: 223.9 on 33 and 1135 DF, p-value: < 2.2e-16
Notably, the r-squared didn’t change much. I will now remove the next two highest p-values, which are BsmtHalfBath and X3SsnPorch
# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - BsmtHalfBath - X3SsnPorch)
# Print the model summary
summary(model_all_manual)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + GarageArea +
## WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch +
## PoolArea + MiscVal + MoSold + YrSold, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138628 -16206 -1146 14263 216779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.485e+05 1.350e+06 -0.184 0.85391
## MSSubClass -1.344e+02 2.699e+01 -4.981 7.30e-07 ***
## LotFrontage 1.159e+02 4.636e+01 2.499 0.01258 *
## OverallQual 1.479e+04 1.152e+03 12.844 < 2e-16 ***
## OverallCond 4.627e+03 9.604e+02 4.818 1.64e-06 ***
## YearBuilt 3.365e+02 6.775e+01 4.967 7.84e-07 ***
## YearRemodAdd 1.612e+02 6.525e+01 2.470 0.01365 *
## MasVnrArea 3.335e+01 5.746e+00 5.804 8.39e-09 ***
## BsmtFinSF1 4.285e+01 4.521e+00 9.478 < 2e-16 ***
## BsmtFinSF2 2.115e+01 6.634e+00 3.189 0.00147 **
## BsmtUnfSF 2.194e+01 3.993e+00 5.493 4.87e-08 ***
## X1stFlrSF 7.110e+01 5.670e+00 12.541 < 2e-16 ***
## X2ndFlrSF 7.951e+01 4.792e+00 16.591 < 2e-16 ***
## LowQualFinSF 2.181e+01 1.765e+01 1.236 0.21682
## BsmtFullBath 1.974e+03 2.384e+03 0.828 0.40789
## FullBath -2.152e+03 2.676e+03 -0.804 0.42142
## HalfBath -6.922e+03 2.556e+03 -2.708 0.00688 **
## BedroomAbvGr -1.376e+04 1.633e+03 -8.426 < 2e-16 ***
## KitchenAbvGr -1.216e+04 4.945e+03 -2.458 0.01411 *
## TotRmsAbvGrd 2.839e+03 1.205e+03 2.357 0.01859 *
## Fireplaces 1.079e+03 1.693e+03 0.637 0.52407
## GarageYrBlt -1.501e+01 7.120e+01 -0.211 0.83306
## GarageCars 2.492e+03 2.751e+03 0.906 0.36515
## GarageArea 1.286e+01 9.597e+00 1.340 0.18041
## WoodDeckSF 1.960e+01 7.605e+00 2.577 0.01009 *
## OpenPorchSF 1.795e+01 1.447e+01 1.240 0.21508
## EnclosedPorch -1.771e+01 1.648e+01 -1.074 0.28283
## ScreenPorch 2.040e+01 1.633e+01 1.249 0.21185
## PoolArea 5.262e+01 2.105e+01 2.499 0.01259 *
## MiscVal -1.131e+00 1.595e+00 -0.709 0.47856
## MoSold -1.820e+02 3.305e+02 -0.551 0.58183
## YrSold -3.784e+02 6.718e+02 -0.563 0.57331
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29700 on 1137 degrees of freedom
## Multiple R-squared: 0.8668, Adjusted R-squared: 0.8632
## F-statistic: 238.8 on 31 and 1137 DF, p-value: < 2.2e-16
Again, the r-squared didn’t change much. I will now remove the next two highest p-values, which are GarageYrBlt and Fireplaces.
# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - GarageYrBlt - Fireplaces)
# Print the model summary
summary(model_all_manual)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF +
## EnclosedPorch + ScreenPorch + PoolArea + MiscVal + MoSold +
## YrSold, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138289 -16444 -1341 14260 216525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.521e+05 1.348e+06 -0.187 0.85167
## MSSubClass -1.340e+02 2.692e+01 -4.976 7.49e-07 ***
## LotFrontage 1.194e+02 4.594e+01 2.598 0.00949 **
## OverallQual 1.489e+04 1.141e+03 13.044 < 2e-16 ***
## OverallCond 4.655e+03 9.571e+02 4.864 1.31e-06 ***
## YearBuilt 3.268e+02 5.795e+01 5.640 2.15e-08 ***
## YearRemodAdd 1.544e+02 6.330e+01 2.439 0.01486 *
## MasVnrArea 3.345e+01 5.734e+00 5.834 7.04e-09 ***
## BsmtFinSF1 4.285e+01 4.517e+00 9.486 < 2e-16 ***
## BsmtFinSF2 2.107e+01 6.628e+00 3.179 0.00152 **
## BsmtUnfSF 2.178e+01 3.983e+00 5.468 5.59e-08 ***
## X1stFlrSF 7.211e+01 5.475e+00 13.171 < 2e-16 ***
## X2ndFlrSF 7.994e+01 4.749e+00 16.834 < 2e-16 ***
## LowQualFinSF 2.097e+01 1.749e+01 1.199 0.23069
## BsmtFullBath 2.003e+03 2.382e+03 0.841 0.40074
## FullBath -2.232e+03 2.670e+03 -0.836 0.40351
## HalfBath -6.812e+03 2.550e+03 -2.672 0.00766 **
## BedroomAbvGr -1.386e+04 1.623e+03 -8.540 < 2e-16 ***
## KitchenAbvGr -1.252e+04 4.892e+03 -2.559 0.01062 *
## TotRmsAbvGrd 2.833e+03 1.203e+03 2.354 0.01872 *
## GarageCars 2.651e+03 2.733e+03 0.970 0.33212
## GarageArea 1.159e+01 9.189e+00 1.262 0.20730
## WoodDeckSF 1.982e+01 7.542e+00 2.628 0.00871 **
## OpenPorchSF 1.816e+01 1.445e+01 1.257 0.20902
## EnclosedPorch -1.802e+01 1.646e+01 -1.094 0.27405
## ScreenPorch 2.185e+01 1.618e+01 1.351 0.17697
## PoolArea 5.285e+01 2.103e+01 2.514 0.01209 *
## MiscVal -1.106e+00 1.593e+00 -0.694 0.48782
## MoSold -1.764e+02 3.301e+02 -0.534 0.59329
## YrSold -3.755e+02 6.713e+02 -0.559 0.57602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29680 on 1139 degrees of freedom
## Multiple R-squared: 0.8668, Adjusted R-squared: 0.8634
## F-statistic: 255.6 on 29 and 1139 DF, p-value: < 2.2e-16
Again, the r-squared didn’t change much but fewer and fewer variables. I will now remove the next two highest p-values, which are MoSold and YrSold.
# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - MoSold - YrSold)
# Print the model summary
summary(model_all_manual)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF +
## EnclosedPorch + ScreenPorch + PoolArea + MiscVal, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137305 -15921 -1231 14168 216965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004e+06 1.248e+05 -8.045 2.14e-15 ***
## MSSubClass -1.335e+02 2.690e+01 -4.963 8.01e-07 ***
## LotFrontage 1.177e+02 4.585e+01 2.567 0.01038 *
## OverallQual 1.485e+04 1.139e+03 13.036 < 2e-16 ***
## OverallCond 4.646e+03 9.564e+02 4.858 1.35e-06 ***
## YearBuilt 3.269e+02 5.792e+01 5.644 2.10e-08 ***
## YearRemodAdd 1.529e+02 6.320e+01 2.420 0.01568 *
## MasVnrArea 3.347e+01 5.730e+00 5.841 6.76e-09 ***
## BsmtFinSF1 4.302e+01 4.506e+00 9.546 < 2e-16 ***
## BsmtFinSF2 2.113e+01 6.621e+00 3.192 0.00145 **
## BsmtUnfSF 2.192e+01 3.976e+00 5.512 4.38e-08 ***
## X1stFlrSF 7.200e+01 5.466e+00 13.171 < 2e-16 ***
## X2ndFlrSF 7.986e+01 4.744e+00 16.835 < 2e-16 ***
## LowQualFinSF 2.141e+01 1.747e+01 1.226 0.22043
## BsmtFullBath 1.932e+03 2.371e+03 0.815 0.41532
## FullBath -2.257e+03 2.668e+03 -0.846 0.39773
## HalfBath -6.806e+03 2.547e+03 -2.672 0.00766 **
## BedroomAbvGr -1.388e+04 1.619e+03 -8.570 < 2e-16 ***
## KitchenAbvGr -1.275e+04 4.877e+03 -2.615 0.00904 **
## TotRmsAbvGrd 2.887e+03 1.200e+03 2.406 0.01629 *
## GarageCars 2.705e+03 2.729e+03 0.991 0.32178
## GarageArea 1.167e+01 9.179e+00 1.271 0.20389
## WoodDeckSF 1.969e+01 7.535e+00 2.614 0.00908 **
## OpenPorchSF 1.810e+01 1.443e+01 1.255 0.20985
## EnclosedPorch -1.767e+01 1.644e+01 -1.075 0.28270
## ScreenPorch 2.145e+01 1.616e+01 1.327 0.18463
## PoolArea 5.405e+01 2.094e+01 2.581 0.00998 **
## MiscVal -1.097e+00 1.592e+00 -0.689 0.49090
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29670 on 1141 degrees of freedom
## Multiple R-squared: 0.8667, Adjusted R-squared: 0.8636
## F-statistic: 274.8 on 27 and 1141 DF, p-value: < 2.2e-16
Again, the r-squared didn’t change much. I will now remove the next two highest p-values, which are MiscVal and BsmtFullBath.
# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - MiscVal - BsmtFullBath)
# Print the model summary
summary(model_all_manual)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd +
## GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch +
## ScreenPorch + PoolArea, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137270 -16314 -1359 14160 217278
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.013e+06 1.237e+05 -8.190 6.94e-16 ***
## MSSubClass -1.309e+02 2.677e+01 -4.891 1.15e-06 ***
## LotFrontage 1.177e+02 4.583e+01 2.568 0.010341 *
## OverallQual 1.485e+04 1.138e+03 13.045 < 2e-16 ***
## OverallCond 4.552e+03 9.521e+02 4.781 1.97e-06 ***
## YearBuilt 3.267e+02 5.781e+01 5.651 2.02e-08 ***
## YearRemodAdd 1.583e+02 6.288e+01 2.517 0.011961 *
## MasVnrArea 3.324e+01 5.715e+00 5.817 7.79e-09 ***
## BsmtFinSF1 4.445e+01 4.087e+00 10.876 < 2e-16 ***
## BsmtFinSF2 2.222e+01 6.464e+00 3.437 0.000609 ***
## BsmtUnfSF 2.184e+01 3.973e+00 5.497 4.75e-08 ***
## X1stFlrSF 7.196e+01 5.456e+00 13.189 < 2e-16 ***
## X2ndFlrSF 7.975e+01 4.741e+00 16.823 < 2e-16 ***
## LowQualFinSF 2.147e+01 1.746e+01 1.230 0.219040
## FullBath -2.508e+03 2.647e+03 -0.947 0.343687
## HalfBath -6.890e+03 2.544e+03 -2.708 0.006862 **
## BedroomAbvGr -1.383e+04 1.617e+03 -8.551 < 2e-16 ***
## KitchenAbvGr -1.288e+04 4.860e+03 -2.651 0.008136 **
## TotRmsAbvGrd 2.872e+03 1.197e+03 2.399 0.016581 *
## GarageCars 2.798e+03 2.726e+03 1.027 0.304776
## GarageArea 1.168e+01 9.174e+00 1.273 0.203273
## WoodDeckSF 2.011e+01 7.515e+00 2.676 0.007560 **
## OpenPorchSF 1.869e+01 1.441e+01 1.297 0.194887
## EnclosedPorch -1.741e+01 1.642e+01 -1.061 0.289119
## ScreenPorch 2.106e+01 1.614e+01 1.305 0.192060
## PoolArea 5.373e+01 2.092e+01 2.568 0.010355 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29650 on 1143 degrees of freedom
## Multiple R-squared: 0.8666, Adjusted R-squared: 0.8637
## F-statistic: 297 on 25 and 1143 DF, p-value: < 2.2e-16
Again, the r-squared didn’t change much. I will now remove the next two highest p-values, which are FullBath and GarageCars.
# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - FullBath - GarageCars)
# Print the model summary
summary(model_all_manual)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
## HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea +
## WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch +
## PoolArea, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137411 -16451 -1250 13885 214724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.847e+05 1.136e+05 -8.670 < 2e-16 ***
## MSSubClass -1.309e+02 2.674e+01 -4.895 1.12e-06 ***
## LotFrontage 1.181e+02 4.582e+01 2.577 0.010088 *
## OverallQual 1.489e+04 1.130e+03 13.177 < 2e-16 ***
## OverallCond 4.565e+03 9.502e+02 4.804 1.76e-06 ***
## YearBuilt 3.163e+02 5.477e+01 5.775 9.93e-09 ***
## YearRemodAdd 1.539e+02 6.250e+01 2.463 0.013938 *
## MasVnrArea 3.336e+01 5.710e+00 5.843 6.69e-09 ***
## BsmtFinSF1 4.453e+01 4.063e+00 10.959 < 2e-16 ***
## BsmtFinSF2 2.218e+01 6.447e+00 3.440 0.000603 ***
## BsmtUnfSF 2.180e+01 3.971e+00 5.489 4.98e-08 ***
## X1stFlrSF 7.091e+01 5.267e+00 13.463 < 2e-16 ***
## X2ndFlrSF 7.804e+01 4.331e+00 18.020 < 2e-16 ***
## LowQualFinSF 1.977e+01 1.740e+01 1.136 0.256286
## HalfBath -5.944e+03 2.371e+03 -2.507 0.012304 *
## BedroomAbvGr -1.397e+04 1.610e+03 -8.682 < 2e-16 ***
## KitchenAbvGr -1.362e+04 4.798e+03 -2.838 0.004618 **
## TotRmsAbvGrd 2.909e+03 1.196e+03 2.432 0.015160 *
## GarageArea 1.927e+01 5.556e+00 3.468 0.000545 ***
## WoodDeckSF 2.010e+01 7.515e+00 2.675 0.007577 **
## OpenPorchSF 1.674e+01 1.434e+01 1.168 0.243134
## EnclosedPorch -1.800e+01 1.641e+01 -1.097 0.273047
## ScreenPorch 2.172e+01 1.613e+01 1.347 0.178236
## PoolArea 5.415e+01 2.091e+01 2.589 0.009735 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29650 on 1145 degrees of freedom
## Multiple R-squared: 0.8664, Adjusted R-squared: 0.8637
## F-statistic: 322.8 on 23 and 1145 DF, p-value: < 2.2e-16
Again, the r-squared didn’t change much. I will now remove the next two highest p-values, which are EnclosedPorch and LowQualFinSF.
# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - EnclosedPorch - LowQualFinSF)
# Print the model summary
summary(model_all_manual)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + HalfBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea +
## WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137057 -16453 -1493 13910 213617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004e+06 1.087e+05 -9.241 < 2e-16 ***
## MSSubClass -1.274e+02 2.661e+01 -4.788 1.91e-06 ***
## LotFrontage 1.193e+02 4.576e+01 2.606 0.009268 **
## OverallQual 1.481e+04 1.126e+03 13.160 < 2e-16 ***
## OverallCond 4.616e+03 9.409e+02 4.905 1.07e-06 ***
## YearBuilt 3.273e+02 5.075e+01 6.450 1.64e-10 ***
## YearRemodAdd 1.524e+02 6.238e+01 2.442 0.014745 *
## MasVnrArea 3.326e+01 5.704e+00 5.831 7.14e-09 ***
## BsmtFinSF1 4.450e+01 4.062e+00 10.956 < 2e-16 ***
## BsmtFinSF2 2.192e+01 6.443e+00 3.403 0.000691 ***
## BsmtUnfSF 2.179e+01 3.968e+00 5.492 4.89e-08 ***
## X1stFlrSF 7.034e+01 5.251e+00 13.395 < 2e-16 ***
## X2ndFlrSF 7.718e+01 4.297e+00 17.960 < 2e-16 ***
## HalfBath -5.930e+03 2.370e+03 -2.502 0.012481 *
## BedroomAbvGr -1.397e+04 1.609e+03 -8.681 < 2e-16 ***
## KitchenAbvGr -1.421e+04 4.769e+03 -2.979 0.002949 **
## TotRmsAbvGrd 3.212e+03 1.177e+03 2.729 0.006440 **
## GarageArea 1.897e+01 5.554e+00 3.416 0.000658 ***
## WoodDeckSF 2.086e+01 7.488e+00 2.786 0.005425 **
## OpenPorchSF 1.804e+01 1.432e+01 1.260 0.207799
## ScreenPorch 2.458e+01 1.596e+01 1.540 0.123767
## PoolArea 5.399e+01 2.082e+01 2.593 0.009642 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29660 on 1147 degrees of freedom
## Multiple R-squared: 0.8661, Adjusted R-squared: 0.8636
## F-statistic: 353.2 on 21 and 1147 DF, p-value: < 2.2e-16
Again, the r-squared didn’t change much. I will now remove the next highest p-value, which is OpenPorchSF.
# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - OpenPorchSF)
# Print the model summary
summary(model_all_manual)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + HalfBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea +
## WoodDeckSF + ScreenPorch + PoolArea, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137606 -16434 -1102 13637 214553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.012e+06 1.085e+05 -9.328 < 2e-16 ***
## MSSubClass -1.277e+02 2.661e+01 -4.798 1.82e-06 ***
## LotFrontage 1.199e+02 4.576e+01 2.619 0.008928 **
## OverallQual 1.490e+04 1.124e+03 13.259 < 2e-16 ***
## OverallCond 4.641e+03 9.410e+02 4.933 9.31e-07 ***
## YearBuilt 3.285e+02 5.075e+01 6.474 1.41e-10 ***
## YearRemodAdd 1.550e+02 6.236e+01 2.486 0.013053 *
## MasVnrArea 3.294e+01 5.700e+00 5.780 9.62e-09 ***
## BsmtFinSF1 4.481e+01 4.056e+00 11.049 < 2e-16 ***
## BsmtFinSF2 2.204e+01 6.444e+00 3.419 0.000649 ***
## BsmtUnfSF 2.219e+01 3.956e+00 5.609 2.55e-08 ***
## X1stFlrSF 7.060e+01 5.249e+00 13.450 < 2e-16 ***
## X2ndFlrSF 7.768e+01 4.280e+00 18.148 < 2e-16 ***
## HalfBath -5.817e+03 2.369e+03 -2.456 0.014211 *
## BedroomAbvGr -1.399e+04 1.610e+03 -8.689 < 2e-16 ***
## KitchenAbvGr -1.450e+04 4.764e+03 -3.043 0.002393 **
## TotRmsAbvGrd 3.187e+03 1.177e+03 2.708 0.006878 **
## GarageArea 1.918e+01 5.553e+00 3.455 0.000571 ***
## WoodDeckSF 2.041e+01 7.481e+00 2.729 0.006457 **
## ScreenPorch 2.529e+01 1.595e+01 1.585 0.113174
## PoolArea 5.424e+01 2.083e+01 2.604 0.009329 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29670 on 1148 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8636
## F-statistic: 370.6 on 20 and 1148 DF, p-value: < 2.2e-16
Again, the r-squared didn’t change much. I will now remove the next highest p-value, which is ScreenPorch.
# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - ScreenPorch)
# Print the model summary
summary(model_all_manual)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + HalfBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea +
## WoodDeckSF + PoolArea, data = train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138158 -15879 -1138 13677 218486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.931e+05 1.079e+05 -9.203 < 2e-16 ***
## MSSubClass -1.279e+02 2.663e+01 -4.802 1.78e-06 ***
## LotFrontage 1.165e+02 4.574e+01 2.546 0.011035 *
## OverallQual 1.493e+04 1.124e+03 13.277 < 2e-16 ***
## OverallCond 4.726e+03 9.401e+02 5.027 5.76e-07 ***
## YearBuilt 3.240e+02 5.070e+01 6.390 2.40e-10 ***
## YearRemodAdd 1.497e+02 6.231e+01 2.402 0.016443 *
## MasVnrArea 3.329e+01 5.699e+00 5.841 6.75e-09 ***
## BsmtFinSF1 4.501e+01 4.056e+00 11.095 < 2e-16 ***
## BsmtFinSF2 2.266e+01 6.436e+00 3.520 0.000448 ***
## BsmtUnfSF 2.226e+01 3.959e+00 5.624 2.35e-08 ***
## X1stFlrSF 7.110e+01 5.243e+00 13.560 < 2e-16 ***
## X2ndFlrSF 7.788e+01 4.281e+00 18.192 < 2e-16 ***
## HalfBath -5.488e+03 2.361e+03 -2.324 0.020292 *
## BedroomAbvGr -1.397e+04 1.611e+03 -8.670 < 2e-16 ***
## KitchenAbvGr -1.497e+04 4.758e+03 -3.145 0.001703 **
## TotRmsAbvGrd 3.178e+03 1.178e+03 2.699 0.007064 **
## GarageArea 1.945e+01 5.554e+00 3.501 0.000481 ***
## WoodDeckSF 1.906e+01 7.437e+00 2.563 0.010507 *
## PoolArea 5.593e+01 2.082e+01 2.687 0.007314 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29690 on 1149 degrees of freedom
## Multiple R-squared: 0.8656, Adjusted R-squared: 0.8634
## F-statistic: 389.5 on 19 and 1149 DF, p-value: < 2.2e-16
All the predictors now have statistically significant p-values. Interestingly, only the last elimination made the manual model different than the backwards elimination model. Also, notable, the r-squared for eliminated model is practiacally the same as the original model with all the numeric variables. Even though, as mentioned, this model’s purpose is prediction, having a model with fewer variables is always better and less prone to overfitting. I will kepp this model as the final iteration of the model using all the numeric variables.
Next, I will work on improving the PCA model.
First, lets add a pca model with just the first two components.
# Build a regression model using the first 2 PCA components
pca_2 <- cbind(SalePrice = train_imputed$SalePrice, pca_result_all$x[, 1:2])
model_pca_2 <- lm(SalePrice ~ ., as.data.frame(pca_2))
# Print the model summary
summary(model_pca_2)
##
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(pca_2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -127944 -20243 -3651 15330 271638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181183.6 980.7 184.743 < 2e-16 ***
## PC1 25951.1 349.1 74.337 < 2e-16 ***
## PC2 1583.7 539.8 2.934 0.00342 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33530 on 1166 degrees of freedom
## Multiple R-squared: 0.826, Adjusted R-squared: 0.8257
## F-statistic: 2767 on 2 and 1166 DF, p-value: < 2.2e-16
Now I will build a PCA model using backwards elimination on all the principal components.
# Perform backwards elimination on the model using all the principal components
model_pca_backwards <- step(model_pca_99)
## Start: AIC=23084.66
## SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +
## PC10 + PC11 + PC12 + PC13 + PC14 + PC15 + PC16 + PC17 + PC18 +
## PC19 + PC20 + PC21 + PC22 + PC23 + PC24 + PC25 + PC26 + PC27 +
## PC28 + PC29 + PC30 + PC31 + PC32 + PC33
##
## Df Sum of Sq RSS AIC
## - PC17 1 4.2023e+06 4.1565e+11 23083
## - PC24 1 4.2082e+07 4.1568e+11 23083
## - PC6 1 1.3189e+08 4.1577e+11 23083
## <none> 4.1564e+11 23085
## - PC11 1 8.9201e+08 4.1653e+11 23085
## - PC23 1 1.1586e+09 4.1680e+11 23086
## - PC19 1 2.0976e+09 4.1774e+11 23089
## - PC29 1 2.2335e+09 4.1788e+11 23089
## - PC14 1 2.3293e+09 4.1797e+11 23089
## - PC13 1 2.7680e+09 4.1841e+11 23090
## - PC15 1 2.8466e+09 4.1849e+11 23091
## - PC20 1 3.5338e+09 4.1918e+11 23093
## - PC18 1 4.4477e+09 4.2009e+11 23095
## - PC10 1 5.4819e+09 4.2112e+11 23098
## - PC25 1 7.1124e+09 4.2275e+11 23103
## - PC27 1 9.1426e+09 4.2478e+11 23108
## - PC2 1 9.6772e+09 4.2532e+11 23110
## - PC16 1 1.0151e+10 4.2579e+11 23111
## - PC28 1 1.5081e+10 4.3072e+11 23124
## - PC32 1 1.7119e+10 4.3276e+11 23130
## - PC7 1 1.8177e+10 4.3382e+11 23133
## - PC12 1 1.9868e+10 4.3551e+11 23137
## - PC9 1 2.1525e+10 4.3717e+11 23142
## - PC31 1 2.5240e+10 4.4088e+11 23152
## - PC22 1 2.6360e+10 4.4200e+11 23155
## - PC21 1 2.6631e+10 4.4227e+11 23155
## - PC3 1 5.5773e+10 4.7142e+11 23230
## - PC4 1 6.2564e+10 4.7821e+11 23247
## - PC8 1 6.6040e+10 4.8168e+11 23255
## - PC5 1 7.0799e+10 4.8644e+11 23267
## - PC30 1 9.1222e+10 5.0686e+11 23315
## - PC33 1 1.1814e+11 5.3379e+11 23375
## - PC26 1 2.0648e+11 6.2213e+11 23554
## - PC1 1 6.2134e+12 6.6290e+12 26320
##
## Step: AIC=23082.67
## SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +
## PC10 + PC11 + PC12 + PC13 + PC14 + PC15 + PC16 + PC18 + PC19 +
## PC20 + PC21 + PC22 + PC23 + PC24 + PC25 + PC26 + PC27 + PC28 +
## PC29 + PC30 + PC31 + PC32 + PC33
##
## Df Sum of Sq RSS AIC
## - PC24 1 4.2082e+07 4.1569e+11 23081
## - PC6 1 1.3189e+08 4.1578e+11 23081
## <none> 4.1565e+11 23083
## - PC11 1 8.9201e+08 4.1654e+11 23083
## - PC23 1 1.1586e+09 4.1681e+11 23084
## - PC19 1 2.0976e+09 4.1774e+11 23087
## - PC29 1 2.2335e+09 4.1788e+11 23087
## - PC14 1 2.3293e+09 4.1798e+11 23087
## - PC13 1 2.7680e+09 4.1841e+11 23088
## - PC15 1 2.8466e+09 4.1849e+11 23089
## - PC20 1 3.5338e+09 4.1918e+11 23091
## - PC18 1 4.4477e+09 4.2009e+11 23093
## - PC10 1 5.4819e+09 4.2113e+11 23096
## - PC25 1 7.1124e+09 4.2276e+11 23101
## - PC27 1 9.1426e+09 4.2479e+11 23106
## - PC2 1 9.6772e+09 4.2532e+11 23108
## - PC16 1 1.0151e+10 4.2580e+11 23109
## - PC28 1 1.5081e+10 4.3073e+11 23122
## - PC32 1 1.7119e+10 4.3277e+11 23128
## - PC7 1 1.8177e+10 4.3382e+11 23131
## - PC12 1 1.9868e+10 4.3551e+11 23135
## - PC9 1 2.1525e+10 4.3717e+11 23140
## - PC31 1 2.5240e+10 4.4089e+11 23150
## - PC22 1 2.6360e+10 4.4201e+11 23153
## - PC21 1 2.6631e+10 4.4228e+11 23153
## - PC3 1 5.5773e+10 4.7142e+11 23228
## - PC4 1 6.2564e+10 4.7821e+11 23245
## - PC8 1 6.6040e+10 4.8169e+11 23253
## - PC5 1 7.0799e+10 4.8645e+11 23265
## - PC30 1 9.1222e+10 5.0687e+11 23313
## - PC33 1 1.1814e+11 5.3379e+11 23373
## - PC26 1 2.0648e+11 6.2213e+11 23552
## - PC1 1 6.2134e+12 6.6290e+12 26318
##
## Step: AIC=23080.79
## SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +
## PC10 + PC11 + PC12 + PC13 + PC14 + PC15 + PC16 + PC18 + PC19 +
## PC20 + PC21 + PC22 + PC23 + PC25 + PC26 + PC27 + PC28 + PC29 +
## PC30 + PC31 + PC32 + PC33
##
## Df Sum of Sq RSS AIC
## - PC6 1 1.3189e+08 4.1582e+11 23079
## <none> 4.1569e+11 23081
## - PC11 1 8.9201e+08 4.1658e+11 23081
## - PC23 1 1.1586e+09 4.1685e+11 23082
## - PC19 1 2.0976e+09 4.1779e+11 23085
## - PC29 1 2.2335e+09 4.1792e+11 23085
## - PC14 1 2.3293e+09 4.1802e+11 23085
## - PC13 1 2.7680e+09 4.1846e+11 23087
## - PC15 1 2.8466e+09 4.1854e+11 23087
## - PC20 1 3.5338e+09 4.1922e+11 23089
## - PC18 1 4.4477e+09 4.2014e+11 23091
## - PC10 1 5.4819e+09 4.2117e+11 23094
## - PC25 1 7.1124e+09 4.2280e+11 23099
## - PC27 1 9.1426e+09 4.2483e+11 23104
## - PC2 1 9.6772e+09 4.2537e+11 23106
## - PC16 1 1.0151e+10 4.2584e+11 23107
## - PC28 1 1.5081e+10 4.3077e+11 23120
## - PC32 1 1.7119e+10 4.3281e+11 23126
## - PC7 1 1.8177e+10 4.3387e+11 23129
## - PC12 1 1.9868e+10 4.3556e+11 23133
## - PC9 1 2.1525e+10 4.3721e+11 23138
## - PC31 1 2.5240e+10 4.4093e+11 23148
## - PC22 1 2.6360e+10 4.4205e+11 23151
## - PC21 1 2.6631e+10 4.4232e+11 23151
## - PC3 1 5.5773e+10 4.7146e+11 23226
## - PC4 1 6.2564e+10 4.7825e+11 23243
## - PC8 1 6.6040e+10 4.8173e+11 23251
## - PC5 1 7.0799e+10 4.8649e+11 23263
## - PC30 1 9.1222e+10 5.0691e+11 23311
## - PC33 1 1.1814e+11 5.3383e+11 23371
## - PC26 1 2.0648e+11 6.2217e+11 23550
## - PC1 1 6.2134e+12 6.6290e+12 26316
##
## Step: AIC=23079.16
## SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC7 + PC8 + PC9 + PC10 +
## PC11 + PC12 + PC13 + PC14 + PC15 + PC16 + PC18 + PC19 + PC20 +
## PC21 + PC22 + PC23 + PC25 + PC26 + PC27 + PC28 + PC29 + PC30 +
## PC31 + PC32 + PC33
##
## Df Sum of Sq RSS AIC
## <none> 4.1582e+11 23079
## - PC11 1 8.9201e+08 4.1671e+11 23080
## - PC23 1 1.1586e+09 4.1698e+11 23080
## - PC19 1 2.0976e+09 4.1792e+11 23083
## - PC29 1 2.2335e+09 4.1805e+11 23083
## - PC14 1 2.3293e+09 4.1815e+11 23084
## - PC13 1 2.7680e+09 4.1859e+11 23085
## - PC15 1 2.8466e+09 4.1867e+11 23085
## - PC20 1 3.5338e+09 4.1935e+11 23087
## - PC18 1 4.4477e+09 4.2027e+11 23090
## - PC10 1 5.4819e+09 4.2130e+11 23093
## - PC25 1 7.1124e+09 4.2293e+11 23097
## - PC27 1 9.1426e+09 4.2496e+11 23103
## - PC2 1 9.6772e+09 4.2550e+11 23104
## - PC16 1 1.0151e+10 4.2597e+11 23105
## - PC28 1 1.5081e+10 4.3090e+11 23119
## - PC32 1 1.7119e+10 4.3294e+11 23124
## - PC7 1 1.8177e+10 4.3400e+11 23127
## - PC12 1 1.9868e+10 4.3569e+11 23132
## - PC9 1 2.1525e+10 4.3735e+11 23136
## - PC31 1 2.5240e+10 4.4106e+11 23146
## - PC22 1 2.6360e+10 4.4218e+11 23149
## - PC21 1 2.6631e+10 4.4245e+11 23150
## - PC3 1 5.5773e+10 4.7159e+11 23224
## - PC4 1 6.2564e+10 4.7838e+11 23241
## - PC8 1 6.6040e+10 4.8186e+11 23250
## - PC5 1 7.0799e+10 4.8662e+11 23261
## - PC30 1 9.1222e+10 5.0704e+11 23309
## - PC33 1 1.1814e+11 5.3397e+11 23370
## - PC26 1 2.0648e+11 6.2230e+11 23549
## - PC1 1 6.2134e+12 6.6292e+12 26314
# Print the model summary
summary(model_pca_backwards)
##
## Call:
## lm(formula = SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC7 +
## PC8 + PC9 + PC10 + PC11 + PC12 + PC13 + PC14 + PC15 + PC16 +
## PC18 + PC19 + PC20 + PC21 + PC22 + PC23 + PC25 + PC26 + PC27 +
## PC28 + PC29 + PC30 + PC31 + PC32 + PC33, data = as.data.frame(pca_99))
##
## Residuals:
## Min 1Q Median 3Q Max
## -83749 -10669 -153 10814 139969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181183.6 559.1 324.074 < 2e-16 ***
## PC1 25951.1 199.0 130.401 < 2e-16 ***
## PC2 1583.7 307.7 5.146 3.13e-07 ***
## PC3 -4335.9 351.0 -12.355 < 2e-16 ***
## PC4 5096.2 389.5 13.085 < 2e-16 ***
## PC5 -6243.6 448.5 -13.920 < 2e-16 ***
## PC7 -3608.2 511.6 -7.053 3.03e-12 ***
## PC8 -7005.0 521.1 -13.444 < 2e-16 ***
## PC9 4050.7 527.8 7.675 3.54e-14 ***
## PC10 2065.1 533.2 3.873 0.000113 ***
## PC11 839.6 537.4 1.562 0.118462
## PC12 4047.5 548.9 7.374 3.18e-13 ***
## PC13 1534.7 557.6 2.752 0.006012 **
## PC14 1425.4 564.6 2.525 0.011711 *
## PC15 1616.9 579.3 2.791 0.005340 **
## PC16 -3069.0 582.3 -5.271 1.63e-07 ***
## PC18 2135.5 612.1 3.489 0.000504 ***
## PC19 1480.5 617.9 2.396 0.016739 *
## PC20 1953.0 628.0 3.110 0.001918 **
## PC21 -5620.6 658.4 -8.537 < 2e-16 ***
## PC22 -5760.9 678.3 -8.494 < 2e-16 ***
## PC23 1260.7 708.0 1.781 0.075237 .
## PC25 3477.8 788.3 4.412 1.12e-05 ***
## PC26 20663.8 869.3 23.772 < 2e-16 ***
## PC27 4481.7 896.0 5.002 6.56e-07 ***
## PC28 6428.3 1000.6 6.424 1.94e-10 ***
## PC29 -2626.7 1062.4 -2.472 0.013567 *
## PC30 -17238.6 1091.0 -15.800 < 2e-16 ***
## PC31 -10384.0 1249.4 -8.311 2.67e-16 ***
## PC32 8577.8 1253.2 6.845 1.25e-11 ***
## PC33 27212.4 1513.4 17.981 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19120 on 1138 degrees of freedom
## Multiple R-squared: 0.9448, Adjusted R-squared: 0.9434
## F-statistic: 649.4 on 30 and 1138 DF, p-value: < 2.2e-16
Lets remove the principal components with p values above 0.01.
model_pca_backwards <- update(model_pca_backwards, . ~ . - PC11 - PC14 - PC19 - PC23 - PC29)
# Print the model summary
summary(model_pca_backwards)
##
## Call:
## lm(formula = SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC7 +
## PC8 + PC9 + PC10 + PC12 + PC13 + PC15 + PC16 + PC18 + PC20 +
## PC21 + PC22 + PC25 + PC26 + PC27 + PC28 + PC30 + PC31 + PC32 +
## PC33, data = as.data.frame(pca_99))
##
## Residuals:
## Min 1Q Median 3Q Max
## -86159 -11004 89 10732 139446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181183.6 563.7 321.436 < 2e-16 ***
## PC1 25951.1 200.6 129.340 < 2e-16 ***
## PC2 1583.7 310.3 5.104 3.88e-07 ***
## PC3 -4335.9 353.8 -12.254 < 2e-16 ***
## PC4 5096.2 392.7 12.979 < 2e-16 ***
## PC5 -6243.6 452.2 -13.806 < 2e-16 ***
## PC7 -3608.2 515.8 -6.996 4.49e-12 ***
## PC8 -7005.0 525.3 -13.334 < 2e-16 ***
## PC9 4050.7 532.1 7.613 5.60e-14 ***
## PC10 2065.1 537.5 3.842 0.000129 ***
## PC12 4047.5 553.4 7.314 4.87e-13 ***
## PC13 1534.7 562.2 2.730 0.006432 **
## PC15 1616.9 584.1 2.768 0.005724 **
## PC16 -3069.0 587.1 -5.228 2.04e-07 ***
## PC18 2135.5 617.1 3.460 0.000559 ***
## PC20 1953.0 633.1 3.085 0.002088 **
## PC21 -5620.6 663.8 -8.468 < 2e-16 ***
## PC22 -5760.9 683.8 -8.424 < 2e-16 ***
## PC25 3477.8 794.7 4.376 1.32e-05 ***
## PC26 20663.8 876.4 23.578 < 2e-16 ***
## PC27 4481.7 903.3 4.961 8.06e-07 ***
## PC28 6428.3 1008.8 6.372 2.70e-10 ***
## PC30 -17238.6 1100.0 -15.672 < 2e-16 ***
## PC31 -10384.0 1259.7 -8.243 4.55e-16 ***
## PC32 8577.8 1263.5 6.789 1.81e-11 ***
## PC33 27212.4 1525.8 17.835 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19270 on 1143 degrees of freedom
## Multiple R-squared: 0.9437, Adjusted R-squared: 0.9424
## F-statistic: 765.7 on 25 and 1143 DF, p-value: < 2.2e-16
predictions <- predict(model_all_manual, test_imputed)
# Calculate the RMSE
rmse_model_all_manual <- sqrt(mean((test_imputed$SalePrice - predictions)^2))
predictions <- predict(model_pca_2, test_imputed)
# Calculate the RMSE
rmse_model_pca_2 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))
predictions <- predict(model_pca_backwards, test_imputed)
# Calculate the RMSE
rmse_model_pca_backwards <- sqrt(mean((test_imputed$SalePrice - predictions)^2))
# Combine all the models rmse and r-squared into one dataframe
model_results_new <- data.frame(Model = c("Backwards Elimination", "Top Correlated Variables", "PCA 80", "PCA 2", "PCA Backwards"),
RMSE = c(rmse_model_all_manual, rmse_model_top, rmse_model_pca_80, rmse_model_pca_2, rmse_model_pca_backwards),
R_squared = c(summary(model_all_manual)$r.squared, summary(model_top)$r.squared, summary(model_pca_80)$r.squared, summary(model_pca_2)$r.squared, summary(model_pca_backwards)$r.squared),
F_statistic = c(summary(model_all_manual)$fstatistic[1], summary(model_top)$fstatistic[1], summary(model_pca_80)$fstatistic[1], summary(model_pca_2)$fstatistic[1], summary(model_pca_backwards)$fstatistic[1]),
AIC = c(AIC(model_all_manual), AIC(model_top), AIC(model_pca_80), AIC(model_pca_2), AIC(model_pca_backwards)),
BIC = c(BIC(model_all_manual), BIC(model_top), BIC(model_pca_80), BIC(model_pca_2), BIC(model_pca_backwards)))
model_results_new
## Model RMSE R_squared F_statistic AIC BIC
## 1 Backwards Elimination 27736.90 0.8655953 389.4636 27417.09 27523.43
## 2 Top Correlated Variables 33713.83 0.8090931 820.7894 27801.33 27841.84
## 3 PCA 80 147295.33 0.8716175 433.7558 27361.50 27462.78
## 4 PCA 2 146536.96 0.8259851 2767.2872 27685.03 27705.28
## 5 PCA Backwards 150478.54 0.9436518 765.6635 26412.87 26549.60
Since the primary goal of this model is prediction, the model with the lowest RMSE is the best model. The model with the lowest RMSE is the model using backwards elimination on all the numeric variables. While the backwards PCA model has a significantly higher r-squared and F-statistic, along with the lowest AIC and BIC, the RMSE is the most important metric for this model and the PCA models seem to perform terribly for prediction. We will select the backwards elimination model as the final model.
# Predict the SalePrice using the final model
predictions <- predict(model_all_manual, eval_imputed)
# Save the predictions to a CSV file
submission <- data.frame(Id = eval$Id, SalePrice = predictions)
write.csv(submission, file = "submission.csv", row.names = FALSE)
# Add PCA to the eval set
pca_eval <- prcomp(eval_imputed, scale = TRUE)
eval_imputed <- cbind(eval_imputed, pca_eval$x)
# Predict using the PCA model
predictions <- predict(model_pca_backwards, eval_imputed)
# Save the predictions to a CSV file
submission <- data.frame(Id = eval$Id, SalePrice = predictions)
write.csv(submission, file = "submission_pca.csv", row.names = FALSE)
Kaggle User Name is Shaya Engelman. The Kaggle score for the final model was 0.36307. Not a great score, but factoring in that I didn’t include any of the categorical values (and neighborhood, for example, is obviously an important predictor of house price), it’s not terrible. I would love to revisit this competition when I have more time to add categorical variables and improve my model.